Back

Privacy-preserving deduplication to enhance federated learning

4 March 2025 | By: Dr Aydin Abadi | 2 min read
Image of a padlock and key in front of laptop keyboard.

As machine learning models become increasingly reliant on large-scale, distributed datasets, ensuring data quality and privacy has never been more critical.

Deduplication is a vital preprocessing step that enhances machine learning model performance and saves training time and energy. However, enhancing federated learning through deduplication poses a new set of challenges, especially regarding scalability and privacy.

Dr Aydin Abadi, Lecturer in Cybersecurity at our School of Computing, discusses these challenges, a solution, and real-world implications.

What is federated learning?

Federated learning (FL) is a machine learning technique that offers a decentralised approach to training models, enabling multiple devices to contribute to model improvement without sharing raw data. However, the presence of duplicated data across devices introduces inefficiencies, such as increased training time, higher resource consumption, and compromised model accuracy.

In our latest research, we proposed a novel protocol to address these challenges.

The problem: duplicated data in federated learning

Federated learning allows decentralised devices to collaboratively train machine learning models without sharing their raw data. However, duplicated data across devices can hinder model performance and increase training time. Imagine multiple users typing similar phrases on their phones – their local datasets may contain overlapping information, leading to inefficiencies in the federated training process.

The challenge lies in deduplicating data - or eliminating duplicated data - without violating the privacy of participating devices. Traditional methods require direct data sharing, which contradicts the core privacy-preserving nature of federated learning.

Our solution: Efficient Privacy-Preserving Multi-Party Deduplication

As a solution, we introduced Efficient Privacy-Preserving Multi-Party Deduplication (EP-MPD): a protocol designed to remove duplicates from datasets across multiple devices while maintaining strict privacy guarantees. EP-MPD leverages advanced cryptographic techniques, specifically two novel variants of the Private Set Intersection (PSI) protocol:

  1. Efficient Group PSI (EG-PSI 1): Built on symmetric key cryptography for very high computational efficiency.
  2. Efficient Group PSI (EG-PSI 2): Based on oblivious pseudorandom functions, offering enhanced privacy at the cost of increased computation.

By relying on the developed PSI along with other techniques, EP-MPD enables devices to collaboratively deduplicate data without revealing their raw datasets to one another.

AI-generated image of a network with a lock in the centre, surrounded by monitors and laptops that also feature a lock symbol.

Different devices securely interact with a central server to collaboratively train a model. AI generated image: Dall-E

Key results: improved performance, accuracy, and privacy

Our extensive experiments demonstrate the significant benefits of EP-MPD:

  • Accuracy improvement: By removing redundant data, EP-MPD improves the accuracy of the resulting federated learning model, ensuring better generalisation to unseen data.
  • Performance gains: We observed up to a 19% improvement in model perplexity and a 27% reduction in training time when applying deduplication to federated learning of large language models.
  • Scalability: EP-MPD scales effectively, handling datasets with millions of entries and tens of participating clients.
  • Privacy assurance: The protocol ensures that only the necessary intersection information is revealed, and no raw data is exposed.

Technical highlights: how EP-MPD works

The protocol operates in several phases:

  1. Local preparation: Each client encrypts its dataset using cryptographic keys.
  2. Private set intersection: Clients engage in secure multi-party computation to identify duplicates without exposing their data.
  3. Deduplication: After identifying duplicates, each client removes redundant entries locally.
EP-MPD’s modular design ensures flexibility, allowing it to be integrated with various federated learning frameworks.

Real-world impact

Our research addresses critical challenges in industries where federated learning is prevalent, such as:

  • Healthcare: Hospitals can collaboratively train models on patient data without sharing sensitive information.
  • Smart cities: Devices in smart cities can improve model accuracy for traffic management systems by removing redundant data.
  • Natural language processing: Companies developing large language models can reduce costs and environmental impact by minimising duplication in training datasets.

Conclusion

Our EP-MPD protocol not only reduces the runtime of federated learning by eliminating redundant data but also improves the overall accuracy of the trained model while maintaining strict privacy guarantees. Our work has been accepted for presentation at the Network and Distributed System Security (NDSS) Symposium 2025.

Our work is a step toward more efficient, privacy-preserving federated learning, and we invite researchers and practitioners to explore the potential applications of EP-MPD in their domains. We look forward to engaging with the community, advancing research in this field, and sharing our ongoing research and future developments.


You might also like:

Sign up for the latest research insights. Subscribe.