Data Privacy: Synthetic Data vs GDPR-Compliant Pseudonymisation
One of the buzzwords in the tech and data privacy industries at the moment is synthetic data. Synthetic data is created artificially, to address some of the privacy challenges that exist when trying to capture value from personal data. However, while synthetic data can be useful in some niche applications like application and software testing, it is far from the panacea it is sometimes portrayed to be, and has a number of problems that limit its use more broadly. One of the biggest issues is that the data use is limited: when new data is added to an original data set, the synthetic data based on that set has to be re-created. While just one of the issues with synthetic data, it will in many cases make employing synthetic data sets impractical. This highlights a common challenge with many of “hot” data privacy technologies: while they might be technically useful, in practice are frequently slow, inefficient, of low utility or impractical to use in a production environment.
On the other hand, GDPR-compliant Pseudonymisation (see www.pseudeonymisation.com) allows the transformation of personal data into privacy respectful data sets that retain high utility, without many of the drawbacks that synthetic data and other “new” data protection technologies suffer from.
First I’ll provide a quick primer on what synthetic data is and the situations in which it is usually used. Next, I’ll go into more detail on some of the challenges involved with using synthetic data. Then, find out how Pseudonymisation approaches these issues differently.
What is Synthetic Data, and Why is it Used?
Synthetic data is artificial data created for a specific purpose, containing all of the characteristics and complexities of a real data set, without personally-identifying information. For synthetic data to be useful, it should closely match the real data, including relationships between data in the set.
There are two main ways to create synthetic data:
- Drawing numbers from real-world statistical distributions to create a new data set; or
- Creating a model that explains the behavior that you have observed, and then going backwards and reproducing new, artificial “behavior” data following the model you created.
In addition, you can input raw data into a machine learning model, which then learns the patterns in the data set. Then, the model can generate artificial data points that have no relationship to the original ones, other than that they are statistically similar.
Synthetic data is often used to:
- Perform calculations on data sets that would otherwise contain sensitive or personally-identifying information (potentially alleviating some legal and compliance issues).
- Conduct machine learning and AI development processes.
- Create unusual or atypical data sets, so that testing can be conducted on unexpected situations or outliers.
- Baseline models using data that represents authentic data.
- Re-use and repurpose data without having to ask for consents again.
- Assist with software development testing and modelling purposes.
These use-cases have led to the growth of synthetic data for several purposes, such as fraud detection models and visual AI and ML uses, such as training autonomous vehicles to drive on simulated roads.
What are the Challenges with Synthetic Data?
While synthetic data can be useful for many situations, there are also several challenges.
First, the algorithm used to produce synthetic data still needs some real data as input to train the model. This means that the real data still needs to be treated with privacy controls to fit within GDPR and other privacy law requirements. As a result, synthetic data does not completely sidestep slowdowns that may come from legal and compliance areas.
In addition, complex data sets can be difficult to reproduce in synthetic form, as it requires a large amount of computing power (or large amount of data, leading to additional privacy issues). When dealing with complex source data with a lot of noise, synthetic data can in some cases suffer from model overfit, reflecting too much noise in the original data rather than detecting the important characteristics that predict future patterns. These complex data sets can also be costly and slow to turn into accurate synthetic representations. On the flipside, when dealing with small data sets, synthetic data may not be necessary (when considering the cost and time), and small data sets may not have enough data points to create reliable synthetic data. This means that the scope for which it can be used can be somewhat limited, and it may be most applicable to data sets that are not too complex, with large amounts of data that are not sensitive or private in nature.
In all kinds of synthetic data set production, it’s possible for biases in the original source data to become amplified. This depends to some extent on how the synthetic data production model is trained. Biases can also be introduced by the creators of the synthetic data, though this is hopefully avoided by clear business and data handling policies.
One issue that can come up in the creation and use of synthetic data for machine learning and iterative processing, is the later combination of synthetic data sets, or the addition of new data to original models. If you have two or more tables of data you want to protect using synthetic data creation, all of the tables need to be ready and joined first before the data can be generated. This is because the statistical relationships between variables within and between tables need to be replicated. If you need to update or supplement data in the source tables, or need to add new tables, the synthetic data creation process needs to begin again.
This becomes a problem when conducting iterative ML and AI development, as new data is constantly added to the data set in these types of processes. For analytics, different kinds of analyses may need to be performed on different sets of data, which can require the re-creation of synthetic data sets multiple times.
Are there Other Options?
There are other options for protecting data in use, such as differential privacy, and fully homomorphic encryption (FHE). However, these techniques also have some issues when adding new data to the data set.
With both of these approaches, when data sets are combined or when multiple operations with data are performed, data can become vulnerable to the Mosaic Effect. This results in the potential re-identification of individuals or the higher likelihood of personal data being exposed.
Like synthetic data, using FHE to protect data is also costly in terms of time and computing power. This can limit the functionality of FHE in real-world settings. If techniques such as FHE and synthetic data creation are applied in fast-moving analytics, ML, and AI settings, these inefficiencies can slow projects down and reduce data utility and value.
How Can Pseudonymisation and Variant Twins Solve these Problems?
While there are some problems with synthetic data (and other techniques), there is a potential solution: Pseudonymisation. This is not the pseudonymisation that you’ve known from the past. Rather, this is a newly-defined term in the GDPR, used in Anonos’ state-of-the-art technology platform, BigPrivacy. BigPrivacy uses a combination of encryption techniques (including hashing, tokenisation, generalization, masking, binning, rounding, etc.), alongside GDPR-compliant Pseudonymisation, and the protection of both direct and indirect identifiers using dynamic tokenisation at the data element level. The combination of these techniques is known as a Privacy Action, and source data can be transformed using these Privacy Actions into data assets, known as Variant Twins.
The use of Variant Twins can help to alleviate the issues above with synthetic data, from several different perspectives.
First, by using a method that employs Pseudonymisation, data processing is done within the GDPR. This then reduces the fear of privacy and compliance issues, as data can be processed lawfully. Synthetic data still faces some legal and compliance complications with the data used to create synthetic data sets, which means that if appropriate measures are not in place to create the original synthetic data, data sets may be limited or risky to use. Pseudonymisation can solve this problem for synthetic data by transforming the original data set into Variant Twins before using them to produce the synthetic data.
Introduced biases can also be alleviated, as the use of Variant Twins prevents data controllers and data privacy officers from seeing which values or identifiers they are handling during processing.
Using Pseudonymisation allows no data utility to be lost during processing, which can help to ensure that models are trained on representative data. This also allows the processing of more data, which can reduce overfit and underfit issues, and can ensure that results are more reliable.
Using Variant Twins, the problem with needing to engage in complete table-refreshing when you add new data is avoided. Using Variant Twins, you can protect and use tables as you need them, and by using the same Privacy Action configurations, you can join the tables later. All data that has the same Privacy Action applied is “matchable” and combinable, which allows you to bring new data into play without the need to begin again. You can also update your Variant Twins without needing to regenerate all of the tables. This means that you can also add new data after the initial data processing or combination, without having to go through a complete restart. This makes Pseudonymisation-enabled Variant Twins a better choice for complex and evolving ML and AI analytics, which are primarily iterative and discovery-driven.
In general, the use of Variant Twins and Pseudonymisation solves many of the problems that you run into with synthetic data. While the name might sound catchy, the promises of synthetic data cannot live up to the hype in the practical business world.
This article originally appeared in LinkedIn. All trademarks are the property of their respective owners. All rights reserved by the respective owners.
CLICK TO VIEW CURRENT NEWS