According to Gartner®, “through 2030, for data used to train artificial intelligence (AI) models, synthetic tabular data will grow at least three times as fast as real structured data.” 1
Synthetic data has been expanding rapidly in the last few years. To use data to benefit your business, products or research, you need to consider leveraging synthetic data for its many advantages while also avoiding potential risks.
In this blog post, we examine the adoption of synthetic data across industries, its benefits, and provide an overview of the major use cases and recommendations for data, privacy and business leaders.
Here's what you'll learn:
Here's what you'll learn:
- What synthetic data is, a high-level overview;
- Synthetic data's privacy-enhancing benefits;
- The major use cases for synthetic data according to Gartner analysts;
- What to consider when starting a synthetic data project.
What is synthetic data?
The term synthetic data refers to data that has been generated artificially. In synthetic tabular data generation, the goal is to use real world data to train and create synthetic data with similar statistical properties. This newly created synthetic dataset is comparable to the original dataset in terms of quality and statistical distribution. Thus, synthetic data appears and behaves exactly like the source data, but it does not have any direct relation to actual individuals, places or objects. The more realistic synthetic data is, the greater its analytical value.
In most cases, it is created by using machine learning or artificial intelligence algorithms that learn the distributions and relationships within the original dataset.
In most cases, it is created by using machine learning or artificial intelligence algorithms that learn the distributions and relationships within the original dataset.
Synthetic data overcomes many pitfalls, allowing for faster, less expensive, and more scalable access to data that is representative of the underlying source and privacy-preserving.” 2
A synthetic dataset can be generated from real world events, financial transactions, customer, medical, insurance data. Synthetic data is used for a variety of academic and business purposes.
As Gartner points out, “In some cases, real data is too expensive or time-consuming to collect at scale, especially for the types of analysis that customers hope to achieve. Additionally, the real data that clients collect may have imbalances (e.g., lacking “edge cases” or demographic diversity) or face regulatory or company-imposed privacy restrictions that prevent its use outside specific contexts.” 3
As Gartner points out, “In some cases, real data is too expensive or time-consuming to collect at scale, especially for the types of analysis that customers hope to achieve. Additionally, the real data that clients collect may have imbalances (e.g., lacking “edge cases” or demographic diversity) or face regulatory or company-imposed privacy restrictions that prevent its use outside specific contexts.” 3
Does synthetic data enhance privacy?
Synthetic data is gaining popularity as a privacy-enhancing technology (PET). The synthetic data does not have a 1:1 relationship with the original real world data. It is possible to ensure that synthetic data is privacy-preserving and does not leak personal information by providing additional privacy measures, such as training the model using Differential Privacy during the synthesis, or testing the final output against various privacy attacks. If you would like to know more about the technical details of synthetic data generation and how generative adversarial networks are used to create synthetic datasets, read this article.
Of course, privacy preservation is critical for such industries as healthcare, insurance, finance and banking because of the volume of sensitive and personal data involved.
Strict regulations govern the processing of health data in both Europe and the United States, including the EU General Data Protection Regulation (GDPR), the Health Insurance Portability and Accountability Act (HIPAA) and the Health Information Technology for Economic and Clinical Health Act (HITECH).
Switzerland recently updated its Federal Act on Data Protection (FADP), reinforcing privacy obligations and increasing the requirements for documenting processing activities and implementing governance procedures.
GDPR requirements apply to all financial organizations in Europe as of 2018. Financial data is also subject to strict regulations in the United States. Financial institutions are required to comply with federal laws such as the California Consumer Privacy Act (CCPA). Furthermore, these companies are subject to industry-specific standards, for example, the Payment Card Industry Data Security Standard (PCI DSS) or the Gramm-Leach-Bliley Act (GLBA).
In addition, we cannot overlook the Trans-Atlantic Data Privacy Framework, which is set to regulate the flow of data between the EU and the U.S. This framework carries strong obligations for companies handling data transferred from the EU, including the requirement that they self-certify that they adhere to the Principles through the Department of Commerce.
Of course, privacy preservation is critical for such industries as healthcare, insurance, finance and banking because of the volume of sensitive and personal data involved.
Strict regulations govern the processing of health data in both Europe and the United States, including the EU General Data Protection Regulation (GDPR), the Health Insurance Portability and Accountability Act (HIPAA) and the Health Information Technology for Economic and Clinical Health Act (HITECH).
Switzerland recently updated its Federal Act on Data Protection (FADP), reinforcing privacy obligations and increasing the requirements for documenting processing activities and implementing governance procedures.
GDPR requirements apply to all financial organizations in Europe as of 2018. Financial data is also subject to strict regulations in the United States. Financial institutions are required to comply with federal laws such as the California Consumer Privacy Act (CCPA). Furthermore, these companies are subject to industry-specific standards, for example, the Payment Card Industry Data Security Standard (PCI DSS) or the Gramm-Leach-Bliley Act (GLBA).
In addition, we cannot overlook the Trans-Atlantic Data Privacy Framework, which is set to regulate the flow of data between the EU and the U.S. This framework carries strong obligations for companies handling data transferred from the EU, including the requirement that they self-certify that they adhere to the Principles through the Department of Commerce.
Sharing data across jurisdictions and legal entities, according to vendor Statice, is one use case that it comes across in its work with clients.” 4
The use of synthetic data as a PET is promising because privacy-preserving synthetic data does not contain personal information. As long as there is no real world personal information present, personal data regulations do not apply, allowing the synthetic data to be used freely for secondary analysis, shared internally and externally and monetized while remaining compliant and protecting your sensitive data.
The major use cases for synthetic data
For many industries, initiatives focus primarily on internal data. But in others, such as healthcare, the focus is on external collaboration and external data. Both can be enhanced by using synthetic data, and its use cases are numerous, including product development, gaining access to testing data, validating external vendors and partners, and many others.
This graphic was published by Gartner, Inc. as part of a larger research document and should be evaluated in the context of the entire document.
GARTNER is a registered trademark and service mark of Gartner, Inc. and/or its affiliates in the U.S. and internationally and is used herein with permission. All rights reserved.
GARTNER is a registered trademark and service mark of Gartner, Inc. and/or its affiliates in the U.S. and internationally and is used herein with permission. All rights reserved.
Machine learning training and testing
A typical machine learning project faces three major challenges: the lack of real-world data, the quality of the data, and the inability to unlock the data to train machine learning algorithms.
Externally or even within the same organization, sharing sensitive datasets can take a significant amount of time. Unlike sensitive datasets, properly anonymized synthetic data does not require lengthy access requests, thus improving time to value. Furthermore, using synthetic data can enhance the quality of the data. It is the ability to have control over the output that gives synthetic data its potential to produce a more balanced, clean, and useful dataset for training machine learning models.
If you lack adequate data, it is too expensive, or there is no consent to use it in a machine learning project, generating synthetic data can fill this gap.
Externally or even within the same organization, sharing sensitive datasets can take a significant amount of time. Unlike sensitive datasets, properly anonymized synthetic data does not require lengthy access requests, thus improving time to value. Furthermore, using synthetic data can enhance the quality of the data. It is the ability to have control over the output that gives synthetic data its potential to produce a more balanced, clean, and useful dataset for training machine learning models.
If you lack adequate data, it is too expensive, or there is no consent to use it in a machine learning project, generating synthetic data can fill this gap.
Synthetic data can alleviate this bottleneck — with a relatively small portion of “real” data, many vendors can generate synthetic data that is statistically similar to the underlying data at scale and use automation to save the company’s resources." 5
Privacy-compliant internal and external data sharing
To open up real world data or obtain secondary consent to use it for Machine Learning purposes is not an easy task. Compliance verification processes can take months. Creating synthetic data with the appropriate privacy protection levels can simplify this process. For instance, using anonymized synthetic data for a new machine learning project does not require secondary consent.
Synthetic data generation also allows you to safeguard the privacy of your customers, thereby reducing their exposure to risk. Thus, you are able to conduct experiments using simulated datasets, test different machine learning models, and process the data without privacy concerns.
Additionally, the use of synthetic data opens up new opportunities for cooperation. You may work with a third party to test a Proof of Concept (PoC) before implementing it on a large scale.
Synthetic data generation also allows you to safeguard the privacy of your customers, thereby reducing their exposure to risk. Thus, you are able to conduct experiments using simulated datasets, test different machine learning models, and process the data without privacy concerns.
Additionally, the use of synthetic data opens up new opportunities for cooperation. You may work with a third party to test a Proof of Concept (PoC) before implementing it on a large scale.
Software product testing and development
Furthermore, synthetic data is increasingly used for the testing of software applications due to its ability to generate large amounts of "fake" data that is realistic and can be used in the same manner as real data without posing compliance risks.
Synthetic test data should not be confused with mock or dummy data, which is also widely used in testing. The major difference is that synthetic data preserves the statistical properties of the real data that it is derived from, thus making it more useful for analysis.
Product professionals can generate synthetic data to gain the ability to utilize data freely to create better products, collaborate with external partners, run MVPs, scale or pivot quickly, and use the cloud more effectively.
Synthetic test data should not be confused with mock or dummy data, which is also widely used in testing. The major difference is that synthetic data preserves the statistical properties of the real data that it is derived from, thus making it more useful for analysis.
Product professionals can generate synthetic data to gain the ability to utilize data freely to create better products, collaborate with external partners, run MVPs, scale or pivot quickly, and use the cloud more effectively.