In the ever-evolving data ecosystem, data privacy and security have become crucial aspects of effective modern data architectures. With businesses increasingly shifting towards multi-party and cloud-based platforms, the management of data protection for complex processing presents a multi-faceted dilemma.
Secondary data use - repurposing data originally collected for a different reason, such as for AI application development - often yields significant business benefits. Yet, enterprises frequently continue to employ methods that do not sufficiently safeguard sensitive data when pursuing these secondary use cases. Fortunately, there are alternatives that do provide effective protection while meeting the utility requirements of secondary data uses.
In this blog post, we will discuss:
- Data encryption, masking, and tokenization as data protection techniques and their limitations
- The advantages of a toolbox approach versus a single-tool strategy in opening up advanced data use cases
- An introduction to pseudonymized Variant Twins, state-of-the-art protected data outputs embodying a novel approach to preserving sensitive data privacy and integrity
The Three States of Data
It is common to classify data as being in one of three states:
At Rest: This refers to data that is stored on physical or digital media. This is data that isn't actively being moved or processed.
In Transit: When data is being moved from one location to another, it is considered to be 'in transit'.
In Use: This refers to data while it is being processed or consumed by applications, users, or transactions.
In a typical project scenario, the data, which often includes personally identifiable information, at first resides in storage, such as in a database. When an analysis is conducted, this data is first retrieved and sent to an analytics engine. While it is being sent, it is in transit. Once there, the data is processed during computation (in use) to derive valuable insights.
The paramount goal of data protection is to ensure privacy and security in each of these states, with an emphasis on the 'in use' state.
This is because while in use, sensitive data becomes most vulnerable to data breaches as it is exposed to applications, systems, or users. Ensuring robust security during the 'in use' state is not just a recommendation, but a necessity in the modern world of data privacy protection.
Companies frequently employ a combination of access controls - to limit who can gain access to the original sensitive data - and encryption to ensure data security while it is at rest or in transit. Nevertheless, access controls do not protect data when in use, and encrypted data must be decrypted to enable use. To protect the decrypted data during processing, companies often resort to simple data masking tools or tokenization of data, even though these approaches diminish the utility of the data and do not adequately protect privacy in complex, multi-party/data source processing environments.
What are Data Encryption, Masking, and Tokenization
Data encryption, masking, and tokenization are the most widespread techniques used for data protection.
What is Data Encryption
Data encryption transforms data into a code or cipher that can only be accessed by those possessing the correct decryption key or code. Classic examples of encryption are password-protected documents and email services that encrypt messages during transit.
The primary role of encryption is securing sensitive data, and ensuring that, whether at rest or in transit, it is not accessible or readable without access to the decryption key.
To process the data, the decryption key is applied to reveal unprotected cleartext.
What is Data Masking
Data masking is a method where data is obfuscated, typically with a single recurring character, in a way that preserves the original data's format. For instance, a masked cardholder data may appear as **** **** **** 1234, keeping the original data structure intact while protecting sensitive information.
The purpose of using masking solutions is to ensure that sensitive data is not exposed to non-authorized users, while still allowing non-sensitive parts of the data to be used for various purposes such as verification, testing, testing, development, or data analysis.
What is Data Tokenization
Tokenized data is a product of data tokenization, a technique that is used to protect a sensitive data value by replacing it with a randomly generated token value. Tokenizing data requires a system that retrieves and reconnects various token values back to the original sensitive data.
For example, when tokenized, the credit card number "1234 5678 9012 3456" might be replaced with a token like "ABCD EFGH IJKL MNOP". Anybody seeing the tokenized credit card data would not be able to reverse-engineer it to get the actual credit card number without the benefit of additional supplemental information. This often takes the form of a mapping table relating original data values and the assigned tokens. Access to this table is controlled, allowing only authorized systems or users to retrieve the actual sensitive data when necessary.
An example of the credit card processing using data tokenization process during purchase.
Limitations of Encryption, Masking and Tokenization
While data tokenization, masking, and encryption can be effective for certain use cases like testing or protecting data at rest or in transit, they fall short when looking to protect data in use.
As data use cases have evolved
, so too have the challenges associated with data privacy and security, and accompanying regulatory requirements. As a result, traditional data protection methods such as data encryption, masking and tokenization no longer provide effective protection:
Data encryption protects data at rest and in transit, but not in use. To use the data, it must be decrypted, thereby exposing it to leaks and breaches. This risk is amplified in a multi-party environment where no one organization manages all the control mechanisms.
Data masking, while useful in certain narrow contexts, diminishes data utility by stripping away meaningful information. It can also inadvertently lead to data subjects' re-identification due to the "Mosaic Effect", wherein the combination of individually masked datasets can be combined to reveal private information.
Data tokenization also degrades data utility, as the randomly generated tokens are not useful for analytics. Static tokenization, where the same token is used to replace a particular value each time it occurs in order to preserve analytic utility leaves data vulnerable to re-identification via the "Mosaic Effect".
These techniques were once sufficient when data was scarce and expensive for adversaries attempting unauthorized reidentification to obtain. Now that data is inexpensive and easily obtained they fall short of the protection and utility requirements of today’s secondary uses of data. A new approach is needed.
The Toolbox Approach
As discussed earlier, real value often comes from secondary uses of data - repurposing it for objectives other than those for which it was originally collected. This could be for developing AI and machine learning models, data sharing, data enrichment, or data monetization.
These secondary uses of data each have very different requirements for both protection and utility. No single data protection technique can achieve the needs of all of them, and, as noted above, traditional techniques, even in combination, don’t get the job done either. What’s needed is a toolbox of state of the art Privacy-Enhancing Technologies (PETs).
When evaluating PETs, it's essential to consider their ability to meet the needs of your organization, the level of data protection provided, the preservation of data utility, scalability, and compatibility with your existing architecture. Going beyond simple masking and static tokenization solutions, the toolbox approach introduces newer, more powerful PETs and provides the flexibility and robustness required.
To better illustrate the benefits of a toolbox approach, let’s consider the benefits of using a variety of PETs for the following seven universal secondary data uses.
Data for application development and testing: Industry standards and regulations often prohibit the use of production data in application development and testing. Simulated data (sometimes called mock or fake data) is artificially generated data that “looks” realistic, but isn’t actually drawn from real data, so there is no privacy risks associated with its use.
Internal data sharing: Various departments within a company may need to share data to coordinate day-to-day business operations. Encryption would be standard practice to safeguard the data during storage and transit. In many cases, simple techniques like masking, tokenization and generalization may provide sufficient risk mitigation.
Data analytics, machine learning, and AI model building: Though novel and cutting edge just a decade ago, machine learning, AI and other advanced analytics are now widely used across all industries and organizational sizes.
Synthetic data stands as perhaps the best PET for the model development or training phase of these tools:
Synthetic data, because it generates a new set of records with no one-to-one relationship with the source records from which they are derived, provides protection against the risk of unauthorized reidentification.
Model building rests primarily on discovering and quantifying statistical relationships between elements in a data set. Synthetic data is designed to do exactly that by mimicking the statistical relationships in a dataset through a completely new set of generated records.
Data analytics, machine learning, and AI model deployment: After the AI and machine learning models have been developed, a company wants to deploy them to improve business operations. Once deployed, synthetic data no longer fits the bill, as the models need to be used to generate insights and inferences using real data. To avoid the risks associated in processing unprotected cleartext, while preserving its utility, a PET known as statutory pseudonymization is a great choice.
Sharing data with service providers: Once the company shares data with third-party service providers, it must ensure the data is protected not only at rest or in transit but also in use. Depending on the specific requirements of the use case, synthetic data or statutory pseudonymization will be good choices.
Sharing data for enrichment: Data sets increase in value and utility as they become more comprehensive. Accordingly, companies frequently want to share or exchange data to enrich already existing data. This includes sharing data across jurisdictional boundaries in compliance with data sovereignty requirements. Depending on use case requirements and regulatory restrictions, aggregated, synthetic, or statutorily pseudonymized data are likely the right choice.
Sharing data for monetization: In some cases, when done in compliance with relevant regulations, data is sometimes sold to other organizations that can obtain value from it. As with the two previous secondary uses, either synthetic or statutorily pseudonymized might be the appropriate choice.
Data Embassy: A Toolbox of Modern, Usable PETs
Anonos Data Embassy software enables a patented toolbox of PETs that can be combined and configured to meet the needs of every company’s unique data journey.1 Whether using data for testing and development, internal data sharing, AI and machine learning model building and deployment or external data sharing, Data Embassy provides the data protection techniques needed.
Data Embassy creates non-identifiable, high utility transformed versions of source data, known as Variant Twins. Variant Twins are the result of applying a combination of suitable PETs for a particular secondary use of data that de-risks data without degrading its analytical value. This data can be used within and beyond trusted environments, enabling internal and external data sharing.
Anonos Data Embassy Variant Twins offer key advantages over other alternatives:
Protecting data in use: Variant Twins can be applied to protect data as early as possible in the data lifecycle, ensuring data protection in use.
Preserving analytical value: Variant Twins maintain the essential statistical properties of the original data, ensuring the analytical value remains intact.
Facilitating secondary uses of data: Variant Twins can accommodate a wide range of secondary data uses, from simple application testing to advanced applications such as AI and data enrichment.
Ensuring compliance: Variant Twins technologically enforce data protection policies according to use case specifics and regulatory requirements, enabling companies to utilize data responsibly and reduce the data breach risk.
Case Study: Pharmaceutical Company Accelerates Time-to-Data from 4 months to 4 days.
A leading pharmaceutical company faced an all-to-common data challenge: a data asset of immense potential value, but unavailable for use due to complex internal approval procedures and privacy risk concerns. Each use case involving the data required an extensive approval process lasting over four months. Moreover, their current data protection methods degraded utility, with less than 80% accuracy relative to processing cleartext data.
Using Anonos Data Embassy software, the company created Variant Twins delivering high utility that embedded organizational privacy policies and security standards directly into the data, facilitating rapid approval of data uses and use cases.
The results were transformative. By employing Anonos' Variant Twins technology, the company was able to protect sensitive data and accelerate data project approvals from over four months to just four days
. This speed of operation fostered an environment where secure data sharing is a seamless process both for both internal and external uses.
Don’t miss out on transformative data use cases. Integrate Variant Twins into your data privacy strategy.
SCHEDULE A DEMO