Tokenization vs Pseudonymization: Key Differences

In March this year, data from 73 million AT&T customers was published on the dark web. This included passwords, social security numbers, and addresses. Last year, 23andMe also experienced a large data breach. In this case, it exposed highly personal information such as genetic ancestry results. As a result, 23andMe suffered serious reputational damage, their stock price plummeted, and the company made many layoffs.

The Wall Street Journal also reports that the number of data breaches is increasing, with data breaches in the U.S. reaching "3,205 in 2023, up 78% from 2022 and 72% from the previous high-water mark in 2021."

This is why protecting customer and enterprise data is so important. When (not if) a breach occurs, proper protection means that no personal and/or proprietary data is exposed. In addition, privacy laws such as the GDPR require a certain level of protection to even lawfully collect and process data in the first place. This protection can be established with several different techniques and approaches.

Tokenization and pseudonymization are two particularly widespread approaches that can be used to protect personal data. However, there are significant differences between the two. To protect your data while preserving its analytical value, you must understand the benefits and drawbacks of each. We’ll discuss those here.

More specifically, this blog post will cover:

What tokenization and pseudonymization are
The key differences between the two
The business applications & limitations
Real-world examples of each approach in action

Tokenization vs Pseudonymization: Key Differences

First, to understand the differences between tokenization and pseudonymization, let's define them. Then, we'll go into some business applications of each. Let's start with tokenization.

What is Tokenization?

Data tokenization is a technique used to protect a sensitive data value by replacing it with a token value. Tokenizing data requires a system that retrieves and reconnects various token values to the original sensitive data.

For example, when tokenized, the credit card number "1234 5678 9012 3456" might be replaced with a token like "ABCD EFGH IJKL MNOP". The image below shows how a card number might be tokenized during an online consumer purchase.

Anybody seeing the tokenized credit card number or other tokenized data wouldn’t be able to reverse-engineer it to get the actual values without additional supplemental information.

This supplemental information could consist of using the same token in other transactions, leading to the ability to reverse engineer the original data values from the assigned tokens.

This ability to re-identify data is referred to as the Mosaic Effect, which occurs when a person is indirectly identifiable via linkage attacks because some datasets can be combined with other datasets known to relate to the same individual, enabling the individual to be distinguished from others.

It’s important to note that the tokenized credit card data is also known as a token. The token is the collection of characters that replace the personal data, such as in the above example "ABCD EFGH IJKL MNOP". This token is a representation of information but contains no direct link to the original data in itself. In the above example, this token represents the credit card number.

You may have also heard of tokenization in the context of AI. Let’s cover it now.

What is Tokenization in Generative AI?

When generative AI processes information, it must be broken down into smaller pieces. Otherwise, the AI cannot understand it. Tokenization in the context of AI is the process of breaking down prompts, or sentences, that have been fed into the AI.

The sentences are broken down into individual words or even parts of words. The AI model then processes these components. AI models, such as Large Language Models (LLM), use a process of tokenization to understand language. This is known as Natural Language Processing (NLP).

Tokenizing data in the context of privacy is a little different than in the context of LLMs and NLP.

For privacy purposes, you will want to control who can reverse any tokens that have been created.

For example, if you have tokenized credit card numbers, names, and ID numbers in the context of a bank, unless you control how often these tokens are assigned and used, undesired parties will be able to reverse the tokenization to see which information is connected with which bank account or customer.

This is where pseudonymization comes in.

What is Pseudonymization?

In contrast to tokenization, pseudonymization controls the possibility of reversing the tokenization process and reconnecting the personal data with the individual. The GDPR specifically notes pseudonymization as a valuable approach for data protection. In the GDPR, to achieve effective pseudonymization, you must process personal data “in such a manner that the personal data can no longer be attributed to a specific data subject without the use of additional information."

This additional information is like a key that would re-connect a token to the personal information, showing that "ABCD EFGH IJKL MNOP" actually relates to the credit card number of "1234 5678 9012 3456."

The GDPR states that this "additional information" must be kept separately and be subject to technical and organizational measures. This means the additional information must be well protected, using both organizational policies and technical protections.

What is Statutory Pseudonymization?

In addition, pseudonymization under the GDPR is not as simple as it may seem. The GDPR sets a relatively high legal standard, meaning that the basic tokenization process alone is not enough to meet the requirements.

Pseudonymization under the GDPR is also known as "statutory pseudonymization." To achieve this standard, there are five elements that your pseudonymization process should meet:

Protection of all data elements, including direct and indirect identifiers.
Protection against singling-out attacks, by using approaches such as K-anonymity and aggregation.
The use of dynamic tokens, i.e. different tokens at different times for different purposes and at different locations.
The use of non-algorithmic lookup tables for reconnecting tokens with personal information.
Controlled re-linkability by keeping the "additional information" separate and protected. This makes sure that only authorized people for authorized purposes have access.

By applying these approaches to your pseudonymization process, you can ensure that you have met the GDPR standard.

Deciding whether to use pseudonymization or tokenization to protect personal and proprietary information depends on what business case you would like to implement.

Let's now look at the two approaches and their differing applications.

Key Differences Between Tokenization and Pseudonymization

While tokenization and pseudonymization are related, they have several key differences.

The table below highlights some of the key aspects of each approach and compares them directly with each other.

Aspects	Tokenization	Pseudonymization
Method	Individual values are replaced with a token that does not immediately reveal the replaced data value.	Direct and indirect identifiers are replaced with a dynamic pseudonym, which can be re-linked to the original information only under controlled authorized conditions.
Original data retained	Yes	Yes
Reversibility	Yes (uncontrollably, through attacks)	Yes (controllably)
Protects data in transit, in storage, or in use?	In transit or storage	In transit, storage, and in use
GDPR status	Not mentioned in the GDPR	Explicitly mentioned in the GDPR as a method of data protection
Personal data	No	Yes
Common uses	Banking, sharing information with third parties, credit cards, insurance, some healthcare	Healthcare information, customer information, finance and banking, marketing, customer service

While tokenization and pseudonymization may appear very similar, this comparison table shows exactly how different they are. When you look at their application in real business applications, you can see even more clearly what each method can do.

Let's examine how tokenized data and pseudonymized data can each be applied to real business operations and the limitations and benefits of each.

Business Applications of Tokenization

Tokenization is commonly used in business applications, particularly finance and fintech use cases.

One classic example would be a bank's call center, where customer support staff may need to access certain information about a bank account. However, they shouldn’t be able to see all of the customer’s information.

Tokenization can also mask only some of the relevant numbers, leaving some available to act as an identifier when necessary. This makes tokenization a useful approach for several business cases, such as controlled information processing and banking transactions.

Let’s take a look at the benefits in more detail.

Benefits of Tokenization in Business Cases

Enhanced security: By replacing sensitive data with non-sensitive tokens, tokenization ensures that even if a breach occurs, hackers only gain access to the tokens instead of immediately identifiable personal information.
Regulatory compliance: Tokenization helps enterprises to adhere to some regulatory requirements and mitigate the risk of certain non-compliance penalties - typically those that involve the controlled use of the data within protected perimeters.
Increased customer trust: Your enterprise can show a commitment to improved security when using approaches like tokenization (this applies even more so to pseudonymization, which controls relinkability).

Now let’s take a look at the limitations.

Limitations of Tokenization in Business Cases

While data tokenization can be effective for certain use cases, such as protecting data at rest or in transit, it doesn't perform well if you want to protect data in use. It can also reduce granularity and utility for certain use cases. Here’s why.

When data is tokenized, the personal or proprietary data is protected only if the distribution and use of the data are controlled to prevent undesired re-identification via the Mosaic Effect.

But to actually process this data, i.e. to use it, either a loss of granularity results (from processing data with tokenized values), or you have to leave values without protection to ensure their utility.

For protecting data in use, it can be a better choice (as explained below) to achieve Statutory Pseudonymization, which retains the flexibility to controllably reverse tokens later for the relevant values.

In addition, as data use cases have evolved, so have the challenges associated with data privacy and security. The accompanying regulatory requirements have also become stricter.

As a result, traditional data protection methods such as tokenization are in many cases, no longer sufficient to meet regulatory requirements on their own. They may not provide effective protection unless used in combination with other techniques.

In the image below, you can see that only three identifiers can be used to identify a large percentage of people just by combining data sets (linkage attacks). This is only one of the many attacks that your data protection approaches may be subject to.

In addition, many modern use cases aim for granular, detailed information, which tokenization can, at times, obscure.

This is because data tokenization degrades data utility, as the randomly generated tokens are not useful for analytics. Static tokenization, which uses the same token to replace a particular value each time for analytical purposes, leaves data vulnerable to re-identification through linking attacks.

If you want to perform a detailed analysis of customer information for analytics purposes, for example, tokenization won’t be a good choice. This is because the utility of the data will be heavily reduced.

In addition, for applications such as health data, medical providers usually want the information to stay associated with the applicable patient.

The same applies to financial and banking information. These applications may often require re-linking. Once tokenization has occurred and the data has been shared, transferred, or processed, banks often want the data to be re-linked to the individual it was associated with.

This is where pseudonymization comes in.

Business Applications of Pseudonymization

Statutory pseudonymization is well-suited to applications in which re-linking is required. It is also much more suitable for protecting data in use than tokenization. In addition, it supports a wide range of data sharing and multi-cloud processing use cases.

Using statutory pseudonymization is one way to give your enterprise speed and scale in data use cases. This is because it preserves both data privacy and data utility.

Benefits of Pseudonymization in Business Cases

In contrast to tokenization, statutory pseudonymization is an excellent method for protecting data in use while preserving data utility. There are several benefits of statutory pseudonymization, such as:

High level of GDPR-compliant protection: Statutory pseudonymization is one of the few data protection approaches mentioned by the GDPR, European regulators, and other data protection bodies such as the European Data Protection Board.
Increased speed of project approvals: When using a highly-regarded and regulator-approved data protection approach, your projects are less likely to face pushback from legal and privacy teams. This can fast-track approvals for data projects.
Expansion of business use cases: Statutory pseudonymization allows your enterprise to carry out use cases that other data protection methods do not enable.
Controlled re-linking: Re-connecting personal information with other data, such as the results of analytics or machine learning algorithms, can allow you to undertake detailed research. You can then also re-link the results to an individual so that they can benefit from this analysis.
Ability to transfer and process data in the cloud: When data is well protected in transit and in use, you can transfer data to the cloud and process it without contravening data protection regulations or facing data sovereignty issues.

Now let’s take a look at these benefits in action.

Pseudonymization: An Enterprise Use Case

One good illustration of pseudonymization is in the case of customer data and third-party service providers.

Situation: A hardware provider wants to outsource its help desk to a third party in another jurisdiction and apply data minimization to control the exposure of its customer information.
Challenge: The company doesn’t want to share its source data with a third party unless specifically requested by the customers.
Requirements: The source data must not be exposed unless requested.

The hardware company can use Statutory Pseudonymization to create on-demand data relinking for its entire database. This means it can build an application that prompts the service team to verify customers who call in by confirming certain checks.

When the customer confirms these checks (e.g. the help desk asks for the customer to provide their name, account ID, and ticket number), the service team can re-link the pseudonyms and see the relevant information. They can then provide the necessary assistance.

When the customer provides this permission, they consent to their information being revealed. This complies with the privacy regulations, keeping data protected until controlled re-linking is required. No unnecessary data is revealed to the outsourced help desk.

In general, pseudonymization enable a wide range of business cases and has particular strengths when you need both high utility and high privacy.

In the image below you can see the ways in which tokenization compares to pseudonymization, as well as other data protection techniques.

Pseudonymization: An Enterprise Use Case

Using Tokenization and Pseudonymization for Enterprise Success

Both tokenization and pseudonymization are useful approaches for data protection and limiting the consequences of breach.

If your business use case does not require re-linking, and you mainly want to protect data from being accessed, and the use case involves limited processing within a protected perimeter (such as obscuring a credit card number), tokenization may be the right approach.

On the other hand, if your enterprise wants to undertake more complex analytics or re-link data, such as in healthcare use cases, pseudonymization makes more sense. Anonos Data Embassy specializes in Statutory Pseudonymization but also supports tokenization, so you can combine them or choose between them as needed for your use case.

Sign up for our newsletter to discover more nuanced use cases and exclusive content you won't find anywhere else on Anonos.