General Data Protection Regulation (GDPR)
General Provisions
Chapter 5

PSEUDONYMISATION TECHNIQUES

Following the adversary models and types of attacks described in Chapter 4, this Chapter presents in brief the most common pseudonymisation techniques and policies today. For a more detailed analysis on the cryptographic primitives please refer to [1].

In principle, a pseudonymisation function maps identifiers to pseudonyms. There is one fundamental requirement for a pseudonymisation function. Let us consider two different identifiers 𝐼𝑑1 and 𝐼𝑑2 and their corresponding pseudonyms 𝑝𝑠𝑒𝑢𝑑𝑜1 and 𝑝𝑠𝑒𝑢𝑑𝑜2. A pseudonymisation function must verify that 𝑝𝑠𝑒𝑢𝑑𝑜1 is different than 𝑝𝑠𝑒𝑢𝑑𝑜2. Otherwise, the recovery of the identifier could be ambiguous: the pseudonymisation entity cannot determine if 𝑝𝑠𝑒𝑢𝑑𝑜1 corresponds to 𝐼𝑑1 or 𝐼𝑑2. However, a single identifier 𝐼𝑑 can be associated to multiple pseudonyms (𝑝𝑠𝑒𝑢𝑑𝑜1, 𝑝𝑠𝑒𝑢𝑑𝑜2…) as long as it is possible for the pseudonymisation entity to invert this operation. In all cases, according to the definition of pseudonymisation (see Chapter 2), there exists some additional information that allows the association of the pseudonyms with the original identifiers; this is the pseudonymisation secret. The simplest case of pseudonymisation secret is the pseudonymisation mapping table.

In the following sections, the main options available to pseudonymise a single identifier are first defined. The different policies available for pseudonymisation are then described, comparing their implementation characteristics. A reference to the main criteria that a controller may use to select a pseudonymisation technique is also made. Last, the possibilities of recovery of pseudonymisation by the pseudonymisation entity are discussed.

5.1 SINGLE IDENTIFER PSEUDONYMISATION

Starting from the pseudonymisation of a single identifier, a list of possible approaches is presented below, together with relevant advantages and constraints.

5.1.1 Counter

Counter is the simplest pseudonymisation function. The identifiers are substituted by a number chosen by a monotonic counter. First, a seed 𝑠 is set to 0 (for instance) and then it is incremented. It is critical that the values produced by the counter never repeat to prevent any ambiguity.

The advantages of the counter rest with its simplicity, which make it a good candidate for small and not complex datasets. In terms of data protection, the counter provides for pseudonyms with no connection to the initial identifiers (although the sequential character of the counter can still provide information on the order of the data within a dataset). This solution, however, may have implementation and scalability issues in cases of large and more sophisticated datasets, as the complete pseudonymisation mapping table needs to be stored.

5.1.2 Random number generator (RNG)

RNG is a mechanism that produces values in a set that have an equal probability of being selected from the total population of possibilities and, hence, are unpredictable18. This approach is similar to the counter with the difference that a random number is assigned to the identifier.

Two options are available to create this mapping: a true random number generator or a cryptographic pseudo-random generator (see [11] for exact definitions). It should be noted that in both cases, without due care, collisions can occur19. A collision is the case of two identifiers being associated to the same pseudonym. The probability that a collision will appear is related to the well-known birthday paradox [12].

RNG provides strong data protection (as, contrary to the counter, a random number is used to create each pseudonym, thus it is difficult to extract information regarding the initial identifier, unless the mapping table is compromised). Collisions may be an issue as mentioned earlier, as well as scalability (the complete pseudonymisation mapping table must be stored), depending on the implementation scenario.

5.1.3 Cryptographic hash function

A cryptographic hash function takes input strings of arbitrary length and maps them to fixed length outputs [13] [14]. It satisfies the following properties:

  • One-way: it is computationally infeasible to find any input that maps to any pre-specified output.
  • Collision free: it is computationally infeasible to find any two distinct inputs that map to the same output.

A cryptographic hash function is directly applied to the identifier to obtain the corresponding pseudonym: 𝑃𝑠𝑒𝑢𝑑𝑜 = 𝐻(𝐼𝑑). The domain of the pseudonym depends on the length of the digest produced by the function.

As mentioned in [1], while a hash function can significantly contribute towards data integrity, it is generally considered weak as a pseudonymisation technique as it is prone to brute force and dictionary attacks. Specific examples of this weakness are provided in Chapters 6 and 7 below.

5.1.4 Message authentication code (MAC)

This primitive can be seen as a keyed-hash function. It is very similar to the previous solution except that a secret key is introduced to generate the pseudonym. Without the knowledge of this key, it is not possible to map the identifiers and the pseudonyms. HMAC [15] [16] is by far the most popular design of message authentication code used in Internet protocols.

As mentioned in [1], MAC is generally considered as a robust pseudonymisation technique from a data protection point of view, since reverting the pseudonym is infeasible, as long as the key has not be compromised. Different variations of the method may apply with different utility and scalability requirements of the pseudonymisation entity (see more specific examples in Chapters 6 and 7 below).

5.1.5 Encryption

This report mainly considers symmetric (deterministic) encryption and in particular block ciphers like the AES and their modes of operation [11]. The block cipher is used to encrypt an identifier using a secret key, which is both the pseudonymisation secret and the recovery secret. Using block ciphers for pseudonymisation requires to deal with the block size. The size of the identifiers can be smaller or larger than the input block size of block cipher. If the identifiers’ size is smaller, padding [11] must be considered. In the case where the identifiers’ size is larger than the block size, there are two options that can be used to solve this problem; the identifiers can be compressed into something smaller than the block size; if compression is not an option available, a mode of operation (like the counter mode CTR) can be used. However, this last option requires managing an extra parameter, the initialisation vector.

As mentioned in [1] encryption may also be a robust pseudonymisation technique, with several properties similar to MAC. Specific examples are discussed in Chapters 6 and 7.

Although this report mainly focuses on deterministic encryption schemes, probabilistic encryption is another alternative, which could be used especially in cases where there is need to derive different pseudonyms for the same identifier (see also fully-randomized pseudonymisation policy below). For further information, see also in [1]. 

5.2 PSEUDONYMISATION POLICIES

While the choice of the pseudonymisation technique is essential, the policy (or mode) of implementation of pseudonymisation is equally important to its practical application.

This part considers the more general problem of the pseudonymisation of a database or any document which contains 𝑘 identifiers. Let us consider an identifier 𝐼𝑑 which appears several times in two datasets 𝐴 and 𝐵. After pseudonymisation, the identifier 𝐼𝑑 is substituted with respect to one of the following policies: deterministic pseudonymisation, document-randomized pseudonymisation and fully-randomized pseudonymisation.

5.2.1 Deterministic pseudonymisation

In all the databases and each time it appears, 𝐼𝑑 is always replaced by the same pseudonym 𝑝𝑠𝑒𝑢𝑑𝑜. It is consistent within a database and between different databases. The first step to implement this policy is to extract the list of unique identifiers contained in the database. Then, this list is mapped to the pseudonyms and finally the identifiers are substituted to the pseudonyms in the database (see Figure 8).

Figure 8: Deterministic pseudonymisation

All techniques mentioned in Chapter 5.1 can be directly used to implement deterministic pseudonymisation.

5.2.2 Document-randomized pseudonymisation

Each time 𝐼𝑑 appears in a database, it is substituted with a different pseudonym (𝑝𝑠𝑒𝑢𝑑𝑜1, 𝑝𝑠𝑒𝑢𝑑𝑜2,...). However, 𝐼𝑑 is always mapped to the same collection of ( 𝑝𝑠𝑒𝑢𝑑𝑜1, 𝑝𝑠𝑒𝑢𝑑𝑜2) in the dataset 𝐴 and 𝐵.

Figure 9: Document-randomized pseudonymisation

The pseudonymisation is only consistent between different databases in this case. The mapping table is created this time using all the identifiers contained in the database. Each occurrence of a given identifier (i.e., Alice in Figure 9) is treated independently.

5.2.3 Fully-randomized pseudonymisation

Finally, for any occurrences of 𝐼𝑑 within a database 𝐴 or 𝐵, 𝐼𝑑 is replaced by a different pseudonym ( 𝑝𝑠𝑒𝑢𝑑𝑜1, 𝑝𝑠𝑒𝑢𝑑𝑜2). This case is fully-randomized pseudonymisation. This policy can be viewed as a further extension of document randomized pseudonymisation. In fact, the two policies have the same behaviour when they are applied on a single document. However, if the same document is pseudonymised twice with fully-randomized pseudonymisation, two different outputs are obtained. With document-randomized pseudonymisation, the same output would have been obtained twice. In other words, in document-randomized pseudonymisation the randomness is selective (e.g. only for Alice), whereas in fully-randomized pseudonymisation randomness is global (it applies to any record)

5.3 CHOOSING A PSEUDONYMISATION TECHNIQUE AND POLICY

The choice of a pseudonymisation technique and policy depends on different parameters, primarily the data protection level and the utility of the pseudonymised dataset (that the pseudonymisation entity wishes to achieve). In terms of protection, as discussed in the previous sections, RNG, message authentication codes and encryption are stronger techniques as they thwart by design exhaustive search, dictionary search and guesswork. Still, utility requirements might lead the pseudonymisation entity towards a combination of different approaches or variations of a selected approach. Similarly, with regard to pseudonymisation policies, fullyrandomized pseudonymisation offers the best protection level but prevents any comparison between databases. Document-randomized and deterministic functions provide utility but allow linkability between records. Specific solutions might be applicable, depending on the identifiers that need to be pseudonymised (see Chapters 6 and 7 for more specific examples).

In addition, the pseudonymisation entity may be concerned by the complexity associated to a certain scheme in terms of implementation and scalability: is it simple to apply pseudonymisation to the identifiers and does pseudonymisation impact the database size?

Table 3: Comparison of different techniques in terms of flexibility (identifier format) and pseudonym size

Method Identifier size Pseudonym size m in bits
Counter Any 𝑚 = 𝑙𝑜𝑔2k
Random Number Generator Any 𝑚 ≫ 2𝑙𝑜𝑔2𝑘
Hash function Any Fixed or 𝑚 ≫ 2𝑙𝑜𝑔2𝑘
Message Auth. Codes Any Fixed or 𝑚 ≫ 2𝑙𝑜𝑔2𝑘
Encryption Fixed20 Fixed or same as identifier

Most solutions can be applied on identifiers of variable size except for certain choices in the case of encryption. The size of the pseudonym depends on 𝑘, the number of the identifiers contained in the database. For random number generator, hash function and message authentication code, there is a probability of collision: the size of the pseudonym must be chosen carefully (see birthday paradox). Hash functions and message authentication codes are suitably designed so as to ensure that the digest size prevents any risks of collision. Finally, the size of the pseudonyms produced by an encryption scheme can be fixed or equal to the size of the original identifier. Table 3 presents the scalability of the aforementioned approaches with regards to the recovery function.

5.4 RECOVERY

As, by definition, the use of additional information is central to pseudonymisation, the pseudonymisation entity must implement a recovery mechanism. This mechanism can be more or less complex depending on the pseudonymisation function. In general, they consist of the use of a pseudonym 𝑝𝑠𝑒𝑢𝑑𝑜 and a pseudonymisation secret 𝑆 to recover the corresponding identifier 𝐼𝑑. This case can occur for example when the pseudonymisation entity has detected an anomaly in its system and needs to contact the designated entities. The “anomaly” can be for instance a data breach and the pseudonymisation entity needs to notify the data subjects under GDPR. In addition, the recovery mechanism might be necessary in order to allow for the exercise of data subjects rights (under articles 12-21 GDPR).

Table 4: Comparison of different techniques with regard to recovery mechanism

Method Recovery based on pseudonym
Counter Mapping table
Random Number Generator Mapping table
Hash function Mapping table
Message Auth. Codes Mapping table
Encryption Decryption

Most methods described previously require the pseudonymisation entity to keep the mapping table between the identifiers and the pseudonyms to perform identifier recovery with the exception of encryption (Table 4). Indeed, decryption can be directly applied on the identifier.

5.5 PROTECTION OF THE PSEUDONYMISATION SECRET

In order for pseudonymisation to be efficient, the pseudonymisation entity must always protect the pseudonymisation secret by proper technical and organisational measures. This clearly depends on the specific pseudonymisation scenario (see Chapter 3).

Firstly, the pseudonymisation secret must be isolated from the dataset, i.e. the pseudonymisation secret and the dataset must never be handled in the same file (otherwise, it will be too easy for an adversary to recover the identifiers). Secondly, the pseudonymisation secret must be securely deleted from any insecure media (memory storage and systems). Thirdly, strong access control policies must ensure that only authorised entities have access to this secret. A secure logging system must keep track of all the access requests made to the secret. Finally, the pseudonymisation secret must be encrypted if it is stored on a computer, which in turn necessitates a proper key management and storage for this encryption.

5.6 ADVANCED PSEUDONYMISATION TECHNIQUES

Beyond the pseudonymisation techniques listed above, there exists a plethora of other, more advanced pseudonymisation techniques, suited for multiple different contexts. Explaining each of these in detail would exceed the scope of this report, so some of these techniques are briefly listed here, for interested readers to follow.

Apart from plain hashing of data, more advanced structures like Merkle trees [17, 18] utilise hashes of sets of hashes, e.g. h3=hash(h1,h2), to achieve structured pseudonyms that can be uncovered only partially instead of completely. Similarly, hash chains [19] rely on repeatedly hashing the hash values of hash values, e.g. h4=h3(h2(h1(x))), to produce a value that requires multiple hash inversions to re-identify the original data of a given pseudonym. One example for such a hashing technique, a pseudonymisation chain, involves several pseudonymisation entities that subsequently take the pseudonyms created by the previous pseudonymisation entity as input to create new pseudonyms (e.g. by applying another layer of hashing). Such a chain will hold even if an adversary manages to uncover all but one of the pseudonymisations applied in the total chain, making it a very robust pseudonymisation technique. It is common practice e.g. for clinical trials. 

If the input domain spans over multiple dimensions (see Chapter 8 for an example), bloom filters [20], apart from being used as an anonymisation technique, can be utilised to efficiently perform computationally feasible pseudonymisation over all possible combinations of input values on the different domains, despite the state explosion problem.

Linkable transaction pseudonyms and/or controlled pseudonym linkability with the option of step-wise re-identification can also constitute another interesting approach [21].

Finally, all techniques that can effectively be utilised to increase anonymisation can also be useful for pseudonymisation, such as the common techniques for k-anonymity [3, 22, 23] or differential privacy [24] and beyond [25]. See also relevant descriptions in [2]. Zero-knowledge proof [26] and the broader area of attribute-based credentials can provide interesting solutions as well [2].