This Chapter addresses some techniques that may be utilised for pseudonymisation, describing their main characteristics, advantages and/or limitations. In order to do so, Section 3.1 sets some basic design goals and discusses the topic of data separation, which is essential to the notion of pseudonymisation. The different techniques are then described, in particular hashing without key (Section 3.2), hashing with key or salt (Section 3.3), encryption as pseudonymisation technique (Section 3.4), other cryptographic techniques (Section 3.5), tokenization (Section 3.6) and other approaches (Section 3.7).
It should be noted that the various pseudonymisation techniques are presented in the context of deriving pseudonyms from initial identifiers that are associated to individuals (see also discussion on identifiers in Section 2.1.3). Although for reasons of simplicity the focus is on cases that a unique identifier associated to an individual (e.g. a device identifier) is being transformed into a pseudonym, the presented techniques can be also applied to more complex cases (e.g. where there is a combination of identifiers).
Note that the same terminology as presented in Chapter 2 is also used in this Chapter, including the notion of identifiers and pseudonyms, which are core in the descriptions that will follow.
3.1 Design goals
As mentioned earlier, pseudonymisation may contribute towards hiding an individual’s real identity, as well as supporting unlinkability across different data processing domains. When examining different pseudonymisation techniques, it is important to assess whether the aforementioned purposes can be met and to what extent. In this regard, note should be taken of the fact that, as mentioned in Section 2.2, not all pseudonymisation techniques would fall under the stricter definition of GDPR, which requires that pseudonymised data can no longer be attributed to a specific data subject without the use of additional information. Therefore, the choice of the pseudonymisation technique is essential for controllers.
To this end, the following design goals can be set by the data controllers towards adopting an optimal technique, taking into account the risks of the specific data processing operation to the rights and freedoms of individuals:
- D1) the pseudonyms should not allow an “easy” re-identification by any third party (i.e. any other than the controller or processor) within a specific data processing context (so as to “hide” the initial identifiers in a specific context).
- D2) it should not be trivial for any third party (i.e. any other than the controller or processor) to reproduce the pseudonyms (so as to avoid the usage of the same pseudonyms across different data processing domains – unlinkability across domains).
The aforementioned goals are based on the assumption that a data controller will be able to re-identify the data subjects after a pseudonymisation process (as it has access to the additional information). A data processor might also have this possibility under the instructions of the controller. This is clearly not the case for third parties, against whom the data are actually protected.
There are also cases in which there is no need for the controller to associate the pseudonymised data with specific initial identifiers. For instance, a controller may only need to perform tracking of individuals, i.e. to be able to distinguish any individual from others within a specific processing context, without actually having knowledge of the individual’s real identity or, more generally, his or her initial identifiers 26. Again,pseudonymisation may also be the vehicle for fulfilling such a requirement, via appropriately employing a technique to ensure that the same pseudonym will always be assigned to the same individual. As it will be discussed next, the choice of a proper pseudonymisation technique is strongly contingent on whether there is a possibility, in the context of the data processing, for the controller to refrain from storing the initial identifiers and only track the data subjects on the basis of pseudonyms.
It should also be pointed out that, as mentioned earlier, a pseudonymisation approach may also yield additional data protection gains in terms of data accuracy. This adds a third design goal, which should also be considered by data controllers. For instance, there exist pseudonymisation techniques generating pseudonyms that are mathematically bound to the initial identifiers and, thus, these pseudonyms may suffice to allow verification of data subjects’ identities under specific frameworks.
Furthermore, together with the aforementioned design goals, another important aspect for controllers to consider is that of data separation, i.e. separation of the pseusonymised data from the additional information (the two distinct outputs of pseudonymisation). Indeed, since the notion of pseudonymisation implies an association between pseudonyms and the initial identifiers (the additional information), a mapping table or other relevant structure that allows for this association (e.g. a key, as discussed next) would probably need to be in place.
Depending on the data processing operation, the data controller may employ different security measures for the protection of additional information, like physical separation of identity/pseudonym mappings in conjunction with strict access control mechanisms and/or other security techniques. In some cases, the data processor may undertake the storage of the additional information under the instructions of the controller (although this approach could in certain implementations raise privacy concerns, especially if the security of the additional information is not under the direct control of the data controller). For special cases, a trusted third party could also be employed for the storage of the additional information (e.g. an authority providing guarantees for such a role). Finally, as also stated above, there are also cases in which the data controller needs only to track the users or where the need for re-identification occurs in special cases (i.e. for a subset of the pseudonymised data). In such cases, more sophisticated approaches may be employed for de-centralised storage of the additional information, e.g. the generation of the pseudonyms can be mounted on the user’s environments, without necessitating a central point for storing the identifiers/pseudonyms mappings.
All the aforementioned aspects need to be carefully considered by the data controller before selecting a specific pseudonymisation technique. In the next Sections, we shall present some relevant techniques, assess them with regard to the above design goals (D1 and D2) and describe their relative advantages and disadvantages. Although in the descriptions and relevant examples we focus explicitly on cases that a unique identifier associated to an individual is being transformed into a pseudonym, the same techniques can be also applied to more complex cases. For example, such a case could be that of a combination of identifiers, where the initial identifiers are concatenated to form a new “generalised” identifier, which in turn will be the basis to generate the corresponding pseudonym (recall also the relevant discussion in Section 2.1.3).
3.2 Hashing without key
Hashing is a technique that can be used to derive pseudonyms, but, as will be shown later in this Section, has some serious drawbacks with regard to the design goals set in Section 3.1. Still, it is a starting point for understanding other stronger techniques in the field and this is why we present it first. Moreover, hashing can be a useful tool to support data accuracy.
A cryptographic hash function h is a function with specific properties (as described next) which transforms any input message m of arbitrary length to a fixed-size output h(m) (e.g. of size 256 bits, that is 32 characters), being called hash value or message digest.
The message digest satisfies the following properties [Menezes, 1996]: i) given h(m), it is computationally infeasible 27 to compute the unknown m, and this holds for any output h(m) – i.e. the function h is mathematically irreversible (pre-image resistance), ii) for any given m, it is computationally infeasible to find another m’≠ m such that h(m’)=h(m) (2nd pre-image resistance), iii) it is computationally infeasible to find any two distinct inputs m, m’ (free choice) such that h(m’)=h(m) (collision resistance). Clearly, if a function is collision-resistant, then it is 2nd pre-image resistant too 28.
Figure 1: Operation of a cryptographic hash function
In other words, a cryptographic hash algorithm is one that generates a unique digest 29 (which is also usually called fingerprint) of a fixed size for any single block of data of arbitrary size (e.g. an initial identifier of any kind). Note that for any given hash function, the same unique digest is always produced for the same input (same block of data).
It is important to point out that state-of-the art hash functions should be chosen; therefore, commonly used hash functions such as MD5 and SHA-1 [Menezes,1996] with known vulnerabilities – with respect to the probability of finding collisions – should be avoided (see [Wang, 2005], [Dougherty,2008], [Stevens, 2017a], [Stevens,2017b]). Instead, cryptographically resistant hash functions should be preferable, e.g. SHA-2 and SHA-3 are currently considered as state-of-the-art [FIPS, 2012], [FIPS,2015].
The above properties of hash functions allow them to be used in several applications, including data integrity30 and entity authentication31. For instance, once an app market has a hash server storing hash values of app source codes, any user can verify whether the source code has been modified or not via a simple validation of its hash value – since any modification of the code would lead to a different hash value (see, e.g., [Jeun, 2011]). Similarly, recalling the discussion in Section 3.1 on data accuracy, a pseudonym that is generated via hashing user’s identifiers may be a convenient way for a data controller to verify a user’s identity.
However, when it comes to pseudonymisation, despite the aforementioned properties of a cryptographic hash function, simple hashing of data subjects’ identifiers to provide pseudonyms has major drawbacks.
More precisely, with regard to the aforementioned D1 and D2 design goals, we have the following:
- The D2 property does not hold, since any third party that applies the same hash function to the same identifier gets the same pseudonym 32.
- In relation to the above observation, the D1 property also does not necessarily hold, since it is trivial for any third party to check, for a given identifier, whether a pseudonym corresponds to this identifier (i.e. though hashing the identifier 33).
Therefore, a reversal of pseudonymisation is possible whenever such an approach is adopted, as having a list of the (possible) initial identifiers is adequate for any third party to associate these identifiers with the corresponding pseudonyms, with no any other association being in place 34. In fact, following the GDPR definition of pseudonymisation, one could argue that hashing is a weak pseudonymisation technique as it can be reversed without the use of additional information. Relevant examples are provided in [Demir,2018] (and in references therein), where the researchers refer to the Gravatar service 35 and describe how users’ email addresses can be derived through their hash value, which is shown in the URL that corresponds to the gravatar of the user, without any additional information.
Hence, hash functions are generally not recommended for pseudonymisation of personal data, although they can still contribute to enhancing security in specific contexts with negligible privacy risks and when the initial identifiers cannot be guessed or easily inferred by a third party. For the vast majority of the cases, such pseudonymisation technique does not seem to be sufficient as a data protection mechanism [Demir, 2018]. However, a simple hashing procedure may still have its own importance in terms of data accuracy, as stated previously.
3.3 Hashing with key or salt
A robust approach to generate pseudonyms is based on the use of keyed hash functions – i.e. hash functions whose output depends not only on the input but on a secret key too; in cryptography, such primitives are being called message authentication codes (see, e.g., [Menezes, 1996]).
The main difference from the conventional hash functions is that, for the same input (a data subject’s identifier), several different pseudonyms can be produced, according to the choice of the specific key – and, thus, the D2 property is ensured. Moreover, the D1 property also holds, as long as any third party, i.e. other than the controller or the processor, (e.g. an adversary) does not have knowledge of the key and, thus, is not in the position to verify whether a pseudonym corresponds to a specific known identifier. Apparently, if the data controller needs to assign the same pseudonym to the same individual, then the same secret key should be used.
To ensure the aforementioned properties, a secure keyed-hash function, with properly chosen parameters, is needed. A known such standard is the HMAC [FIPS, 2008], whose strength is contingent on the strength of the underlying simple hash function (and, thus, incorporating SHA-2 or SHA-3 in HMAC is currently a right option). Moreover, the secret key needs to be unpredictable and of sufficient length, e.g. 256 bits, which could be considered as adequate even for the post-quantum era 36 . If the secret key is disclosed to a third party, then the keyed hash function actually becomes a conventional hash function in terms of evaluating its pseudonymisation strength. Hence, recalling the definition of pseudonymisation in the GDPR, the data controller should keep the secret key securely stored separately from other data, as it constitutes the additional information, i.e. it provides the means for associating the individuals – i.e. the original identifiers – with the derived pseudonyms.
Figure 2: Operation of a keyed hash function
Keyed hash functions are especially applicable as pseudonymisation techniques in cases that a data controller needs – in specific data processing context – to track the individuals without, however, storing their initial identifiers (see also [Digital Summit, 2017]). Indeed, if the data controller applies – always with the same secret key – a keyed hash function on a data subject’s identifier to produce a pseudonym, without though storing the initial user’s identifier, then we have the following outcomes:
- The same pseudonym will always be computed for each data subject (i.e. allowing tracking of the data subject).
- Associating a pseudonym to the initial identifier is practically not feasible (provided that the controller does not have knowledge of the initial identifiers).
Therefore, if only tracking of data subjects is required, the controller needs to have access to the key but does not need to have access to the initial identifiers, after pseudonymisation has been performed. This is an important consideration that adheres to the principle of data minimization and should be considered by the controller as a data protection by design aspect.
Moreover, a keyed hash function has also the following property: if the secret key is securely destroyed and the hash function is cryptographically strong, it is computationally hard, even for the data controller, to reverse the pseudonym to the initial identifier, even if the controller has knowledge of the initial identifiers. Therefore, the usage of a keyed hash function may allow for subsequent anonymisation of data, if necessary, since deleting the secret key actually deletes any association between the pseudonyms and the initial identifiers. More generally, using a keyed hash function to generate a pseudonym and subsequently deleting the secret key is somehow equivalent to generate random pseudonyms, without any connection with the initial identifiers.
Another approach that is often presented as an alternative to the keyed hash function is the usage of an unkeyed (i.e. conventional) hash function with a so-called “salt” – that is the input to the hash function is being augmented via adding auxiliary random-looking data that are being called “salt”. Again, if such a technique is appropriately applied, for the same identifier, several different pseudonyms can be produced, according to the choice of the salt – and, thus, the D2 property is ensured, whilst the D1 property also holds with regard to third parties provided that they do not have knowledge of the salt. Of course, this conclusion is valid only as long as the salt is appropriately secured and separated from the hash. Note that, as in the case of keyed hash, the same salt should be used by the controller in cases that there is need to assign always the same pseudonym to the same individual 37. Moreover, salted hash functions can be utilized in cases where the controller does need to store the initial identifiers, while still being able to track the data subjects. Last, if the salt is securely destroyed by the controller, it is not trivial to restore the association between pseudonyms and identifiers.
However, it should be stressed that in several typical cases employing salts for protecting hashes has some serious drawbacks:
- On one hand, the salt does not share the same unpredictability properties as secret keys (e.g. a salt may consist of 8 characters, i.e. 64 bits, as in the cases of protecting users’ passwords in some Linux systems). More generally, from a cryptographic point of view, a keyed hash function is considered as more powerful approach than a salted hash function 38. There exist though several cryptographically strong techniques for generating salted hashes, which in turn could be considered as appropriate candidates for generating pseudonyms – a notable example being the bcrypt [Provos, 1999].
- Moreover, salts in most common scenarios are generally stored together with corresponding hash values, thus seriously weakening protection. The alternative use of the so-called peppers, which are hidden protected salts and are separately stored from hashes, can provide an enhanced alternative. A pseudonymisation approach that is based on salted/peppered hash values (namely, the case of Entertain TV) is described in [Digital Summit, 2017].
It is, therefore, recommended that salted hashes are used with caution for pseudonymisation and in accordance with available best-practices in the field.
3.4 Encryption as a pseudonymisation technique
Symmetric encryption of data subjects’ identifiers is also considered as an efficient method to obtain pseudonyms. In a typical case, the original identifier of a data subject can be encrypted through a symmetric encryption algorithm (e.g. the AES, being the encryption standard [FIPS, 2001]), thus providing a ciphertext that is to be used as a pseudonym; the same secret key is needed for the decryption.
Figure 3: Operation of symmetric encryption
Such a pseudonym satisfies the D2 property, as well as the D1 property as long as no third party, i.e. any other than the controller or processor, has access to the encryption key and under the assumption that state-of-the art algorithms and sufficient lengths are used (see also in [ENISA, 2014b]). For symmetric encryption algorithms, similarly to the keyed hash functions, a key size of 256 bits is currently being considered as adequate for security, even for the post-quantum era as has been also stated above [Bernstein, 2017].
The main difference of encryption with respect to the keyed hash functions – in terms of pseudonymisation – is that the secret key owner (i.e. the data controller) may always obtain the data subjects’ initial identifiers, through a simple decryption process 39 . On the contrary, as explained earlier, keyed hash functions provide the possibility to the data controllers for tracking the individuals, without having knowledge (storing) of the initial identifiers. This is not the case with encryption (as pseudonymisation method), where the initial identifiers may always be known to the controller.
Aside this aspect, symmetric encryption has other similar properties – in terms of pseudonymisation – with keyed hash functions, namely: i) the same secret key should be used to provide the same pseudonym for the same identifier, ii) if the key is destroyed, it is not trivial to associate a pseudonym with the initial identifier, even if the initial identifiers are being stored by the data controller.
Hence, symmetric encryption can generally be employed (as a pseudonymisation technique) in cases that a data controller needs not only to track the data subjects but also to know their initial identifiers (see also [Digital Summit, 2017]). Traceability is grounded on a deterministic nature of the encryption method, i.e. encrypting the same identifier with the same key always yields the same pseudonym. Re-identifiability (of initial identifiers) rests, as explained above, with the very nature of symmetric encryption.
Apart from symmetric encryption algorithms, asymmetric (i.e. public key) encryption algorithms may be also used in specific cases for pseudonymisation purposes. The main characteristic of public key encryption is that each entity participating in the scheme has a pair of keys, i.e. the public and the private key. The public key of an entity can be used by anyone to encrypt data but only the specific entity can decrypt these data with the use of its private key. Although the two keys are necessarily mathematically related, knowledge of the public key does not allow determining the private key. To provide the so-called ciphertext indistinguishability property, the public key algorithms may be appropriately implemented in a probabilistic form by introducing randomness in the encryption process. This means that randomly chosen values are being used at each encryption cycle. In this way, if the same message is encrypted twice with the same public key each time, the corresponding two ciphertexts will be different, without affecting the decryption capability for the holder of the decryption key.
Due to its aforementioned properties, public key encryption may serve as an instrument for pseudonymisation in some specific contexts. For example, it might be desirable for the data controller that the entity (e.g. role or team) authorized to perform the pseudonymisation (within the same controller) is not the same with the one that is authorized to perform the re-identification. The usage of asymmetric encryption can facilitate this (i.e. by using the public key of the entity that is authorized to perform reidentification to generate the pseudonyms 40), thus allowing for separation of duties, especially in complex or high-risk environments [Elger, 2010]. There have been relevant applications of this approach in the health sector (see, e.g., [Aamot, 2013], [Verheul, 2016]).
Moreover, as mentioned earlier, asymmetric encryption in probabilistic form allows for generating different pseudonyms for the same individual (with the same public key) 41. Hence, it may also find application in cases where a data controller needs to assign each time a different pseudonym for the same identifier (data subject), especially when there is no need to track the data subjects (still being able to reidentify them). Note, however, that in such cases, both the public and the private key rest with the data controller (as there is no need for the public key to be accessible to other parties). Moreover, it should be stressed that appropriate implementations in symmetric ciphers may also yield probabilistic encryption (see also [Digital Summit, 2017]).
Figure 4: Operation of public key (asymmetric) encryption
The properties of asymmetric encryption may also be used in several other contexts that are related to obscuring the individuals’ identities. For instance, in distributed ledger technologies 42 in which the users do not reveal their real identities, their corresponding unique addresses may be obtained through their public keys [ENISA,2016]; such a user’s address allows the other users to verify his or her digital signature – that is to verify that the data have been indeed signed by the user with the claimed address.
Having said that, it should be stressed that asymmetric key algorithms necessitate the usage of very large keys, which in turn may give rise to implementation restrictions – e.g. 3072 key bits are needed in RSA (see, e.g. [Bernstein, 2017]). Even if the elliptic curve cryptography is considered, which offers much smaller key sizes than the RSA as well as faster computation (see, e.g., [Gura, 2004]), it is still less efficient than symmetric key algorithms. Moreover, one should keep in mind that the currently most known and widely used public key algorithms – including RSA and elliptic curve cryptographic algorithms – will not be strong in the post-quantum era [Bernstein, 2017] 43.
3.5 Other cryptography-based techniques
Appropriate combination of several cryptographic schemes may also provide robust pseudonymisation approaches, for example by the use of techniques such as secure multi-party computation and homomorphic encryption (see [ENISA, 2014a] and the references therein). Several advanced cryptographybased pseudonymisation solutions have been proposed to alleviate data protection issues, especially in cases of personal data processing that present very high risks – e.g. in wide scale e-health systems. As a recent characteristic example, we refer to the polymorphic encryption and pseudonymisation technique proposed in [Verheul, 2016]. In this method, each user (i.e. patient in case of an e-health system) has a cryptographically generated different pseudonym at different parties. For instance, the patient has different pseudonyms at doctors X, Y, Z, and at medical research groups U, V, W – that is domain-specific pseudonyms are being produced 44. Based on these so-called polymorphic techniques, a pilot example in health sector is already in place 45.
Another important cryptographic approach for deriving pseudonyms rests with appropriately realising a decentralized solution, to allow the participating users to generate their own pseudonyms and subsequently allow them keep pseudonyms under in their own custody [Lehnhardt, 2011]. Such a design goal is not a trivial task since several crucial issues need to be resolved – e.g. the pseudonym generation process should avoid duplicates, whereas each user should be able to unambiguously prove, whenever he or she wants, that is the owner of a specific pseudonym. All these approaches necessitate the appropriate use of several cryptographic primitives (see, e.g., [Schartner, 2005], [Lehnhardt, 2011] – the latter one has been applied by a large producer of healthcare information systems located in Germany, as is stated therein). For example, the approach in [Lehnhardt, 2011] rests with the usage of public key cryptography – and, more precisely, of elliptic curve cryptography – in a way that each user computes his or her own pseudonym based on a secret that he or she acquires. These ideas introduce a fundamental property in the overall concept of pseudonymisation, since the additional information that is needed to re-identify each user is solely under the control of the user himself or herself and not of the data controller, whose role is to provide such a decentralized pseudonymisation technique. Therefore, these approaches – although costly – seem to be the best options in cases that the data protection by design principle necessitates to ensure that the data controller should not have a priori knowledge of the data subject’s identity, unless the data subject chooses to prove his or her identity at any time.
As a final point, it should be noted that a common challenge for most cryptographic techniques is key management, which is usually not trivial, depending also on the overall scale of application, as well as the specific technique chosen.
Tokenisation refers to the process that the data subjects’ identifiers are replaced by randomly-generated values, known as tokens, without having any mathematical relationship with the original identifiers. Hence, knowledge of a token has no usefulness for a third party, i.e. any other than the controller or processor (see, e.g., [WP29, 2014]). Tokenisation is commonly used to protect financial transactions46, but it is not limited to such applications 47.
Clearly, the tokenization system should be appropriately designed to ensure that indeed there is no mathematical relationship between pseudonyms and the original identifiers. Moreover, other restrictions should also be taken into account, depending on the context of the overall processing – e.g. if tokenisation is being used to pseudonymise credit card number in payments systems, the randomly generated tokens should not have any possibility of matching real card numbers (such a risk could possibly exist in cases of format-preserving tokenisations, i.e. in cases that the tokens have the same format with the initial data). Due to the random hidden mapping from original data to a token, it becomes evident that tokenisation satisfies both the D1 and D2 pseudonymisation properties (see Section 3.1). Since there is an entity which stores this hidden mapping (i.e. a token server in the tokenisation system), re-identification of data subjects by the data controller will be possible in all cases. This also includes tracking, as long as there is only one mapping for each identifier.
However, it should be noted that, despite the efficiency of tokenization, its deployment may be, depending on the context, quite challenging, e.g. synchronization of tokens across several systems may be needed in several applications. Therefore, previously mentioned approaches that employ keyed hash functions or encryption algorithms could be preferable with regard to reducing complexity and storage.
Several other well-known techniques, such as masking, scrambling and blurring, can also be considered in the context of pseudonymisation, having though restrictions with regard to their possible applications, whilst all of them mainly focus on pseudonymising data being at rest (i.e. data that are being stored in a file/database).
Masking refers to the process of hiding part of an individual’s identifier with random characters or other data. For example, a masked credit card number may be transformed in the following way:
4678 3412 5100 5239 -> XXXX XXXX XXXX 5239
Clearly, such an approach cannot ensure that the D1 and D2 pseudonymisation properties are always satisfied. For example, masking the IP addresses of computers lying in the same LAN may allow for reidentifying the original IP address (once a third party is able to find out the entire space of the available IP addresses in this LAN). Moreover, there are also risks, if masking is not carefully designed, to assign the same pseudonym to different users, therefore potentially leading to collisions.
Scrambling refers to general techniques for mixing or obfuscating the characters. The process can be reversible, according to the chosen technique. For example, a simple permutation of characters may be such a scrambling, e.g. a credit card number may be transformed as follows:
4678 3412 5100 5239 -> 0831 6955 0734 4122
Apparently, scrambling can be considered as a simple form of symmetric encryption, which would not satisfy either the D1 or the D2 property – e.g. a simple permutation of characters may allow reidentification in specific cases (see again the previous example with the IP addresses of a LAN).
Generally, both masking and scrambling are in fact weak pseudonymisation techniques and their use is generally not recommended as a good practice in personal data processing. However, despite their limitations, they may be utilized to provide a level of protection in specific contexts (for instance, masked telephone numbers can be used for displaying, for billing purposes, the telephone calls made from business premises).
Blurring is another technique, which aims to use an approximation of data values, so as to reduce the precision of the data, reducing the possibility of identification of individuals. For instance, appropriate round functions can be used to transform numerical values into new ones. Blurring can be also applied to individuals’ pictures (i.e. image obfuscation) as a part of a pseudonymisation process; recent research though illustrates that image recognition techniques based on artificial neural networks may recover the hidden information from such blurred images [McPherson, 2016].
Other known techniques that can be referenced in this context are those of barcodes, QR codes or similar methods, which, however, aim mainly towards supporting data accuracy, rather than providing a data pseudonymisation solution.