General Data Protection Regulation (GDPR)
General Provisions
Chapter 7

EMAIL ADDRESS PSEUDONYMISATION

In this Chapter, the pseudonymisation of email addresses is considered as one more specific use case of the techniques presented earlier in the document.

An electronic mail (e-mail) address constitutes a typical identifier of an individual. An e-mail address has the form local@domain, where the local part corresponds to the user that owns the address and the domain corresponds to the mail service provider. E-mail addresses are generally used in several applications; for example, they may form the main identifier of an individual that registers to an electronic service. Moreover, e-mail addresses are typically present in many databases, in which other identifiers - such as individuals' names - may also be present.

Users tend to use the same e-mail address for different applications, sharing it with various organisations, e.g. when they sign up for online accounts. Moreover, e-mail addresses are often published online, while it has been shown that they can be easily found or guessed24. Due to these special characteristics, when e-mail addresses are used as identifiers, their protection is especially important.

In this use case, email addresses are considered as identifiers (e.g. in a database or online service), while analysing the application of different pseudonymisation techniques to them. It is always considered that the pseudonymisation process is performed by a pseudonymisation entity (e.g. data controller) as part of the operation/provision of a service.

7.1 COUNTER AND RANDOM NUMBER GENERATOR

Considering the descriptions in Chapter 5, both counter and RNG can be used for the pseudonymisation of emails with the use of a mapping table, as the one shown in the example of Table 12. Clearly, pseudonymisation is strong as long as the mapping table is secured and stored separately from the pseudonymised data.

Table 12: Example of email address pseudonymisation with RNG or counter (full pseudonymisation)

E-mail address Pseudonym (Random number generator) Pseudonym (counter generator)
alice@abc.eu 328 10
bob@wxyz.com 105 11
eve@abc.eu 209 12
john@qed.edu 83 13
alice@wxyz.com 512 14
mary@clm.eu 289 15

In the example of Table 12, both counter and RNG result to pseudonyms that do not reveal any information on the initial identifiers (email addresses) and do not allow any further analysis (e.g. statistical analysis) on the pseudonyms. In order to increase utility, it is possible to apply pseudonymisation only to a part of the email address, e.g. the local part (without affecting the domain part - see Table 13).

Table 13: Example of email address pseudonymisation with RNG or counter (only local part pseudonymisation)

E-mail address Pseudonym (Random number generator) Pseudonym (counter generator)
alice@abc.eu 328@abc.eu 10@abc.eu
bob@wxyz.com 105@wxyz.com 11@wxyz.com
eve@abc.eu 209@abc.eu 12@abc.eu
john@qed.edu 83@qed.edu 13@qed.edu
alice@wxyz.com 512@wxyz.com 14@wxyz.com
mary@clm.eu 289@clm.eu 15@clm.eu

As shown in Table 13, while the emails are pseudonymised, it is still possible to know the domain and, thus, conduct relevant analysis (e.g. number of email users originating from the same domain). As discussed earlier in the document, counter may be weaker in terms of protection as it allows for predictions due to its sequential nature (e.g. in cases where email addresses come from the same domain, the use of counter may reveal information regarding the sequence of the different email users in the database).

Starting from this simple case, depending on the level of data protection and utility that the pseudonymisation entity needs to achieve, different variations might be possible by retaining different levels of information in the pseudonyms (e.g. on identical domains, local parts, etc.).

Table 14: Examples of email address pseudonymisation with RNG - various utility levels

E-mail address Pseudonym (RNG) retaining the info on identical domain Pseudonym (RNG) retaining also the info on identical country/extension Pseudonym (RNG) retaining the info on identical local parts and domains Pseudonym (RNG) retaining the info on identical country/extension, domains and local parts
alice@abc.eu 328@1051 328@1051.3 328@1051 328@1051.3
bob@wxyz.com 105@833 105@833.7 105@833 105@833.7
eve@abc.eu 209@1051 209@1051.3 209@1051 209@1051.3
john@qed.edu 83@420 83@420.8 83@420 83@420.8
alice@wxyz.com 512@833 512@833.7 328@833 328@833.7
mary@clm.eu 289@2105 289@2105.3 289@2105 289@2105.3

The main pitfalls of both counter and RNG lie with the scalability of the technique in cases of large datasets, especially if it is required that the same pseudonym is always assigned to the same address (i.e. in a deterministic pseudonymisation scenario as in Table 12). Indeed, in such case, the pseudonymisation entity needs to perform a cross-check throughout the whole pseudonymisation table whenever a new entry is to be pseudonymised. Complexity increases in more sophisticated cases of implementation as those shown in Table 14 (e.g. when the pseudonymisation entity needs to classify email addresses with the same domain or the same country without revealing this domain/country).

7.2 CRYPTOGRAPHIC HASH FUNCTION

As stated in [34], the total number of worldwide email accounts is roughly estimated to 4.7 billion ≈ 2 32 (since, despite the theoretically practically infinite size of the valid email addresses space, existing addresses lie in a much smaller space). This fact, as also mentioned earlier in the Chapter, makes email addresses easily found or guessed 25, thus rendering cryptographic hash functions a weak technique for pseudonymisation [34]. Indeed, it is trivial to any insider or external adversary, having access to a pseudonymised list of email addresses, to perform a dictionary attack (Figure 10). This observation is relevant to all pseudonymisation scenarios presented in Chapter 3 (independently of whether the pseudonymisation entity is the controller, the processor or a trusted third party).

Figure 10: Reversing an e-mail address from its hash value

Despite the aforementioned pitfalls of cryptographic hash functions, it should be pointed out that, as indicated in [35], service providers often share email addresses with third parties, just by simply hashing them. A concrete example of such case is the operation of the so-called custom audience lists, which provides to companies the possibility to compare hashed values of customers’ email addresses for defining common lists of customers 26.

Notwithstanding the above significant data-protection risks, the cryptographic hash values could still be of some use under certain conditions, e.g. for internal coding of email addresses (such as for example in the context of research activities) and as validation/integrity mechanism for a data controller (see also in [1]). Hash functions could also be used to pseudonymise parts of an email address (e.g. only the domain part), thus allowing some utility on the derived pseudonyms; if the remaining part is pseudonymised by a stronger method (e.g. MAC), then the risk of reversing the whole initial e-mail address is significantly reduced.

7.3 MESSAGE AUTHENTICATION CODE

Compared to simple hashing, a message authentication code (MAC) provides significant data protection advantages also for email address pseudonymisation, as long as the secret key is securely stored. Moreover, the pseudonymisation entity may use different secret keys, for different sectors, to generate for example different sector-based pseudonyms for the same email address. A MAC can also be used to restrict the controller from having access to the email addresses in cases where access to the pseudonyms is sufficient for the particular purpose of processing (e.g. under scenarios 5 and 6 in Chapter 3). Such a case could be, for example, in interest-based display advertising, in which the advertisers need to associate a unique pseudonym for each individual but without being able to reveal the user's original identity [36].

As in previous techniques, in order to increase utility of the pseudonyms, different implementation scenarios could be discussed in practice. For example, one possible approach would be to apply the MAC separately to different parts of the e-mail address (e.g. local and domain parts), using the same secret key. A characteristic example is shown in Figure 11: the usage of the same key for each MAC results in generating the same sub-pseudonyms for the corresponding domain parts (in green color) whenever the email address domains are identical. However, since the output of a MAC has a fixed size, which is generally much larger than the size of the initial e-mail address27, the resulting pseudonyms may be of quite large size (which is further increased if different parts are pseudonymised separately).

Figure 11: Using MAC to generate pseudonymised e-mail addresses with some utility

One important aspect regarding practical implementation of MAC is recovery. It should be stressed that even the data pseudonymisation entity, which has access to the secret key, is not able to directly reverse the pseudonyms; such a reversion can be obtained only indirectly, by reproducing the pseudonyms for each known e-mail address in order to see the matches with the pseudonymised list. Clearly, if a pseudonymisation mapping table is available, reversing pseudonyms is trivial, but in such a case, the storage requirements also increase. For these reasons, MAC is probably not the most practical pseudonymisation technique in cases that the data controller needs to be able to map pseudonyms to e-mail addresses easily (e.g. in some cases under Chapters 3.1 and 3.2).

7.4 ENCRYPTION

An alternative to MAC is encryption, applied especially in a deterministic way, i.e. by utilising a secret key to produce a pseudonym for each e-mail address (symmetric encryption). Deployment is more practical in such case, since there is no need to provide for a pseudonymisation mapping table: recovery is directly possible through the decryption process [37].

Note that, although some asymmetric (public key) cryptographic algorithms can be implemented in a deterministic way28, they are not recommended for the pseudonymisation of e-mail addresses (or for other data types, see also in [1]). For example, let us assume that the pseudonymisation entity needs to generate, for each e-mail address, different pseudonyms for different – internal or external – users/recipients (with the assumption that each recipient will be able to re-identify his or her own data but not the pseudonymised data of other recipients). One possibility to achieve this goal would be to encrypt the emails with the public key of each recipient, thus allowing only the specific recipient to perform the decryption. However, assuming that the public keys are in principle available to anyone, any adversary may mount a dictionary attack based on known (or guessed) e-mail addresses (as the one shown in Figure 10, in which the public key encryption with a known public key is being used instead of a hash function).

The nature of encryption by default does not allow for utility of the pseudonymised data. Encrypting separately the parts of an e-mail address may suffice to alleviate this issue, similarly to the message authentication codes (see Figure 11), in which the MAC can be replaced by an encryption algorithm. Generally, to allow pseudonyms to carry some useful information, specific cryptographic techniques can be used; an illustrative example is given next with format preserving encryption.

FORMAT PRESERVING ENCRYPTION (FPE)

A database scheme might expect a particular data type for specific fields. For example, an email address is expected to contain a local part (info), followed by an @ symbol, which in turn is followed by a domain. If there is no need, for the data controller, to retain the initial e-mail addresses but there is still need to keep a pseudonymised list by keeping the structure of the database, format preserving encryption is a suitable candidate for achieving this. There are several known implementations on format-preserving encryption, based on known encryption schemes29. In any case, any (pseudo)random substitution of characters30 by other characters lying in the same alphabet - i.e. the set of alphanumeric characters enriched by special characters appearing in local parts of e-mail addresses - suffices to ensure that the derived pseudonym has the desired form. The difference between FPE and conventional cryptography is illustrated in Figure 12.

Figure 12: Conventional vs. format preserving encryption to derive pseudonym from e-mail address

Note that, in Figure 12, a symmetric stream cipher has been used for the conventional encryption, in order to ensure that the derived pseudonym has the same length with the initial address (the characters of the derived pseudonym are non-alphanumeric and, thus, are given in the hexadecimal form).

It should be noted that, depending on the case, it might be needed to appropriately engineer FPE implementations, in order to avoid the emergence of patterns that may leak information on the individuals’ identities.