As mentioned in Chapter 3, the primary goal of pseudonymisation is to limit the linkability between a pseudonymised dataset and the holders of the pseudonyms and to thereby protect the identity of the data subjects. This type of protection is typically intended to counter the efforts of an adversary to perform a re-identification attack.
This Chapter considers the possible adversarial models and different types of re-identification attacks that are important to pseudonymisation. To this end, the notions of insider and external adversaries are addressed, while examining their possible roles under the pseudonymisation scenarios discussed earlier in the report. Understanding these topics is an essential element for further analysing the use of pseudonymisation techniques in the following Chapters.
4.1 INSIDER ADVERSARIES
According to the common understanding of the term in IT security, an insider is an adversary with specific knowledge, capabilities, or permissions (with regard to the target of the adversary)15. In the context of pseudonymisation, this implies that the adversary is in a position to gain information about the pseudonymisation secret and/or other relevant significant information.
For example, considering the scenarios in Chapter 3, an insider could be on the controller’s side (e.g. an employee working for the controller) under scenarios 1, 2, 3 and 4. It could also rest on the processor’s side (e.g. a malicious employee of a contractor) under scenarios 2 and 4. Last, in the case of scenario 5, the insider adversary could lie within the trusted third party (acting in this scenario as the pseudonymisation entity). By default third parties that might legitimately have access to the personal data (e.g. a supervisory or law enforcement authority) are not considered as adversaries16.
4.2 EXTERNAL ADVERSARIES
In contrast to the insider adversary, an external adversary does not have direct access to the pseudonymisation secret or other relevant information. However, this type of adversary may have access to a pseudonymised dataset, and may also be able to perform the task of pseudonymisation to arbitrary input data values chosen by the adversary (e.g. by having access to a black box implementation of the pseudonymisation function, or by being able to force the pseudonymisation entity to pseudonymise arbitrary inputs). The goal of an external adversary is to increase his or her own information on the pseudonymised dataset, e.g. by learning the identity behind a given pseudonym (and acquiring further information on that identity from the additional data found in the dataset for the given pseudonym).
Considering the scenarios of Chapter 3, by definition any actor who acts maliciously in all scenarios and is not part of the pseudonymisation entity or working on behalf of the pseudonymisation entity should be considered as an external adversary. A (malicious) data controller could take the role of an external adversary under scenario 5 or 6. A (malicious) data processor could also take this role under scenario 3.
4.3 GOALS OF ATTACK ON PSEUDONYMISATION
Depending on the context and the pseudonymisation method, the adversary can have different goals that he or she wants to achieve against pseudonymised data, i.e. retrieval of the pseudonymisation secret, complete re-identification, or discrimination. While most of the examples described in the next paragraphs focus on uncovering the “real” identity of the data subjects, it should be noted that a successful attack is not (only) a matter of reverse engineering, but more the capability of singling out a specific individual from a group (even if the “real” identity is not revealed).
4.3.1 Pseudonymisation secret
In this case, the adversary focuses on discovering the pseudonymisation secret (i.e. when the pseudonymisation secret is used). This attack is the most severe one, as with the use of the pseudonymisation secret, the adversary is able to re-identify any pseudonym in the dataset (complete re-identification or discrimination), as well as to perform further pseudonymisation processes on the dataset.
4.3.2 Complete re-identification
When the aim of the attack is complete re-identification, the adversary wishes to achieve the linking of one or more pseudonyms back to the identity of the pseudonym holders. This type of adversary has largely been discussed in the literature (see e.g.   ).
The most severe complete re-identification attack consists of the re-identification of all pseudonyms. The adversary can use two strategies to achieve this goal: recovering each identifier from the corresponding pseudonym independently; or recovering the pseudonymisation secret (see in 4.3.1). The least severe form of complete re-identification attacks involves an adversary who can only re-identify a subset of pseudonyms in the dataset. For example, consider a pseudonymised dataset of student grades of a university course. Each entry of the dataset contains a pseudonym corresponding to the identity of the student (name and surname) and a second pseudonym on the student's gender (e.g. by mapping male students to odd numbers and female students to even numbers). An adversary succeeds in a complete re-identification attack if he/she recovers the name, surname and gender of a student.
The goal of the discrimination attack is to identify properties of a pseudonym holder (at least one). These properties may not directly lead to uncovering the identity of the pseudonym holder, but may suffice to discriminate him or her in some way.
Considering the student grades example that was presented before, the dataset of student grades may contain two even numbers among many odd numbers as pseudonyms. Even numbers correspond to female students while odd numbers correspond to male students (this fact is known to the attacker). Both even numbers have scored 100% as result at the final exam. Further, let us assume that there are no other students that scored 100% in the pseudonymised dataset. If an adversary gains additional knowledge that a certain student scored 100% in this course, the attacker immediately learns that this student was female. Vice versa, if the adversary learns that a student of that course was female, the adversary immediately learns that this student had scored 100%. It is important to understand that the adversary does not learn the identity of a pseudonym holder here, but only learns some property (i.e. gender or grade value) of the holder. Given that several students share the same combination of property values, the adversary is not able to pinpoint the exact data record of a particular pseudonym holder. However, this extra information gained may already suffice for purposes of discrimination that the adversary intends to perform, or it may be utilised in a subsequent background knowledge attack to uncover the identity behind a pseudonym.
4.4 MAIN ATTACKING TECHNIQUES
There are three main generic techniques to break a pseudonymisation function: brute force attacks (exhaustive search), dictionary search, and guesswork17. The effectiveness of these attacks depends on several parameters, including:
The amount of pseudonym holder (data subject) information contained in the pseudonym.
- The background knowledge of the adversary.
- The size of the identifier domain.
- The size of the pseudonym domain.
- The choice and configuration of the pseudonymisation function utilised (this also includes the size of the pseudonymisation secret).
These attacking techniques are briefly described next.
4.4.1 Brute force attack
The practicality of this attack technique is conditioned on the adversary’s ability to compute the pseudonymisation function (i.e. there is no pseudonymisation secret) or his/her access to a “black box” implementation of the pseudonymisation function. Depending on the goal of the attack, extra conditions may apply. If the brute force attack is used to achieve complete reidentification (i.e. restoration of the original identity), the identifier domain must be finite and relatively small. For each pseudonym encountered by the adversary, he/she can attempt to recover the original identifier by applying the pseudonymisation function on each value of the identifier domain until he/she finds a match.
Table 1: Pseudonymisation of month of birth
|Month of birth
||Month of birth
Let us consider the pseudonymisation of a month of birth in a dataset. The size of the identifier domain is 12, thus an adversary can enumerate quickly all the possibilities. The pseudonyms associated to each month are computed in this case as the sum of the ASCII code of the first three letters of the month of birth (with the first being a capital one). Let us consider that an adversary has encountered the pseudonym 301. He or she can apply on each month of birth the pseudonymisation function until he/she finds the month which corresponds to the value 301. Table 1 shows the computations made by the adversary to re-identify pseudonym 301 resulting in the mapping table of the pseudonymisation function.
Obviously, the size of the identifiers domain is critical to successfully mount this attack. For small identifiers domains, like in the example presented above, a brute force attack is trivially feasible. If the identifiers-domain size is infinite, the brute force attack typically becomes infeasible. If the size of the identifiers domain is too large, complete re-identification is extremely difficult, leaving the adversaries with the potential of a discrimination attack.
Indeed, in such case the adversary can consider a subdomain of the identifiers domain for which he or she can compute all the pseudonyms. Let us come back to the example of Table 1, while assuming that the domain is small. Let us assume that the adversary wants to discriminate people with a month of birth starting with the letter J from those starting with a different letter. This subdomain contains January, June and July. The adversary can mount an exhaustive search on this subdomain by computing the pseudonyms corresponding to January, June and July. If he or she finds a pseudonym different from 281, 301 and 299 then he or she knows that the month of birth is not starting with the letter J.
In the case where a pseudonymisation secret is used, even a small identifier domain may not allow mounting such an attack (since the attacker is not able to compute the pseudonymisation function and provided that there is no access to a “black box” implementation of this function). In such a case, a brute force attack can be mounted over the entire space of pseudonymisation secrets – namely the attacker exhaustively checks all the possible secrets and, for each of them, he or she computes the recovery function. This attack will be successful if the attacker correctly guesses the pseudonymisation secret, regardless of the size of the identifier domain. Therefore, to thwart such an attack, the number of possible pseudonymisation secrets should be sufficiently large so as to render the attack practically impossible.
4.4.2 Dictionary search
Dictionary search is an optimisation of brute force attack, which can save computation costs. The adversary has to deal with a large amount of pseudonyms to carry out complete reidentification or discrimination. Therefore, he or she precomputes a (huge) set of pseudonyms and saves the result in a dictionary. Each entry in the dictionary contains a pseudonym and the corresponding identifier or information. Each time the adversary needs to re-identify a pseudonym, he/she is going to search into the dictionary. This search has a pre-computation cost of an exhaustive search and stores the result in large memory. The re-identification of a pseudonym has only the cost of a lookup in the dictionary. The dictionary search is essentially the computation and storage of the mapping table. Time/memory trade-offs are even possible using Hellman tables  or rainbow tables  to further extend the range. However, there are specific variants of this attack that utilise additional knowledge on the way the pseudonymisation function works. Such attacks may even work for infinite input domains
This type of attack utilises some background knowledge (such as probability distribution or any other side information), which the adversary may have on some (or all) of the pseudonym holders, the pseudonymisation function, or the dataset. Implicitly, exhaustive search and dictionary search assume that all the identifiers have the same probability or frequency of occurrences. However, some identifiers may be more frequent than others. Exploiting the statistical characteristics of the identifiers is known as guesswork    and is widely applied in the password-cracking community. It is important to notice that guesswork can still be applied even when the identifiers domain is huge. The adversary does not necessarily need to have access to the pseudonymisation function (since discrimination is possible by simply performing a frequency analysis of the observed pseudonyms).
Let us consider a case dealing with pseudonyms corresponding to ‘first names’. The domain of ‘first names’ is difficult to explore in its entirety. However, the adversary knows which ‘first names’ are the most popular (Table 2). The adversary can mount an exhaustive search or dictionary search on the domain of the most popular ‘first names’ and achieve discrimination.
Table 2: A list of popular first names
|Most popular first names
Let us assume a similar case but with an infinite size of identifiers domain. A finite subdomain of identifiers, which are included in the dataset, can be defined. If the adversary can guess this subdomain, he/she can mount an exhaustive search (see Chapter 6 for relevant use case on email address pseudonymisation). Depending on the amount of background information or metadata that the adversary possesses, and the amount of linkable information found in the pseudonymised dataset, this type of attack may lead to uncovering the identity of a single pseudonym holder, a fraction of them, or the entirety of the dataset. Especially for small datasets, such attacks may be feasible with high success rates.
4.5 UTILITY AND DATA PROTECTION
Depending on the choice of pseudonymisation function, a pseudonym may contain some information on the original identifier. Therefore, every such type of pseudonym carries the risk of being subject to a re-identification attack as those described above. For example, an attacker with sufficient background knowledge might be able to link the pseudonym back to its identifier by performing a guesswork.
Figure 7: Utility and data protection
However, in many cases, the additional information on the original identifier contained in the pseudonym is kept for linkage among pseudonyms themselves, to be performed by a valid subsequent data controller. For instance, a pseudonym may keep the year of birth from a person's birth date as part of the pseudonym (e.g. ‘’AAAA1999’’). This way, it is feasible to categorise pseudonyms based on their year of birth, e.g. concerning their age, their legal status (child or adult), their life conditions (schoolchild/working/retired), or similar. This may be an intentional feature of the pseudonymisation function utilised, allowing controllers to perform such sort of classification even on the pseudonymised data.
Clearly, the choice of the pseudonymisation function may allow for some utility of the pseudonyms created, taking into account the potential loss of protection caused by this pseudonymisation approach. Hence, a trade-off between utility and data protection can be considered (see Figure 7). When considering the application of pseudonymisation to real-world scenarios, this trade-off should be analysed carefully, so as to optimize utility for the intended purposes while keeping the protection of the pseudonym holders (data subjects) as strong as possible.