General Data Protection Regulation (GDPR)
General Provisions
Chapter 8

PSEUDONYMISATION IN PRACTICE: A MORE COMPLEX SCENARIO

As can be seen from the previous two use cases in Chapters 6 and 7, pseudonymisation of even the simplest data types, like IP addresses or e-mail addresses, is a challenging and errorprone task. When it comes to real-word systems, however, it is often not the choice of pseudonymisation technique utilised for one or two specific identifiers that causes the most problems; it is the implicit linkability among a set of pseudonyms and other data values that are joined into a more complex data structure. The most common example is that of an online service that creates user profiles on registration, and enriches these user profiles with personal information on the user whenever new data gets available. Here, even if the user’s e-mail address and all IP addresses found in the user’s access logs are pseudonymised rigorously as discussed above, there still is a large threat of re-identification or discrimination being possible even on the pseudonymised data structure itself. In this section, these more complex cases of data pseudonymisation are discussed.

8.1 A MOCK-UP EXAMPLE

For the sake of discourse, let us assume an example scenario that is very similar to commonly found real-world scenarios: an online social network. The imaginary operator, SocialNetwork Inc. (dubbed SN hereafter), acts as the data controller, and allows its users (assumed to be human individuals only) to register for an account that is stored in the datacentre of SN. With that account, users can make use of a set of functions that e.g. allow linking to other users, organisations, or topics of interest. On registration, users of SN have to provide their real name as first name and last name, nickname, their birthdate and gender, a set of optional personal information (location, interests, biometrics, etc.), as well as a valid e-mail address. Whenever the users access any of the services of SN, their interaction is logged and added to their user profile – including timestamp and IP address of access.

In order to improve compliance with the GDPR, the management of SN decided to pseudonymise the IP addresses in the access logs according to the techniques discussed in Chapter 6. The remaining information is kept in plain text, as it is needed to be presented to the user on the websites of SN where necessary, or to perform checks and validations (e.g. the birthdate is needed to calculate the age and verify the user is older than 16 years when accessing special services). Pseudonymisation of the e-mail address is not feasible here, as SN needs to be able to send e-mails with notifications (and other contents) to the users.

Assume a second imaginary organisation, Online Security Services Corp. (dubbed OSS hereafter), who acts as a data processor on behalf of SN, with the task of maintaining storage and security services for parts of the user database of SN. In this position, OSS is having access to the pseudonymised log-files of SN, i.e. to the pseudonymised IP addresses and timestamps of all website accesses, but not to the original IP addresses themselves. In such a setting, OSS cannot re-identify the users belonging to an IP address because that data is stored in a different database at SN that is not accessible to OSS. Thus, with respect to pseudonymisation, the scenario from Chapter 3.3 appears, with SN as data controller and OSS as subsequent data processor.

8.2 DATA-INHERENT INFORMATION

At first glance, OSS is not able to break the pseudonymisation of IP addresses performed by SN, if it is assumed that SN utilised a sufficiently strong pseudonymisation function. Depending on the pseudonymisation function, and especially on the pseudonymisation policy (cf. Chapter 5.2), OSS might still be able to infer whether a certain pseudonym occurs frequently, rarely, only once, or not at all in the database. This by itself might not suffice to uncover an identity, but it can already be utilised to identify frequently accessing users. If an access record contains a pseudonym with a high frequency of occurrence, OSS can infer that this probably is a heavy user of SN. Vice versa, if a pseudonym occurs for the first time in the dataset, most likely this user just registered for SN and accessed its user account for the first time, or the IP address of a registered user changed (which can happen frequently, making all of these observations become probabilistic).

This sort of data-inherent information already can be useful to OSS, e.g. to learn how many of the users of SN are persistent users, and how many register once and do not return a second time again (with some probabilistic degree of error based on the change of IP addresses). This information can already be critical in the business relation of SN and OSS.

Beyond this data-inherent information, the fact that OSS has continuous access to the database of SN allows for another type of information gathering for OSS: by continuously monitoring the dataset stored for SN, OSS learns the change of the dataset. This includes the total number of accesses to the website of SN, trivially, but can also be utilised e.g. to count the number of new user registrations (first-time pseudonyms occurring) per day or month. Still being mostly of statistical nature, this information already can be utilised to stage real discrimination attacks (so as to devise different impacts on different users’ groups): OSS learns which new user’s pseudonym shows up on which day first, allowing OSS to monitor the amount of interaction this specific user has with SN. This information may easily become an issue of data subject protection, as will be shown later.

8.3 LINKED DATA

In the mock-up scenario, the data accessible to OSS gives more information than just the IP addresses: each log entry stores the timestamp of access as well. Hence, instead of frequently monitoring changes in the database at SN, OSS can simply rely on the linked timestamps to each pseudonym to perform the same type of user discrimination as before. The timestamps are stored along with the pseudonymised IP addresses, hence are directly linked to that information one to one. Based on this linked data, OSS can increase its knowledge on specific users of SN by far: does a specific user access SN more in the morning, at lunch break, or in the evening? Only or mostly on Sundays? Only on religious holidays of the orthodox calendar? Only during the time periods of school holidays in Denmark?

Each such additional type of characterisation allows OSS to get closer to a breach of pseudonymisation, just based on the stored timestamps and the ability to link different data records with identical pseudonyms. As it can be seen, this sort of information starts providing some characterisation of users of SN to OSS that can be considered personal information. However, the linkage requires additional information to be linked to the structured datasets themselves, such as e.g. the orthodox calendar or, the Danish school holidays. Hence, these can be considered as background knowledge attacks as discussed in Chapter 4, but with a varying complexity of the background knowledge necessary. Moreover, such sort of extracted information is of statistical nature, hence not 100% reliable, but with a certain amount of probability. Here, the more data entries are contained in the database, the more reliable (or falsifiable) a linkage hypothesis gets. Thus, the bigger the social network of SN, the easier it gets for OSS to perform such discrimination or even re-identification attacks.

This example included just a pseudonymised IP address and timestamp. It would hold true also, even more reliable, with a pseudonymised e-mail address instead of a pseudonymised IP address, as the latter tends to change less frequently, and thus is more of a unique identifier to a human individual.

8.4 MATCHING DISTRIBUTION OF OCCURRENCES

The data structures of the example above are quite small and simplistic: just IP address and timestamp. Still, they can suffice for discrimination or even re-identification attacks, given enough background information. In addition, real-world data entries typically store more information than just these two values, hence the data records hold more details to be utilised for uncovering the pseudonyms.

Consider that SN stores more than just timestamp and pseudonymised IP address in each data record, e.g. it also stores the type and version of the browser31 utilised by that user, the set and preferences of natural languages the user speaks (as defined in the browser settings), the operating system version of the user’s computer, etc. As was uncovered by the Electronic Frontier Foundation in the Panopticlick project32, this combination of browser settings alone can already be sufficient to uniquely identify a certain browser – and hence user – of an online website. If SN now stores all this information for each access to its website, OSS may have access to it.

Even if SN performs some sort of pseudonymisation on each of these configurations (e.g. by storing only a keyed hash of the Browser Version string received from the user’s browser), OSS can still see all those pseudonymised Browser Version strings, calculate the statistics on which hash value appears how often in the total database of SN, and compare that distribution of different existing values to the publicly available statistics gathered at the Panopticlick website to uncover the true Browser Version string behind each hash value – despite the proper utilisation of the pseudonymisation function. Just the fact that the statistical distribution of different pseudonyms matches the statistical distribution of their assumed plain texts may suffice to uncover those pseudonyms, with a high probability of success.

This is of course greatly dependant on the selected pseudonymisation approach. If an appropriate engineering approach is applied, the addition of metadata to the argument of the pseudonymisation function can offer more protection against reverse engineering.

8.5 ADDITIONAL KNOWLEDGE

If OSS has additional knowledge on a certain user’s characteristics, and is trying to uncover that user’s data records from the pseudonymised database it gets from SN, every piece of additional information may become critical. If OSS knows the specific target user is male and utilises the Chrome Browser on an iPad, this information alone narrows down significantly the set of possibilities of user profiles seen by OSS. Each of these data values, even if pseudonymised, reduces the set of possibilities, i.e. the set of user profiles contained in the SN database that may belong to the specific target user searched for by OSS. The browser information can be addressed with the distribution probability attack outline in Section 8.4, removing a large portion of user profiles having browser pseudonyms with far too many or far too few occurrences to match with the specific “Chrome on an iPad” configuration probability.

From the remaining profiles, a trivial brute-force attack or statistical distribution attack reveals to OSS which pseudonym maps to which gender, eliminating about half of the remaining user profiles. If now all of the remaining user profiles have in common that the first access to SN was between May and July 2018, OSS has already learned something about that specific user; he/she registered at SN in that time period. This is a successful inference attack. Analysing the remaining user profiles further, OSS may learn about a specific pattern of timestamps of SN utilisation found with two of those user profiles, such that they match the assumed utilisation pattern of the target individual (that OSS was able to observe at some occasions in the past). Hence, the target search set gets reduced to only two user profiles.

Every information that both these profiles have in common must thereby hold true for the specific target individual as well, probably telling OSS quite a lot on their search target already. To eliminate the remaining false candidate, OSS may simply monitor utilisation of SN by these two profiles specifically, and on next access validate whether that access could have originated from their target individual or not (based on additional background knowledge obtained from those facts OSS already learned about their target). In the end, OSS is able to link the user profile to the target identity. Thereby, OSS also is able to uncover all pseudonymisations performed on that individual’s data values as well, potentially allowing OSS to uncover or discriminate against other user profiles as well.

Still, it should be noted that the problem of additional available information is “orthogonal” to pseudonymisation, while being primarily a data protection by design issue. Therefore, as also mentioned earlier in the report, on top of pseudonymisation, one can consider the injection of noise to the arguments of the pseudonymisation function, or the use of generalization, in order to make brute force attacks less effective (see also Chapter 5.6). This degree of freedom is a way to further strengthen pseudonymisation and protect against relevant attacks. 

8.6 LINKAGE AMONG MULTIPLE DATA SOURCES

Beyond the above scenario of SN and OSS, an even more challenging scenario of pseudonymisation emerges when considering not just two organisations (SN and OSS) to participate, but when assuming a large-scale marketplace of pseudonymised data. In such scenarios, multiple different organisations share pseudonymised datasets of personal data, with the intention of allowing some utility (e.g. creating profiles for marketing purposes) while protecting the identity of the data subjects themselves. The often-heard argument in such scenarios is that the pseudonymisation prevents re-identification of data subjects, thus legitimising such data sharing. This report does not argue for or against legitimacy of sharing of pseudonymised datasets, but discusses the issues of properly applying pseudonymisation in such a setting.

Assume a set of companies A to E, who all collect personal data on their users, such as the data gathered by SN in the previous example. Linkage of user profiles of different companies could be performed by comparing the e-mail addresses utilised by the respective users. If two user profiles found at, say, companies B and D, registered with exactly the same e-mail address, they most likely belong to the same data subject. However, obviously, the e-mail address itself is personal data, as was discussed in Chapter 7. It thus becomes necessary to apply pseudonymisation to the e-mail addresses in the datasets of B and D before sharing them among A, B, C, D, and E.

The challenge here is that all participants want to keep the utility of the pseudonymised data to link profiles belonging to the same person, without reducing the protection of identity of that user. Hence, all five companies need to apply the very same pseudonymisation, utilising the very same pseudonymisation function and pseudonymisation secret, in order to be able to compare and link data records from different datasets against each other. Here, there is a clear discrepancy between the utility (of linking the pseudonymised e-mail addresses) and protection (of the users of those e-mail addresses). In other words, B and D should be able and allowed to learn that their particular data records share the same e-mail address, hence belong to the same user, but should not be able to learn what e-mail address – and hence data subject – that is.

As discussed in Chapter 7, in such scenarios, the use of weak pseudonymisation functions (like plain hashing) allows for trivial brute-force, guesswork, or probability distribution attacks, as discussed above. Enriched with the additional (non-personal) data contained in the shared data records, and maybe with some additional background knowledge, these attacks must be considered practical and largely successful in many scenarios. Even worse, the more companies share information on a particular data subject’s attributes, the more information is available to an intentional adversary against the pseudonymisation utilised, hence the more likely a success of such attacks is.

Privacy risks may occur even in the more general scenario that the organisations apply different (and even strong) pseudonymisation techniques to their users’ identifiers (e.g. e-mail or IP address). Let us assume that the aforementioned set of companies A to E provide such pseudonymous data to OSS, in order to obtain, e.g., statistical services. If the provided pseudonyms are accompanied with information on the users browser/device as described in Section 8.4 (browser settings, operating system etc.), and recalling that any such device information is expected to be unique for each device33, then OSS may trivially link different pseudonyms, provided from different companies, corresponding to the same user. 

8.7 COUNTERMEASURES

As discussed in Chapter 5, techniques of (document- or fully-) randomized pseudonymisation reduce the linkage between different pseudonyms from different datasets, hence may mitigate or even eliminate statistical characteristics of the pseudonymised databases. At the same time, they limit the ability to link different data records (potentially spread over many organisations) to one user profile. Hence, even if randomized pseudonymisation is applied, OSS might still be able to perform the attacks outlined above if OSS is able to uncover whether two different pseudonyms belong to the same identifier. Similarly, B and D may successfully re-identify the data subject behind the shared user profiles. Here, the trade-off between protection and utility becomes evident again.

So, how can one defend against such types of attacks on pseudonymisation in a reliable way?

Following the analysis in this report, the best approach to pseudonymisation is to:

  • Consider the whole dataset available.
  • Learn about input domain sizes of individual data values.
  • Apply pseudonymisation onto all data values in such a way that brute force and dictionary attacks become infeasible.
  • Eliminate any option for background knowledge or statistical distribution attacks.
  • Design the resulting large-scale pseudonymisation function in such a way that the pseudonymised dataset keeps only the type of utility necessary for the purpose of processing, but removes all other utility from the pseudonymised dataset.

For the example scenario in this Chapter, SN may utilise a pseudonymisation scheme that pseudonymises not just the IP addresses themselves, but all possible combinations of IP addresses and timestamps. Then, linking the timestamp to any external data source becomes infeasible as this information is no longer available to OSS. For a successful re-identification, OSS would need to know (or guess) the exact combination of IP address and timestamp. In general, pseudonymisation of a combination of data inputs cannot reasonably be uncovered without knowing (or guessing) all of the input data in plaintext. For this setting, such a pseudonymisation would block any attempt of OSS to uncover a given pseudonym in a much more robust way.

Examples of basic techniques for robust pseudonymisation functions have already been discussed in Chapter 5, along with an in-depth discussion of their resilience against the attacks on pseudonymisation outlined in Chapter 4. In order to extend these to structured data records, it is often sufficient to consider the whole data record as the input, and apply a tailored combination of keyed hash functions and techniques common to anonymisation in general. More advanced techniques of pseudonymisation have briefly been discussed in Chapter 5.6 and in a previous report by ENISA [2].