CHAPTER 2

THE NOTION OF PSEUDONYMISATION

This Chapter provides an analysis on the notion of pseudonymisation and its overall role in the protection of personal data. In particular, Section 2.1 starts with a definition of pseudonymisation, including both its technical description, as well as its definition under GDPR. Based on this analysis, Section 2.2 discusses the difference between pseudonymisation and anonymisation. Section 2.3 elaborates on the core data protection benefits of pseudonymisation, while Section 2.4 examines its role in GDPR.

For the discussions in this Chapter, we use the following terminology, derived from the relevant GDPR definitions:

  • Data controller is the entity that determines the purposes and means of the processing of personal data (article 4(7) GDPR). The data controller is responsible for the data processing and may employ pseudonymisation as a technical measure for the protection of personal data.
  • Data processor is the entity that processes personal data on behalf of the controller (article 4(8) GDPR). The processor may apply pseudonymisation techniques to the personal data, following relevant instructions from the controller.
  • Data subject is a natural person whose personal data are processed and may be subject to pseudonymisation. The term individual is also used in the text to refer to a data subject. Moreover, the term user is utilised in the same sense, especially when discussing online/mobile systems and services.
  • Third party is any entity other than the data subject, controller or processor (article 4(10) GDPR).

Any examples presented in the text are only to support the underlying technical description and are not
meant to present a legal interpretation on the relevant cases.

2.1 Definition of pseudonymisation

2.1.1 Technical description

In broad terms, pseudonymisation refers to the process of de-associating a data subject’s identity from the personal data being processed for that data subject. Typically, such a process may be performed by replacing one or more personal identifiers, i.e. pieces of information that can allow identification (such as e.g. name, email address, social security number, etc.), relating to a data subject with the so-called pseudonyms, such as a randomly generated values.

To this end, the ISO/TS 25237:2017 standard defines pseudonymisation as a ‘particular type of deidentification that both removes the association with a data subject and adds an association between a particular set of characteristics relating to the data subject and one or more pseudonyms’ 6 [ISO, 2017]. Deidentification, according to the same standard is a ‘general term for any process of reducing the association between a set of identifying data and the data subject’. A pseudonym is also defined as ‘a personal identifier that is different from the normally used personal identifier and is used with Pseudonymised data to provide dataset coherence linking all the information about a data subject, without disclosing the real world person identity’. As a note to the latter definition, it is stated in ISO/TS 25237:2017 that pseudonyms are usually restricted to mean an identifier that does not allow the direct derivation of the normal personal identifier. They can either be derived from the normally used personal identifier in a reversible or irreversible way or be totally unrelated.

Another technical definition of pseudonymisation is provided by the ISO/IEC 20889:2018 standard as a ‘deidentification technique that replaces an identifier (or identifiers) for a data principal with a pseudonym in order to hide the identity of that data principal 7 ’ [ISO, 2018]. A pseudonym is subsequently defined as a ‘unique identifier created for a data principal to replace the commonly used identifier or identifiers for that data principal’. Relevant definitions can also be found in [Pfitzmann, 2010], where a pseudonym is considered as ‘an identifier of a data subject other than one of the subject’s real names’ and the notion of pseudonymity is defined as ‘the use of pseudonyms as identifiers’.

Despite the different terminology used, in all the aforementioned definitions, it is clear that pseudonymisation is expected to take out of sight, or ‘hide’ the identifying information (i.e. personal identifiers) relating to data subjects by replacing them with pseudonyms, while, however maintaining an association between the two (personal identifiers, pseudonyms) that allows for the re-identification when needed. Clearly, towards providing a high level of protection of data subjects’ identities, such an association should be, somehow, secured and not obvious to anyone having access only to the pseudonymised data to render pseudonymisation a realistic choice. This association falls under the concept of ‘additional information’ introduced by GDPR and will be discussed in Section 2.1.2.

Note that for the remainder of the document we use interchangeably the terms personal identifier or initial identifier or identifier to refer to any piece of information that can be used to identify a data subject (see also Section 2.1.3). The term pseudonym is used to refer to a piece of information that replaces a personal identifier as the result of a pseudonymisation process.

2.1.2 Pseudonymisation in GDPR

Pseudonymisation is defined in article 4(5) of the GDPR as: ‘the processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject without the use of additional information, provided that such additional information is kept separately and is subject to technical and organisational measures to ensure that the personal data are not attributed to an identified or identifiable natural person’.

In GDPR, the data controller is the entity responsible to decide whether and how pseudonymisation will be implemented. Data processors, acting under the instructions of controllers, may also have an important role in implementing pseudonymisation. Recital (29) GDPR states that the application of pseudonymisation to personal data can reduce the risks to the data subjects concerned and help data controllers and processors to meet their data protection obligations. Moreover, recital (30) states that measures of pseudonymisation should, whilst allowing general analysis, be possible within the same controller, when that controller has taken appropriate technical and organisational measures and that additional information for attributing the personal data to a specific data subject is kept separately.

The GDPR definition of pseudonymisation, while in accordance with the aforementioned technical descriptions, provides a stricter framework for implementation as it states that ‘personal data can no longer be attributed to a specific data subject without the use of additional information’. Following the definition of personal data in article 4(1) GDPR, i.e. any information relating to an identified or identifiable person, this would mean in practice that pseudonymised data should not allow for any direct or indirect identification of data subjects (without the use of additional information). Therefore, pseudonymisation under GDPR goes beyond the protection of ‘the real world person identity’ to also cover the protection of indirect identifiers relating to a data subject (e.g. online unique identifiers – see also Section 2.1.3). Moreover, the reversal of pseudonymisation should not be trivial for any third parties that do not have access to the ‘additional information’. This is clearly relevant to the pseudonymisation technique applied, which should also be in accordance with the GDPR data protection principles (article 5 GDPR).

Moreover, the GDPR definition of pseudonymisation puts a lot of emphasis on the protection of the additional information, which, taking into account the technical meaning of pseudonymisaton, would in practice refer to the association between the initial identifiers of the data subjects and the pseudonyms. According to GDPR, this association needs to be secured and separated from the pseudonymised data (by the data controller). Indeed, if anyone with access to pseudonymised data had also access to the additional information, then he/she would be trivially able to reverse the pseudonymisation, i.e. to identify the individuals whose data are processed. There is no specific reference in the GDPR with regard to whether such a data separation should be logical or physical. In any case, if a data controller performs pseudonymisation, it is evident that appropriate measures need to be implemented to prevent access to associations between pseudonyms and initial identifiers (e.g. by putting them into a different database or into a trusted third party). Cleary, destroying such associations, in cases where preserving is not required by the controller, may add an additional layer of protection.

It should be pointed out though that there might be cases in which a third party (i.e. other than the controller or processor) could possibly be able to re-identify an individual from pseudonymised data, even without access to the additional information being kept by the data controller. For instance, this may occur in cases where the pseudonymisation technique is not strong enough, for example due to the fact that the pseudonyms are “trivially” generated from personal data that are publicly available (note, however, that such a technique would probably not fall under the strict definition of GDPR in the first place). In addition, there is always the risk that the post-pseudonymised dataset still contains fields (e.g. a street address) or combination of fields that, when correlated with other information, could allow for the re-identification of the individuals (see also relevant discussion on anonymisation in Section 2.2). For example, free text fields with a message and a greeting line could potentially allow linking to a specific individual even when the data are pseudonymised (i.e. personal identifiers have been removed). The characteristics of the dataset could play an important role to this end, as they could potentially facilitate inference of individuals’ identifies from the post-pseudonymised data (e.g. if a dataset relates to a small/specialised group of persons, certain attributes may immediately infer the identifiers of specific individuals within this group, even when personal identifiers have been removed). This risk is further accentuated by the fact that even if re-identification is not possible at a certain point in time, accumulating additional data that are associated with a pseudonym could possibly allow for re-identification in the future.

To this end, data controllers should have a clear understanding of the scope of data pseudonymisation and select the appropriate technique that could suffice for this particular scope. As mentioned earlier, an inadequate level of pseudonymisation would probably not meet the requirements laid down by the data protection principles of GDPR (article 5), even if they fall under the broader technical meaning of pseudonymisation.

2.1.3 The notion of identifiers

We have already referred in several instances to the notion of identifiers (or personal identifiers or initial identifiers) and their central role in pseudonymisation. In this Section, the report elaborates further on this important matter.

According to the Article 29 Working Party [WP29, 2007], identifiers are pieces of information, holding a particularly privileged and close relationship with an individual, which allows for identification, whilst the extent to which certain identifiers are sufficient to achieve identification is dependent on the context of the specific personal data processing. Hence, identifiers may be single pieces of information (e.g. name, email address, social security number, etc.) but also more complex data. For instance, although the name of the person is one of the most common identifiers, complex-type data (e.g. photos, templates of fingerprints etc.) or combination of data (e.g. combination of street address, date of birth and sex) may also play the role of identifiers. Moreover, the possibility of an identifier to lead to the identification of a specific data subject is highly relevant to the particular context in which it is applied, which in practice means that the same identifier might provide for different levels of identifiability of the same data subjects in different contexts. For example, even the simple case of an individual’s name may not always suffice to uniquely identify the individual, unless additional information is being considered – for instance, a very common family name will not be sufficient to identify someone (i.e. to single someone out) from the whole of a country’s population [WP29, 2007]; however, this might become possible if the name is combined with other data, such as for example telephone number or email address.

Another important aspect to this end is that, when considering whether a piece of information could qualify as a personal identifier, the possibility of both direct, as well as indirect identification of the data subject by the data controller needs to be taken into account, which broadens the overall notion of identifiers. This aspect is especially relevant to the use of online and mobile services, where a multitude of device and application identifiers are utilised (by the device/service/application providers) to single out specific individuals (i.e. the users of the relevant devices or applications). For instance, in the mobile ecosystem, the usage of unique device identifiers may have a significant impact on the private lives of the users [WP29, 2013], allowing for extensive tracking and profiling.

Therefore, pseudonymisation should not necessarily be interpreted as a technique applying to a single simple attribute/identifier, since there may be cases which necessitate application of pseudonymisation techniques to a bundle of multiple attributes of an individual (e.g. name, location, timestamp) to bring data protection benefits. Hence, depending on the context, different requirements with respect to pseudonymisation may occur. In this document, we use the term identifier (or personal identifier or initial identifier) to refer to all the possible cases (complex or not).

2.1.4 Pseudonymisation and self-chosen pseudonyms

It is important to stress that the notion of data pseudonymisation should not be confused with the practice of self-chosen pseudonyms, i.e. pseudonyms that individuals might select and apply themselves, such as for example nicknames of users in online blogs or forums 8. The latter is a well-known practice that can contribute to ‘hiding’ an individual’s real name but it is based on the choice of the individuals themselves and does not rely on a process applied by a data controller 9.

Although self-chosen pseudonyms might contribute to reducing the exposure of an individual’s identity in specific contexts, collection and storage of merely such type of data by a data controller (e.g. provider of an online blog or forum) does not constitute pseudonymised data processing in the meaning of the GDPR. In fact, self-chosen pseudonyms actually play the role of identifiers and can be used to single out specific individuals, especially in correlation with other relevant data (e.g. posts in a blog) or even from the chosen pseudonyms themselves.

Note, however, that the aforementioned concept of self-chosen pseudonyms, should not be confused with cases of pseudonymisation where the pseudonym (as a part of the whole pseudonymisation process performed by a data controller) is generated locally in the data subject’s environment (e.g. user’s device via a cryptographic technique). Such cases do fall under the definition of pseudonymisation and can even constitute best practices in the field, as described in Chapters 2 and 3.

2.2 Pseudonymisation and anonymisation

There is often some confusion between the notion of pseudonymisation and that of anonymisation and their application in practice. However, as discussed in this Section, these two notions are clearly different and attention should be paid so as not to perceive pseydonymised data as anonymised.

ISO/TS 25237:2017 defines anonymisation as a ‘process by which personal data is irreversibly altered in such a way that a data subject can no longer be identified directly or indirectly, either by the data controller alone or in collaboration with any other party’ [ISO, 2017]. Similarly, NIST refers to anonymisation 10 as a ‘process that removes the association between the identifying dataset and the data subject In simple words, an anonymised dataset does not allow for identifying any individual, for either the controller or third party. Therefore, anonymised data do not qualify as personal data.

Clearly, removing any user’s identifier is prerequisite for anonymisation, but in general this does not suffice to ensure anonymity and more sophisticated approaches need to be adopted. Several known anonymisation techniques exist [WP29 – 2014], whereas two main approaches are the so-called randomisation and generalisation techniques: the first one aims to “randomly” alter the data (e.g. via noise addition) in order to remove the link between the dataset and the individual, whilst the latter aims to appropriately modify, for specific attributes, the respective scale or order of magnitude (e.g. an exact age can be replaced by an age range etc.) in order to prevent singling out. In general though, there is no anonymisation technique that should be considered as panacea 11.

As already mentioned, a common misleading mistake is that pseudonymised data are perceived as anonymous data 12. However, this is not the case; recalling the relevant definitions, pseudonymisation is related to the existence of an association between personal identifiers and pseudonyms, whilst in anonymisation such an association should not be available by any means. Hence, re-identification is possible (and even required for the data controller) in pseudonymisation whereas in anonymisation this is in principle not the case. In other words, pseudonymised data are still personal data, while anonymised data are not.

The GDPR also explicitly clarifies this misinterpretation. More precisely, as stated in its Recital (26), anonymous information refers to information which does not relate to an identified or identifiable natural person – and, thus, anonymous data are not considered as personal data (in such a case, the legal framework on personal data protection does not apply) 13 . On the contrary, pseudonymised data, which can be attributed to a natural person by the data controller with the use of additional information, are personal data and all relevant data protection principles apply to them.

Still, the term “anonymous” is often used in common language to describe cases where the identities of the data subjects are only hidden (but the data are not truly anonymised). For example, there exist several so-called “anonymous” social network applications that typically do not require their users to create profiles and collect very limited information about them. By these means, it is supposed that users may express their beliefs and opinions freely without exposing their identities. However, many of these applications process an identifier of the user’s device – for instance, to send notifications to users whenever other “anonymous” users like their posts or to provide information on nearby “anonymous” users of the same network. However, as also stated earlier, device identifiers should in principle be considered as personal data since they are associated with the device’s users. This is especially the case for permanent identifiers. Still, even if a non-permanent device ID is being used by such “anonymous” applications, there might still exist an association between this identifier and the device, which in turn poses risks for user’s privacy [Chatzistefanou, 2017], e.g. potentially facilitating the process of device fingerprinting 14.

It should be pointed out that even in the absence of personal identifiers, data are not necessarily anonymous. For example, in the previous case of the anonymous social networks, a user of such a network might be identified by, e.g., his/her posts and/or other activities, without the use of any device identifier. Similarly, as it has been shown in [Su, 2017], simple browsing histories could be linked to some social network profiles such as Twitter or Facebook accounts, owing to the fact that users are more likely to click on links posted by accounts that they follow. In the same context, it was shown in [Kurtz, 2016] that users of iOS devices could be singled out through their personalised device configurations, despite the fact that there was no access from third-party apps to any device hardware identifiers. Moreover, as stated in [Zhou, 2013], personal data could be inferred from publicly available information in earlier versions of the Android system. This shows the difficulty of yielding data anonymous, while widening the notion of pseudonymised data 15.

However, it should be noted that, despite the distinction between pseudonymisation and anonymisation, the former often relies on techniques of the latter in order to enhance its efficiency. For instance, in some cases it might be a good practice to involve certain anonymisation techniques (e.g. attributes generalisation) in the pseudonymisation process, so as to reduce the possibility of third parties to infer personal data.

2.3 Data protection benefits of pseudonymisation

Following the GDPR definition of pseudonymisation, one important remark is that pseudonymisation starts with a single input (original dataset) and results in a pair of outputs (pseudonymised dataset, additional information) that together can reconstruct the original input. The same logic also resides behind the technical definitions of pseudonymisation, as the pseudonymised dataset is actually a modified version of the original dataset where data subjects’ identifiers have been replaced by pseudonyms, whereas the additional information provides the link between the pseudonyms and the identifiers. Therefore, pseudonymisation in fact separates the original dataset in two parts, where each of the parts has a meaning with regard to specific individuals only in combination with the other. This decoupling is essential in understanding the notion of pseudonymisation and the benefits that it brings with regard to data protection.

To start with, the first and obvious benefit of pseudonymisation, actually directly derived from its definition, is to hide the identity of the data subjects for any third party (i.e. other than the controller or processor) in the context of a specific data processing operation, thus enhancing their security and privacy protection. Indeed, if by means of security (e.g. access control, chain of custody), the data controller can keep the two distinct outputs of pseudonymisaton separate, then any recipient or other third party having access to pseudonymised data cannot trivially derive the original dataset and, thus, the identity of the data subjects.

To this end, pseudonymisation can actually go beyond the hiding of real identities in a specific data processing context into supporting the data protection goal of unlinkability 16 [ENISA, 2014a], i.e. reducing the risk that privacy-relevant data can be linked across different data processing domains. Indeed, when data are pseudonymised, it is more difficult for a third party to link them to other personal data that might be relating to the same data subject (again without the use of additional information) 17. Unlinkability is closely relevant to the fundamental data protection principles of necessity and data minimisation.

Furthermore, it is important to consider that there might be cases where the controller does not need to have access to the real identities of data subjects in the context of its specific processing, for example, it might be sufficient for the controller only to trace/track the data subjects without storing their initial identifiers 18. Certain pseudonymisation techniques, on the basis of the decoupling mentioned above, can facilitate this goal, thus supporting the overall concept of data protection by design (e.g. by technically using the least possible personal data for a given purpose of processing).

Last, recalling the role of decoupling in pseudonymisation, another important benefit of this process that should not be underestimated is that of data accuracy. Indeed, if a data controller has in its possession the two outputs of pseudonymisation, the integrity of the original dataset (which can only be reconstructed on the basis of both these outputs) cannot be contested. This can be a useful tool for the data controller, contributing to the data protection principle of accuracy.

Following the above elements, pseudonymisation, if properly applied, can be beneficial for the data controller as a useful tool that not only can enhance the security of personal data but also support its overall compliance with the GDPR data protection principles. At the same time, pseudonymisation is also beneficial for the data subjects, whose personal data protection is enhanced, thus further contributing to building trust between controllers and data subjects, which is an essential element for digital services.

2.4 The role of pseudonymisation in GDPR

Recognizing the possible benefits of pseudonymisation, the GDPR refers approximately fifteen (15) times
to it in several forms, including the following:

  • According to the Article 25(1) GDPR, pseudonymisation may be an appropriate technical and organisational measure towards implementing data protection principles in an effective manner and integrating the necessary safeguards into the processing (data protection by design);
  • According to the Article 32(1) GDPR, pseudonymisation – as well as encryption – may be an appropriate technical and organisational measure towards ensuring an appropriate level of security (security of processing).

For both the above cases, the GDPR explicitly mentions that the choice of the pseudonymisation as an appropriate measure is contingent on the cost of implementation and the nature, scope, context and purposes of processing as well as the relevant risks for the rights and freedoms of natural persons. Therefore, a decision on whether pseudonymisation should take place or not rests with the associated data protection risks, which means that there are cases where pseudonymisation stands as a prerequisite (e.g. whenever it is needed to ensure that the processing is proportionate to the purpose it is meant to address), but there are also cases where pseudonymisation may not be necessary. Even though in cases where pseudonymisation needs to take place, the data controller should proceed one step further to examine which is the optimal approach, taking into account all the aforementioned relevant factors. This is a direct consequence of the fact that neither all pseudonymisation techniques are effective up to the same extent nor do they have the same requirements in terms of implementation. Therefore, it is probable that even if a specific pseudonymisation approach suffices to address the data protection risks in one case of data processing, it may not be appropriate for a different data processing.

Moreover, it should be pointed out that pseudonymisation, according to the GDPR provisions, serves as the vehicle to somehow “relax” some of the data controller’s obligations. For instance, personal data may be further processed, in accordance with the principle of purpose limitation, for archiving purposes in the public interest, scientific or historical research purposes or statistical purposes, if these data are subject to specific safeguards, including pseudonymisation (articles 5(1)(b) and 89(1) GDPR). In the same line, article 6(4) GDPR states that towards deciding whether any new purpose (other than that for which the data have been collected) is compatible with the initial one, several factors should be considered – including the possible existence of pseudonymisation as an appropriate safeguard. In addition, appropriatelyimplemented pseudonymisation can reduce the likelihood of individuals being identified in the event of a data breach, which is a factor to be considered by the data controller towards assessing the risks of the breach and deciding whether data subjects should be informed [WP29-2018] (although, as it is also explicitly stated in [WP29-2018], pseudonymisation techniques alone cannot be regarded as making the data unintelligible).

Finally, in cases that the data controllers process the personal data in a way that they cannot identify the individuals (e.g. in processes where the additional information allowing for re-identification has been deleted by the controller 19), additional exemptions might apply for the controller, on the basis of articles 12(2) and 11 GDPR. More precisely, articles 15 to 20 (i.e. the data subjects rights concerning access to the data, rectification and erasure of data, restrictions on processing and data portability) do not apply whenever the controller is provably unable to identify the users 20 (unless the data subjects provide additional information themselves enabling their identification 21). However, this “relaxation” on the controller’s obligations necessitates that the controller should establish well-determined procedures for proving that such re-identification is indeed impossible based on the current data that are being processed. To achieve this goal, a prerequisite is the transparency of the overall pseudonymisation process, which clearly raises another important aspect of pseudonymisation that should be considered by the controllers.

Apparently, the aforementioned “relaxation” of data controllers’ obligations holds only if a proper pseudonymisation approach has been adopted, so as to ensure that the risks to the rights and freedoms of data subjects are indeed reduced. Therefore, the above discussion further illustrates the importance of choosing appropriate pseudonymisation techniques.

2.4.1 Pseudonymisation and encryption

It should be noted that there is often confusion among data controllers around the notions of encryption and pseudonymisation, both referenced in GDPR as security measures (article 32). However, despite some common elements, the main goals of these techniques are actually different. We will briefly discuss this difference in the next paragraphs.

With regard to pseudonymisation, it is evident from the previous discussion that it mainly focuses on protecting the identities of individuals (for anyone without access to additional information). Yet, pseudonymised data do provide some legible information and, thus, a third party (i.e. other than the controller or processor) may still understand the semantic (structure) of the data 22, despite the fact that these data cannot be associated with an individual.

On the other hand, encryption aims at ensuring – via appropriately utilizing mathematical techniques – that the whole dataset that is being encrypted23 is unintelligible to anyone but specifically authorised users, who are allowed to reverse this unintelligibility (i.e. to decrypt) 24. To this end, encryption is a main instrument to achieve confidentiality of personal data by hiding the whole dataset and making it unintelligible to any unauthorised party (as long as state-of-the-art algorithms and key lengths are used and the encryption key is appropriately protected).

Moreover, as mentioned earlier, pseudonymisation rests on decoupling, i.e. from one single input (initial dataset) to a dual output (pseudonymised data, additional information). Reversal of pseudonymisation is possible for anyone who can retrieve the additional information or who can link pseudonymised data to the initial data with the use of any other information. On the contrary, encryption, from a single input (initial data), generates again a single output (encrypted data) and its reversal mainly lies with any unauthorised access to the decryption key25.

However, despite the aforementioned distinction, it is important to state that encryption may also be used as a pseudonymisation technique (whereas the opposite is impossible). Cryptographic primitives can in general be used in pseudonymisation techniques to generate pseudonyms with desired properties.