General Data Protection Regulation (GDPR)
General Provisions
Chapter 6

IP Address Pseudonymisation

Using the techniques and information presented earlier in the document, in this Chapter a specific use case on the pseudonymisation of IP addresses is presented.

An IP address is used to uniquely identify a device on an IP network. There are two types of IP addresses: IPv4 [27] and IPv6 [28]. The report focuses in this use case on IPv4, as it is still the most commonly used, while extending the concepts described earlier to IPv6 would be quite complex and beyond the scope of this document. An IPv4 address consists of 32 bits (128 bits for IPv6) divided into a network prefix (most significant bytes) and host identifier (least significant bytes) with the help of a subnet mask. They are often represented using a dotted decimal format which consists of 4 decimal numbers between 0-255 separated by dots like 127.0.0.1. The size of network prefix and host identifier depends on the size of the CIDR block (Classless Inter-Domain Routing [29]). In addition, some IP addresses are special like 127.0.0.1 (localhost) or 224.0.0.1 (multicast). These special addresses are all defined in [30] and are categorised in 15 classes.

The Internet Assigned Numbers Authority (IANA) is managing the whole IP address space with the help of five regional Internet registries (RIRs). They allocate subsets of IP addresses to local organisations like Internet Service Providers, which in turn assign addresses to the devices of the end-users. Each IP address assignment is documented by the corresponding RIR in the socalled WHOIS database21. The assignment can be static or dynamic (using Dynamic Host Configuration Protocol - DHCP for instance).

From a legal perspective, the status of IP addresses has been discussed by the Court of Justice of the European Union in the case C-582/14 Breyer v Bundesrepublik Deutschland22. Static or dynamic IP addresses are considered as personal data. This was also confirmed by Opinion 4/2007 of the Article 29 Data Protection Working Party on the concept of personal data [31]. Therefore, database or network traces containing IP addresses must be protected and pseudonymisation is an obvious protection feature, which can allow the use of IP addresses, while preventing their linkability to specific individuals. That being said, choosing an appropriate pseudonymisation technique for IP addresses consists of finding a good trade-off between utility and data protection. Indeed, the data controller may still need to compute statistics or detect patterns (misconfiguration of a device or for quality of services) in the pseudonymised database. Utility and data protection cannot be treated independently in practice, however, they are separated next only for better understanding.

6.1 PSEUDONYMISATION AND DATA PROTECTION LEVEL

The main characteristic of the IP address pseudonymisation problem is the size of the input space (identifier domain): there are only 232 possible IP addresses. This makes exhaustive and dictionary searches available to an adversary to mount complete re-identification or discrimination attacks if the pseudonymisation function is not properly chosen.

Taking into consideration the aforementioned characteristic, cryptographic hash functions are especially vulnerable in this use case. As an example an IP address pseudonymised with the hash function SHA-256 has been considered. An adversary with a pseudonym/digest can use existing tools23 to perform an exhaustive search. Table 5 shows the duration of this search on a single ordinary laptop running an Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz processor (8 cores) and the size of the dictionary. Even in worst case, it takes only about 2 minutes to recover the IP address belonging to a given pseudonym.

Table 5: Practical costs of attacks against hash function pseudonymisation

IP class Number of possible IPs Time of exhaustive search Dictionary size
145.254.160.X 256 200ms 8KB
145.254.X.X 65536 200ms 2MB
145.X.X.X 16777216 2s 512MB
X.X.X.X 4294967296 2min16s 128GB

Furthermore, let us assume that the adversary wishes to determine if a pseudonym corresponds to a special address [30] . This discrimination attack does not need to be performed on the 232 possible IP addresses but only on the 588,518,401 possible special IP addresses.

The aforementioned simple case demonstrates that pseudonymisation of IP addresses using only cryptographic hash functions fails. Therefore, for data protection other pseudonymisation functions must be preferred, like message authentication codes, encryption with a secret ad hoc generated key, or random number generators. As discussed earlier in the report, an adversary cannot mount the same attacks because these methods use a secret key (MAC and encryption) or source of randomness (for RNG). Counter can be used too, but one must be cautious of possible predictions (arising from the sequential nature of counter).

6.2 PSEUDONYMISATION AND UTILITY LEVEL

As already mentioned, in the case of IP addresses, utility might be an essential requirement for the pseudonymisation entity, e.g. for the calculation of statistics or network security. Therefore, the approach applied (independently of the chosen technique) should allow for adequate protection, while preserving some basic useful information (arising from the IP addresses). In this section, two different dimensions towards this issue are considered: first, the possibility to minimise the level/scope of pseudonymisation of the IP address; and second, the choice of the pseudonymisation policy (mode).

6.2.1 Pseudonymisation level

In the previous section, it was considered that pseudonymisation is applied on the complete IP address (32 bits). However, in order to increase utility, it is possible to apply it only on the least significant bits of the address (host identifier) to preserve the network prefix. This technique is called prefix-preserving pseudonymisation [32]. It allows identification of the global origin of a packet (network) without knowing which device within the network has actually sent it. It is critical to understand how many devices exist for a given prefix. Table 5 shows different sizes of prefix. This technique is used already by several service providers to pseudonymise IP addresses (see e.g. in [33]).

6.2.2 Choice of pseudonymisation mode

The choice of the pseudonymisation mode has a strong impact on the utility and on the data protection level, independently of the choice of a certain pseudonymisation technique. In this section, this relation is further explored with a specific example.

Let us consider the pseudonymisation of the source and destination IP addresses in a network trace. Table 6 provides the source and destination addresses of the first packets of an HTTP request between a client (145.254.160.237) and a server (65.208.228.223).

Table 6: Source and destination of an HTTP request

  Source Destination
Packet 1 145.254.160.237 65.208.228.223
Packet 2 65.208.228.223 145.254.160.237
Packet 3 145.254.160.237 65.208.228.223
Packet 4 145.254.160.237 65.208.228.223
Packet 5 65.208.228.223 145.254.160.237

In the example mentioned above, let us apply deterministic pseudonymisation using an RNG for instance. Each IP address is associated to a unique pseudonym. The mapping table obtained in our case is given in Table 7. After deterministic pseudonymisation, Table 8 is obtained.

Table 7: Mapping table for deterministic pseudonymisation

IP address Pseudonym
145.254.160.237 238
65.208.228.223 47

Table 8: Source and destination addresses transformed using deterministic pseudonymisation

Packet number Source Destination
Packet 1 238 47
Packet 2 47 238
Packet 3 238 47
Packet 4 238 47
Packet 5 47 238

Let us compare the information that can be extracted from the original network trace (Table 6) and Table 8. As can be seen from this comparison, from both traces (original and pseudonymised), it is possible to infer the total number of IP addresses involved and how many packets were sent by each address during the communication. Therefore, while the IP addresses in Table 8 are pseudonymised, the same level of statistical analysis (and, thus, utility) is possible on the IP addresses. 

Now, let us consider the case of document-randomized pseudonymisation with an RNG. Each time an IP address is encountered, it is transformed into a different pseudonym. For instance, IP Address 145.254.160.237 is associated to 5 pseudonyms, namely 39, 71, 48, 136 and 120 (Table 9). After applying document-randomized pseudonymisation, Table 10 is obtained.

Table 9: Mapping table for document-randomized pseudonymisation

IP address Pseudonym
145.254.160.237 39,71,48,136,120
65.208.228.223 23,30,60,160,231

Table 10: Source and destination addresses transformed using document-randomized pseudonymisation

Packet number Source Destination
Packet 1 39 23
Packet 2 30 71
Packet 3 48 60
Packet 4 136 160
Packet 5 231 120

As shown from Table 10, while it was possible in Table 6 and Table 8 to count 2 IP addresses, this is not the case in Table 10 in which 10 IP addresses are virtually involved. Therefore, the level of utility has been reduced (while, however, increasing the level of protection). Obviously, the application of fully-randomized pseudonymisation has an even stronger impact on utility. Table 11 compares the different modes of IP pseudonymisation to this end.

Table 11: Mode of pseudonymisation and utility

Mode of pseudonymisation
Utility Deterministic Document-randomized Fully-randomized
Statistics (count...) YES NO NO
Protocol semantics YES NO NO
Comparison between different traces YES YES NO

Clearly, there is not a single solution to this problem and the final choice always rests with the utility and protection requirements of the pseudonymisation entity.