Anonos | CPDP 2018 The importance of technical solutions such as dynamic pseudonymous data

CPDP 2018 Webinar

Presentation Transcript

Gary LaFever

CEO at Anonos
Former Partner at Hogan Lovells

Malte Beyer-Katzenberger

Policy Officer at European Commission

Gwendal Le Grand

Commission Nationale de l'Informatique et des Libertés

Dr. Alison Knight

Head of Research integrity and Governance at University of Southampton

CPDP 2018 - Data Protection by Design and by Default - The importance of technical solutions such as dynamic pseudonymous data

Gary

[00:07] I want to start by saying it truly is an honor to be here, and I want to highly recommend to everyone in the audience the papers and the research that Sophie and Alison have done on the need for dynamism for data protection. It is definitely worth looking at. And in fact, what I'm here to talk about is a particular approach to data protection by design and by default that is predicated on what is referred to as Dynamic Pseudonymisation. And to highlight the difference between dynamic and static, the uses of data today are dynamic, and they're challenging traditional static approaches to data protection.

[00:49] There is a study that many of you are probably aware of where in the US several data scientists secured copies of anonymous databases from the census. One was zip code, one was age, and one was gender. In each one of those three datasets, everyone was anonymous. The problem is, the same token was used to replace people's names. So, if it was me, I would be ABCD in the first dataset, ABCD in the second dataset, and ABCD in the third dataset. When you combine the three datasets, you actually reduce the uncertainty or entropy to the level where there are claims that up to 87% of the US population are re-identifiable by name.

[01:40] And so, the dynamic use of data threatens and challenges traditional static approaches to protection and that's why we believe the dynamic approach to data protection by design and by default using Pseudonymisation makes sense because then you never claim the data is outside of the GDPR. Rather, you show that the technical and organizational measures are in place that it's permitted use under the GDPR. Next slide, please.

[02:06] So, the relevancy and timeliness of this issue is really highlighted by a recent guidance that came out from the Article 29 Working Party. It came out in late November, and I think it was initially published when I first noticed it in mid-December. And in essence, it deals with the focus that I will add that is traditional data, historical data, legacy data that's collected up through May 25th based on consent that does not satisfy the new, stricter requirements.

[02:35] And so, if you don't have consent as your legal basis, what can you do to retain access and use of this data? And typically, we've heard lawyers advising clients they have two choices. They can delete the data, which means they have no access to it again. They can anonymize the data and truly anonymize data. And if they do so, it will have residual value. It will have general statistical relevance that can be used for a number of purposes, but they can never relink it back to individuals or identifying data by definition.

[03:07] The reality is there is a third approach. And that third approach is if you satisfy the requirements of Pseudonymisation. For those of you who were here at the prior panel, legitimate interest is available. It's not a catchall. It's not something you just throw things at. But one of the ways you can support legitimate interest is through Pseudonymisation.

GDPR Technical and Organizational Requirements

[03:31] And so, this slide is intended to show shock and horror. It's because it should be no surprise that existing governance systems were not designed to support data protection by design and by default. It didn't exist before the GDPR. Privacy by design did. But if you read Article 25, it's very clear by default the data is protected. That's contrary to how data is used today. And by design only that data that is necessary to support an authorized purpose is revealed.

[04:03] And so, that is a requirement that goes to caring for the data. Data sharing is data caring. I have to remember that, Gwendal. That’s a good one. And it's true though. If you design the engineering capabilities to protect the data through use, you actually can make greater use. So, it should be no shock that existing GRC solutions were not designed to support this. The next slide, if you would, please.

Conflicting Compliance and Business Drivers

[04:26] This tension here is not new. It's an age-old tension. And you can replace business on the right hand side with innovation in general. There has traditionally been this tension between the people whose job it is to protect data and those whose job it is to maximize the innovation, but they've been at odds and the next slide will show why.

[04:45] If the approach of governance risk and compliance GRC solutions is to protect only, then literally the best you can hope to do is to get up to that wall of absolute complete protection, but you're not going to be able to cross the innovation barrier.

[05:04] However, if you have incorporated the legal, technical, and organizational requirements into the system at the data element level that protects the data in a way that respects and honors the rights of the data subjects, you can remove just those blocks that are appropriate at any given situation that enables you to have both defensive and offensive GRC.

Taking the 'Personal' Out of Personal Data

[05:32] So, this is one approach to data protection by design and by default. This is the approach that BigPrivacy takes. In this instance, there's three steps. Step one, you de-link every single data element. What does that mean? It means you pseudonymise it. You're replacing the data with non-algorithmically derived tokens. There's no means to reverse engineer the value of a token. It's the relationship only. It’s a dereference. So, number one was Pseudonymisation.

[06:03] Number two is you de-risk the data. There's a principle under the EU Data Protection Law of functional separation. If you can functionally separate the information value of data from means of re-identification, you're given certain incentives. In fact, our interpretation of Pseudonymisation under the GDPR is very conservative. It literally says: “On the one hand, you have information value. On the other hand, you have the means to re-identification. And a key is required to join the two.”

[06:23] The reality is our opinion is if you use the same token again and again, a key is not even necessary. As indicated by the first example, if you're using static tokenization, you don't need a key. You can correlate the sets of records and actually figure out someone's identity. So, the second step of de-risking literally is functionally separating the information value from the means of re-identification.

[06:56] And then, step three is critical and we believe it is what data protection by design and by default is all about, which is you only provide that level of identifiability. Most processing actually doesn't want identifiable data. The issue is traditional technologies have actually provided that and there hasn't been a distinction.

[07:17] So, let me draw what we see as the differences with data by design and by default. Number one, you separate the information value from the identifiability. You don't make it so that if you want information value, you have to take identifiable data. That's point one. The second is the access that you get to data is focused on specific use cases. And the third is re-linking is possible in controlled authorized conditions.

[07:47] And I want to make it very clear, this is not an as-a-service offering. This is a technology offering that's available to data controllers so they can exercise their stewardship rights. And what this does is it enables you to have greater legal uses of data internally because there are incentives in the GDPR, as Gwendal mentioned, for Pseudonymisation. And it also increases the opportunity for sharing extra.

[08:12] So, you've taken the tension between the data users, business Innovation on the right hand side, and the compliance people - those who are protecting the rights of the data subjects, and you've made it not a tug of war anymore because the governance controls established by compliance are technologically enforced to enable use of pseudonymous data. And only in those situations where identifiability is required and authorized for which appropriate legal basis exists is that provided. But oftentimes, pseudonymous data is more than sufficient for the need.

[08:45] The next slide, which is my favorite slide because it blinks, on the left hand side, the bright vivid blue represents usable data. Below that to the right is the dark gray, and this is meant to highlight that in a static approach to data protection using static identifiers, there's a confusion between identity and information. I get one, I get the other. I can't split them apart. And you actually lose protection at scale, which I've mentioned earlier as you start to combine additional datasets.

[09:18] And lastly, the whole value proposition is deterministic where you rely on knowing that an individual is that specific individual. And if you go to the right hand side, what that's meant to highlight is each individual square is an individual cell of data. Sometimes it's needed. Sometimes it's not needed. Why is it even in the conversation? So, you've dynamically changed identifiers. You will notice on the left, I call them static identifiers. On the right, they're dynamic de-identifiers.

[09:49] What that means is because the identifier is changing, so if your name appeared - I had the prior example of three datasets that was ABCD, ABCD, and ABCD. Why isn't it ABCD, Q99 and DDID? Now each of the three datasets is anonymous. And unless you have permission to know that ABCD equals Q99 equals DDID, you don't know that. But you haven't lost the ability to re-link. And so, you actually are separating value from identity. And the more data that's added, the protection actually increases. And for most use cases, it's probabilistic. It's not deterministic, which is more than necessary for the desired use.

Reconciling Data Use and Data Protection

[10:29] So, the next slide just highlights the difference between the different types of protection and what they were designed to do. So, the three columns are a scalable protection against linkage attacks of the mosaic effect, the ability to relink, and the last one is increased sharing opportunities. We believe what the GDPR provides in data protection by design and by default actually is the only approach that hits all three, which is why we think it's the way to increase innovation and sharing.

4 Steps to Greater Sharing & Collaboration

[10:57] And the next slide highlights the most popular use case. This is oftentimes referred to as multi-party computing or MPC. And when people think of MPC, they think of homomorphic encryption or maybe differential privacy. But the reality is that can very quickly explain real world commercial use cases of data protection by design and by default. So, you have on the left hand side in step one, a blue individual and a green individual. The dark blue boxes represent identifying data. The dark green boxes represent identifying data. The two parties agree in advance to a schema, a generalized probabilistically modified version of their data that is still granular, and they can agree upon the cluster size.

[11:42] In many industries, a cluster of five is held to be satisfactory. But whatever that size happens to be, you don't have an individual that is identified in the two datasets. So I don't know that John Smith is a customer of yours. What I know is there are five people in a category and in a data cluster that has similar characteristics. One of them may be John Smith. None of them may be John Smith. But I have reduced my data to a level of identifiability that doesn't reveal who people are but reveals valuable information about small clusters. Why is that relevant? Because if they both apply the same schema to their data, you can now combine those datasets and have greater knowledge about the combination of the datasets in a non-identifying manner. This uses k-anonymity, l-diversity, and a number of different algorithmic means to ensure that the risk is small enough.

[12:34] And so, in doing so, you enable two parties to exchange data, enrich the data, have better knowledge that data is an asset to them, and you have never jeopardized the individual rights to data subjects. And then in step four, with appropriate legal basis and perhaps even consent, you have the ability to get the benefits of this enhanced information back to the original data subjects so you don't have to conflate or confuse identity and data value.

[13:05] So, the next slide shows that you can actually through data protection by design and by default leverage non-identifying data. But you can also overcome the limitations on data linking and consent by having controlled relinkability. Again, it's a balancing of interest. This is not a silver bullet. There is no such thing as a silver bullet. And in doing so, you take the advantage and benefit of both the data subjects and the data controller of incentives built into GDPR.

[13:37] The very last slide simply highlights and enforces that this distinction is timely because data controllers have a one-time opportunity. The guidance that came out says that all the data that you've collected through May 25th, if it's based on broad based consent, is no longer permissible to possess or process. And so, you have to do something. One of the things companies can consider is whether they have a legitimate interest in maintaining the data for statistical purposes, and whether Pseudonymisation can enable them in doing so. Thank you.

Sophie

[14:13] Thank you very much, Gary. If I were to sum that for you, your answer is that of both using technology and also by adding some controls including legal controls.

Gary

[14:28] Yes. So, in order for this approach to work, you have to have a foundation of knowledge along the compliance when you set the tools in place so that the individual data controller is the one who's setting the rules. But then the technology automatically enforces those rules. And so, the reason the compliance person in the second slide where they used to tell you more had a smile on his face is because he knows the data that leaves his premise whether it's to an internal user or an external user cannot violate the rules he imposed on it. So, it's a combination of both legal and technical.

CPDP 2018 - Data Protection by Design and by Default - The importance of technical solutions such as dynamic pseudonymous data