When we talk about Privacy Enhancing Techniques (PETs) we often don’t consider what we really mean by “privacy”. We often miss the nuance that “privacy” can come in many different shapes and sizes. Understanding different types of privacy and how to measure them allows us to ensure our data is secure for a given application and can help us select the right PETs to achieve appropriate levels of privacy
Privacy is not a binary concept where a data set is either “private” or not but a much more nuanced idea. Luckily we can measure and quantify different types of privacy. To achieve both high levels of utility and privacy in your data, you need to understand the levels of privacy achieved with each PET application. This allows the trade-off between privacy and utility to be optimised for a given application. While there are large amounts of ways to analyse and quantify the privacy of data sets, there are a few key privacy models that are important to understand.
Our data set
To further explain these privacy models we are going to demonstrate them on an example data set. We can classify each "column" or attribute of this data set as either being a sensitive attribute (SA), identifying (ID), quasi-identifying (QID), or non-sensitive (NS). Generally, SAs are the thing that we are trying to protect, but are often also the attributes that we are hoping to analyse. IDs are direct identifiers (passport numbers, full names, etc.). QIDs on the other hand are attributes such as age and gender that relate to a person. While QIDs are not directly identifying, if a number of them are combined they can be used to identify an individual. For example, it has been shown that it is possible to identify about 87% of the US population by only using their zip code, age, and gender.
Our data set shows what political parties an individual voted for. To improve the privacy of our data we have applied a masking PET to the "Names" and a generalisation PET to the "Age".
We now also consider a possible attack on the protected data using linkage. This is where the attacker knows an individual is included in the data set and has some information about that individual. This sort of information can be gathered from all kinds of public data sources. In this instance, this information might have come from a voter registry.
K-anonymity states that each individual is hidden in a group of k-1 other individuals with the same QID values. The larger the number of individuals in each QID group the higher the privacy as it is hard for the attacker to link any individual to a sensitive attribute.
Considering our data we can determine our k value by grouping our data by our QIDs.
The smallest QID group has only one person in it giving this data set a k value of 1, making this data set 1-anonymous.
If we now consider Ellie from the table the attacker has, we see that she is in this QID group of one.
We can therefore say with 100% certainty that Ellie voted for the Green Party and her privacy has been breached.
If we remove her entry from the protected data we can recalculate the number of people present in each QID group.
There are now at least 5 people in each QID group. This data set is now 5-anonymous so even though we know the age and gender of Anita and Bill we can not determine which specific entry they are in the protected data as they are hidden in a group of 5 people.
l-diversity is the extension of k-anonymity. It states that each QID group must have at least l distinct and well-represented values.
Let's look at our QID groups again, and in particular, at the group Bill, a 23 year old male, belongs to.
Bill’s information is hidden in a group of 5 and while we can not determine which person he is in that group we might still be able to determine which political party he voted for. If we look at the QID group Bill is in we can see that all 5 people in that QID group voted for the same party. While we still don't know which of those entries belongs to Bill, it doesn’t matter as we can say with certainty that he voted for the Green Party.
Therefore this data set has an l value of 1 as the QID group of males, 20-30 only has one distinct value.
We can now remove all values of this QID group from our data set and take another look at how many distinct values are in each QID group.
We now have at least 2 distinct values for each QID group. This data set is now 2-diverse. As with k-anonymity, a high value represents high privacy.
While l-diversity has ensured that each QID group contains distinct values it does not address the fact that an attacker can still make a very good guess at linking an individual to a sensitive attribute depending on the distribution of sensitive attributes in that QID group.
t-closeness states that the distribution of sensitive attributes in each QID group can not be too far from the distribution of those values in the rest of the data set. This is quantified by the value t. Unlike l-diversity and k-anonymity, the greater the t value is, the lower the privacy.
The t-closeness of a QID group can be calculated using the earth mover’s distance (EMD) which gives the distance between probability distributions. Depending on the type of data, i.e. numerical or categorical, the t will be calculated in a different way. For categorical data, we can calculate the EMD of two distributions, P and Q, using the variational distance given below.
We will first calculate the distribution of Q, which we will take to be our whole data set. The probability of any value is given by the number of times it appears divided by the total number of values. E.g. The probability of any person in the dataset voting for the Red party is 8⁄15 as 8 people out of the total 15 voted for the red party. Q
Q =( Green party, Red Party)
Q =( 7⁄15, 8⁄15)
We will now also calculate the distribution of two of our QID groups, Female and Male 10-20 which will be given by Pfemale and Pmale.
Pmale = ( 3⁄5, 2⁄5)
Pfemale = ( 1⁄5, 4⁄5)
We then calculate the t-closeness value for those two QID groups.
E(P male , Q) = 1⁄2 [ | 3⁄5 -7⁄15 | + | 2⁄5 - 8⁄15 | ] = 0.13
E(P female , Q) = 1⁄2 [ | 1⁄5 -7⁄15 | + | 4⁄5 - 8⁄15 | ] = 0.27
These numbers indicate that privacy is greater in the 10-20 Male QID group is greater than it is in the 10-20 Female QID group.
In simple terms, if we look at the whole data set we have close to a 50/50 split in what political parties were voted for. If we consider the Female 10-20 QID group this is however not true as 4⁄5 voted for the Red party and only one person voted for the Green party.
A fundamental limitation of l-diversity is that changing distributions of the data usually greatly degrades the quality of the data, reducing the potential for further analysis.
Differential privacy differs from the previous privacy models in terms of the type of attack it is trying to protect against. k-annonymity protected against linking to a particular record in the data set, also known as a record linkage attack. l-diversity protects against record linkage too but also seeks to prevent an attacker linking an individual to an sensitive attribute, even if they can not link them to a specific record. This attack method is known as attribute linkage and is the same type of attack that that t-closeness protects against.
These previous models all required us to consider some aspects of the attackers method. In reality this is not a feasible way to ensure robust and long term privacy. With differential privacy it is possible to mathematically ensure the privacy when giving an output of an operation that was performed on the data.
While we have considered various attack methods we have not considered a key weakness. That is the attacker's ability to infer information about an individual by simply knowing they are included in a dataset. For example, imagine if the "Green party" from our example data set kept their own records of people who are registered members. Even if you were not able to link an individual to a record or an attribute in that dataset you have still been able to determine their political affiliation just by knowing they are included in that data set.
Differential privacy is focused on preventing an attacker from knowing if an individual is even in a given data set. Differential privacy states that if you have two data sets that vary by a single person then the outcome of any operation on either data set will basically be the same. Ɛ-differential is a concept that allows us to quantify the difference in these outcomes. The formal definition of differential privacy is given here.
P[ A(D1)=O ] ≤ eε . P[ A(D2)=O ]
We can break this equation down in simple terms: So we have two data sets D2 and D1 that vary by one individual. If we only consider the left hand side, P[ A(D1)=O ], it is the probability, P, of an operation, A, on the data set D1 being O. And this probability should be very close to that of the same operation on the data set D2 . To generate results of operations that adhere to this model noise is added to the system.
There are two main models for implementing differential privacy. While the formal definition is agreed upon, people often mean different things when they discuss implementations of differential privacy.
Local differential privacy
is a more challenging form of differential privacy to implement. In this model noise is added to the data before it is stored. This means no trusted party is needed to hold the data in its original form which ensures a high level of security. A common example of a naive implementation uses a coin flip.
In our example using political parties we could consider a case where we want to ask people who they voted for but keep that information private. Each person is asked to flip a coin. If the coin lands on heads they answer truthfully. If the coin lands on tails they flip the coin again. This time if the coin lands on heads they answer "Red party" and tails "Green party".
So for each person there is a 50% chance that their answer will be random.
We can now quantify this level of privacy. If we consider someone who actually voted for the "Green party" there is a 75% chance that their answer will be "Green party" once they have completed the coin flips and a 25% chance their answer will be the "Red party". This ensures data scientists are still able to gain insight from the results given enough entries but there is no certainty surrounding the answer of any individual.
Global differential privacy
Relies on a trusted party that holds the confidentiality data in its raw form. This might be a government for example. Once the data has been aggregated it is then possible to release differentially private metrics. These may be counts, summations etc. One of the reasons this model can be simpler to implement is that all the data is known meaning noise levels can be chosen appropriately.
Let's again consider our example data set. Let's consider an attacker who knows all the entries in the data set voted for apart from that of the first person (who voted for "Green party"). The total data set has 21 entries with 13 votes for the "Green party" and 8 for the "Red party". As the attacker knows who everyone in the data set has voted for apart from the first entry they can with 100% certainty say the first entry voted for the "Green party" by simply looking at the summary data. Differential privacy adds noise to these summary results in a way that prevents the attacker from significantly improving their best guess of who the first entry voted for ( in this case a 50/50 chance between two parties). A differentially private summary for instance may have said 14 people voted for the "Green party" and 6 for the "Red party". While this can give a good summary, especially when considering bigger numbers, it tells us nothing about any single indiviual.