PRIVACY VS UTILITY
Protecting data is essential, but it can also have the undesired side effect of preventing you from utilizing it. There is a trade-off between data privacy and utility that also has to be balanced with the operational cost of protecting the data. This balancing act can’t be predetermined and has to be optimized for each individual application. By automating this optimization process, it is possible to gain more insight from your data while keeping it private.
To demonstrate this trade-off, we consider a small data set that gives the political party a person has voted for, as well as their name, age, gender. This is obviously private and sensitive information that needs to be protected. However, what is the best way to protect this sensitive data if we want to share it with a data scientist to investigate the relationship between gender, age and political leaning?
First thing, we will mask the names which remove direct identifiers (ID). This, however, far from ensures privacy, as people can still easily be identified using a combination of quasi-identifiers (QID) such as age and gender. This is known as a linkage attack. More information on those here.
We will employ a further two privacy enhancing techniques (PETs) to increase privacy. Record suppression, which removes specific records at risk of identification, can increase privacy. This comes at the cost of the quantity of data. We can also use generalization to change specific ages to age ranges which can prevent attackers from identifying individuals but also serves to reduce data quality.
If we want to optimize something, we need to be able to measure and quantify outcomes. For the purposes of this demo, we are using k-anonymity as our privacy metric (you can find a more detailed explanation here). We will measure the utility in two ways for our two PETs (Local suppression and generalization). For local suppression, we can simply take the percentage of data removed as a proxy for loss of utility. When it comes to generalization, we can intuitively see that the larger the age range we generalize to (e.g. 2-year intervals vs 20-year intervals) the lower the accuracy of any analysis. Generalized information loss is a more formalized expression of this which we will use as our second measure of utility.
INTERPRETING MEASURES
↑High k-anonymity = ↑High Privacy
↑High Percentage suppressed = ↓Low utility
↑ High Generalized information loss = ↓Low utility
How would you set the data to give maximum privacy and utility?
Widgets will take ~40 seconds to load. Click on the "show widgets" button to see interactive elements.