By Frederick Purcell
It is easy to discuss privacy in data and Privacy Enhancing Techniques (PETs) without fully agreeing on what we mean by ‘Privacy’. This lack of clarity can lead to the wrong technologies being used, resulting in data not being protected in an appropriate way. Likewise, we want to only protect the data to the required level, so valuable insights can still be gained.
The only way to optimise this balance between privacy and utility is to be able to measure the effect of applying PETs.
PETs are applied to data sets with many different privacy outcomes in mind: Are you trying to prevent individuals from being identified as belonging to a data set or are you trying to hide their personal information within a set? When you are considering which PETs to use for a particular application, having a clear understanding of how you are going to affect your data privacy is vital. There are many useful ways of defining privacy depending on both the data and its intended use. Here we briefly discuss some different ways to think about data privacy and how it protects your data.
While this is by no means an exhaustive list its purpose is to demonstrate that ‘private’ does not always mean the same thing. Data might be deemed ‘private’ by one definition but not by another, making this consideration vitally important.
One of the most intuitive ways to consider privacy is by looking at data similarity. This class of privacy measures focuses directly on the data. The main concern is similarity within a data set and how that can leave individuals vulnerable to identification. For example, if we consider a data set that only contains individuals in the age range of 20-30 years but a single individual in the age range 70-80 years, the uniqueness of that entry reduces privacy. If on the other hand there were 100 individuals in that age group it would be much harder to link an individual to any single entry, even if you know their age and know that they are contained within the data set.
As such we consider data to have high privacy if values are well represented within the data set. Data similarity metrics make up many of the established privacy metrics such as k-anonymity and l-diversity.
Another privacy model with a lot of interest is differential privacy. This follows the idea that two queries on the same data set are indistinguishable even if the data set varies by a single individual. Any situation where it is not possible to distinguish between two outcomes then implies high privacy as it is not possible to determine with high certainty if any one person is included in the data set. In the case of differential privacy, this is usually achieved by adding noise to data which has the downside of removing the truthfulness of the data. Correctly applying techniques such as differential privacy can however be challenging as there are many pitfalls that can break the privacy guarantee. If correctly applied it does have the potential to guarantee privacy while maintaining utility.
Uncertainty is another intuitive way to think about data privacy. The greater the uncertainty in an attacker's guess, the greater the privacy of a data set. A common metric using this definition of privacy is entropy. Entropy metrics give a measure of how likely it is to predict any random variable. The less likely an attacker is to correctly predict a variable, the higher the privacy. While this logic holds it does not consider that even uncertain guesses can be correct.
Error-based metrics consider the attacker as they measure privacy in terms of how correct the estimate of an attacker is. Mean Squared Error for example determines privacy to be high if there is a large error between the adversaries estimate and the ground truth. While these can useful insights they require us to make assumptions about our attacker. As an attackers resources might change these assumptions are not guaranteed to hold. Like with uncertainty metrics it must also be considered that while the attacker may not have complete certainty they may be able to infer significant amounts of information.
Like with error based methods, privacy is considered taking the attacker into consideration and not just the data. For example, it is possible to define privacy in terms of how much information an attacker could gain. The less information is lost the higher the privacy of that protected data. Amount of leaked information quantifies this privacy simply by looking at the number of users that can be compromised by an attacker. The smaller the number of identifiable users, the higher the privacy.
This is a broad overview of a few ways to think about privacy but it can be helpful to consider what type of privacy you are looking for and what type of data breaches you are protecting against. Understanding your privacy needs is a crucial part of managing and utilising your data. Through proper management it is possible to keep your data inherently private while allowing your teams to gain meaningful insight, giving you the competitive edge.
Book a free demonstration
Organisations will find that, without a unified approach to navigating their wealth of PETs, their architecture and data strategy will suffer from unnecessary complexity and computational demands. The DataSecOps platform removes this challenge, empowering data and streamlining intense processes.
In the meantime, if you have any queries or questions about the importance of Data Privacy, you can contact us here.