It is important to understand the fundamentals of individual Privacy Enhancing Techniques (PETs). However, when considering the application of PETs on a data set, understanding the data that it contains is equally, if not more important. Only by understanding the privacy needs of your data set can you correctly choose and apply appropriate PETs.
When we consider a data set it is useful to classify each attribute as either Identifying (ID), quasi-identifying (QID) , sensitive attribute (SA) or non-sensitive (NSA). ID values are direct identifiers such as names or passport numbers. A QID on the other hand is a personal trait such as age and gender. On their own they will not allow identification of an individual, but if enough are combined they could be used to identify an individual. The SA is the data we are trying to protect and might be something like an individual's health status, income or drug use history. In most cases we are trying to analyse the SA while preventing that information from being linked to an individual.
Sample Data Set
To demonstrate some of the PETs discussed here we consider an example data set.
Non-Perturbation techniques are a broad class of privacy enhancing techniques that either replace or remove the original value to increase privacy of the data set. These techniques can offer high levels of privacy but often result in the loss of utility of the data. It is important to note that none of these techniques ensure anonymity by themselves.
Masking generally describes any technique that replaces values of a certain attribute without any consideration of the original value. This ensures no information about the protected information can be inferred after masking. There are a number of different ways to perform masking with substitution and nulling out techniques being the most common.
Substitution: Substitution masking replaces the values of an attribute with values selected at random from a predetermined substitution table. For example when protecting date of birth a random selection of dates can be used as a substitution table. As such this technique allows specific data types to be maintained, making it particularly useful when using the protected data to test applications.
Example data: In our example data we use substitution to remove the direct identifier of the individual's name, replacing it with a random value from our substitution group. In this example we use “Andy”, “Rebecca”, and “Chris” as our substitution group.
Nulling out: Nulling out, sometimes referred to as masking out, is another powerful protection technique where the attribute value is replaced with a fixed value (e.g turning a credit card number to xxxx-xxxx-xxxx-xxxx). Again this method allows for length or type to be maintained and effectively signals to any data user that this data has been protected. It can also be implemented on part of the data, for example leaving the last four digits of a credit card number (xxxx-xxxx-xxxx-1234).
Example data: We can also improve the privacy of this data set by Nulling out the ID values.
Local suppression is the process of applying masking on particular values to increase privacy without having to mask all values of that attribute. Usually it is applied to rare values in a dataset that can increase the risk of re-identification. For example when looking at people's income, very high earners and very low earners could have their income masked as these values will be less common.
Example data: If we further consider the data set with the direct identifiers nulled out we can see there are a few outliers in terms of earnings and age. This puts these individuals at increased risk of re-identification . We can remove these values using local suppression.
Record suppression Involves masking the values for particular people within a data set. This could for example be because they have opted out of having their data used.
Example data: All the people, with the exception of Gwen and Peter, may have given permission for the data to be used. As such record suppression can be used to remove those individuals' entries.
The process of anatomisation separates an attribute from the main data. This removes the link between a sensitive attribute and data that could be used to identify the person it relates to. This method can offer both high utility and privacy but does mean the link between an attribute and other information in the data set is lost.
Example data: We can consider an example where a data scientist is interested in investigating gender pay gaps. The information required for this analysis allows anatomisation to be used to isolate the attributes that are of interest.
When considering a data set, privacy can often be improved but utility retained by reducing specificity of the data. Generalisation achieves this by replacing a value with a range (e.g. age ranges). While it can easily be applied to numerical data types, generalisation can also be applied to other data. For example towns can be generalised to countries. The purpose of techniques like this is to make any combination of identifying features more common, thereby reducing the risk of re-identification.
Example data: After Nulling out the name attribute privacy can be further increased by reducing specificity of the age QID attribute. In this example a 10 year age range is used. By increasing or decreasing the age range the specificity of the data can further be controlled.
The process of Pseudonymisation replaces values with the Pseudonym. A key differentiator to this technique is that it is reversible. The link between pseudonyms and the original values is kept in a separate location which can later be used to reverse the pseudonymisation process. Any data set that has been pseudonymised can however not be considered as anonymised and as such requires different management. Pseudonymisation can also be applied with different policies resulting in different levels of privacy and utility. The pseudonymisation policies are:
Deterministic: Each value is mapped to a pseudonym and replaced by that pseudonym every time it appears in the data set. This retains information such as distribution of values. While this can offer analyst insight into the data, the same information about distribution can also be used to re-identify values.
Document randomised: In this policy pseudonyms are assigned randomly. As such the same value in the data will not always map to the same pseudonym. If the same dataset is pseudonymised using this policy, the output will always look the same.
Fully randomised: This policy behaves much the same way as the document randomised policy with the key difference that each time a data set is pseudonymised it will generate a different output. While it further increases privacy it prevents any single value from being tracked, which may be required for tasks such as application testing.
There are a number of different ways to implement pseudonymisation depending on the use case:
Encryption based Pseudonymisation creates a pseudonym for each value by encrypting it. The behaviour can be set to be deterministic, document randomised or fully randomised by using a different salt before encryption. It is important to note that the use of deterministic encryption might not meet the privacy requirements of all regulatory bodies.
While this is a simple technique it can be highly effective. Each value is replaced by a count (1,2,3 etc.). The nature of this technique means it is best suited either to deterministic or document randomised policies.
Random number generator
This technique is better suited to the fully randomised policy and replaces each value with a randomly generated number.
Example data: Deterministic Counter: Here we can still see that Julia and Rose appear twice within our data set as Rose is pseudonymised by ‘3’ and Julia by ‘2’.
As the name suggests perturbation methods “perturb” the data in some form. This can be done by either moving the values in relation to each other or by changing the data itself. Perturbation methods are extremely powerful and can offer the highest level of privacy while also retaining high levels of utility. They are however more challenging to apply correctly and it can often be difficult to measure the level of privacy achieved.
This technique adds random noise, or noise of a known distribution, such as Gaussian noise. The technique is well suited to numerical data and can achieve high levels of privacy. Determining the right level of noise to add to the data is however a challenging problem. Too much noise and all meaning is lost, whereas too little noise will result in a high risk of re-identification.
Example data: In the example data noise addition was used to randomly change the income values. When combined with PETs such as masking on other attributes this can generate very high levels of privacy.
Permutation techniques relate to data being moved within a data set. The simplest form is shuffling. In this process all entries of a specific attribute are randomly shuffled, meaning no meaningful connection can be made between an individual in the data set and the shuffled attribute. This can give very high utility as it leaves the original data visible.
Example data: In this example both the values in the Age and Income columns were shuffled. This still allows analysis to be performed on the data while making it more challenging to identify any one individual or link an individual to their sensitive attributes.
Generally micro-aggregation can be thought of as a clustering technique where values of a specific attribute are grouped. Every value in that group is then assigned the same value, for example the mean of the group. This has the effect of increasing the instance of any combination of identifying attributes, thereby increasing privacy.
Example data: The income values were grouped and assigned their mean value. While it has reduced the number of unique values, reading the risk of identification, the outliers in this dataset were not protected well.
Synthetic data is the process of generating a data set that follows the distribution of the original data. It is also possible to use a known distribution to generate the data or a mix of a known distribution and the original distribution. Using a known distribution offers the highest privacy, but with careful application a mix of distributions can offer both high privacy and utility.
Cryptographic techniques are powerful tools that can make protecting and re-identifying data sets simple. Many of these techniques can be implemented as part of other PETs. For example, deterministic encryption can be used as part of pseudonymisation. As with other PETs it is again important to note that no single cryptographic technique can ensure anonymity of a data set. Encrypted data is sometimes treated differently under certain regulations as it is a reversible process if access to the cryptographic key is gained. As such, key management becomes an incredibly important part of any cryptographic PET. While the field of cryptography is vast, we are focusing on techniques that allow some form of analysis to be performed on the data in its protected state.
Under deterministic encryption if the same value is encrypted it will result in the same cipher text (encrypted value). While the cipher text has no intrinsic meaning this technique allows for exact matches to be found and distributions to be analysed through statistical analysis.
When considering certain data sets, order can often tell us a lot about the data (e.g. earns more or earns less). When we encrypt an attribute using order-preserving encryption the cipher text that replaces the original values will be ordered in the same way as the original values. This again allows analysis to be performed on the protected data. For example it can be searched over particular ranges of the protected attribute.
For many applications data is protected to give realistic datasets that can be used in test environments. Likewise data scientists often want to place protected data in existing applications that might have format requirements. Format-preserving encryption retains the format of the original data. For example a 16 digit bank card would be encrypted to another 16 digit number.
Homomorphic encryption transforms the sensitive data into randomised encrypted values. The power of homomorphic encryption is that it is possible to perform operations on the encrypted value and then decrypt the solution. This solution will be the same as if the operations were performed on the original data. For example you might want to protect people's income but want to work out the average income of a group of people. Those encrypted income values can then be summed and the result decrypted, allowing analysis without exposing any individual's income.
PETs make up a small part of the broader subject of data privacy. For more information on other topics please look at our other resources