How do we make data Obfuscate or "De-identificate" to make it anonymous and share it publicly?

Right now, I am working on preparing a small dataset for release to the public by getting rid of sensitive information. While working on it, I wondered... what are the best practices of dealing private or sensitive polynomial attributes in a dataset?(*) I have heard to create anonymity or to permute is achieved by De-identification, obscurification, anonymization.

However, I would like to learn more about this topic in data science/analysis. I am particularly interested in the packages and concepts that can be use in R.

(*)Well besides the obvious of solutions of completely removing the sensitive attribute or using hashcode encryption. I am a little bit familiar with how this problem can be complicated by the ability to correlate attributes

Topic anonymization dataset r

Category Data Science

I have had a couple of talks about this once and I think you should know about three different concepts.

Different attributes

Attributes can be divided into three different categories. Hard identifiers refer to a specific person, e.g. full name or passport number. A combination of soft identifiers can also be used to identify a single person, because the combination is unique. Sensitive attributes are the ones that infringe privacy and that should be protected.

Measuring privacy

There are several measures for how anonymous your dataset is. These measures help you to focus on the attributes that need to be altered. In order of increasingly stronger requirements you have:

  1. A dataset is $k$-anonymous if every combination of soft identifiers occurs at least $k$ times.
  2. A dataset is $i$-diverse if for every combination of soft identifiers there are at least $i$ different values of the sensitive attributes.
  3. A dataset is $t$-closed if for every combination of soft identifiers, the distribution of sensitive attributes is not more than $t$ (i.e. KL-divergence) different from the global distribution of sensitive attributes.

Privacy protecting modifications

Hard identifiers can be treated with masking or pseudomization. Masking refers to deleting the hard identifier. A less drastic approach is pseudomization where each records hard identifiers gets a pseudonym (e.g. hash) that is only known by the dataset creator.

Soft identifiers can be treated with suppression, generalisation or randomisation. Suppressing values of soft-identifiers refers to replacing it by a default value (e.g. '*' or the mode). Another strategy, called generalisation, is to create groups of variables that are alike (e.g. 'age 10-20' instead of age 19). A third way is to add randomness to quantitative variables by adding zero-mean noise.

Sensitive attributes can be treated with permutation or randomisation. Permutation refers to permuting the sensitive attributes within every combination of soft identifiers (within a dataset that is at least $2$-anonymous). Like soft identifiers, quantitative sensitive attributes can also be randomised by adding noise.

Ps. as an R-package you could consider sdcMicro, and this guideline.


Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.