Data anonymization in Python

I am working on an industrial project which consists of real data. Now, the data contains sensitive information about company operations which could not be disclosed publically. As a result, I need to anonymize the original data first before implementing the machine learning algorithms. `The data anonymization includes:

  • changing the names of persons,

  • places,

  • geographical locations, etc.

I would like to know what are the best practices for anonymizing datasets? Ideally, I should be able to get the original data back after performing analysis on the anonymized dataset.

I went through the literature and looked over some answered questions. They all are based on cybersecurity aspects like encryption and decryption algorithms. I am not familiar with cybersecurity algorithms. Is there any way to slightly change the data without digging into cybersecurity algorithms?

Topic data anonymization python data-cleaning machine-learning

Category Data Science


In general, I'd say HIPPA standards are a good start. That would include separating the non-personally identifying information (pii) from what doesn't have to be kept private. [1].

In all honesty, there aren't great standards for anonymizing geolocation such that it both both protects privacy but also allows for data analysis and it's an area of interest for NIST. In fact, it was the one of subjects of 2018 The Unlinkable Data Challenge.

A detailed set of approaches can be found here.

Past that, I would refer you to what are known as Cryptographic Right Answers. Hash immediately, don't use MD5, MD6 or SHA-1, etc.


As far as I know text anonymization is mostly considered a manual pre-processing step, I'm not aware of any reliable fully automated method. The reliability of the process is usually crucial for legal and ethical reasons, that's why there must be some amount of manual work.

That being said, the process can be made semi-automatically, especially if the scope of the information to be obfuscated is not too large. In your case a NE tagger could probably be applied to capture a large part of the entities.

Once all the entities have been annotated in the original data, it's straightforward to replace them automatically with a placeholder. This can be done while keeping the original and anonymized version aligned (typically using a unique id for every entity).

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.