Name Anonymization Software

Although I have seen a few good questions asked about data anonymization, I was wondering if there were answers to this more specific variant.

I am seeking a tool (or to design one) that will anonymize human names from a specific country: particularly first names in unstructured text. Many of the tools that I have seen have considered the wider dimensions of data anonymization; with an equal focus on dates of birth, addresses, etc.

An imperative aspect is that it needs to have near absolute recall. The major pitfalls, as far as I can see, are diminutive variants ("Tommy" instead of "Thomas", "Ben" instead of "Benjamin", etc.) and typos. These two factors prevent a simple regex based on a database of names (based on censuses, etc.)

Topic anonymization text-mining

Category Data Science


You have a few problems here. The first is cleaning your data. That's a whole separate issue form anonymization and belongs in another question if you're still having problems with it.

The second is your anonymization. After you have some sort of identifier you're satisfied with (sounds like you're using people's real names), try hashing their names to generate a new id. This id is useful because you'll always be able to take the original name and figure out what id it is but won't be able to derive the real names from just the hashed id (providing your hashing algorithm is good).

Further reading:


I don't think you really need some special software, but rather to employee existing tools, such as encryption algorithms.

Why not just encrypt the names with any key-based algorithm and store the key securely?

If you didn't need to be able to recover the names, but just to identify variation to the level of differences in diminutives, then you could simply use hashing rather than encryption.

I'm not sure what environment you want to carry this out it, but any language such as R or SQL/NoSQL database could easily carry this out programmatically.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.