How to protect data from internal data scientists?

In our company we want to protect data privacy internally. Meaning, we want to find a way to anonymize the data so the data science team members cannot expose it and yet still can use it for modelling.

I googled and read about Pseudonymization. But I mean, is it destroying the data? I didn't find any reliable source using it practically.

Topic anonymization

Category Data Science


Your question:
You seem to be asking a managerial/policy question phrased like a data-science question. The policy question is "how do I keep customer data private from internal data scientists without harming its usability".

The data-science question is something like "how do I transform data so that the privacy and identifiability of its original form can not be deduced, while not disabling other analytical processes". This is the seed of the zero-information paradox.

tl;dr
I think your policy-person is asking a question equivalent to "how do I make my computer hacker-proof", where the only perfect answer is not to have the computer. There are going to be levels of "resistant" but there is no such thing as "hacker-proof".

Problem proposition:
One of the problems with this question is that vast majority of policy-asker specialized technical expertise is nearly nothing compared to the people you are trying to "selectively impede". Explaining an answer to them that they can understand might keep an idiot out, but doesn't actually stop data exfiltration.

Consider how data aggregation with cell phones works.
https://eclecticlight.co/2015/08/24/data-aggregation-how-it-can-break-privacy/

The many policy-folks asking the question can get an answer they think means "yes" when in fact it means "no", and a persistent of clever data person can figure it while the policy-person can't.

Simple example:
Lets make a process where we replace first name with a number. "Smith" becomes 1, "Jones" becomes 2, etcetera. Is that process reversible using only the output? Given only a list of numbers can I get back to the names? Yes, though it varies. If I look at the frequency of last names and compare them with number frequencies I should be able to do a decent job of de-anonymizing the common names. Saying this again, if 15% of last names are "Smith" and 15% of my output list of numbers are "1" then there is a really good chance that 1 means Smith.

That is a toy example, but the MAC address of your cell phone is known and sold. If all the data in the world is anonymized except the MAC, and I can go to a 3rd part and buy a list of MAC to identity mappings, then your data isn't anonymized at all. You missed the baby in that bathwater.


To add onto other answers:

Pseudonymization is a way to anonymize specific data mostly customer data by removing personalized ID variables like names, email, etc. and adding a randomized but unique ID.

This process allows you to add data and share data via a joint ID while keeping knowledge of other data completely on one side.

A common use case is market research where a client company only has its own customer data and the pseudonymous ID and the research company only has the surveyed data and the pseudonymous ID. They can then share only parts of the data that they need to share without exposing more sensitive data.

This could be helpful in your case if your DS team needs to model with customer data but should not know the customer names, etc. while still being able to report for example individual lead scores back to you.

You do not destroy the data! You simply add a new unique but meaningless identifier.


Members of your data science team should be familiar with various forms of data anonymization. Depending on the nature of your data, it usually involves removing or obfuscating all data that has the potential to identify a person/client/other. Feature scaling, encoding, and name swapping (as @I_Play_With_Data had mentioned) can help reduce the possibility of revealing personal data or identifying the input source (individual persons or other entities).

While there's usually data that can be dropped or obfuscated entirely without impacting the results (like encoding or removing a client's SSN from a dataset), there are often features which are more difficult to handle in a correct manor. If you decide to encode categorical data, there are multiple ways that this can be done and the data scientists will need to be made aware of any assumptions made during the process (i.e. that null values were encoded or that a certain set of columns represent a single original feature). There are a number of things that can go wrong if you're too aggressive in your attempts to anonymize data, so often it's best to delegate the task to people with experience on both ends of the business, for example: a member of the compliance department familiar with data science or a member of the data science team who is also a subject matter expert.


If your data is completely numeric, have you considered removing column names from the data? It's completely possible that your staff could carry out their modeling functions without having to know what the numbers are at any stage. You would have to do some data prep to make sure that correlated columns have been accounted for, but even that could be still addressed with the "anonymous" columns.

If you give your staff a dataset with randomized column names that would still preserve your desired privacy, the data would effectively be useless.


You could checkout the OpenMined Pysyft library which is a library for encrypted, privacy preserving deep learning built over Pytorch. PySyft decouples private data from model training.

Github link to the Pysyft library - https://github.com/OpenMined/PySyft

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.