Anonymizing data

Question

Anonymizing data

william007

2017年2月6日 07:52

In https://www.kaggle.com/c/santander-product-recommendation/data it mentions that

Please note: This sample does not include any real Santander Spain customers, and thus it is not representative of Spain's customer base.

What are the ways where the Santander can anonymize their customers yet the solutions by Kaggle can be useful for them?

Topic anonymization

Category Data Science

Icyblade · Accepted Answer · 2017年2月6日 07:52

I think data masking technique is what you are looking for.

The main reason for applying masking to a data field is to protect data that is classified as personal identifiable data, personal sensitive data or commercially sensitive data, however the data must remain usable for the purposes of undertaking valid test cycles.

Take your santander problem for example, there is an age feature in the original dataset. As we all know, $1\leqslant \text{age}\leqslant 200$ (nobody survives 201 years right?). If we do something like $\text{new_age}:=(\ln{(\pi* \text{age}+\sqrt{2})})^2$, new_age is kind of "encrypted" and no one knows the actual age of John Doe.

References

https://en.wikipedia.org/wiki/Data_masking

Spacedman · Accepted Answer · 2017年2月6日 07:43

If a model predicts useful information for a class of customers, maybe customers over 50, or those with more than 1000EUR, then that's useful even without knowing who the individuals in the model are.

The actual data doesn't seem to be anonymous data though, it is implied to be synthetic data or possibly from another bank altogether ("does not include any real Santander Spain customers").

Anonymizing data

References

About