I think data masking technique is what you are looking for.

The main reason for applying masking to a data field is to protect data that is classified as personal identifiable data, personal sensitive data or commercially sensitive data, however the data must remain usable for the purposes of undertaking valid test cycles.

Take your santander problem for example, there is an age feature in the original dataset. As we all know, $1\leqslant \text{age}\leqslant 200$ (nobody survives 201 years right?). If we do something like $\text{new_age}:=(\ln{(\pi* \text{age}+\sqrt{2})})^2$, new_age is kind of "encrypted" and no one knows the actual age of John Doe.

References

https://en.wikipedia.org/wiki/Data_masking


If a model predicts useful information for a class of customers, maybe customers over 50, or those with more than 1000EUR, then that's useful even without knowing who the individuals in the model are.

The actual data doesn't seem to be anonymous data though, it is implied to be synthetic data or possibly from another bank altogether ("does not include any real Santander Spain customers").

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.