How to release datasets with fingerprinting
I intend on monetising some large datasets. These datasets are anonymised and released to (paying) clients via a web api. Are there any standard algorithms such that if the datasets are intentionally leaked publicly, the data can be altered such that the responsible party can be identified, while at the same time the data remains practically useful?
There are certain approaches which come to mind, such as every client's data being very slightly different with known changes. For example in spatial data, every lon/lat pair is altered by the same very small vector. My worry is that if the data is anonymised again by the client before being leaked, a naive attempt might easily be circumvented.
(I am not a data scientist so I'm not really sure what the correct jargon is for what I am looking for)
Topic data anonymization
Category Data Science