How to release datasets with fingerprinting

I intend on monetising some large datasets. These datasets are anonymised and released to (paying) clients via a web api. Are there any standard algorithms such that if the datasets are intentionally leaked publicly, the data can be altered such that the responsible party can be identified, while at the same time the data remains practically useful?

There are certain approaches which come to mind, such as every client's data being very slightly different with known changes. For example in spatial data, every lon/lat pair is altered by the same very small vector. My worry is that if the data is anonymised again by the client before being leaked, a naive attempt might easily be circumvented.

(I am not a data scientist so I'm not really sure what the correct jargon is for what I am looking for)

Topic data anonymization

Category Data Science


"Digital watermarking" is a set of techniques that might be useful in this context.

From the wikipedia page:

"Watermarking" is the process of hiding digital information in a carrier signal... Digital watermarks may be used to verify the authenticity or integrity of the carrier signal or to show the identity of its owners. [Emphasis added]

To address your requirement, you would insert a unique watermark for each client that receives your data. Watermarking techniques address requirements such as robustness to modification and imperceptibility.

As an example, this paper talks about watermarking numerical data: "Watermarking Numerical Data in the Presence of Noise", pdf

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.