How do you choose an appropriate $k$ to achieve $k$-anonymity for data?

How do you choose an appropriate $k$ to achieve $k$-anonymity for a data? What methods exist that are agnostic to the business context for the problem?

Topic anonymization

Category Data Science


In most cases $k$ emerges from the volume and nature of data, plus trhe anonymity method used. Rarely does one have explicit control over $k$, except implicitly through these options.

Think of $k$ as a score instead of as a parameter.

It is possible, for example, some records will have higher $k$-anonymity than others. Then the average $k$ counts, or even the minimum.

If anonymity is a requirement, then the highest possible value of $k$ is what is needed. Since for each record there are only $k-1$ similar records, so methods can be used to exhaustively find the anonymised info, thus the highest possible $k$ is needed in order to slow down this process and make it practically impossible.

Of course the maximum $k$ is achieved when all data columns are anonymised, but this creates useless data, so the tradeoff between useful data and maximum anonymity results in a range of $k$ values to achieve (and this depends on the actual nature and volume of data).

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.