Noisification of categorical data proportions for privacy-preservation

Question

Noisification of categorical data proportions for privacy-preservation

R Hill

2016年12月21日 12:17

Imagine I'm conducting an ongoing poll asking people's favourite animal out of a list of animals, [cat, dog, penguin, chimpanzee, ...] etc.

I want to provide an interface that lets people query this poll data to see the relative popularity of each animal by different demographics. For example, querying the general population might reveal the plurality of respondents (36%) prefer penguins, but querying the 18-25 age-bracket might the plurality of respondents in that cohort (41%) prefer cats.

It's desirable to preserve the privacy of my respondents' animal preferences as much as possible. However, an attacker may be able to use prior knowledge of a given respondent to deduce their response by asking a specific enough series of queries.

I wish to limit an attacker's ability to do this by noisifying the data presented to those querying the data. As such, I want a procedure that pseudorandomly adds or removes a fraction of a percentage point from each category, but preserves their relative ordering. I also wish this procedure to be deterministic over the same set of data (though this can easily be achieved by using a fixed seed in the pseudorandom procedure).

Formally, I want

$$f : \mathbb R_{0}^n \rightarrow \mathbb R_{0}^n;\;\; f(\mathbf{x}) = \mathbf{y}; \;\; |\mathbf{x}| = |\mathbf{y}| = 1$$

where $\mathbf{x}$ is the vector of proportions of each category.

One naive way of going about this would be to simply add a pseudorandom Gaussian noise vector to the original vector and then renormalise. This poses at least two problems:

1) the "zero problem": if a cohort has zero people who like cats, how should the noisification procedure treat this? I'm inclined to say it should maintain the value at zero, but I can't think of a principled way of achieving this

2) the variance of the noise should ideally be the same for all elements in the vector, but any obvious procedure for forcing positivity would all typically result in smaller variance of noise for smaller values, so the noisification would end up making large values larger and small values smaller after renormalisation.

I feel like this should be a problem people have encountered before, but I can't find it in the literature.

Topic privacy noisification

Category Data Science

Noisification of categorical data proportions for privacy-preservation

About