Is it possible to use a generative model to "share" private data?

Let's say we have some data set, with lots of instances $X$ and a target $y$. If it is of some importance, you may assume that it is a "real life" data set : medium sized, with important correlations, y is unbalanced...etc.

Let's also say this data set is of relative interest : the field of research is rather active, but there is no (or only a few) available data set. So we are considering publishing our data set.

However, a direct publication wouldn't be possible for privacy concerns. Basic approaches have been considered (pseudonymisation, grouping instances to achieve statistical disclosure control, publishing an old dataset), but mostly set aside.

The research field concentrate on discriminative approaches for learning $y$, based on $X$. However, I recently came to read about generative approaches. So I am curious for (1) if there are generative techniques that could be used to share information about $X$x$y$ without being able to recognise any individual instances.

I think the usefullness of sharing a generative model (/the output of a generative model) to people that are trying to build discriminative models may be limited by the inherent performance of the generative model. Are there more to it than just calibrating a SOTA discriminative model twice (once on real data, once on generated data) and compare performances ? Wouldn't it be better to just share the SOTA algo calibrated on our dataset ?

Topic generative-models privacy dataset

Category Data Science


This is an active area of research, and there are some results demonstrating that this is possible for medical data: https://arxiv.org/abs/1807.10225. You are correct that the performance of the generative model will be a limiting factor (it is not possible to learn more than the generative model can encode), but with a powerful enough generative model you can still draw meaningful insights.


Yes it is.

That is, in theory atleast. So we already have mathematical tools to prove whether privacy (and how much-thats the parameter epsilon) is perserved.

Its called differential privacy. I highly recommend this non technical introduction

Long story short, we coud let a generative model learn a prediction function of a model that guarantees privacy. Remember practice and theory are the same, in theory, but not in practice...


Unfortunately I don't think a generative model could prevent from leaking private information from the original dataset.

Like any other kind of model, the generative model is based on the values obtained from the training data. The idea of using such a model in "generation mode" is indeed interesting since it would make it difficult to reverse-engineer the instances it generates back to the real individuals. It would be difficult sure, but not impossible: by re-connecting pieces of information together or exploiting rare (distinctive) cases, somebody could acquire at least partial personal information from the instances.

Additionally the design of the generative model itself introduces a huge bias in the data: the distribution of the instances would be modeled after this design, which might or might not accurately represent the real distribution. This issue significantly lowers the interest in exploiting the generated instances, since it's essentially artificial data.

For the record I think I've seen research about using distributed ML methods in order to overcome the privacy issue. As far as I understand the idea is to keep every individual/institution in control of their data, allowing specific automatized methods to read it in some kind of safe way.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.