Is it possible to use a generative model to "share" private data?
Let's say we have some data set, with lots of instances $X$ and a target $y$. If it is of some importance, you may assume that it is a "real life" data set : medium sized, with important correlations, y is unbalanced...etc.
Let's also say this data set is of relative interest : the field of research is rather active, but there is no (or only a few) available data set. So we are considering publishing our data set.
However, a direct publication wouldn't be possible for privacy concerns. Basic approaches have been considered (pseudonymisation, grouping instances to achieve statistical disclosure control, publishing an old dataset), but mostly set aside.
The research field concentrate on discriminative approaches for learning $y$, based on $X$. However, I recently came to read about generative approaches. So I am curious for (1) if there are generative techniques that could be used to share information about $X$x$y$ without being able to recognise any individual instances.
I think the usefullness of sharing a generative model (/the output of a generative model) to people that are trying to build discriminative models may be limited by the inherent performance of the generative model. Are there more to it than just calibrating a SOTA discriminative model twice (once on real data, once on generated data) and compare performances ? Wouldn't it be better to just share the SOTA algo calibrated on our dataset ?
Topic generative-models privacy dataset
Category Data Science