What is the best way for synthetic data generation while maintaining privacy?
For one of the projects where we are working as third party contractors, we need a way for the company to share some datasets which can be used for data science. It is not possible for the company to share the real data as that would be a privacy issue.
We are exploring ways so that the company can either share the data while maintaining privacy or else ways to generate fake data that matches the statistics/demographics of the actual data.
We are currently looking at a couple of options:
- Using differential privacy to add noise to the data and then sharing the transformed data with us. Can this approach lead to any privacy issue? I am concerned about reverse engineering. Does privacy budget apply here? How should it be tackled?
- Using encoder-decoder neural networks to learn vector embedding of the real data. Once the vector embedding is learned, the decoder can be destroyed and the encoder's output can be shared with us.
Is there any other approach that can be used for synthetic data generation that resembles the actual data in terms of demography and statistics. Or else what would be the best way to access the real data without violating privacy?
Topic sequence-to-sequence privacy autoencoder dataset
Category Data Science