I have a very huge dataset from the NLP area and I want to make it anonymous. Is there any way to check if my pre-processing is correct? Generaly, is there any way to evaluate how good is the pre-processing for the anonyminity? I want to mention that the dataset is really huge, therefore it can be cheched manually.
I remember that I read a story where journalists were able to figure out the health records of some individual (I think it was some senator, but not sure) by using different data sets. That was an example showing that data anonymization is not sufficient for data privacy. However, I can't find this story on the Internet. Did anyone come across this story? Update: I found it.
I have tried a simple algorithm to anonymize my data using the de-identification technique. But the code doesn't work for me. I want to anonymize the data by slightly changing the values. The data sample is available here import pandas as pd import uuid as u import datetime as dt # generate a pseudo-identifier sequesnce using python random number generator library uudi. def uudi_generator(length): uudi_list= list() i=0 while i < length: uudi_list.append(u.uuid4()) i+=1 return uudi_list #import original originaL dataset dataset …
I am using a dataset from Marketing and sales department. The dataset contains customer name (company name), company address, pincode, no of orders placed, revenue generated from that customer etc. My question is whether I should hide/mask/anonymize customer name and address etc? Of course, the insights that we generate will be used by the business users from sales and marketing team. So, should we use a duplicate identifier (mapping sheet) to indicate the customer names and address etc. For ex: …
There are GDPR articles that relate to a person's ownership of their data e.g., Art. 17 GDPR Right to erasure (‘right to be forgotten’) and Art. 20 GDPR Right to data portability. In case one would anonymize the data without a way to "restore" the relation between the person (name + e-mail address) (which in turn would allow handling of the person-specific data), I'd say this would conflict with these GDPR articles. Are there data anonymization techniques that allow to …
A regular digital questionnaire can be completely anonymous, by sending out a non-personalized URL for the questionnaire and not asking or storing identifiable information (such as the users IP address or asking questions about date of birth, etc.). By this I mean, as the researcher, I am unable to later identify who filled out a questionnaire, even if I wanted to. I now have a longitudinal study, with 4 waves of questionnaires, one year apart each. Consecutive waves are required …
How do you choose an appropriate $k$ to achieve $k$-anonymity for a data? What methods exist that are agnostic to the business context for the problem?
Is it possible to automatically detect fields holding personal information (name, phone, address, SSN, passport, gov ID...) from its names, using python in order to upload datasets into the cloud after encrypting or anonymizing the PII fields? I am open to do my own model by training it on a dataset that holds thousands of fields and each one is classified whether personal or not. But apparently I can't find any related datasets.
In our company we want to protect data privacy internally. Meaning, we want to find a way to anonymize the data so the data science team members cannot expose it and yet still can use it for modelling. I googled and read about Pseudonymization. But I mean, is it destroying the data? I didn't find any reliable source using it practically.
I am working on an industrial project which consists of real data. Now, the data contains sensitive information about company operations which could not be disclosed publically. As a result, I need to anonymize the original data first before implementing the machine learning algorithms. `The data anonymization includes: changing the names of persons, places, geographical locations, etc. I would like to know what are the best practices for anonymizing datasets? Ideally, I should be able to get the original data …
I intend on monetising some large datasets. These datasets are anonymised and released to (paying) clients via a web api. Are there any standard algorithms such that if the datasets are intentionally leaked publicly, the data can be altered such that the responsible party can be identified, while at the same time the data remains practically useful? There are certain approaches which come to mind, such as every client's data being very slightly different with known changes. For example in …
Right now, I am working on preparing a small dataset for release to the public by getting rid of sensitive information. While working on it, I wondered... what are the best practices of dealing private or sensitive polynomial attributes in a dataset?(*) I have heard to create anonymity or to permute is achieved by De-identification, obscurification, anonymization. However, I would like to learn more about this topic in data science/analysis. I am particularly interested in the packages and concepts that …
In https://www.kaggle.com/c/santander-product-recommendation/data it mentions that Please note: This sample does not include any real Santander Spain customers, and thus it is not representative of Spain's customer base. What are the ways where the Santander can anonymize their customers yet the solutions by Kaggle can be useful for them?
Motivation I work with datasets that contain personally identifiable information (PII) and sometimes need to share part of a dataset with third parties, in a way that doesn't expose PII and subject my employer to liability. Our usual approach here is to withhold data entirely, or in some cases to reduce its resolution; e.g., replacing an exact street address with the corresponding county or census tract. This means that certain types of analysis and processing must be done in-house, even …
Although I have seen a few good questions asked about data anonymization, I was wondering if there were answers to this more specific variant. I am seeking a tool (or to design one) that will anonymize human names from a specific country: particularly first names in unstructured text. Many of the tools that I have seen have considered the wider dimensions of data anonymization; with an equal focus on dates of birth, addresses, etc. An imperative aspect is that it …