Evaluation of the preprocessing to make a dataset anonymous

I have a very huge dataset from the NLP area and I want to make it anonymous. Is there any way to check if my pre-processing is correct? Generaly, is there any way to evaluate how good is the pre-processing for the anonyminity? I want to mention that the dataset is really huge, therefore it can be cheched manually.
Category: Data Science

Data privacy breach example (data anonymisation)

I remember that I read a story where journalists were able to figure out the health records of some individual (I think it was some senator, but not sure) by using different data sets. That was an example showing that data anonymization is not sufficient for data privacy. However, I can't find this story on the Internet. Did anyone come across this story? Update: I found it.
Category: Data Science

How to write custom de-identification algorithm in Python?

I have tried a simple algorithm to anonymize my data using the de-identification technique. But the code doesn't work for me. I want to anonymize the data by slightly changing the values. The data sample is available here import pandas as pd import uuid as u import datetime as dt # generate a pseudo-identifier sequesnce using python random number generator library uudi. def uudi_generator(length): uudi_list= list() i=0 while i < length: uudi_list.append(u.uuid4()) i+=1 return uudi_list #import original originaL dataset dataset …
Category: Data Science

Data Anonymization for all domains?

I am using a dataset from Marketing and sales department. The dataset contains customer name (company name), company address, pincode, no of orders placed, revenue generated from that customer etc. My question is whether I should hide/mask/anonymize customer name and address etc? Of course, the insights that we generate will be used by the business users from sales and marketing team. So, should we use a duplicate identifier (mapping sheet) to indicate the customer names and address etc. For ex: …
Category: Data Science

Does data anonymization conflict with GDPR rules?

There are GDPR articles that relate to a person's ownership of their data e.g., Art. 17 GDPR Right to erasure (‘right to be forgotten’) and Art. 20 GDPR Right to data portability. In case one would anonymize the data without a way to "restore" the relation between the person (name + e-mail address) (which in turn would allow handling of the person-specific data), I'd say this would conflict with these GDPR articles. Are there data anonymization techniques that allow to …
Category: Data Science

Can longitudinal studies be completely anonymous?

A regular digital questionnaire can be completely anonymous, by sending out a non-personalized URL for the questionnaire and not asking or storing identifiable information (such as the users IP address or asking questions about date of birth, etc.). By this I mean, as the researcher, I am unable to later identify who filled out a questionnaire, even if I wanted to. I now have a longitudinal study, with 4 waves of questionnaires, one year apart each. Consecutive waves are required …
Category: Data Science

How to identify a field as holding personal identifiable information from the name of the field itself using ML model in python?

Is it possible to automatically detect fields holding personal information (name, phone, address, SSN, passport, gov ID...) from its names, using python in order to upload datasets into the cloud after encrypting or anonymizing the PII fields? I am open to do my own model by training it on a dataset that holds thousands of fields and each one is classified whether personal or not. But apparently I can't find any related datasets.
Category: Data Science

How to protect data from internal data scientists?

In our company we want to protect data privacy internally. Meaning, we want to find a way to anonymize the data so the data science team members cannot expose it and yet still can use it for modelling. I googled and read about Pseudonymization. But I mean, is it destroying the data? I didn't find any reliable source using it practically.
Category: Data Science

Data anonymization in Python

I am working on an industrial project which consists of real data. Now, the data contains sensitive information about company operations which could not be disclosed publically. As a result, I need to anonymize the original data first before implementing the machine learning algorithms. `The data anonymization includes: changing the names of persons, places, geographical locations, etc. I would like to know what are the best practices for anonymizing datasets? Ideally, I should be able to get the original data …
Category: Data Science

How to release datasets with fingerprinting

I intend on monetising some large datasets. These datasets are anonymised and released to (paying) clients via a web api. Are there any standard algorithms such that if the datasets are intentionally leaked publicly, the data can be altered such that the responsible party can be identified, while at the same time the data remains practically useful? There are certain approaches which come to mind, such as every client's data being very slightly different with known changes. For example in …
Category: Data Science

How do we make data Obfuscate or "De-identificate" to make it anonymous and share it publicly?

Right now, I am working on preparing a small dataset for release to the public by getting rid of sensitive information. While working on it, I wondered... what are the best practices of dealing private or sensitive polynomial attributes in a dataset?(*) I have heard to create anonymity or to permute is achieved by De-identification, obscurification, anonymization. However, I would like to learn more about this topic in data science/analysis. I am particularly interested in the packages and concepts that …
Category: Data Science

Anonymizing data

In https://www.kaggle.com/c/santander-product-recommendation/data it mentions that Please note: This sample does not include any real Santander Spain customers, and thus it is not representative of Spain's customer base. What are the ways where the Santander can anonymize their customers yet the solutions by Kaggle can be useful for them?
Category: Data Science

How can I transform names in a confidential data set to make it anonymous, but preserve some of the characteristics of the names?

Motivation I work with datasets that contain personally identifiable information (PII) and sometimes need to share part of a dataset with third parties, in a way that doesn't expose PII and subject my employer to liability. Our usual approach here is to withhold data entirely, or in some cases to reduce its resolution; e.g., replacing an exact street address with the corresponding county or census tract. This means that certain types of analysis and processing must be done in-house, even …
Category: Data Science

Name Anonymization Software

Although I have seen a few good questions asked about data anonymization, I was wondering if there were answers to this more specific variant. I am seeking a tool (or to design one) that will anonymize human names from a specific country: particularly first names in unstructured text. Many of the tools that I have seen have considered the wider dimensions of data anonymization; with an equal focus on dates of birth, addresses, etc. An imperative aspect is that it …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.