How to generate test set with no data-leakage using multiple columns

I am developing a fraud detection algorithm. Among other things, my dataset contains the phone number, email address and a few other fields that should uniquely identify a user (let's call them "unique fields"). In order to prevent data leakage between my training and test set I want to be sure that my test set contains only users that are completely new, meaning that they should not have any user whose unique fields matches any unique field of any user …
Category: Data Science

K-Fold cross validation and data leakage

I want to do K-Fold cross validation and also I want to do normalization or feature scaling for each fold. So let's say we have k folds. At each step we take one fold as validation set and the remaining k-1 folds as training set. Now I want to do feature scaling and data imputation on that training set and then apply the same transformation on that validation set. I want to do this for each step. I am trying …
Category: Data Science

Is data leakage giving me misleading results? Independent test set says no!

TLDR: I evaluated a classification model using 10-fold CV with data leakage in the training and test folds. The results were great. I then solved the data leakage and the results were garbage. I then tested the model in an independent new dataset and the results were similar to the evaluation performed with data leakage. What does this mean? Was my data leakage not relevant? Can I trust my model evaluation and report that performance ? Extended version: I'm developing …
Category: Data Science

Why leaky features are problematic

I want to know why leaky features are problematic in machine learning/data science. I'm reading a book that uses the Titanic dataset for illustration. It says that the column body (Body Identification Number) leaks data since if we are creating a model to predict if a passenger would die, knowing that they had a body identification number a priori would let us know they were already dead. Logical-wise, this makes sense. But assuming I don't have any knowledge about this …
Topic: data-leakage
Category: Data Science

Is it safe to use labels created from unsupervised model to train a supervised model using the same data?

I have a dataset where I have to detect anomalies. Now, I use a subset of the data(let's call that subset A) and apply the DBSCAN algorithm to detect anomalies on set A.Once the anomalies are detected, using the dbscan labels I create a label variable (anomaly:1, non-anomaly:0) in the dataset A. Now, I train a supervised algorithm on dataset A to predict the anomalies using the label as the dependent/target variable and finally use the trained supervised model to …
Category: Data Science

How does a data leakage work?

I'm working with panel data - every row represents a timestamp (observation) and there are multiple rows for a single timestamp (around 20 rows each). I have a total of 8719 unique timestamps. Obs_temp is the target column. "1" column represents the hour. Every timestamp has 20 different observations (with different feature values but same target value). When I randomly split the data into train & test and predict, Random forest and KNN scored 0.55 and 0.0002 MAE respectively. (Baseline …
Category: Data Science

Can I create a new target value based on the average target value of same data points for regression?

I am trying to predict profit of retail stores. The orginal dataframe looks like this: Store No feature A feature B year profit A 1 2 2016 20000 A 1 2 2017 40000 B 4 3 2017 50000 B 4 3 2018 40000 C 5 6 2015 80000 C 5 6 2016 90000 In production information about profit and year will not be available. Since year is not available, we have same data points with different target values. So I …
Category: Data Science

Joining of Technical replicates with experimental data

I have a task in which I need to join data collected from non-destructive biological sensor analyses with data collected from various microbiological "wet-lab" methods, e.g. colony counting, on the observation/sample names, which represent various environmental conditions, for the purposes of generating machine learning models for the prediction of microbiological status based on the aforementioned sensor output. However, I am considering how to proceed with dealing with technical duplicates/repeats, i.e. additional plates from the same biological sample, re-runs/re-evaluation of samples …
Category: Data Science

How to fit Word2Vec on test data?

I am working on a Sentiment Analysis problem. I am using Gensim's Word2Vec to vectorize my data in the following way: # PREPROCESSING THE DATA # SPLITTING THE DATA from sklearn.model_selection import train_test_split train_x,test_x,train_y,test_y = train_test_split(x, y, test_size = 0.2, random_state = 69, stratify = y) train_x2 = train_x['review'].to_list() test_x2 = test_x['review'].to_list() # CONVERT TRIAN DATA INTO NESTED LIST AS WORD2VEC EXPECTS A LIST OF LIST TOKENS train_x3 = [nltk.word_tokenize(k) for k in train_x2] test_x3 = [nltk.word_tokenize(k) for k in …
Category: Data Science

Frequency/Count encoding

How do I perform frequency/count encoding for a train and test set? The implementations of this encoding I've seen simply frequency encode the categorical variables on a particular dataset (no specific train, and test encoding transformation). For instance: dataset.groupby("cat_column").size()/len(dataset) In my case now I have a train, and test set. [First option] Is it okay (due to leakage? or there won't?) for me to use frequency encoding on the whole dataset. OR [Second option] I should take into consideration train, …
Category: Data Science

Preprocessing for the final model to be deployed

Typically for a ML workflow, we import the data (X and y), split the X and y into train, valid and test, preprocess the data for train, valid and test(scale, encode, impute nan values etc), perform HP tuning and after getting the best model with the best HP, we fit the final model to the whole dataset (i.e. X and y). Now the issue here is that X and y are not preprocessed as only the train, valid and test …
Category: Data Science

proximity matrix of random forest and data leakage

My objective is to train a random forest classifier on a binary set of data and use the resulting proximity matrix to understand the sub-populations in the data. I have read some papers on this subject, but I find it difficult to develop a pipeline that is robust and does not leak data. I really want to determine a stable matrix over many iterations so I can be sure it will generalize. For example, I may do something like this: …
Category: Data Science

Does this make data leakage in time series? # need help for understanding time series data

Does this make data leakage in time series? I already read this, data leakage when scaling time series Data leakage is when information from outside the training dataset is used to create the model. assume the past day is 3, predicting day is 2 Does this lead to data leakage in time series? I am not sure about this. Considering both figures both test Y is after train / valid Y, but test X is overlapping on train / valid …
Category: Data Science

What can I do when my test and validation scores are good, but the submission is terrible?

This is a very broad question, I understand and I'm totally fine if someone believes it's not appropriate to do it. But it's killing me not to understand this... Here's the thing, I'm doing a machine learning model to predict the tweet topic. I'm participating in this competition. So this is what I've done in order to ensure I'm not overfitting: I separated 10% of my training data and I called validation set, and I used the rest (90%) to …
Category: Data Science

Splitting before tfidf or after?

When should I perform preprocessing and matrix creation of text data in NLP, before or after train_test_split? Below is my sample code where I have done preprocessing and matrix creation (tfidf) before train_test_split. I want to know will there be data leakage? corpus = [] for i in range(0 ,len(data1)): review = re.sub('[^a-zA-Z]', ' ', data1['features'][i]) review = review.lower() review = review.split() review = [stemmer.stem(j) for j in review if not j in set(stopwords.words('english'))] review = ' '.join(review) corpus.append(review) from …
Category: Data Science

Time-Series Cross-Validation for LSTM

Is it at all possible to separate my data into train/test sets with cross validation for time series data? I am experimenting with a LSTM model. Also, I am hoping to prevent data leakage/peaking in cross validating the X and y sets, can I manually purge where the data overlaps? This is for a financial prediction problem. I will need to perform a MinMaxScaler transformation in addition to the cross validation and I am unsure where to perform this if …
Category: Data Science

Can I apply feature selection before splitting by requiring selection occurs > 90% of time

I want to move the feature selection step to before splitting to save time and allow bigger input dataset. If, in repeated subsamples, a feature is selected in over X percentage of cases I will keep it. Alternatively use very low X to remove features that will clearly never be selected. I have read warnings against doing this including on this forum because of information leakage. Feature selection: Information leaking if done before CV-split? But if the feature would have …
Category: Data Science

Can data leakage be sometimes acceptable?

I have recently started using kaggle and I have stumbled on a few examples of practices I would consider do be data leakage. Many of them were done by people well established on the platform and I could tell by their notebooks, that they knew what they were doing. As some examples I have seen someone fix skewness on the whole dataset before any train-test split. As another I have seen multiple people impute missing data not only based on …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.