data-leakage

How to generate test set with no data-leakage using multiple columns

Anatole

2022年4月20日 07:45

I am developing a fraud detection algorithm. Among other things, my dataset contains the phone number, email address and a few other fields that should uniquely identify a user (let's call them "unique fields"). In order to prevent data leakage between my training and test set I want to be sure that my test set contains only users that are completely new, meaning that they should not have any user whose unique fields matches any unique field of any user …

Topic: data-leakage training sql scikit-learn pandas

Category: Data Science

K-Fold cross validation and data leakage

2022年4月15日 00:01

I want to do K-Fold cross validation and also I want to do normalization or feature scaling for each fold. So let's say we have k folds. At each step we take one fold as validation set and the remaining k-1 folds as training set. Now I want to do feature scaling and data imputation on that training set and then apply the same transformation on that validation set. I want to do this for each step. I am trying …

Topic: data-leakage data-imputation feature-scaling cross-validation

Category: Data Science

Is data leakage giving me misleading results? Independent test set says no!

PeMADS

2022年4月14日 08:07

TLDR: I evaluated a classification model using 10-fold CV with data leakage in the training and test folds. The results were great. I then solved the data leakage and the results were garbage. I then tested the model in an independent new dataset and the results were similar to the evaluation performed with data leakage. What does this mean? Was my data leakage not relevant? Can I trust my model evaluation and report that performance ? Extended version: I'm developing …

Topic: model-evaluations data-leakage overfitting model-selection machine-learning

Category: Data Science

Why leaky features are problematic

Huy Truong

2022年3月27日 17:24

I want to know why leaky features are problematic in machine learning/data science. I'm reading a book that uses the Titanic dataset for illustration. It says that the column body (Body Identification Number) leaks data since if we are creating a model to predict if a passenger would die, knowing that they had a body identification number a priori would let us know they were already dead. Logical-wise, this makes sense. But assuming I don't have any knowledge about this …

Topic: data-leakage

Category: Data Science

Is it safe to use labels created from unsupervised model to train a supervised model using the same data?

Indranil Bhattacharya

2022年3月14日 07:01

I have a dataset where I have to detect anomalies. Now, I use a subset of the data(let's call that subset A) and apply the DBSCAN algorithm to detect anomalies on set A.Once the anomalies are detected, using the dbscan labels I create a label variable (anomaly:1, non-anomaly:0) in the dataset A. Now, I train a supervised algorithm on dataset A to predict the anomalies using the label as the dependent/target variable and finally use the trained supervised model to …

Topic: data-leakage anomaly-detection dbscan

Category: Data Science

How does a data leakage work?

UgurZCifci

2022年3月13日 20:41

I'm working with panel data - every row represents a timestamp (observation) and there are multiple rows for a single timestamp (around 20 rows each). I have a total of 8719 unique timestamps. Obs_temp is the target column. "1" column represents the hour. Every timestamp has 20 different observations (with different feature values but same target value). When I randomly split the data into train & test and predict, Random forest and KNN scored 0.55 and 0.0002 MAE respectively. (Baseline …

Topic: data-leakage regression random-forest time-series

Category: Data Science

Can I create a new target value based on the average target value of same data points for regression?

freshst4r

2022年3月13日 13:05

I am trying to predict profit of retail stores. The orginal dataframe looks like this: Store No feature A feature B year profit A 1 2 2016 20000 A 1 2 2017 40000 B 4 3 2017 50000 B 4 3 2018 40000 C 5 6 2015 80000 C 5 6 2016 90000 In production information about profit and year will not be available. Since year is not available, we have same data points with different target values. So I …

Topic: data-leakage supervised-learning regression data-cleaning

Category: Data Science

Joining of Technical replicates with experimental data

LongStreak

2022年3月12日 18:03

I have a task in which I need to join data collected from non-destructive biological sensor analyses with data collected from various microbiological "wet-lab" methods, e.g. colony counting, on the observation/sample names, which represent various environmental conditions, for the purposes of generating machine learning models for the prediction of microbiological status based on the aforementioned sensor output. However, I am considering how to proceed with dealing with technical duplicates/repeats, i.e. additional plates from the same biological sample, re-runs/re-evaluation of samples …

Topic: data-leakage machine-learning-model data-wrangling machine-learning

Category: Data Science

Why label encoding before split is data leakage?

Anar

2022年3月2日 07:39

I want to ask why Label Encoding before train test split is considered data leakage? From my point of view, it is not. Because, for example, you encode "good" to 2, "neutral" to 1 and "bad" to 0. It will be same for both train and test sets. So, why do we have to split first and then do label encoding?

Topic: test labelling data-leakage training preprocessing

Category: Data Science

How to fit Word2Vec on test data?

spectre

2022年2月13日 05:47

I am working on a Sentiment Analysis problem. I am using Gensim's Word2Vec to vectorize my data in the following way: # PREPROCESSING THE DATA # SPLITTING THE DATA from sklearn.model_selection import train_test_split train_x,test_x,train_y,test_y = train_test_split(x, y, test_size = 0.2, random_state = 69, stratify = y) train_x2 = train_x['review'].to_list() test_x2 = test_x['review'].to_list() # CONVERT TRIAN DATA INTO NESTED LIST AS WORD2VEC EXPECTS A LIST OF LIST TOKENS train_x3 = [nltk.word_tokenize(k) for k in train_x2] test_x3 = [nltk.word_tokenize(k) for k in …

Topic: data-leakage gensim word2vec sentiment-analysis python

Category: Data Science

Frequency/Count encoding

user9269433

2022年1月8日 14:27

How do I perform frequency/count encoding for a train and test set? The implementations of this encoding I've seen simply frequency encode the categorical variables on a particular dataset (no specific train, and test encoding transformation). For instance: dataset.groupby("cat_column").size()/len(dataset) In my case now I have a train, and test set. [First option] Is it okay (due to leakage? or there won't?) for me to use frequency encoding on the whole dataset. OR [Second option] I should take into consideration train, …

Topic: data-leakage encoding categorical-data machine-learning

Category: Data Science

Preprocessing for the final model to be deployed

spectre

2021年11月29日 23:35

Typically for a ML workflow, we import the data (X and y), split the X and y into train, valid and test, preprocess the data for train, valid and test(scale, encode, impute nan values etc), perform HP tuning and after getting the best model with the best HP, we fit the final model to the whole dataset (i.e. X and y). Now the issue here is that X and y are not preprocessed as only the train, valid and test …

Topic: data-leakage overfitting preprocessing python machine-learning

Category: Data Science

proximity matrix of random forest and data leakage

neuroguy123

2021年11月4日 19:23

My objective is to train a random forest classifier on a binary set of data and use the resulting proximity matrix to understand the sub-populations in the data. I have read some papers on this subject, but I find it difficult to develop a pipeline that is robust and does not leak data. I really want to determine a stable matrix over many iterations so I can be sure it will generalize. For example, I may do something like this: …

Topic: data-leakage random-forest clustering

Category: Data Science

Does this make data leakage in time series? # need help for understanding time series data

andy

2021年10月4日 07:32

Does this make data leakage in time series? I already read this, data leakage when scaling time series Data leakage is when information from outside the training dataset is used to create the model. assume the past day is 3, predicting day is 2 Does this lead to data leakage in time series? I am not sure about this. Considering both figures both test Y is after train / valid Y, but test X is overlapping on train / valid …

Topic: data-leakage rnn preprocessing time-series data-mining

Category: Data Science

What can I do when my test and validation scores are good, but the submission is terrible?

Yuxxxxxx

2021年9月27日 17:03

This is a very broad question, I understand and I'm totally fine if someone believes it's not appropriate to do it. But it's killing me not to understand this... Here's the thing, I'm doing a machine learning model to predict the tweet topic. I'm participating in this competition. So this is what I've done in order to ensure I'm not overfitting: I separated 10% of my training data and I called validation set, and I used the rest (90%) to …

Topic: pipelines data-leakage overfitting nlp

Category: Data Science

Splitting before tfidf or after?

spectre

2021年8月21日 10:37

When should I perform preprocessing and matrix creation of text data in NLP, before or after train_test_split? Below is my sample code where I have done preprocessing and matrix creation (tfidf) before train_test_split. I want to know will there be data leakage? corpus = [] for i in range(0 ,len(data1)): review = re.sub('[^a-zA-Z]', ' ', data1['features'][i]) review = review.lower() review = review.split() review = [stemmer.stem(j) for j in review if not j in set(stopwords.words('english'))] review = ' '.join(review) corpus.append(review) from …

Topic: data-leakage preprocessing nlp python machine-learning

Category: Data Science

What is the difference between data leakage and endogeneity?

Tanguy

2021年8月20日 21:56

I have the impression the former is used in ML whereas the latter is used in econometrics. They both carry the idea that information from the target is "leaking" in explanatory variables. Is there any difference between those two notions?

Topic: data-leakage

Category: Data Science

Time-Series Cross-Validation for LSTM

DomIsAwesomee

2021年8月6日 17:03

Is it at all possible to separate my data into train/test sets with cross validation for time series data? I am experimenting with a LSTM model. Also, I am hoping to prevent data leakage/peaking in cross validating the X and y sets, can I manually purge where the data overlaps? This is for a financial prediction problem. I will need to perform a MinMaxScaler transformation in addition to the cross validation and I am unsure where to perform this if …

Topic: data-leakage lstm cross-validation time-series predictive-modeling

Category: Data Science

Can I apply feature selection before splitting by requiring selection occurs > 90% of time

ran8

2021年8月2日 01:06

I want to move the feature selection step to before splitting to save time and allow bigger input dataset. If, in repeated subsamples, a feature is selected in over X percentage of cases I will keep it. Alternatively use very low X to remove features that will clearly never be selected. I have read warnings against doing this including on this forum because of information leakage. Feature selection: Information leaking if done before CV-split? But if the feature would have …

Topic: data-leakage feature-selection

Category: Data Science

Can data leakage be sometimes acceptable?

Mateusz

2021年7月19日 11:50

I have recently started using kaggle and I have stumbled on a few examples of practices I would consider do be data leakage. Many of them were done by people well established on the platform and I could tell by their notebooks, that they knew what they were doing. As some examples I have seen someone fix skewness on the whole dataset before any train-test split. As another I have seen multiple people impute missing data not only based on …

Topic: data-leakage data-imputation preprocessing kaggle

Category: Data Science

About