I want to ask why Label Encoding before train test split is considered data leakage? From my point of view, it is not. Because, for example, you encode "good" to 2, "neutral" to 1 and "bad" to 0. It will be same for both train and test sets. So, why do we have to split first and then do label encoding?
I'm working on a dataset that isn't split into test and train set by default and I'm a bit concerned about the imbalance between the 'label' distributions between them and how they might affect the trained model's performance. Let me note that I use deep neural networks and the prediction type is regression. By sequentially splitting the samples into test/train (20/80) I get the following distributions respectively. I'm worried, since model performance is not improving by tuning hyperparamaters, if I'm …
My teacher did this in class, and I'm wondering is this ok to use .fit_transform with xtest? It shouldn't just be poly.transform(xtest) Teacher's Code from sklearn.preprocessing import PolynomialFeatures poly = PolynomialFeatures(degree=3) xtrain_poly = poly.fit_transform(xtrain) xtest_poly = poly.fit_transform(xtest) As I think it should be: from sklearn.preprocessing import PolynomialFeatures poly = PolynomialFeatures(degree=3) xtrain_poly = poly.fit_transform(xtrain) xtest_poly = poly.transform(xtest) As an optional question, what does fit() and transform() do in PolynomialFeatures? transform() scales the data based on some value(s) returned by fit(), such …
It is considered best practice to split your data into a train and test set at the start of a data science / machine learnign project (and then your train set further into a validation set for hyperparamter optimisation). If it turns out that the distribution in your train set isn't the same as your test set, perhaps one group is completely lacking in the test set, or a group is overly represented in the test set for example, what …
I am starting in Machine Learning, and I have doubts about some concepts. I've read we need to split our dataset into training, validation and test sets. I'll ask four questions related to them. 1 - Training set: It is used in .fit() for our model to learn parameteres such as weights in a neural network? 2 - Validation set: Can also be used in .fit(). The validation set is used so we can validate our model at the end …
I am trying to design an algorithm that will allow to calculate the relevance of test data to a trained model. This can be done by checking if predictor variables have a different distribution in train and test data (covariate shift). Main idea: If there exists a covariate shift, then upon mixing train and test we’ll be able to classify the origin of each data point (whether it is from test or train) with good accuracy. I define a 'relevance …