Right order for Data preparation in Machine Learning

For the below mentioned steps of data preparation

  • Outlier detection/treatment
  • Data imputation
  • Data scaling/standardisation
  • Class balancing

There are two sub questions

  1. Should each of these steps performed post test/train split?
  2. Should it be done on test data?

Would appreciate explanation for each step individually.

Topic data-imputation feature-scaling outlier class-imbalance machine-learning

Category Data Science


We split the data into a training and test set because we want to mimic the "real world", essentially, how our model will perform when it encounters new/unseen examples.

The test set will no longer mimic the real world once we use it to develop our model, even if we only use it to scale or determine moments for imputations. Our model will essentially have additional information which could ultimately improve the overall performance of the test set - the following phenomenon is known as data leakage and it is the reason why many models tend to do poorly when faced with new examples.

As a best practice, it is recommended to split your data into a training and test set from the beginning. Put away the test set and do not touch it until you are ready to evaluate the model. As an example of how to do this, you would apply all of your transformations or changes to the test set based on observations made from the training set. For example: Standardizing a variables in the train and test set using the moments estimated using the training set (example - see option 3). This will give you a better indication of how your model will perform on unseen examples.


Data processing is a state of the art. You need your gut or intuition to be able make a good model. You have to perform data cleaning first. Garbage IN = Garbage OUT. If you don't clean data your model will be a garbage itself. Data Scientist spend 20%-30% of the time in cleaning data. There is no rule for cleaning data. But you could follow few guidelines for Data cleaning.

  1. Remove Duplicate values:- You need to be an artist to recognise the duplicate data. You could drop the duplicate records and just keep one record out of all duplicate records. pandas.DataFrame.drop_duplicates will perform data cleaning for you.
  2. Data integrity:- you need to ensure whether the data is accurate. If few fields of Age column contains Male or Female that actually don't make any sense. Or say Age contains 200 years for a person who in actual is 20 years old. You could say its outlier detection as well. But it doesn't necessarily be an outlier.
  3. Null Values (Data Imputation):- You could use sklearn library or pandas library for this. Both comes with its own features. Tip: Use latest KNNImputer. Imputation for completing missing values using k-Nearest Neighbors. It gives far better results. Reference

PERFORM SPLIT NOW :-

To avoid Data Leaks this has to be done. Standardising data before the split means that your training data contains information about your test data.

  1. Column Standardisation: It is required to scale the data. By doing this, you transform your data such that each column have a mean equal to zero and Variance of value 1. This is implemented by default in many algorithms. This will help your model to execute faster and eliminate problems related to scaling. (Every column is of same scale and model does't give preference to one column which has high values.)
  2. Encode categorical features.

NOTE:- Splitting is performed before Feature Scaling (Standardisation). StandardScaler is fit_transformed on training data and only transformed on test data.


About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.