How to feature engineering after getting test data in deployment?

I am kind of confuse about this topic of feature engineering. I am trying to make an web app in which people can upload test data as csv. Now I am confuse about how to do feature engineering after deploy the app, especially how to handle outliers and missing value?

  1. Suppose I want to change all the outliers of the test data with Q3+(1.5*IQR) value. My confusion is should I use the training dataset's calculated Q3+(1.5*IQR) value to change all test data's outliers or should I calculate Q3+(1.5*IQR) value for test data separately and change all the outliers.
  2. Same confusion for missing values. Should I impute missing values with training dataset's mean/median/mode or with test data's mean/median/mode.
  3. I know that all types of transformer which we fit and transform for the training data, has to be used to only transform test data. But suppose I do any normal transformation which doesn't have fit and transform, to make a feature more gaussian like, like I used np.cos() to make one of the feature more gaussian like. Should I use np.cos() to that feature also in test data ?

My overall confusion is how to do feature engineering after deploy of the model, where the test data can be anything like a single row or multiple row with missing values, outliers blah blah.

Topic transformation feature-engineering

Category Data Science


Ok so basically the main rule thumb is - use whatever you can use in real-time, otherwise of course it is wrong to use it because you wouldn't be able to provide it on real time. If people upload test data, it means you assume it is unseen and therefore can't use its values.

I will answer your question assuming real-time scenario where you get new data frequently and needs to decide what to do when new data exists.

1+2. Regarding question 1 and 2 that depends on the train distribution vs test distribution. You can use only the data that available on real-time scenario, if you have dates on it - use train+test until the date of the currently predicted instance (rolling window). The right treatment is always depends on the data properties. For example, data with inherent trend in it would make outliers clipping fail, and you better recompute the distribution properties(IQR, Q3 etc.) very frequently or use rolling window (only recent data) for this computation. Same goes for missing values. On the other hand, if you expect the distribution to be similar you can use the train only.

  1. Any transformation that do not depend on the data distribution can and should be applied on train and test sets

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.