Can I create a new target value based on the average target value of same data points for regression?

Question

Can I create a new target value based on the average target value of same data points for regression?

freshst4r

2022年3月13日 13:05

I am trying to predict profit of retail stores. The orginal dataframe looks like this:

Store No	feature A	feature B	year	profit
A	1	2	2016	20000
A	1	2	2017	40000
B	4	3	2017	50000
B	4	3	2018	40000
C	5	6	2015	80000
C	5	6	2016	90000

In production information about profit and year will not be available. Since year is not available, we have same data points with different target values. So I thought adding the average profit for every store, since the input features stays the same. Then drop the old target value, year and remove the duplicates. Then it looks like this:

Store No	feature A	feature B	Average profit
A	1	2	30000
B	4	3	45000
C	5	6	85000

Can I use 'Average profit' as my new target for regression models or will this create data leakage, since the average is not what we predict in production (We predict the store's profit not the average and independent from the year)?

Or is this step completely unnecessary, since this is how the regression models work mathematically?

Thanks in advance.

Edit: Edited the sample set, since it can happen that the profit decreases over time. But anyway the information about year is not available - so no temporal denpendency

Topic data-leakage supervised-learning regression data-cleaning

Category Data Science

Kasra Manshaei · Accepted Answer · 2022年3月13日 12:39

Your solution makes total sense and if you do not have temporal data in production then this is how you better do. I just add small points:

Data Leakage does not happen when you transform solely based on targets or solely based on features. So you are actually safe here according to data leakage.
There might be significant dispersion in your targets i.e. you have to predict $75k$ as mean of two target values of $100k$ and $50k$. I suggest you also learn a dispersion measure (variance, std, etc.) as another target and train for learning both central (e.g. mean) and dispersion (var or std) measures. This helps you have a better understanding of how "good" your "good" predictions are (i.e. predicting around $75k$ for above example is machine-learning-wise good, but the true statistics of your data suggests that this is still far from both real values. This can be captured by learning a dispersion measure)

Can I create a new target value based on the average target value of same data points for regression?

About