Can I create a new target value based on the average target value of same data points for regression?

I am trying to predict profit of retail stores. The orginal dataframe looks like this:

Store No feature A feature B year profit
A 1 2 2016 20000
A 1 2 2017 40000
B 4 3 2017 50000
B 4 3 2018 40000
C 5 6 2015 80000
C 5 6 2016 90000

In production information about profit and year will not be available. Since year is not available, we have same data points with different target values. So I thought adding the average profit for every store, since the input features stays the same. Then drop the old target value, year and remove the duplicates. Then it looks like this:

Store No feature A feature B Average profit
A 1 2 30000
B 4 3 45000
C 5 6 85000

Can I use 'Average profit' as my new target for regression models or will this create data leakage, since the average is not what we predict in production (We predict the store's profit not the average and independent from the year)?

Or is this step completely unnecessary, since this is how the regression models work mathematically?

Thanks in advance.

Edit: Edited the sample set, since it can happen that the profit decreases over time. But anyway the information about year is not available - so no temporal denpendency

Topic data-leakage supervised-learning regression data-cleaning

Category Data Science


Your solution makes total sense and if you do not have temporal data in production then this is how you better do. I just add small points:

  • Data Leakage does not happen when you transform solely based on targets or solely based on features. So you are actually safe here according to data leakage.
  • There might be significant dispersion in your targets i.e. you have to predict $75k$ as mean of two target values of $100k$ and $50k$. I suggest you also learn a dispersion measure (variance, std, etc.) as another target and train for learning both central (e.g. mean) and dispersion (var or std) measures. This helps you have a better understanding of how "good" your "good" predictions are (i.e. predicting around $75k$ for above example is machine-learning-wise good, but the true statistics of your data suggests that this is still far from both real values. This can be captured by learning a dispersion measure)

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.