Can I create a new target value based on the average target value of same data points for regression?
I am trying to predict profit of retail stores. The orginal dataframe looks like this:
Store No | feature A | feature B | year | profit |
---|---|---|---|---|
A | 1 | 2 | 2016 | 20000 |
A | 1 | 2 | 2017 | 40000 |
B | 4 | 3 | 2017 | 50000 |
B | 4 | 3 | 2018 | 40000 |
C | 5 | 6 | 2015 | 80000 |
C | 5 | 6 | 2016 | 90000 |
In production information about profit and year will not be available. Since year is not available, we have same data points with different target values. So I thought adding the average profit for every store, since the input features stays the same. Then drop the old target value, year and remove the duplicates. Then it looks like this:
Store No | feature A | feature B | Average profit |
---|---|---|---|
A | 1 | 2 | 30000 |
B | 4 | 3 | 45000 |
C | 5 | 6 | 85000 |
Can I use 'Average profit' as my new target for regression models or will this create data leakage, since the average is not what we predict in production (We predict the store's profit not the average and independent from the year)?
Or is this step completely unnecessary, since this is how the regression models work mathematically?
Thanks in advance.
Edit: Edited the sample set, since it can happen that the profit decreases over time. But anyway the information about year is not available - so no temporal denpendency
Topic data-leakage supervised-learning regression data-cleaning
Category Data Science