Concatenating Data in two years
I have to use a Machine Learning Model to predict the Electricity consumption and carbon emission based on some buildings' features. (Area, year of construction ...) Here is the link to the data. The problem is that I have data from 2 years 2015 and 2016, for each year I have some buildings and the mean of consumption and emission. I'm wondering what is the best way to concatenate the data. Since there are some buildings that are registered only for one year (but the majority of buildings is there both in 2015 and 2016) This is the ideas that I've come with so far :
- Aggregating the data, i.e. taking only the buildings that are present in both years and taking the mean of consumption and emission, and thus we will obtain the mean for both years. (I believe this is a good strategy but we will have to deal with changes in categorical features )
- Keep all the data, but make sure when to do the split to train the models that the same building belongs either to the train or the test. (This is the easiest approach but splitting will be more challenging, especially if we want to use K-fold)
What do you think is the best way to concatenate this data set ? Thank you for your help.
Topic aggregation dataset data-cleaning
Category Data Science