Concatenating Data in two years

I have to use a Machine Learning Model to predict the Electricity consumption and carbon emission based on some buildings' features. (Area, year of construction ...) Here is the link to the data. The problem is that I have data from 2 years 2015 and 2016, for each year I have some buildings and the mean of consumption and emission. I'm wondering what is the best way to concatenate the data. Since there are some buildings that are registered only for one year (but the majority of buildings is there both in 2015 and 2016) This is the ideas that I've come with so far :

  1. Aggregating the data, i.e. taking only the buildings that are present in both years and taking the mean of consumption and emission, and thus we will obtain the mean for both years. (I believe this is a good strategy but we will have to deal with changes in categorical features )
  2. Keep all the data, but make sure when to do the split to train the models that the same building belongs either to the train or the test. (This is the easiest approach but splitting will be more challenging, especially if we want to use K-fold)

What do you think is the best way to concatenate this data set ? Thank you for your help.

Topic aggregation dataset data-cleaning

Category Data Science


If your goal is to predict consumption yearly, I'd really go with option 2. You pointed out well the "only" issue with this opetion : you have to be sure lines from the same building are on the same set, either way you'd over-estimate your model performance.

Another thing, not directly linked to your question : Year of construction is not a good variable. If your building was built in 2010 and you analyse consumption in 2015, building is kinda new (5yo). If you apply it on current buildings, in 2021, it's now 11yo, but Year of construction is still 2015. Better have a variable, like 'Years Old', counting, at the moment of the consumption is done, how old the building is.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.