Advantages to combining similarly-named columns for supervised ML?

Is there any benefit to combining similarly named columns either for an improvement in accuracy or for speeding up training/prediction in case of logistic regression, random forest or neural network models?

I have seen this done at times but wasn't sure if there was more than a heuristically-motivated reason for doing it.

eg. Converting this:

name col1 col2 col3 time
gina 5 12 20 30
john 6 7 43 40

to this:

name (col1,col2,col3) time
gina (5,12,20) 30
john (6,7,43) 40

Topic data-wrangling supervised-learning accuracy

Category Data Science


What you are talking about is called feature engineering. Basically it is done to reduce the dimensionality of the dataset. What we are doing is combining 2 or more features which provide the same info, into one feature.

For example I had this dataset where I had to predict the price of a used car. There were 2 features month of registration and year of registration. So I combined the into 1 feature called age of car. That way I reduced 1 dimension of my dataset.

Keep in mind that this is mostly done for numerical features and not for categorical ones. Doing this for categorical features will result in textual dataset which will need NLP.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.