Advantages to combining similarly-named columns for supervised ML?

Question

Advantages to combining similarly-named columns for supervised ML?

v81

2021年10月14日 09:19

Is there any benefit to combining similarly named columns either for an improvement in accuracy or for speeding up training/prediction in case of logistic regression, random forest or neural network models?

I have seen this done at times but wasn't sure if there was more than a heuristically-motivated reason for doing it.

eg. Converting this:

name	col1	col2	col3	time
gina	5	12	20	30
john	6	7	43	40

to this:

name	(col1,col2,col3)	time
gina	(5,12,20)	30
john	(6,7,43)	40

Topic data-wrangling supervised-learning accuracy

Category Data Science

spectre · Accepted Answer · 2021年10月14日 09:19

What you are talking about is called feature engineering. Basically it is done to reduce the dimensionality of the dataset. What we are doing is combining 2 or more features which provide the same info, into one feature.

For example I had this dataset where I had to predict the price of a used car. There were 2 features month of registration and year of registration. So I combined the into 1 feature called age of car. That way I reduced 1 dimension of my dataset.

Keep in mind that this is mostly done for numerical features and not for categorical ones. Doing this for categorical features will result in textual dataset which will need NLP.

Advantages to combining similarly-named columns for supervised ML?

About