How to build multiple variable regression having a mix of numerical & categorical features?

Question

How to build multiple variable regression having a mix of numerical & categorical features?

Артём Ощепков

2022年4月26日 22:02

There is a need to estimate Annual Average Daily Traffic Volume (AADT).

We have bunch of data about vehicles' speeds during several years. It is noticed that AADT depends on the average number of such samples during some time, so a regression model $Y = f(x_1)$ could help estimating the AADT.

The problem is there are other features affecting the dependency which are both numerical $(x_2, .., x_k)$ and categorical $(c_1 = data\ provider, c_2 = road\ class, .., c_m)$.

We believe that $x_1$ affects the AADT much more than all the other features and the $x_1$ itself could also depend on other features too. That's why we would like to get a set of regressions $Y = f(x_1)$ depending on $(x_2, ..x_k, \ c_1, ..c_m)$.

Both $k$ and $m$ are just few.

—

Is it reasonable to cluster dataset by features $(x2, .., x_k,\ y_1, .., y_m)$ first, and then try to find regression $Y=f(x_1)$ in each cluster?

Or is it better to consider all the features $(x1, x2, .., x_k,\ y_1, .., y_m)$ together with $x1$ having more weight than others?

Also note that for multiple variable regression there is a mix of numerical categorical features.

Topic multivariate-distribution features regression categorical-data

Category Data Science

Brian Spiering · Accepted Answer · 2021年7月4日 15:53

Regression is a machine learning technique that learns the weights of features from the data. If $x_1$ is the most important feature, the model will learn to weight it the most.

There is no reason to cluster the data first.

Categorical features should be encoded to be numerical. One common encoding choice is one-hot encoding.

How to build multiple variable regression having a mix of numerical & categorical features?

About