How to build multiple variable regression having a mix of numerical & categorical features?

There is a need to estimate Annual Average Daily Traffic Volume (AADT).

We have bunch of data about vehicles' speeds during several years. It is noticed that AADT depends on the average number of such samples during some time, so a regression model $Y = f(x_1)$ could help estimating the AADT.

The problem is there are other features affecting the dependency which are both numerical $(x_2, .., x_k)$ and categorical $(c_1 = data\ provider, c_2 = road\ class, .., c_m)$.

We believe that $x_1$ affects the AADT much more than all the other features and the $x_1$ itself could also depend on other features too. That's why we would like to get a set of regressions $Y = f(x_1)$ depending on $(x_2, ..x_k, \ c_1, ..c_m)$.

Both $k$ and $m$ are just few.

Is it reasonable to cluster dataset by features $(x2, .., x_k,\ y_1, .., y_m)$ first, and then try to find regression $Y=f(x_1)$ in each cluster?

Or is it better to consider all the features $(x1, x2, .., x_k,\ y_1, .., y_m)$ together with $x1$ having more weight than others?

Also note that for multiple variable regression there is a mix of numerical categorical features.

Topic multivariate-distribution features regression categorical-data

Category Data Science


Regression is a machine learning technique that learns the weights of features from the data. If $x_1$ is the most important feature, the model will learn to weight it the most.

There is no reason to cluster the data first.

Categorical features should be encoded to be numerical. One common encoding choice is one-hot encoding.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.