Dealing with diverse groups in regression

Question

Dealing with diverse groups in regression

Kira Bulatov

2022年6月1日 07:17

What happens if a certain dataset contains different "groups" that follow different linear models?

For example, let's imagine that examining the scatterplot of a certain feature $x_i$ against $y$ we can see that some points follow a linear relationship with a coefficient $\beta_A0$ while other points clearly have $\beta_B0$. We can infer that these points belong to two different populations, population $A$ responds negatively to high values of feature $x_i$ while population $B$ responds positively. We then create a categorical feature (or one hot encoding) to show which population each row belongs to.

Is splitting the dataset required or are commonly used algorithms able to recognize the different relations between features from different categorical variables?

Topic missing-data linear-regression regression

Category Data Science

Brian Spiering · Accepted Answer · 2021年12月26日 20:56

1

Brian Spiering answered at 2021年12月26日 20:56

Option include segmented regression or decision tree regression. Both of those algorithms are able learn to predict different targets values conditional on feature values.

Johannes · Accepted Answer · 2018年10月24日 10:25

For the case of unobservable groups, you could use mixture models, in your case a mixture of linear regression models. Mixture models identify latent (=unobserved) clusters in the data so that each cluster has the same parameters in the consequent part of the model. The text book example are mixed Gaussians, where each individual observation comes from a Normal distribution, but the mean is different for each group. In your case, a mixture model would infer clusters of individuals that share regression coefficients and estimate the coefficients for each cluster in one step.

For a basic introduction, see Grün, B., & Leisch, F. (2008). Finite mixtures of generalized linear regression models. Recent advances in linear models and related areas (pp. 205-230). Physica-Verlag HD (link)

Finite mixture models require the number of latent groups to be specified (e.g. domain knowledge or cross-validation). Infinite mixture models find a good number of groups from the data.

These models typically do not give you clear rules as to why an individual belongs to a cluster and consequently cannot be used for unknown individuals, but could possibly be extended by a prior that explicitly models cluster probabilities based on observed data.

user2974951 · Accepted Answer · 2018年9月24日 06:51

You can't really do that, there may be some factor which binds certain "groups" of data together, but there are many reasons for this. Your relationship may be nonlinear, or the "groups" of data may represent subjects / objects, where a stronger correlation exists. Unless you know for a fact that these points belong to different populations you shouldn't do that, use the data that you have to model these groupings.

Dealing with diverse groups in regression

About