Which algorithm to use to identify clusters with a similar value?

Here, an example of my problem: 10000 observations of people with several features [age, gender, region, number of sons, ...] and a value to predict "income". There is not a general relationship between features and income, therefore a normal regression has poor results. Nevertheless, I want to identify specific patterns where this relationship exists. For instance: [young, woman, 2 son] -> high income [young, man] -> small income ...

Maybe doing a clustering on the features, and then a regression on each cluster? Or pattern recognition? topic modeling?

Thank you in advance

Topic pattern-recognition noise prediction regression clustering

Category Data Science


Plot the distribution of income (histogram) and see if you see clusters there (i.e. if it is a Gaussian mixture) and if so, tray to do your regression for each cluster there and see if it works.

Example: If you want to predict recruitment based on CV, then for a normal office job your target is almost regardless of input as many people with many background can do that but for a technical expert position you see more correlation between features and target. Might happen in your data as well.

The more precise way to do this is actually to discritize your numeric features and one-hot-encode all the features. Then if the correlation between target and only some values of features exist, then you will be able to capture it.


What you are describing is called ordinal regression. The target variable (income) is divided into discrete groups where the relative ordering between different groups is preserved (low, middle, high).

Binning a continuous variable as the advantage of better handling noisy data.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.