Influence of imbalanced feature on prediction

I want to use XGB regression. the dataframe is coneptually similar to this table:


index    feature 1   feature 2   feature 3  encoded_1  encoded_2  encoded_3   y
0          0.213      0.542       0.125       0             0        1        0.432
1          0.495      0.114       0.234       1             0        0        0.775
2          0.521      0.323       0.887       1             0        0        0.691

My question is, what is the influence of having imbalanced observations of the encoded features? for example, is I have more features that are encoded 1 comapred to encoded 2 or encoded_3. Just to make it clear, I want to use regression and not classification.

If there is any material to read about it pelase let me know.

Topic imbalanced-data xgboost regression python

Category Data Science


As Erwan said, the imbalanced dataset problem is about the target variables and not the features.

But if your model favors a section of your regression target more, you can perform a study on the distribution of the target variable and then, depending on the distribution, perform a transformation (e.g. square root or exp), to get a more uniform output.

Also, an underfit can be mistakenly thought of as a result of feature imbalance and not the representativeness of your features. You can add new features or even transformed versions of your current features to capture non-linearity in your data.


It doesn't matter, it's just what the data is.

I assume that you're thinking about issues related to "imbalanced dataset", but this term refers only to imbalance in the values of the target variable (and it's more commonly used about classification, but technically it's relevant also in regression).

Features don't need to be balanced in any way, they just need to be good indicators for the target variable.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.