How to deal with missing values that are supposed to be missing?

Question

How to deal with missing values that are supposed to be missing?

IcarusX

2022年4月19日 22:16

I am trying to predict loan defaults with a fairly moderate-sized dataset. I will probably be using logistic regression and random forest.

I have around 35 variables and one of them classifies the type of the client: company or authorized individual.

The problem is that, for authorized individuals, some variables (such as turnover, assets, liabilities, etc) are missing, because an authorized individual should not have this stuff. Only a company can have turnover, assets, etc.

What do I do in this case? I cannot impute the missing values, but I also can't leave them empty. In the dataset there are about 80% companies and 20% authorized individuals. If I can't impute that data, should I just drop the rows in which we find authorized individuals altogether? Is there any other sophisticated method to make machine learning techniques (logistic regression and random forests) somehow ignore the empty values?

Topic missing-data decision-trees logistic-regression

Category Data Science

ralph · Accepted Answer · 2022年4月19日 22:16

Do not ignore missing values. In your case, they carry important information. Consider (1) binning numeric variables, including a separate bin for 'missing', or (2) impute the missing values with 0, introducing a dummy variable for when the variable is 'missing'.

Point (1) results in a loss of information, but is most common and easiest to interpret. Point (2) reduces information loss, but leads to bias. I would consider (1).

How to deal with missing values that are supposed to be missing?

About