How to deal with missing values that are supposed to be missing?

I am trying to predict loan defaults with a fairly moderate-sized dataset. I will probably be using logistic regression and random forest.

I have around 35 variables and one of them classifies the type of the client: company or authorized individual.

The problem is that, for authorized individuals, some variables (such as turnover, assets, liabilities, etc) are missing, because an authorized individual should not have this stuff. Only a company can have turnover, assets, etc.

What do I do in this case? I cannot impute the missing values, but I also can't leave them empty. In the dataset there are about 80% companies and 20% authorized individuals. If I can't impute that data, should I just drop the rows in which we find authorized individuals altogether? Is there any other sophisticated method to make machine learning techniques (logistic regression and random forests) somehow ignore the empty values?

Topic missing-data decision-trees logistic-regression

Category Data Science


Do not ignore missing values. In your case, they carry important information. Consider (1) binning numeric variables, including a separate bin for 'missing', or (2) impute the missing values with 0, introducing a dummy variable for when the variable is 'missing'.

Point (1) results in a loss of information, but is most common and easiest to interpret. Point (2) reduces information loss, but leads to bias. I would consider (1).

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.