How to deal with missing values that are supposed to be missing?
I am trying to predict loan defaults with a fairly moderate-sized dataset. I will probably be using logistic regression and random forest.
I have around 35 variables and one of them classifies the type of the client: company or authorized individual.
The problem is that, for authorized individuals, some variables (such as turnover, assets, liabilities, etc) are missing, because an authorized individual should not have this stuff. Only a company can have turnover, assets, etc.
What do I do in this case? I cannot impute the missing values, but I also can't leave them empty. In the dataset there are about 80% companies and 20% authorized individuals. If I can't impute that data, should I just drop the rows in which we find authorized individuals altogether? Is there any other sophisticated method to make machine learning techniques (logistic regression and random forests) somehow ignore the empty values?
Topic missing-data decision-trees logistic-regression
Category Data Science