What to do when one feature has very large importance/weight?

Question

What to do when one feature has very large importance/weight?

Daria

2022年6月3日 07:27

I am new to Data Science and currently am trying to predict customers churn for a company that offers of subscription-based bookings management software. Its customers are gyms. I have a small unbalanced dataset of a historical data (False 670, True 230) with 2 numerical predictors: age(days since subscription), number of active days in the last month(days on which a customer(gym) had bookings) and 1 categorical: logo (boolean, if a customers uploaded a logo in a software).

Predictors have following negative correlations with churn :

logo: 0.65
num_active_days_last_month: 0.40
age: 0.3

Feature importances look similar with Logo having the most weight.

When I predict, the model (logistic regression) classifies customers without logo as churners, even thought they are quite active.

For example the following two customers have almost the same probability to churn:

Customer 1:

logo: True
num_active_days_last_month: 1
age:30 days

Customer 2:

logo: False
num_active_days_last_month: 22
age: 250 days

I understand that this is what model learned from the dataset, but it just doesn’t make sense in my mind to have such strong importance assigned to something like Logo. Is there any way I can avoid completely excluding Logo from the predictors? maybe somehow decrease its importance?

Thank you in advance for any help/ suggestions i can get.

Topic data-science-model churn logistic-regression classification

Category Data Science

Nicolas Martin · Accepted Answer · 2022年6月3日 07:27

I don't understand why the logo is taken into account in your algorithm.

Generally speaking, you have to take into account variables in your algorithm that make sense, either because it is very obvious (which seems to be your case) or because you didn't find any correlation with other data (through a correlation algorithm).

My suggestion is to remove the logo from your model first. Then, the two remaining variables might not be enough to do predictions with a data science algorithm. Perhaps the active days in last month is enough?

Of course, the customers who have a high age and were present in the last month have lower chances to churn.

What could be interesting in your case is predicting when a customer is most likely to churn thanks to a model that recognize time series patterns.

However, I'm affraid there is no enough variables to reach interesting results, nor enough data because 1000 rows may not cover most scenarios and statistical sets.

What to do when one feature has very large importance/weight?

About