Why coefficients from logistic regression are not proportional to bad rate?

I am building a logistic regression model in Python with statsmodels.api.Logit. The model contains 12 features that are encoded using pandas.get_dummies(). My final training dataset (xTrain) looks like this:

feature1_A feature1_B feature2_B feature2_C feature_2_D
0 1 0 1 0
0 1 1 0 0
1 0 0 1 0

feature1 is a categorical feature that contains 3 modalities (or categories) A, B, and C (C is used as a base reference so it does not appear in my training set) and feature2 contains 4 modalities (and A is the base reference).

After the training I print the coefficients in my model:

import statsmodels.api as sm
from sklearn.model_selection import train_test_split

xTrain, xTest, yTrain, yTest = train_test_split(df_final, df_final[target], test_size=0.30, random_state=48379)

logit_model = sm.Logit(yTrain.astype(float), sm.add_constant(xTrain.astype(float)))

result = logit_model.fit()
print(result.summary2())

But the issue is that my coefficients are not proportional to the bad rate of my modalities:

For feature2, the coefficients are positively proportional to the bad rate*

  • Modality A has a bad rate of 42% and a model coeff of 0
  • Modality B has a bad rate of 52% and a model coeff of 0,19
  • Modality C has a bad rate of 57% and a model coeff of 0,28
  • Modality D has a bad rate of 60% and a model coeff of 0,55

But for feature1, the coefficients do not make sense:

  • Modality A has a bad rate of 64% and a model coeff of -0.06
  • Modality B has a bad rate of 42% and a model coeff of -0.48
  • Modality C has a bad rate of 49% and a model coeff of 0

Here you will noticed that the modality A has a coefficient which is lower than modality C, even though modality A has a higher bad rate. If I'm not mistaken, this means that the model predicts modality A to be less risky than modality C, which does not make sense because the bad rate is actually higher.

Is there a reason I would have such inversion i.e. the model coefficients not being proportional to the bad rate? Are there any ways to fix this?

*bad rate: number of observation with target = 1 (bad) divided by total number of observation

Topic data-science-model statsmodels logistic-regression classification machine-learning

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.