Why coefficients from logistic regression are not proportional to bad rate?
I am building a logistic regression model in Python with statsmodels.api.Logit
. The model contains 12 features that are encoded using pandas.get_dummies()
.
My final training dataset (xTrain
) looks like this:
feature1_A | feature1_B | feature2_B | feature2_C | feature_2_D |
---|---|---|---|---|
0 | 1 | 0 | 1 | 0 |
0 | 1 | 1 | 0 | 0 |
1 | 0 | 0 | 1 | 0 |
feature1 is a categorical feature that contains 3 modalities (or categories) A, B, and C (C is used as a base reference so it does not appear in my training set) and feature2 contains 4 modalities (and A is the base reference).
After the training I print the coefficients in my model:
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
xTrain, xTest, yTrain, yTest = train_test_split(df_final, df_final[target], test_size=0.30, random_state=48379)
logit_model = sm.Logit(yTrain.astype(float), sm.add_constant(xTrain.astype(float)))
result = logit_model.fit()
print(result.summary2())
But the issue is that my coefficients are not proportional to the bad rate of my modalities:
For feature2
, the coefficients are positively proportional to the bad rate*
- Modality A has a bad rate of 42% and a model coeff of 0
- Modality B has a bad rate of 52% and a model coeff of 0,19
- Modality C has a bad rate of 57% and a model coeff of 0,28
- Modality D has a bad rate of 60% and a model coeff of 0,55
But for feature1
, the coefficients do not make sense:
- Modality A has a bad rate of 64% and a model coeff of -0.06
- Modality B has a bad rate of 42% and a model coeff of -0.48
- Modality C has a bad rate of 49% and a model coeff of 0
Here you will noticed that the modality A has a coefficient which is lower than modality C, even though modality A has a higher bad rate. If I'm not mistaken, this means that the model predicts modality A to be less risky than modality C, which does not make sense because the bad rate is actually higher.
Is there a reason I would have such inversion i.e. the model coefficients not being proportional to the bad rate? Are there any ways to fix this?
*bad rate: number of observation with target = 1 (bad) divided by total number of observation