Why do I get different coefficients from Logistic regression in Python and SPSS

I am a bit confused in regards to the model coefficients calculated by SPSS and sklearn's LogisticRegression.

I am getting different coefficients and intercepts for both methods.

in Python, I am running the following code:

import numpy as np
from sklearn.linear_model import LogisticRegression

vals = [0.01, 0.04, 0.07, 0.08, 0.08, 0.09, 0.10, 0.15, 0.20, 1.85, 1.93, 1.97 ,2.02, 2.09, 2.12, 2.13, 2.21, 2.25]
labels = [0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1]

X = np.array(vals).reshape(-1, 1)
y = np.array(labels)

solvers = ['liblinear', 'newton-cg', 'lbfgs', 'sag', 'saga']

for i in solvers:
    model = LogisticRegression(solver=i, random_state=0).fit(X, y)
    print(f'intercept: {round(model.intercept_[0],3)} coefficient: {round(model.coef_[0][0],3)} ... solver: {i}')

which outputs:

 intercept: -1.012 coefficient: 1.239 ... solver: liblinear
 intercept: -1.609 coefficient: 1.498 ... solver: newton-cg
 intercept: -1.609 coefficient: 1.498 ... solver: lbfgs
 intercept: -1.609 coefficient: 1.496 ... solver: sag
 intercept: -1.609 coefficient: 1.498 ... solver: saga

(In reality, my dataset is much larger of course)

In SPSS if I use Analysie Regression Binary Logistic at default settings. I am getting a different coefficient and intercept.

Is there a transformation on the coefs that is done by skleanr but not by SPSS or is the Logistic regression method used by sklearn so much different from the one in SPSS. When I calculate the log_odds and the the predicted probabilities for the equations derived from either python and SPSS, I am getting quite similar results, however still a little different.

Does anyone know why this difference occurs or whether there is a common method for the two that would yield similar results?

thank you in advance.

Topic spss logistic-regression scikit-learn python

Category Data Science


The LogisticRegression class from sklearn by default uses an L2 penalty as a regularization mechanism, which I assume is not used in the SPSS implementation of the logistic regression. This has been a discussion point before but the main reason for this is that sklearn is focused on prediction and getting the best prediction results (where regularisation is used to prevent overfitting) instead of fitting the model to the data and finding which factors affect the outcome. You can try setting the penalty argument to 'none' to see if that gives you the same results as in SPSS.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.