Difference in statsmodel output vs direct linear algebra with input binary variable

Question

Difference in statsmodel output vs direct linear algebra with input binary variable

kiacan

2021年12月13日 09:18

I was wondering why there might be a difference when I run a simple multiple linear regression with statsmodels OLS, vs just doing it directly with numpy.

The results are identical for both cases, so long as I don't include sex (binary) as one of the predictor variables. I am wondering why this might be the case, and which to prefer in this case? I noticed that in the output of statsmodels it also says Sex[T.1] which may be related (as opposed to the other variables which do not have anything listed besides them)--is the binary version treated specially in the case of statsmodels?

I appreciate it!

Edited with the main aspect of the code:


X_s = pd.DataFrame(a_new_training[['var1','var2']]).astype(float)

X_s.insert(0,'const',1)
y_s = pd.DataFrame(a_new_training['y']).astype(float)
beta_estimated = np.linalg.inv(X_s.T @ X_s) @ X_s.T @ y_s

beta_estimated

            y
0   17.444400
1   -0.163070
2   -0.217814

res = ols('y~var1+var2', a_new_training).fit()
res.summary()

    coef    std err     t   P|t|   [0.025  0.975]
Intercept   17.4444     inf     0   nan     nan     nan
var1        -0.1631     inf     -0  nan     nan     nan
var2        -0.2178     inf     -0  nan     nan     nan

Both the above agree with each other.

However:

X_s = pd.DataFrame(a_new_training[['var1','var2','Sex']]).astype(float)

X_s.insert(0,'const',1)
y_s = pd.DataFrame(a_new_training['y']).astype(float)
beta_estimated = np.linalg.inv(X_s.T @ X_s) @ X_s.T @ y_s
            y
0   12.906569
1   -0.019857
2   -0.760647
3   4.011057

res = ols('y~visit_age+education+Sex', a_new_training).fit()
res.summary()

coef    std err     t   P|t|   [0.025  0.975]
Intercept   0.9352  inf     0   nan     nan     nan
Sex[T.1]    3.8787  inf     0   nan     nan     nan
var1        0.1353  inf     0   nan     nan     nan
var2        -0.7151     inf     -0  nan     nan     nan

```

Topic statsmodels linear-regression regression

Category Data Science

Peter · Accepted Answer · 2021年12月13日 09:18

The code below works for me.

import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
import pandas as pd

X_s = pd.DataFrame({'var1': [1,4,5,8,6,7,1], 'var2': [2,3,6,3,1,5,3]}).astype(float)
y_s = pd.DataFrame({'y': [10,14,15,18,10,12,14]}).astype(float)

df = pd.concat([y_s, X_s], axis=1)

X_s.insert(0,'const',1)

beta_estimated = np.linalg.inv(X_s.T @ X_s) @ X_s.T @ y_s
print(beta_estimated)

res = smf.ols(formula='y ~ var1 + var2', data=df).fit()
print(res.summary())

### Add Dummy
X_s2 = pd.DataFrame({'var1': [1,4,5,8,6,7,1], 'var2': [2,3,6,3,1,5,3], 'dummy':[1,1,1,0,0,0,0]}).astype(float)
df2 = pd.concat([y_s, X_s2], axis=1)
X_s2.insert(0,'const',1)

beta_estimated2 = np.linalg.inv(X_s2.T @ X_s2) @ X_s2.T @ y_s
print(beta_estimated2)

res2 = smf.ols(formula='y ~ var1 + var2 + dummy', data=df2).fit()
print(res2.summary())

Results for the estimation with dummies (manually):

(X'X)-1 X'y =>             y
0  10.133169
1   0.268804
2   0.629470
3  -0.337238

Statsmodels:

==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     10.1332      3.885      2.608      0.080      -2.230      22.496
var1           0.2688      0.613      0.439      0.690      -1.681       2.218
var2           0.6295      0.921      0.684      0.543      -2.301       3.560
dummy         -0.3372      3.138     -0.107      0.921     -10.323       9.649
==============================================================================

Difference in statsmodel output vs direct linear algebra with input binary variable

About