Difference in statsmodel output vs direct linear algebra with input binary variable
I was wondering why there might be a difference when I run a simple multiple linear regression with statsmodels OLS, vs just doing it directly with numpy.
The results are identical for both cases, so long as I don't include sex (binary) as one of the predictor variables. I am wondering why this might be the case, and which to prefer in this case? I noticed that in the output of statsmodels it also says Sex[T.1] which may be related (as opposed to the other variables which do not have anything listed besides them)--is the binary version treated specially in the case of statsmodels?
I appreciate it!
Edited with the main aspect of the code:
X_s = pd.DataFrame(a_new_training[['var1','var2']]).astype(float)
X_s.insert(0,'const',1)
y_s = pd.DataFrame(a_new_training['y']).astype(float)
beta_estimated = np.linalg.inv(X_s.T @ X_s) @ X_s.T @ y_s
beta_estimated
y
0 17.444400
1 -0.163070
2 -0.217814
res = ols('y~var1+var2', a_new_training).fit()
res.summary()
coef std err t P|t| [0.025 0.975]
Intercept 17.4444 inf 0 nan nan nan
var1 -0.1631 inf -0 nan nan nan
var2 -0.2178 inf -0 nan nan nan
Both the above agree with each other.
However:
X_s = pd.DataFrame(a_new_training[['var1','var2','Sex']]).astype(float)
X_s.insert(0,'const',1)
y_s = pd.DataFrame(a_new_training['y']).astype(float)
beta_estimated = np.linalg.inv(X_s.T @ X_s) @ X_s.T @ y_s
y
0 12.906569
1 -0.019857
2 -0.760647
3 4.011057
res = ols('y~visit_age+education+Sex', a_new_training).fit()
res.summary()
coef std err t P|t| [0.025 0.975]
Intercept 0.9352 inf 0 nan nan nan
Sex[T.1] 3.8787 inf 0 nan nan nan
var1 0.1353 inf 0 nan nan nan
var2 -0.7151 inf -0 nan nan nan
```
Topic statsmodels linear-regression regression
Category Data Science