Does statsmodels fully support MultiIndex?
The below code snippet shows how statsmodels seems to flatten MultiIndex tuples by joining them with an underscore _.
import numpy as np
import pandas as pd
from statsmodels.regression.linear_model import OLS
K = 2
N = 10
ERROR_VOL = 1
np.random.seed(0)
X = np.random.rand(N, K)
coefs = np.linspace(0.1, 1, K)
noise = np.random.rand(N)
y = X @ coefs + noise * ERROR_VOL
index_ = pd.MultiIndex.from_tuples([('some_var','feature_0'), ('some_var','feature_1')])
df = pd.DataFrame(X, columns=index_)
ols_fit = OLS(y, df, hasconst=False).fit()
print(ols_fit.params)
The result is
some_var_feature_0 0.230474
some_var_feature_1 1.646789
dtype: float64
Because of the above flattening, the following, and similar operations relying on name matching, fail:
params_stdzd = ols_fit.params * df.std()
ValueError: cannot join with no level specified and no overlapping names
Questions
- Is there a way to get statsmodels to respect a pandas MultiIndex rather than flatten it?
If not:
Is there a way to set the flattening character to something other than underscore?
can I rely on OLS.params respecting the order of df.columns? If so I could just reindex OLS.params with df.columns to get a properly indexed params Series.
Are there better ways to get MultiIndex interoperabilty with statsmodels?
Topic statsmodels pandas data-indexing-techniques python
Category Data Science