Retrieve dropped column names from `sklearn.impute.SimpleImputer`

Question

Retrieve dropped column names from `sklearn.impute.SimpleImputer`

lurscher

2021年12月25日 10:52

The SimpleImputer class takes pandas dataframes and returns unlabeled numpy arrays. Which means that the SimpleImputer drops some features at will, but has no way to communicate which features have been dropped to the caller

I've been trying to come up with a workaround, but they all are extremely hackish and unreliable. Is there something I'm missing?

Topic data-imputation scikit-learn

Category Data Science

Subbu VidyaSekar · Accepted Answer · 2021年12月25日 10:52

As per my experience, It didn't drop any columns. It replaced the name of the columns from actual column names to 1, 2, 3...

To put the actual column names after the imputation

 imp = SimpleImputer(missing_values = np.nan, strategy = 'most_frequent')
 imp.fit(df)
 df = pd.DataFrame(imp.transform(df), columns = df.columns)

Ben Reiniger · Accepted Answer · 2020年5月13日 16:39

SimpleImputer drops columns consisting entirely of missing values. It is indeed unpleasant when trying to associate original columns; the sklearn devs have been discussing this:

https://github.com/scikit-learn/scikit-learn/issues/16426

Vincent's answer is good, if you are working directly: just detect and remove the offending all-missing columns, since they don't contribute anything to your model. If you need something more automatic (e.g., you have a highly-missing column that in cross-validation leads to some training folds being all-missing), then perhaps use a ColumnTransformer, where the columns argument is a callable that checks for all-missing? Then you can use the ColumnTransformer's get_feature_names method to find out when/if a column was removed.

Vicent Blanes · Accepted Answer · 2020年5月13日 10:24

I've got the same issue today, and it's a shame your post got no answers. I think this question is not well addressed in the sklearn documentation. I can show you my workaround to this issue:

headers = X.columns.values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

empty_train_columns =  []
for col in X_train.columns.values:
    # all the values for this feature are null
    if sum(X_train[col].isnull()) == X_train.shape[0]:
        empty_train_columns.append(col)
print(empty_train_columns)

The idea is to keep all your column names, and after you split your data check which of them completely empty in your training set. If I'm not wrong the Imputer respects the column order so, for example, you can correlate every feature with its importance if you are using Decision-Tree-based models.

I'm not satisfied with this ugly piece of code but I couldn't find a more elegant (and simple) solution.

Retrieve dropped column names from `sklearn.impute.SimpleImputer`

About