Retrieve dropped column names from `sklearn.impute.SimpleImputer`

The SimpleImputer class takes pandas dataframes and returns unlabeled numpy arrays. Which means that the SimpleImputer drops some features at will, but has no way to communicate which features have been dropped to the caller

I've been trying to come up with a workaround, but they all are extremely hackish and unreliable. Is there something I'm missing?

Topic data-imputation scikit-learn

Category Data Science


As per my experience, It didn't drop any columns. It replaced the name of the columns from actual column names to 1, 2, 3...

To put the actual column names after the imputation

 imp = SimpleImputer(missing_values = np.nan, strategy = 'most_frequent')
 imp.fit(df)
 df = pd.DataFrame(imp.transform(df), columns = df.columns)

SimpleImputer drops columns consisting entirely of missing values. It is indeed unpleasant when trying to associate original columns; the sklearn devs have been discussing this:

https://github.com/scikit-learn/scikit-learn/issues/16426

Vincent's answer is good, if you are working directly: just detect and remove the offending all-missing columns, since they don't contribute anything to your model. If you need something more automatic (e.g., you have a highly-missing column that in cross-validation leads to some training folds being all-missing), then perhaps use a ColumnTransformer, where the columns argument is a callable that checks for all-missing? Then you can use the ColumnTransformer's get_feature_names method to find out when/if a column was removed.


I've got the same issue today, and it's a shame your post got no answers. I think this question is not well addressed in the sklearn documentation. I can show you my workaround to this issue:

headers = X.columns.values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

empty_train_columns =  []
for col in X_train.columns.values:
    # all the values for this feature are null
    if sum(X_train[col].isnull()) == X_train.shape[0]:
        empty_train_columns.append(col)
print(empty_train_columns)

The idea is to keep all your column names, and after you split your data check which of them completely empty in your training set. If I'm not wrong the Imputer respects the column order so, for example, you can correlate every feature with its importance if you are using Decision-Tree-based models.

I'm not satisfied with this ugly piece of code but I couldn't find a more elegant (and simple) solution.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.