sklearn predict: IndexingError: ('Too many indexers', 'occurred at index <name>')

The goal of what I'm trying to accomplish here is to have the output contain all of the use_cols but the model only be built to calculate on categorical_features. The output will then be used to predict and compare the prediction 'REVIEW_ACTION' to the actual 'REVIEWER_ACTION'. Ignoring for the moment why this is A BAD THING TO DO, can we focus on how to achieve this?

use_cols = ['FIRST_NAME', 'LAST_NAME', 'PERSON_STATUS', 'DIVISION_NAME',
            'PERSON_TYPE', 'JOB_CHANGE', 'JOB_TRANSFER', 'IDENTIFY_DATE',
            'SSO', 'USER_ID', 'ASSET_ID', 'ROLE', 'HPA', 
            'LAST_LOGIN_DATE', 'ASSET_NAME', 'AUDIT_ID', 
            'REVIEWER_ACTION', 'USER ID', 'LABELED_ROLE', 'REVIEW_ACTION']
prediction_col = ['REVIEW_ACTION']

categorical_features = ['DIVISION_NAME', 'PERSON_TYPE', 'JOB_CHANGE', 'JOB_TRANSFER',
                       'SSO', 'USER_ID', 'ASSET_ID', 'ROLE', 'HPA', 'LAST_LOGIN_DATE',
                       'USER ID', 'LABELED_ROLE']

categorical_transformer = Pipeline(steps=[
    ('si', SimpleImputer(strategy='constant', fill_value='missing')),
    ('ohe', OneHotEncoder(handle_unknown='ignore'))],remainder='passthrough')

preprocessor = ColumnTransformer(
    transformers=[
        ('cat', categorical_transformer, categorical_features)])

rf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('rfc', RandomForestClassifier(n_estimators=100))
                     ])

I then fit the data and try to run the prediction

df['rf_prediction'] = df[categorical_features].apply(rf.predict)

and get the error:

IndexingError: ('Too many indexers', 'occurred at index DIVISION_NAME') 

This has something to do with the columns being 'passthrough''d but I'm not sure how to resolve it. I don't want to process some of these columns but want them in the results when I write the file so that I can validate the results.

Somewhere in my debugging I took a step back and am seeing different errors when I call fit. If X = data I see:

ValueError: could not convert string to float: 'Herve'

Note that "Herve" is a field that is passed through (remainder='passthrough') so the model shouldn't be seeing it -- or so I've been led to believe.

Topic pipelines scikit-learn

Category Data Science


Removing remainder='passthrough' resolved the problem. I just tacked the prediction onto 'X' and got what I needed since sklearn v20 can deal with pandas dataframes.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.