sklearn predict: IndexingError: ('Too many indexers', 'occurred at index <name>')
The goal of what I'm trying to accomplish here is to have the output contain all of the use_cols but the model only be built to calculate on categorical_features. The output will then be used to predict and compare the prediction 'REVIEW_ACTION' to the actual 'REVIEWER_ACTION'. Ignoring for the moment why this is A BAD THING TO DO, can we focus on how to achieve this?
use_cols = ['FIRST_NAME', 'LAST_NAME', 'PERSON_STATUS', 'DIVISION_NAME',
'PERSON_TYPE', 'JOB_CHANGE', 'JOB_TRANSFER', 'IDENTIFY_DATE',
'SSO', 'USER_ID', 'ASSET_ID', 'ROLE', 'HPA',
'LAST_LOGIN_DATE', 'ASSET_NAME', 'AUDIT_ID',
'REVIEWER_ACTION', 'USER ID', 'LABELED_ROLE', 'REVIEW_ACTION']
prediction_col = ['REVIEW_ACTION']
categorical_features = ['DIVISION_NAME', 'PERSON_TYPE', 'JOB_CHANGE', 'JOB_TRANSFER',
'SSO', 'USER_ID', 'ASSET_ID', 'ROLE', 'HPA', 'LAST_LOGIN_DATE',
'USER ID', 'LABELED_ROLE']
categorical_transformer = Pipeline(steps=[
('si', SimpleImputer(strategy='constant', fill_value='missing')),
('ohe', OneHotEncoder(handle_unknown='ignore'))],remainder='passthrough')
preprocessor = ColumnTransformer(
transformers=[
('cat', categorical_transformer, categorical_features)])
rf = Pipeline(steps=[('preprocessor', preprocessor),
('rfc', RandomForestClassifier(n_estimators=100))
])
I then fit the data and try to run the prediction
df['rf_prediction'] = df[categorical_features].apply(rf.predict)
and get the error:
IndexingError: ('Too many indexers', 'occurred at index DIVISION_NAME')
This has something to do with the columns being 'passthrough''d but I'm not sure how to resolve it. I don't want to process some of these columns but want them in the results when I write the file so that I can validate the results.
Somewhere in my debugging I took a step back and am seeing different errors when I call fit. If X = data I see:
ValueError: could not convert string to float: 'Herve'
Note that "Herve" is a field that is passed through (remainder='passthrough'
) so the model shouldn't be seeing it -- or so I've been led to believe.
Topic pipelines scikit-learn
Category Data Science