Gridsearch ValueError: Input contains infinity or a value too large for dtype('float64'). - Using Pipeline

Update: I have non NAN values so fillna is not an issue. Clean dataset.

I'm having this error occur when I try to predict using my grid best params. I get a score when fit it onto the training data. I get this error however when I try and predict on the X_test. Very confused.

I'm attempting to use a pipeline and gridsearch combined for my dataset. Code works up to the training part and score.

It's a clean dataset and has no NAN values.

My code is

classifiers = [AdaBoostClassifier(), 
               XGBClassifier(), 
               LogisticRegression(),
               DecisionTreeClassifier(),
               RandomForestClassifier()]

num_cols = X_train.select_dtypes(number).columns
cat_cols = X_train.select_dtypes(object).columns

categorical_transformation = make_pipeline(MinMaxScaler(),
                                           VarianceThreshold(),
                                           PowerTransformer(method='yeo-johnson'))

integer_features = list(X_train.columns[X_train.dtypes == 'int64'])
continuous_features = list(X_train.columns[X_train.dtypes == 'float64'])

int_transformation = make_pipeline(MinMaxScaler(),
                                   VarianceThreshold(),
                                   PowerTransformer(method='yeo-johnson'))
float_transformation = make_pipeline(MinMaxScaler(),
                                     VarianceThreshold(),
                                     PowerTransformer(method='yeo-johnson'))

preprocessor = make_column_transformer((int_transformation, integer_features),
                                       (float_transformation, float_feature))

for classifier in classifiers:
    pipe = make_pipeline(preprocessor, classifier)
    grid = GridSearchCV(pipe, cv=5, scoring=recall, param_grid = {})
    grid.fit(X_train, y_train)
    
    print(classifier)
    print(grid.best_score_)
    # RandomForestClassifier()
    # 0.9996252992392879

pipe = make_pipeline(preprocessor, LogisticRegression())
param_grid_logreg = {logisticregression__C: [0.1, 1, 10, 100, 1000]}

grid_logreg = GridSearchCV(estimator = pipe, param_grid=param_grid_logreg, cv=5)

grid_logreg.fit(X_train, y_train)

print(Best score:, grid_logreg.best_score_)
print(Best parameters:, grid_logreg.best_params_)
# Best score: 0.9337686658306279
# Best parameters: {'logisticregression__C': 0.1}

log_reg_best_model = grid_logreg.best_estimator_
log_reg_best_model.score(X_train, y_train)
# 0.9983211913323731

log_reg_best_model.predict(X_test)

Error:

ValueError: Input contains infinity or a value too large for dtype('float64').

Topic grid-search data-science-model logistic-regression scikit-learn machine-learning

Category Data Science


I solved the issue in the end.

The issue was with the order of my pipeline - I'd placed the PowerTransformer at the end of the pipeline which gave the infinite values. Placing the MinMaxScaler after it solved this :)


The error raises with both, either you have NaNs or infinite values:

From documentation:

def _assert_all_finite(X):
    """Like assert_all_finite, but only for ndarray."""
    X = np.asanyarray(X)
    # First try an O(n) time, O(1) space solution for the common case that
    # everything is finite; fall back to O(n) space np.isfinite to prevent
    # false positives from overflow in sum method.
    if (X.dtype.char in np.typecodes['AllFloat'] and not np.isfinite(X.sum())
            and not np.isfinite(X).all()):
        raise ValueError("Input contains NaN, infinity"
                         " or a value too large for %r." % X.dtype)

You could simply run X_test.describe() and check what the max and min values are, if you have -np.inf or np.inf as min or max values respect. you could replace them.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.