RandomizedSearchCV() not scoring all fits
I'm experiencing an issue with a RandomizedSearchCV grid that is not able to evaluate all of the fits. 50 of the 100 fits I'm calling do not get scored (score=nan), so I'm worried I'm wasting a bunch of time trying to run the gridsearch. I'm wondering how to troubleshoot this and haven't found anything in the past few days and I'm hopeful that the community can help me squash this bug. Now, the details:
I have constructed a XGBClassifier model as such:
xgb_clf = xgb.XGBClassifier(tree_method=exact, predictor=cpu_predictor, verbosity=1,
objective=binary:logistic, scale_pos_weight= 1.64)
# my trainingset is imbalanced 85k majority class, 53k minority class
Currently, I am attempting to use the hashing trick to encode my categorical variables, as they are all nominal. I do this after splitting my training set into X and y variables
ce_hash = ce.HashingEncoder()
hashed_new = ce_hash.fit_transform(X)
hashed_X = hashed_new
I then conduct my train_test_split as normal, then instantiate a RandomizedSearchCV with a parameter grid, code is as such:
X_train, X_test, y_train, y_test = tts(hashed_X, y, test_size=.25)
# create my classifier
xgb_clf = xgb.XGBClassifier(tree_method=exact, predictor=cpu_predictor, verbosity=1,
objective=binary:logistic, scale_pos_weight= 4)
# Create parameter grid
params = {learning_rate: [0.2, 0.1, 0.01, 0.001],
gamma : [10, 12, 14, 16],
max_depth: [2, 4, 7, 10, 13],
colsample_bytree: [ 0.8, 1.0, 1.2, 1.4],
subsample: [0.8, 0.85, 0.9, 0.95, 1, 1.1],
eta: [0.05, 0.1, .2, ],
reg_alpha: [1.5, 2, 2.5, 3],
reg_lambda: [0.5, 1, 1.5, 2],
min_child_weight: [1, 3, 5, 7],
n_estimators: [100, 250, 500]}
from sklearn.model_selection import RandomizedSearchCV
# Create RandomizedSearchCV Object
xgb_rscv = RandomizedSearchCV(xgb_clf, param_distributions=params, scoring='precision',
cv=10, verbose=3)
# Fit the model by running ten fits on ten 'folds', or 100 individual fits.
model_xgboost = xgb_rscv.fit(X_train, y_train)
However, during 50% of the 100 fits, I will get a score that looks like this:
[CV] subsample=0.8, reg_lambda=2, reg_alpha=3, n_estimators=100, min_child_weight=3, max_depth=10, learning_rate=0.001, gamma=16, eta=0.1, colsample_bytree=1.4, **score=nan**, total= 0.1s
When this occurs, it occurs in sections of ten, so 10 straight fits will all generate a score of nan. The 50 nan scores don't always occur in the same order, but there are always 50 that don't get scored correctly.
Would anyone know how I can attempt to correct this and ensure that all 100 fits get scored? Is this happening because I'm using a hashed feature set?
Thanks!
Topic hashing-trick gridsearchcv python
Category Data Science