sklearn models Parameter tuning GridSearchCV

Dataframe:

id    review                                              name         label
1     it is a great product for turning lights on.        Ashley       1
2     plays music and have a good sound.                  Alex         1
3     I love it, lots of fun.                             Peter        0

The aim is to classify the text; if the review is about the functionality of the product (e.g. turn the light on, music), label=1, otherwise label=0.

I am running several sklearn models to see which one works bests:

# Naïve Bayes:
text_clf_nb = Pipeline([('tfidf', TfidfVectorizer()), ('clf', MultinomialNB())])

# Linear Support Vectors Classifier:
text_clf_lsvc = Pipeline([('tfidf', TfidfVectorizer()), ('clf', LinearSVC(loss='hinge',
              penalty='l2', max_iter = 50))])

# SGDClassifier
text_clf_sgd = Pipeline([('tfidf', TfidfVectorizer()), ('clf', SGDClassifier(loss='hinge', penalty='l2',alpha=1e-3,                                                    random_state=42,max_iter=50, tol=None))])

#Random Forest
text_clf_rf = Pipeline([('tfidf', TfidfVectorizer()), ('clf', RandomForestClassifier())])

#neural network MLPClassifier
text_clf_mlp = Pipeline([('tfidf', TfidfVectorizer()), ('clf', MLPClassifier())])

Problem: How to tune models using GridSearchCV? What I have so far:

from sklearn.model_selection import GridSearchCV
parameters = {'vect__ngram_range': [(1, 1), (1, 2)],'tfidf__use_idf': (True, False),'clf__alpha': (1e-2, 1e-3) }
gs_clf = GridSearchCV(text_clf_nb, param_grid= parameters, cv=2,  scoring='roc_auc', n_jobs=-1)
gs_clf = gs_clf.fit((X_train, y_train))

This gives the following error on running gs_clf = gs_clf.fit((X_train, y_train)):

ValueError: Invalid parameter C for estimator Pipeline(memory=None,
         steps=[('tfidf',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=class 'numpy.float64',
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, use_idf=True,
                                 vocabulary=None)),
                ('clf',
                 MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))],
         verbose=False). Check the list of available parameters with `estimator.get_params().keys()`.

I would appreciate any suggestions. Thanks.

Topic text-classification gridsearchcv scikit-learn

Category Data Science


The correct way of calling the parameters inside Pipeline is using double underscore like named_step__parameter_name .So the first thing I noticed is in this line:

parameters = {'vect__ngram_range': [(1, 1), (1, 2)],'tfidf__use_idf': (True, False),'clf__alpha': (1e-2, 1e-3) }

You are calling vect__ngram_range but this should be tfidf__ngram_range

Now this is no the error displayed, rather it seems as if you were somewhere mixed your code since C is a parameter for an SVM not for a MultinomialNB, so check if you are really passing the intended pipeline since I suspect that you are passing the pipeline that constants the SVM but trying to hyper parametrize the MultinomialNB

So check if this dictionary:

parameters = {'vect__ngram_range': [(1, 1), (1, 2)],'tfidf__use_idf': (True, False),'clf__alpha': (1e-2, 1e-3) }

is being also created but for an SVM (the two with the same name parameter)

Finally I would also change those lines:

gs_clf = GridSearchCV(text_clf_nb, param_grid= parameters, cv=2,  scoring='roc_auc', n_jobs=-1)
gs_clf = gs_clf.fit((X_train, y_train))

for only this:

gs_clf = GridSearchCV(text_clf_nb, param_grid= parameters, cv=2,  scoring='roc_auc', n_jobs=-1).fit(X_train, y_train)

It is confusing why you are passing a tuple to the fit method.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.