CNN for subsets of a dataset - how to tune hyperparameters

I have a dataset and would like to train CNNs on subsets of different size of the dataset. I already have a CNN, which classifies very well if I use the entire dataset. Now the question arises if I should really try to additionally optimize the parameters of the CNN for the subsets, regardless of whether I do Data Augmentation or not? Does it really make sense if I try to change the CNN model for the subsets by using …
Category: Data Science

How to perform Grid Search on NLP CRF model

I am trying to perform hyperparameter tuning on sklearn_crfsuite.CRF model. When I try to execute below code, it doesn't give any exception but it probably fails to perform fit. And due to which, if I try to get best estimator from grid search, it doesn't work. %%time # define fixed parameters and parameters to search crf = sklearn_crfsuite.CRF( algorithm='lbfgs', max_iterations=100, all_possible_transitions=True ) params_space = { "c1": [0,0.05,0.1, 0.25,0.5,1], "c2": [0,0.05,0.1, 0.25,0.5,1] } # use the same metric for evaluation f1_scorer …
Category: Data Science

i'm using GridSearchCV to find parameter C for SVC() classifier present in sklearn.svm . I'm not getting the optimal result desired

this is a screenshot of my code. i used abc.best_estimator_ (my GridSearchCV model) to find out best results. As you can see grid has values of C=1 and C=100 along with other values. abc.best_estimator_ says C=1 is the best value. For cross checking i tried using different values of c and here i'm getting a better score for C=100. I was getting similar results while finding gamma also, but later on i commented out gamma so as to focus on …
Category: Data Science

Need an example of a custom class whose instance is fed to sklearn Pipeline / make_pipeline to use with GridSearchCV

According to sklearn.pipeline.Pipeline documentation, the class whose instance is a pipeline element should implement fit() and transform(). I managed to create a custom class that has these methods and works fine with a single pipeline. Now I want to use that Pipeline object as the estimator argument for GridSearchCV. The latter requires the custom class to have set_params() method, since I want to search over the range of custom instance parameters, as opposed to using a single instance of my …
Category: Data Science

random_state on train_test_split() appears to have large effect in performance metrics?

To summarize the problem: I have a data set with ~1450 samples, 19 features and a binary outcome where classes are fairly balanced (0.51 to 0.49). I split the data into a train set and a test set using train_test_split(X, Y, test_size = 0.30, random_state = 42). I am using the train set to tune hyper-parameters in algorithms optimizing for specificity, using GridSearchCV, with a RepeatedStratifiedKFold (10 splits, 3 repeats) cross-validation, and scoring=make_scorer(recall_score, pos_label=0). I am then using the predictions …
Category: Data Science

How to refit GridSearchCV on Multiclass problem

I'm trying to use GridSearchCV for my Multiclass problem. For starters, wanted to test it on KNeighborsClassifier. First, here's the code where I define the function which uses GridSearchCV: from sklearn.model_selection import GridSearchCV from sklearn.model_selection import KFold def grid_search(estimator, parameters, X, y): scoring = ['accuracy', 'precision', 'recall'] kf = KFold(5) clf = GridSearchCV(estimator, parameters, cv=kf, scoring=scoring, refit="accuracy", n_jobs=-1) clf.fit(X, y) i = clf.best_index_ best_precision = clf.cv_results_['mean_test_precision'][i] best_recall = clf.cv_results_['mean_test_recall'][i] print('Best score (accuracy): {}'.format(clf.best_score_)) print('Mean precision: {}'.format(best_precision)) print('Mean recall: {}'.format(best_recall)) print('Best …
Category: Data Science

What's the difference between GridSearchCrossValidation score and score on testset?

I'm doing classification using python. I'm using the class GridSearchCV, this class has the attribute best_score_ defined as "Mean cross-validated score of the best_estimator". With this class i can also compute the score over the test set using score. Now, I understand the theoretical difference between the two values(one is computed in the cross validation, the other is computed on the test set), but how should I interpret them? For example, if in case 1 I get these values (respectively …
Category: Data Science

My own model trained on the full data is better than the best_estimator I get from GridSearchCV with refit=True?

I am using an XGBoost model to classify some data. I have cv splits (train, val) and a separate test set that I never use until the end. I have used GridSearchCV to determine the best parameters and fed my cv splits (5 folds) into it as well as set refit=True so that once it figures out the best hyperparameters it trains on the full data (all folds as opposed to just 4/5 folds) and returns the best_estimator. I then …
Category: Data Science

Overfitted model produces similar AUC on test set, so which model do I go with?

I was trying to compare the effect of running GridSearchCV on a dataset which was oversampled prior and oversampled after the training folds are selected. The oversampling approach I used was random oversampling. Understand that the first approach is wrong since observations that the model has seen bleed into the test set. Was just curious about how much of a difference this causes. I generated a binary classification dataset with following: # Generate binary classification dataset with 5% minority class, …
Category: Data Science

Query regarding surprising spike in accuracy of ML model

I implemented all the major ML models (Logistic Regression, Naive Bayes, SVM, KNN, Decision Tree, Random Forest, Ada Boost & XGBoost) on my dataset. My stratified cross-validation scores are between 70% & 80%. When I implemented my models using grid search, my accuracies shot up & they lie between 90% & 95%. Is this drastic increase in accuracy abnormal & fishy? My GridSearch CV code for Logistic Regression--> from sklearn.datasets import make_blobs, make_classification from sklearn.model_selection import GridSearchCV scaled_inputs, targets = …
Category: Data Science

Multiple values for a single parameter in the mlflow run command

How to pass multiple values to each parameter in the mlflow run command? The objective is to pass a dictionary to GridSearchCV as a param_grid to perform cross validation. In my main code, I retrieve the command line parameters using argparse. And by adding nargs='+' in the add_argument(), I can write spaced values for each hyper parameter and then applying vars() to create the dictionary. See code below: import argparse # Build the parameters for the command-line param_names = list(RandomForestClassifier().get_params().keys()) …
Category: Data Science

Why is gridsearchCV.best_estimator_.score giving me r2_score even if I mentioned MAE as my main scoring metric?

I have a lasso regression model with the following definition : import sklearn from sklearn.model_selection import train_test_split from sklearn.preprocessing import MinMaxScaler from sklearn.preprocessing import PolynomialFeatures from sklearn.preprocessing import scale from sklearn.feature_selection import RFE from sklearn.linear_model import LinearRegression, Lasso from sklearn.svm import SVR from sklearn.model_selection import cross_val_score from sklearn.model_selection import KFold from sklearn.model_selection import GridSearchCV from sklearn.pipeline import make_pipeline from sklearn.metrics import r2_score folds = KFold(n_splits = 5, shuffle = True, random_state = 100) # specify range of hyperparameters hyper_params = …
Category: Data Science

Worse performance after Hyperparameter tuning

I first construct a base model (using default parameters) and obtain MAE. # BASELINE MODEL rfr_pipe.fit(train_x, train_y) base_rfr_pred = rfr_pipe.predict(test_x) base_rfr_mae = mean_absolute_error(test_y, base_rfr_pred) MAE = 2.188 Then I perform GridSearchCV to get best parameters and get the average MAE. # RFR GRIDSEARCHCV rfr_param = {'rfr_model__n_estimators' : [10, 100, 500, 1000], 'rfr_model__max_depth' : [None, 5, 10, 15, 20], 'rfr_model__min_samples_leaf' : [10, 100, 500, 1000], 'rfr_model__max_features' : ['auto', 'sqrt', 'log2']} rfr_grid = GridSearchCV(estimator = rfr_pipe, param_grid = rfr_param, n_jobs = -1, …
Category: Data Science

gridsearchcv best coefficients do not match well with the perfect line

I wrote a program to find the best combination of coefficients to describe a variable. However, the coefficients from the gridsearchcv do not match well with the expected line. This is a sample of my data: pipe = make_pipeline(process, SelectKBest(f_regression), model) gs=GridSearchCV(pipe,params,n_jobs=-1,cv=5, return_train_score = False); gs.fit(x_train, y_train) fin = gs.best_estimator_.steps[2][1]; coef = fin.coef_; intercept = fin.intercept_ and these are the coefficients given: Then if I plot the line with the coefficients: xplot = 16.15589 + 1.13934372*df_loc.ChargeAmount + 1.605411*df_loc.PatientPrice + 6.81365603*df_loc.LastCost …
Category: Data Science

sklearn models Parameter tuning GridSearchCV

Dataframe: id review name label 1 it is a great product for turning lights on. Ashley 1 2 plays music and have a good sound. Alex 1 3 I love it, lots of fun. Peter 0 The aim is to classify the text; if the review is about the functionality of the product (e.g. turn the light on, music), label=1, otherwise label=0. I am running several sklearn models to see which one works bests: # Naïve Bayes: text_clf_nb = Pipeline([('tfidf', …
Category: Data Science

Does hyperparameter tuning of Decision Tree then use it in Adaboost individually vs Simultaneously yield the same results?

So, my predicament here is as follows, I performed hyperparameter tuning on a standalone Decision Tree classifier, and I got the best results, now comes the turn of Standalone Adaboost, but here is where my problem lies, if I use the Tuned Decision Tree from earlier as a base_estimator in Adaboost, then I perform hyperparameter tuning on Adaboost only, will it yield the same results as trying to perform hyperparameter tuning on untuned Adaboost and untuned Decision Tree as a …
Category: Data Science

How to pick best model based on Accuracy and Recall in a GridSearchCV when you have already set scoring = custom_scorer?

This is a binary classification problem, I am using a GridSearchCV from Sklearn to find the best model, here is the GridSearch line I am using: scoring = {'AUCe': 'roc_auc', 'Accuracy': 'accuracy', 'prec': 'precision', 'rec': 'recall', 'f1s': 'f1','spec':make_scorer(recall_score,pos_label=0)} grid_search = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1, cv=cv, scoring=scoring, refit='Accuracy') All is fine, but, my problem is that I want the model to be picked based on both highest Accuracy and Recall, I know that in order to sort the values and pick the …
Category: Data Science

n_jobs=-1 or n_jobs=1?

I am confused regarding the n_jobs parameter used in some models and for CV. I know it is used for parallel computing, where it includes the number of processors specified in n_jobs parameter. So if I set the value as -1, it will include all the cores and their threads for faster computation. But this article:- https://machinelearningmastery.com/multi-core-machine-learning-in-python/#comment-617976 states that using all cores for training, evaluation and hyperparameter tuning is a bad idea. The crux of the article is as follows:- …
Category: Data Science

Can I run this job quicker for GridSearchCV?

I am using GridSearchCV for optimising my predictions and its been 5 hours now that the process is running. I am running a fairly large dataset and I am afraid I have not optimised the parameters enough. df_train.describe(): Unnamed: 0 col1 col2 col3 col4 col5 count 8.886500e+05 888650.000000 888650.000000 888650.000000 888650.000000 888650.000000 mean 5.130409e+05 2.636784 3.845549 4.105381 1.554918 1.221922 std 2.998785e+05 2.296243 1.366518 3.285802 1.375791 1.233717 min 4.000000e+00 1.010000 1.010000 1.010000 0.000000 0.000000 25% 2.484332e+05 1.660000 3.230000 2.390000 1.000000 0.000000 …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.