I have a dataset and would like to train CNNs on subsets of different size of the dataset. I already have a CNN, which classifies very well if I use the entire dataset. Now the question arises if I should really try to additionally optimize the parameters of the CNN for the subsets, regardless of whether I do Data Augmentation or not? Does it really make sense if I try to change the CNN model for the subsets by using …
I am trying to perform hyperparameter tuning on sklearn_crfsuite.CRF model. When I try to execute below code, it doesn't give any exception but it probably fails to perform fit. And due to which, if I try to get best estimator from grid search, it doesn't work. %%time # define fixed parameters and parameters to search crf = sklearn_crfsuite.CRF( algorithm='lbfgs', max_iterations=100, all_possible_transitions=True ) params_space = { "c1": [0,0.05,0.1, 0.25,0.5,1], "c2": [0,0.05,0.1, 0.25,0.5,1] } # use the same metric for evaluation f1_scorer …
this is a screenshot of my code. i used abc.best_estimator_ (my GridSearchCV model) to find out best results. As you can see grid has values of C=1 and C=100 along with other values. abc.best_estimator_ says C=1 is the best value. For cross checking i tried using different values of c and here i'm getting a better score for C=100. I was getting similar results while finding gamma also, but later on i commented out gamma so as to focus on …
According to sklearn.pipeline.Pipeline documentation, the class whose instance is a pipeline element should implement fit() and transform(). I managed to create a custom class that has these methods and works fine with a single pipeline. Now I want to use that Pipeline object as the estimator argument for GridSearchCV. The latter requires the custom class to have set_params() method, since I want to search over the range of custom instance parameters, as opposed to using a single instance of my …
To summarize the problem: I have a data set with ~1450 samples, 19 features and a binary outcome where classes are fairly balanced (0.51 to 0.49). I split the data into a train set and a test set using train_test_split(X, Y, test_size = 0.30, random_state = 42). I am using the train set to tune hyper-parameters in algorithms optimizing for specificity, using GridSearchCV, with a RepeatedStratifiedKFold (10 splits, 3 repeats) cross-validation, and scoring=make_scorer(recall_score, pos_label=0). I am then using the predictions …
I'm trying to use GridSearchCV for my Multiclass problem. For starters, wanted to test it on KNeighborsClassifier. First, here's the code where I define the function which uses GridSearchCV: from sklearn.model_selection import GridSearchCV from sklearn.model_selection import KFold def grid_search(estimator, parameters, X, y): scoring = ['accuracy', 'precision', 'recall'] kf = KFold(5) clf = GridSearchCV(estimator, parameters, cv=kf, scoring=scoring, refit="accuracy", n_jobs=-1) clf.fit(X, y) i = clf.best_index_ best_precision = clf.cv_results_['mean_test_precision'][i] best_recall = clf.cv_results_['mean_test_recall'][i] print('Best score (accuracy): {}'.format(clf.best_score_)) print('Mean precision: {}'.format(best_precision)) print('Mean recall: {}'.format(best_recall)) print('Best …
I'm doing classification using python. I'm using the class GridSearchCV, this class has the attribute best_score_ defined as "Mean cross-validated score of the best_estimator". With this class i can also compute the score over the test set using score. Now, I understand the theoretical difference between the two values(one is computed in the cross validation, the other is computed on the test set), but how should I interpret them? For example, if in case 1 I get these values (respectively …
I am using an XGBoost model to classify some data. I have cv splits (train, val) and a separate test set that I never use until the end. I have used GridSearchCV to determine the best parameters and fed my cv splits (5 folds) into it as well as set refit=True so that once it figures out the best hyperparameters it trains on the full data (all folds as opposed to just 4/5 folds) and returns the best_estimator. I then …
I was trying to compare the effect of running GridSearchCV on a dataset which was oversampled prior and oversampled after the training folds are selected. The oversampling approach I used was random oversampling. Understand that the first approach is wrong since observations that the model has seen bleed into the test set. Was just curious about how much of a difference this causes. I generated a binary classification dataset with following: # Generate binary classification dataset with 5% minority class, …
I implemented all the major ML models (Logistic Regression, Naive Bayes, SVM, KNN, Decision Tree, Random Forest, Ada Boost & XGBoost) on my dataset. My stratified cross-validation scores are between 70% & 80%. When I implemented my models using grid search, my accuracies shot up & they lie between 90% & 95%. Is this drastic increase in accuracy abnormal & fishy? My GridSearch CV code for Logistic Regression--> from sklearn.datasets import make_blobs, make_classification from sklearn.model_selection import GridSearchCV scaled_inputs, targets = …
How to pass multiple values to each parameter in the mlflow run command? The objective is to pass a dictionary to GridSearchCV as a param_grid to perform cross validation. In my main code, I retrieve the command line parameters using argparse. And by adding nargs='+' in the add_argument(), I can write spaced values for each hyper parameter and then applying vars() to create the dictionary. See code below: import argparse # Build the parameters for the command-line param_names = list(RandomForestClassifier().get_params().keys()) …
I have a lasso regression model with the following definition : import sklearn from sklearn.model_selection import train_test_split from sklearn.preprocessing import MinMaxScaler from sklearn.preprocessing import PolynomialFeatures from sklearn.preprocessing import scale from sklearn.feature_selection import RFE from sklearn.linear_model import LinearRegression, Lasso from sklearn.svm import SVR from sklearn.model_selection import cross_val_score from sklearn.model_selection import KFold from sklearn.model_selection import GridSearchCV from sklearn.pipeline import make_pipeline from sklearn.metrics import r2_score folds = KFold(n_splits = 5, shuffle = True, random_state = 100) # specify range of hyperparameters hyper_params = …
I first construct a base model (using default parameters) and obtain MAE. # BASELINE MODEL rfr_pipe.fit(train_x, train_y) base_rfr_pred = rfr_pipe.predict(test_x) base_rfr_mae = mean_absolute_error(test_y, base_rfr_pred) MAE = 2.188 Then I perform GridSearchCV to get best parameters and get the average MAE. # RFR GRIDSEARCHCV rfr_param = {'rfr_model__n_estimators' : [10, 100, 500, 1000], 'rfr_model__max_depth' : [None, 5, 10, 15, 20], 'rfr_model__min_samples_leaf' : [10, 100, 500, 1000], 'rfr_model__max_features' : ['auto', 'sqrt', 'log2']} rfr_grid = GridSearchCV(estimator = rfr_pipe, param_grid = rfr_param, n_jobs = -1, …
I wrote a program to find the best combination of coefficients to describe a variable. However, the coefficients from the gridsearchcv do not match well with the expected line. This is a sample of my data: pipe = make_pipeline(process, SelectKBest(f_regression), model) gs=GridSearchCV(pipe,params,n_jobs=-1,cv=5, return_train_score = False); gs.fit(x_train, y_train) fin = gs.best_estimator_.steps[2][1]; coef = fin.coef_; intercept = fin.intercept_ and these are the coefficients given: Then if I plot the line with the coefficients: xplot = 16.15589 + 1.13934372*df_loc.ChargeAmount + 1.605411*df_loc.PatientPrice + 6.81365603*df_loc.LastCost …
Are Grid Search CV & Randomized Search CV always/necessarily supposed to give more accurate results after hyperparameter tuning as compared to K-Fold Cross Validation?
Dataframe: id review name label 1 it is a great product for turning lights on. Ashley 1 2 plays music and have a good sound. Alex 1 3 I love it, lots of fun. Peter 0 The aim is to classify the text; if the review is about the functionality of the product (e.g. turn the light on, music), label=1, otherwise label=0. I am running several sklearn models to see which one works bests: # Naïve Bayes: text_clf_nb = Pipeline([('tfidf', …
So, my predicament here is as follows, I performed hyperparameter tuning on a standalone Decision Tree classifier, and I got the best results, now comes the turn of Standalone Adaboost, but here is where my problem lies, if I use the Tuned Decision Tree from earlier as a base_estimator in Adaboost, then I perform hyperparameter tuning on Adaboost only, will it yield the same results as trying to perform hyperparameter tuning on untuned Adaboost and untuned Decision Tree as a …
This is a binary classification problem, I am using a GridSearchCV from Sklearn to find the best model, here is the GridSearch line I am using: scoring = {'AUCe': 'roc_auc', 'Accuracy': 'accuracy', 'prec': 'precision', 'rec': 'recall', 'f1s': 'f1','spec':make_scorer(recall_score,pos_label=0)} grid_search = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1, cv=cv, scoring=scoring, refit='Accuracy') All is fine, but, my problem is that I want the model to be picked based on both highest Accuracy and Recall, I know that in order to sort the values and pick the …
I am confused regarding the n_jobs parameter used in some models and for CV. I know it is used for parallel computing, where it includes the number of processors specified in n_jobs parameter. So if I set the value as -1, it will include all the cores and their threads for faster computation. But this article:- https://machinelearningmastery.com/multi-core-machine-learning-in-python/#comment-617976 states that using all cores for training, evaluation and hyperparameter tuning is a bad idea. The crux of the article is as follows:- …
I am using GridSearchCV for optimising my predictions and its been 5 hours now that the process is running. I am running a fairly large dataset and I am afraid I have not optimised the parameters enough. df_train.describe(): Unnamed: 0 col1 col2 col3 col4 col5 count 8.886500e+05 888650.000000 888650.000000 888650.000000 888650.000000 888650.000000 mean 5.130409e+05 2.636784 3.845549 4.105381 1.554918 1.221922 std 2.998785e+05 2.296243 1.366518 3.285802 1.375791 1.233717 min 4.000000e+00 1.010000 1.010000 1.010000 0.000000 0.000000 25% 2.484332e+05 1.660000 3.230000 2.390000 1.000000 0.000000 …