Why does Light GBM model produce different results while testing?

Using the Light GBM regressor, I have trained my data and, using Grid Search, I got the best parameters, but while testing with the best parameters I am getting different results each time, which means the model produces different results for each test iteration. I ran the lightgbm twice with the same parameters, but got different results in validation. I found the only random seed parameter to be baggingSeed. After fixing baggingSeed, the problem also occurred. Should I fix any …
Category: Data Science

grid search - optimal weighting of classifiers

I am using three different of the shelf classifiers. It's a three class classification task. I want to calculate the optimal weights (c1weight, c2weight, c3weight) for each classifier (real task more classifiers and also weights for each class). Maybe simple grid search approach or sklearn ensemble classifier could do that. vc = VotingClassifier(estimators=[('gbc',GradientBoostingClassifier()), ('rf',RandomForestClassifier()),('svc',SVC(probability=True))], voting='soft',n_jobs=-1) params = {'weights':[[1,2,3],[2,1,3],[3,2,1]]} grid_Search = GridSearchCV(param_grid = params, estimator=vc) grid_Search.fit(X_new,y) print(grid_Search.best_Score_) I don't understand how to implement this for the following code. def get_classification(text, c1weight, …
Category: Data Science

MLP classifier Gridsearch CV parameters to tune?

I'm looking to tune the parameters for sklearn's MLP classifier but don't know which to tune/how many options to give them? Example is learning rate. should i give it[.0001,.001,.01,.1,.2,.3]? or is that too many, too little etc.. i have no basis to know what is a good range for any of the parameters. Processing power is limited so i can't just test the full range. If anyone has a general guide of which are the most important to tune and …
Category: Data Science

Random search grid not displaying scoring metric

I want to do a grid search of some few hyperparameters through a XGBClassifier of a binary class, but whenever i run it the score value (roc_auc) is not being display. I read in other question that this can be related to some error in model training but i am not sure which one is in this case. My model training data X_train is a np.array of (X, 19) and my y_train is a numpy.ndarray of shape (X, ) which …
Category: Data Science

Optimizing decision threshold on model with oversampled/imbalanced data

I'm working on developing a model with a highly imbalanced dataset (0.7% Minority class). To remedy the imbalance, I was going to oversample using algorithms from imbalanced-learn library. I had a workflow in mind which I wanted to share and get an opinion on if I'm heading in the right direction or maybe I missed something. Split Train/Test/Val Setup pipeline for GridSearch and optimize hyper-parameters (pipeline will only oversample training folds) Scoring metric will be AUC as training set is …
Category: Data Science

Unbalanced data set - how to optimize hyperparams via grid search?

I would like to optimize the hyperparameters C and Gamma of an SVC by using grid search for an unbalanced data set. So far I have used class_weights='balanced' and selected the best hyperparameters based on the average of the f1-scores. However, the data set is very unbalanced, i.e. if I chose GridSearchCV with cv=10, then some minority classes are not represented in the validation data. I'm thinking of using SMOTE, but I see the problem here that I would have …
Category: Data Science

How to refit GridSearchCV on Multiclass problem

I'm trying to use GridSearchCV for my Multiclass problem. For starters, wanted to test it on KNeighborsClassifier. First, here's the code where I define the function which uses GridSearchCV: from sklearn.model_selection import GridSearchCV from sklearn.model_selection import KFold def grid_search(estimator, parameters, X, y): scoring = ['accuracy', 'precision', 'recall'] kf = KFold(5) clf = GridSearchCV(estimator, parameters, cv=kf, scoring=scoring, refit="accuracy", n_jobs=-1) clf.fit(X, y) i = clf.best_index_ best_precision = clf.cv_results_['mean_test_precision'][i] best_recall = clf.cv_results_['mean_test_recall'][i] print('Best score (accuracy): {}'.format(clf.best_score_)) print('Mean precision: {}'.format(best_precision)) print('Mean recall: {}'.format(best_recall)) print('Best …
Category: Data Science

What's the difference between GridSearchCrossValidation score and score on testset?

I'm doing classification using python. I'm using the class GridSearchCV, this class has the attribute best_score_ defined as "Mean cross-validated score of the best_estimator". With this class i can also compute the score over the test set using score. Now, I understand the theoretical difference between the two values(one is computed in the cross validation, the other is computed on the test set), but how should I interpret them? For example, if in case 1 I get these values (respectively …
Category: Data Science

XGBoost Log Loss different from GridSearchCV Log Loss

I have a classification problem where I am trying to predict if the data returns a 1 or 0. So your classic binary classification. I have my set of data that I have split into the dependent variables (ones I am training on) and the independent variable (my target that I am predicting, either a 0 or 1). I am using log loss as the scoring metric for my model. Firstly, I am using the cv function in xgboost to …
Category: Data Science

How to choose the best hyper-parameter when it is directly influenced by the random_state?

While trying to evaluate my Ridge Regression model and using GridSearchCV to find the best parameter. I noticed that the best estimator changes every time I change the random_state in my KFold object (cv parameter). With this in mind how do I choose the most optimal hyper parameter to implement my model.
Category: Data Science

Voting classifier using grid search for Time Series

I have three models: Arima Auto ARIMA Double Exponential Smoothing I would like to apply an ensemble method - a voting method and allow the classifier to learn weights for these three models. I have checked the votingclassifier present in scikit learn. It requires: fit(x,y) to run. Time series object that is present in series object don't have y. How do you apply a voting classifier and learn weights through grid search?
Category: Data Science

plot gridsearch csv results how?

how can i plot my results from gridsearch csv? clf = GridSearchCV(pipeline, parameters, cv=3,return_train_score=True) clf.fit(x, y) df = pd.DataFrame(clf.cv_results_) i'm trying to get a similar plot to what is here: https://matthewbilyeu.com/blog/2019-02-05/validation-curve-plot-from-gridsearchcv-results , but this uses the grid search object and i have tried and failed at trying to get the same using just the gridsearch df (from above). can anybody help in how i go about this?
Category: Data Science

Query regarding surprising spike in accuracy of ML model

I implemented all the major ML models (Logistic Regression, Naive Bayes, SVM, KNN, Decision Tree, Random Forest, Ada Boost & XGBoost) on my dataset. My stratified cross-validation scores are between 70% & 80%. When I implemented my models using grid search, my accuracies shot up & they lie between 90% & 95%. Is this drastic increase in accuracy abnormal & fishy? My GridSearch CV code for Logistic Regression--> from sklearn.datasets import make_blobs, make_classification from sklearn.model_selection import GridSearchCV scaled_inputs, targets = …
Category: Data Science

GridSearch CV: Suitable scoring metrics for Imbalanced data sets

I am new to machine learning. This is my $1^{st}$ machine learning project and I am working on classification on an imbalanced dataset. There are also multi-classes in the target variable. I would like to know what is the most suitable metrics for scoring the performance in the GridSearchCV. I think roc_au is sometimes used for imbalanced dataset. But there are several ‘roc_auc’ ‘roc_auc_ovo’ ‘roc_auc_ovr’ Which should I use? Alternatively, precision-recall_auc is also used. But I can't seem to find …
Category: Data Science

Gridsearch ValueError: Input contains infinity or a value too large for dtype('float64'). - Using Pipeline

Update: I have non NAN values so fillna is not an issue. Clean dataset. I'm having this error occur when I try to predict using my grid best params. I get a score when fit it onto the training data. I get this error however when I try and predict on the X_test. Very confused. I'm attempting to use a pipeline and gridsearch combined for my dataset. Code works up to the training part and score. It's a clean dataset …
Category: Data Science

Determine model hyper-parameter values for grid search

I built machine learning model for Ridge,lasso, elastic net and linear regression, for that I used gridsearch for the parameter tuning, i want to know how give value range for **params Ridge ** below code? example consider alpha parameter there i uses for alpha 1,0.1,0.01,0.001,0.0001,0 but i haven't idea how this values determine each models.(ridge/lasso/elastic) can some one explain these things? from sklearn.linear_model import Ridge ridge_reg = Ridge() from sklearn.model_selection import GridSearchCV params_Ridge = {'alpha': [1,0.1,0.01,0.001,0.0001,0] , "fit_intercept": [True, False], …
Category: Data Science

Brute-force feature selection and cross-validation

There is an existing score made of 10 parameters; each parameter is equally weighted & the total score is found by summing the score for each parameter. I want to try to reduce the number of parameters in this score, but keep them equally weighted. I have data on 500 people with the score & two outcomes of interest. As the number of parameters are small, I started by doing a brute-force approach to look at all the possible combinations …
Category: Data Science

Fashion MNIST: Is there an easy way to extract only 1% of the data to do a minimal gridsearch?

I am trying implement several models on the fashion-MNIST. I have imported the data according to the tf.keras tutorial: import tensorflow as tf from tensorflow import keras import sklearn import numpy as np f_mnist = keras.datasets.fashion_mnist (train_images, train_labels), (test_images, test_labels) = f_mnist.load_data() class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat', 'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot'] print(train_images) print(train_labels) >>(60000, 28, 28) >>(60000,) print(test_images) print(test_labels) >>(10000, 28, 28) >>(10000,) # Need to concatenate as GridsearchCV takes entire set in input all_images = …
Category: Data Science

How to loop through multiple lists/dict?

I have the following code which finds the best value of k parameter in the KNNImputer. Basically it is looping through the list of k_value and for each element, it is fitting the KNNImputer to the model and in the end appending the result to an empty dataframe. lire_model = LinearRegression() k_value = [1,3,5,7,9,11, 13, 15, 17, 19, 21] k_value_results = pd.DataFrame(columns = ['k', 'mse', 'rmse', 'mae', 'r2']) scoring_list = ['neg_mean_squared_error', 'neg_root_mean_squared_error', 'neg_mean_absolute_error', 'r2'] for s in k_value: imputer = …
Category: Data Science

GridSeachCV not performing well on ML models

from sklearn.model_selection import GridSearchCV svm2=SVC() grid={ 'C': [0.1, 1, 10, 100, 1000], 'kernel': ['linear', 'poly', 'rbf', 'sigmoid'], 'gamma': [1, 0.1, 0.01, 0.001, 0.0001] } svm_grid=GridSearchCV(estimator=svm2,param_grid=grid,cv=3,n_jobs=-1) svm_grid.fit(xtrain,ytrain) svm_grid.best_params_ OUTPUT {'C': 1, 'gamma': 1, 'kernel': 'rbf'} CODE svm_grid.score(xtrain,ytrain) 0.9884434814012278 svm_grid.score(xtest,ytest) 0.8513708513708513 My question is even after performing GridSearch why the model is still overfitting and how can I further increase the accuracy and combat overfitting . I am facing same issues with RandomForest in Gridsearch grid = { 'n_estimators': [10, 20, 40, …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.