Here is the list of hyperparameters that I used: params = { 'scale_pos_weight': [1.0], 'eta': [0.05, 0.1, 0.15, 0.9, 1.0], 'max_depth': [1, 2, 6, 10, 15, 20], 'gamma': [0.0, 0.4, 0.5, 0.7] } The dataset is imbalanced so I used scale_pos_weight parameter. After 5 fold cross validation the f1 score that I got is: 0.530726530426833
I'm fairly new to machine learning and I'm aware of the concept of hyper-parameters tuning of classifiers, and I've come across a couple of examples of this technique. However, I'm trying to use NaiveBayes Classifier of sklearn for a task but I'm not sure about the values of the parameters that I should try. What I want is something like this, but for GaussianNB() classifier and not SVM: from sklearn.model_selection import GridSearchCV C=[0.05,0.1,0.2,0.3,0.25,0.4,0.5,0.6,0.7,0.8,0.9,1] gamma=[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0] kernel=['rbf','linear'] hyper={'kernel':kernel,'C':C,'gamma':gamma} gd=GridSearchCV(estimator=svm.SVC(),param_grid=hyper,verbose=True) gd.fit(X,Y) print(gd.best_score_) print(gd.best_estimator_) …
I am currently using transaction amount as a feature in an XGBoost classification model designed to identify fraudulent transactions. Furthermore, transaction amount is bounded for this problem between 0 and 500. Using transaction amount as a feature does improve target class separability. However, I can't help but wonder if there is a better way to use this variable. To explain, I care more about getting the high transaction amount values correct than I do the low ones. However, the model …
I have a data set labelled with a binary classes. I calculated the principal components from the data, then made the PC transformation. The goal is to find an optimal number of PCs so that the binary classification accuracy is good enough. I've learned a binary classifier sklearn.linear_model.LogisticRegressionCV (default parameters) on the PC-transformed data. The number of PCs was the (hyper-)parameter and it was varied. I cannot interpret the resulting Accuracy v. #PCs graph, why is it so strange? For …
Despite doing/using it a few times, I'm still slightly confused by the use of a validation set for hyper parameter tuning. As far as I can tell, I choose a model, train it on training data, assess performance on training data, then do hyper parameter tuning assessing model performance on validation data, then choose the best model and test this on test data. In order to do this, I basically need to pick a model at random for training data. …
I am working on hyper-tuning random forest classifier with following parameters in random search CV In [100]: # defining model Model = RandomForestClassifier(random state=1) # Parameter grid to pass in RandomSearchCV param grid = { "n_estimators": [200,250,300], "min_samples_leaf": np.arange(1, 4), "max_features": [np.arange(0.3, 0.6, 0.1),'sqrt'],"max_samples": np.arange(0.4, 0.7, 0.1)} #Calling RandomizedSearchcV randomized_cv = RandomizedSearchCV(estimator=Model, param distributions=param grid, n_iter=10, n_jobs = -1, scoring=metrics.make_scorer(metrics.recall_score)) #Fitting parameters in RandomizedSearchcv randomized cv.fit(X train, y train) print ("Best parameters are {} with CV score={}:" .format (randomized_cv.best params_,randomized_cv.best_score_)) …
When performing any hyper parameter tuning, let's say random search for simplicity, and I want to search over a minimum to max units/nodes in a layer, and a minimum to max number of layers, are there rules to guide what is a "large enough" number for my search? Currently all I know is "that should be good enough/large enough, let's search in there". I could be not searching a large enough space, or searching a space that's far too large …
There are quite a few library for hyperparameter optimization that are specific to Keras or other Deep Learning libraries, like Hyperas or Talos. My question is, what's the main benefit of using these libraries compared to, for example, sklearn.model_selection.GridSearchCV() or sklearn.model_selection.RandomizedSearchCV?
I am struggling to understand why I am getting such a high loss/val_loss rate on my training. I am training a regression network. I've normalized the input data to range between -1 to 1, and left the output data unaltered, its range is approximately between -100 and 100. I chose to normalize the input so that I could use tanh as the activation function since it outputs within this range. The neural network consists of 3 layers. print "Model definition!" …
I have a set of inputs, let's call them 'I', that can be fed through a complicated group of functions to produce/calculate a wide variety of outputs (let's call them 'O'). I want to find a subset of outputs (let's call them 'O-prime') within 'O' that contain sufficient information to form a basis in order to find/reconstruct a point in the 'I'-space accurately. In other words I want to pick 'O-prime' such that I am able to uniquely identify any …
When I set up my neural networks, I really have very little idea what I'm doing in advance. It may just be a bit of educated guesswork as to "it may need a few layers only" or "this activation function could be useful for this type of problem". This type of thinking could be quite useful, but it could also be leading me astray in developing a loose framework that's not quite suitable. I may be thinking do a hyper …
I would like to use the (Keras/Tensorflow) hyperband tuning algorithm more than the Keras random search, for instance, when testing hyperparameters. With random search I can set max trials and get a really rough guess of how long it will go on (probably by an order of magnitude uncertainty from max_trials*epochs). With hyperband I don't know how long it will take, or if I'm setting a search that's going to be really limited. Is there a way to make sense …
I am trying to predict time series based on 150 features. When I plot correlation of these features, I am getting 20 features with more or less importance but every model I use, it is completely dominated by only one feature which is competently in sync with predicted output but not actual output . Please refer to the image below. The green line is prediction which is completely in sync with one of the feature.And for every valley in actual …
If anyone is there to answer these, that'll be great. I'm in the midst of a Final Year Project on LSTM. Currently, I’m stuck and confused over LSTM codes. There are 4 hyperparameters that I can play around with: Look back Batch size LSTM units No. of Epochs Can you explain what will happen to my results if I tune each of these hyperparameters? And also is it common if we get different results each time we run the codes?
If I have a model, say: def build_model(self, hp): model = Sequential() model.add(Dense(hp.Choice('units', [12,16,20,24]), hp.Choice("activation", ["elu", "exponential", "gelu", "hard_sigmoid", "linear", "relu", "selu", "sigmoid", "softmax", "softplus", "softsign", "swish", "tanh"]))) model.add(Dense(4, hp.Choice("activation", ["elu", "exponential", "gelu", "hard_sigmoid", "linear", "relu", "selu", "sigmoid", "softmax", "softplus", "softsign", "swish", "tanh"]))) optimizer=tf.keras.optimizers.SGD(learning_rate=1e-5) model.compile(loss='mse', optimizer=optimizer, metrics=['mse']) return model and I want to span the space where the activation functions change on each layer, I believe that hp.Choice will choose one, only, activation function, for the whole model each time …
Is it correct to add dropout to each layer and that it is done as in the below example? class MyHyperModel(kt.HyperModel): def build_model(self, hp): model = Sequential() for i in range(hp.Int('dense_layers',1,4)): model.add(Dense(hp.Choice('units', choice_units), hp.Choice("activation", ["elu", "exponential", "relu"]))) **model.add(layers.Dropout(hp.Choice('rate',[0.0,0.05,0.10,0.15,0.25])))** model.add(Dense(1, hp.Choice("activation", ["elu", "relu"]))) optimizer=tf.keras.optimizers.SGD(hp.Float('learning_rate',min_value=1e-6, max_value=1e-3,default=1e-5)) model.compile(loss='mse', optimizer=optimizer, metrics=['mse']) return model I.e. after each Dense layer, by adding model.add(layers.Dropout(hp.Choice('rate',[0.0,0.05,0.10,0.15,0.25]))) it will add dropout to each new Dense layer. Is this true? And if I wanted to vary the choice of dropout layer …
Rather than a hyperparameter optimisation with kt.tuners.RandomSearch, say, that does (option A), say X model trials (e.g. 100), Y epochs each (say 100, so a total of 10,000 epochs across all models) where Y would be 'enough epochs per experiment to give good estimates for each model' in one whole experiment, would it be more appropriate to split the experiment into two parts (option B): run X*5 model trials (200) with Y/10 epochs each (say 25). (Thus we scan many …
When performing HO, should I be looking to train each model (each with different hyperparameter values, e.g. with RandomSearch picking those values) on the training data, and then the best one is picked? Or should I be looking to choose them judged on their performance on the validation set?
Question 1: In the example of logistic regression, I often see the regularization constant and penalty methods being tuned by a grid search. However, it seems like there are a lot more options for tuning: classifier_os.get_params() gives: {'C': 1.0, 'class_weight': None, 'dual': False, 'fit_intercept': True, ... and many more! So my question is: Are these other parameters typically not worth tuning, or are they left out in examples for another reason? For example, I changed to solver='liblinear' and got sub-par …
I've been experimenting using transformer networks like BERT for some simple classification tasks. My tasks are binary assignment, the datasets are relatively balanced, and the corpus are abstracts from PUBMED. The median number of tokens from pre-processing is about 350 but I'm finding a strange result as I vary the sequence length. While using too few tokens hampers BERT in a predictable way, BERT doesn't do better with more tokens. It looks like the optimal number of tokens is about …