I was trying to figure out if the test set from a competition is similar to the train set. This was done in a NLP competition, in which I had two columns, tweet and type, and I needed to predict the type of crime the tweet was reporting. So I decided to check if the train set is too different from the test set. This is what I've done so far: # drop the target column from the training data …
I have been training my CNN with 200 images per class for a classification problem. There problem is a binary classification one. And with the amount of test data ( 25 per class) I am getting good accuracy, precision and recall values. Does that mean my model is actually good?
I am new in making recommendation systems . I am using the surpriselib library to evaluate my recommendations. All the Accuracy Metrics are well supported in this library. But I also want to compute the Hit Rate of my top n recommender system. I know the formula for hit rate is: (no items users have already purchased)/(no of users) But this does not makes sense to me because to train and test the user vs item ratings I have only …
I want to perform k-fold cross-validation for the setting where I have a training dataset consisting of a sequential time series that is fully benign and a test dataset (also a sequential time series) which contains labeled anomalies. I already took a look at this post, but as my data is sequential, the answer doesn't work out. I am especially stuck with the factor that for K-fold cross-validation, you use (k-1)/k parts of your data for training and 1/k parts …
I am wondering whether I can perform any kind of Cross-Validation or GridSearchCV for unsupervised learning. The thing is that I have the ground truth labels (but since it is unsupervised I just drop them for training and then reuse them for measuring accuracy, auc, aucpr, f1-score over the test set). Is there any way to do this?
I'm testing the Multinomial NB and Bernoulli NB on my dataset and I'm using the cross validation score to better understand which of the two algorithms work better. This is the first classifier: from sklearn.naive_bayes import MultinomialNB clf_multinomial = MultinomialNB() clf_multinomial.fit(X_train, y_train) y_predicted = clf_multinomial.predict(X_test) score = clf_multinomial.score(X_test, y_test) scores = cross_val_score(clf_multinomial, X_train, y_train, cv=5) print(scores) print(score) And these are the scores: [0.75 0.875 0.66666667 0.95833333 0.86956522] 0.8637666498061035 This is the second classifier: from sklearn.naive_bayes import BernoulliNB clf_multivariate = BernoulliNB() …
Suppose I'm training a linear regression model using k-fold cross-validation. I'm training K times each time with a different training and test data set. So each time I train, I get different parameters (feature coefficients in the linear regression case). So I will have K parameters at the end of cross-validation. How do I arrive at the final parameters for my model? If I'm using it to tune hyperparameters as well, do I have to do another cross-validation after fixing …
So to perform my feature selection I ran cross validation over and over again, each time trying different subsets of my attributes and repeated this until I got the best cross validation score I could get. Is this alright to do or I am creating a major bias? I suspect that this could cause a bias and possibly result in data leakage because I am probably learning something about my test set by doing this, but how bad of a …
I know a similar post was made here, but I wanted to ask some follow up questions. I am conducting a cross-validation search to find values of a set of hyper-parameters and need to normalise the data. If we split up the data as follows: 'Training' (call this set 'A' for now) and testing data Split the 'training' into training (call this set 'B' for now) and validation sets what parameters should be used when normalising the datasets? Am I …
Given a database, I split the data in train and test. I want to use a decision-tree classifier (sklearn) for a binary classification problem. Considering I already found the best parameters for my model, if I run the model on the test set I obtain at each run (considering the same hyper-parameters) a different result. Why that? Considering I am using as metric the accuracy score, I have variations from 0.5 to 0.8. Which result should I take as correct, …
I have a severely skewed data sets consisting of 20 something classes where the smallest class contains on the order of 1000 samples and the largest several millions. Regarding the validation data, I understand that I should make sure that it represent a similar ratio between classes compared to the one in my original raw data. Hence, I shouldn't do any under- or over-sampling on that validation data, but can do it on the training data. Because I have such …
There are quite a few questions regarding the optimisation of binary threshold in a classification problem. However, I haven't found a single end-to-end solution to this problem. In an existing project, I have come up with the following pipeline to train a binary classifier: Outer-CV due to small to moderate data size. Inner-CV to tune hyperparameters Train model with tuned hyperparameters on outer-cv trainset Predict on the outer-cv test set Find optimal threshold using prediction probabilities Get score converting prediction …
I'm working on developing a model with a highly imbalanced dataset (0.7% Minority class). To remedy the imbalance, I was going to oversample using algorithms from imbalanced-learn library. I had a workflow in mind which I wanted to share and get an opinion on if I'm heading in the right direction or maybe I missed something. Split Train/Test/Val Setup pipeline for GridSearch and optimize hyper-parameters (pipeline will only oversample training folds) Scoring metric will be AUC as training set is …
I am doing hyperparameter tuning + cross validation and I'm constantly getting that the optimal size of the leaf should be 1. Should I worry? Is this a sign of overfitting?
I am doing a few experiments on medical data. I am about to transfer learn the pretrained networks for my problem. Firstly, I have to pick a network architecture. Secondly, I would like to optimize it's parameters/parameters of optimizer, to get better performance. I would like to pick the network architecture based on 10-fold cross validation of several architectures. I will perform cross validation in a way that I have data split to train:test in a 80:20 manner, then train …
My training accuracy was better than my test accuracy, hence I thought my model was over-fitted and tried Cross-validation. The model further degraded. Is that my input data need to be sanitised further and of better quality? Please share your thoughts what could be getting wrong here. My data distribution: Code snippets... My function get_score: def get_score(model, X_train, X_test, y_train, y_test): model.fit(X_train, y_train.values.ravel()) pred0 = model.predict(X_test) return accuracy_score(y_test, pred0) Logic: print('*TRAIN* Accuracy Score => '+str(accuracy_score(y_train, m.predict(X_train)))) # LinearSVC() used print('*TEST* …
I use a datasampler widget to split a dataset (train and test) with the cross-validation selection. I wonder how it works because some points did not seem clear to me. Question 1: As seen in the figure, I split the data into five subsets (each has 20 observations). Then, I selected one of the subsets (remaining data) to test the models, which means the four subsets are used for training. At this point, while the algorithms build a model, are …
I have a music numerical data (2282 rows × 173 columns) to predict the target sad, happy, angry, relaxed. Now one of the attribute is segment_id and I want to group the data according to segment_id and apply stratified CV. How can I do it? I have 26 segments and each segment appear at least 50 times or more in the data set. I have no idea where to start? Could someone give me some hints? If you need further …
I have a highly imbalanced dataset (± 5% positive instances), for which I am training binary classifiers. I am using nested 5-fold cross-validation with grid search for hyperparameter tuning. I want to avoid undersampling, so I have been looking into the class_weight hyperparameter. For sklearn's decisiontree classifier, this works really well and is easily given as a hyperparameter. However, this is not an option for sklearn's neural network (multi-layer perceptron) as far as I can tell. I have been using …
I have doubts about the differences between these three methods and I would like to clarify the following: Main differences Advantages of one over the other Context of use of each method etc... If anyone could help me, I would appreciate it.