cross-validation

What to do if your adversarial validation show different distributions for an NLP problem?

dsbr_

2022年6月3日 22:30

I was trying to figure out if the test set from a competition is similar to the train set. This was done in a NLP competition, in which I had two columns, tweet and type, and I needed to predict the type of crime the tweet was reporting. So I decided to check if the train set is too different from the test set. This is what I've done so far: # drop the target column from the training data …

Topic: validation cross-validation nlp machine-learning

Category: Data Science

Does high accuracy metrics with small (but equally sampled) dataset means a good model?

Sangathamilan Ravichandran

2022年6月3日 16:00

I have been training my CNN with 200 images per class for a classification problem. There problem is a binary classification one. And with the amount of test data ( 25 per class) I am getting good accuracy, precision and recall values. Does that mean my model is actually good?

Topic: cnn image-classification cross-validation neural-network

Category: Data Science

How do I perform Leave One Out Cross Validation For Top n Recommendation Sytems?

Saket Aryan

2022年6月1日 17:03

I am new in making recommendation systems . I am using the surpriselib library to evaluate my recommendations. All the Accuracy Metrics are well supported in this library. But I also want to compute the Hit Rate of my top n recommender system. I know the formula for hit rate is: (no items users have already purchased)/(no of users) But this does not makes sense to me because to train and test the user vs item ratings I have only …

Topic: cross-validation accuracy recommender-system machine-learning

Category: Data Science

Cross-validation for anomaly detection on time series data

kohlstein

2022年6月1日 15:32

I want to perform k-fold cross-validation for the setting where I have a training dataset consisting of a sequential time series that is fully benign and a test dataset (also a sequential time series) which contains labeled anomalies. I already took a look at this post, but as my data is sequential, the answer doesn't work out. I am especially stuck with the factor that for K-fold cross-validation, you use (k-1)/k parts of your data for training and 1/k parts …

Topic: anomaly-detection cross-validation time-series

Category: Data Science

Cross-Validation for Unsupervised Anomaly Detection with Isolation Forest

Camilo Piñón Blanco

2022年5月24日 00:04

I am wondering whether I can perform any kind of Cross-Validation or GridSearchCV for unsupervised learning. The thing is that I have the ground truth labels (but since it is unsupervised I just drop them for training and then reuse them for measuring accuracy, auc, aucpr, f1-score over the test set). Is there any way to do this?

Topic: isolation-forest unsupervised-learning cross-validation machine-learning

Category: Data Science

Compare cross validation values of Bernoulli NB and Multinomial NB

JimBelushi2

2022年5月22日 22:06

I'm testing the Multinomial NB and Bernoulli NB on my dataset and I'm using the cross validation score to better understand which of the two algorithms work better. This is the first classifier: from sklearn.naive_bayes import MultinomialNB clf_multinomial = MultinomialNB() clf_multinomial.fit(X_train, y_train) y_predicted = clf_multinomial.predict(X_test) score = clf_multinomial.score(X_test, y_test) scores = cross_val_score(clf_multinomial, X_train, y_train, cv=5) print(scores) print(score) And these are the scores: [0.75 0.875 0.66666667 0.95833333 0.86956522] 0.8637666498061035 This is the second classifier: from sklearn.naive_bayes import BernoulliNB clf_multivariate = BernoulliNB() …

Topic: probability cross-validation classification python

Category: Data Science

How are parameters selected in cross-validation?

NAS

2022年5月20日 16:20

Suppose I'm training a linear regression model using k-fold cross-validation. I'm training K times each time with a different training and test data set. So each time I train, I get different parameters (feature coefficients in the linear regression case). So I will have K parameters at the end of cross-validation. How do I arrive at the final parameters for my model? If I'm using it to tune hyperparameters as well, do I have to do another cross-validation after fixing …

Topic: hyperparameter-tuning training parameter-estimation cross-validation machine-learning

Category: Data Science

Using cross validation score to perform feature selection

Rubiks cube

2022年5月19日 23:01

So to perform my feature selection I ran cross validation over and over again, each time trying different subsets of my attributes and repeated this until I got the best cross validation score I could get. Is this alright to do or I am creating a major bias? I suspect that this could cause a bias and possibly result in data leakage because I am probably learning something about my test set by doing this, but how bad of a …

Topic: cross-validation feature-selection

Category: Data Science

What parameters to use when normalising training, validation, and testing data?

Rocky the Owl

2022年5月19日 19:02

I know a similar post was made here, but I wanted to ask some follow up questions. I am conducting a cross-validation search to find values of a set of hyper-parameters and need to normalise the data. If we split up the data as follows: 'Training' (call this set 'A' for now) and testing data Split the 'training' into training (call this set 'B' for now) and validation sets what parameters should be used when normalising the datasets? Am I …

Topic: training normalization cross-validation python

Category: Data Science

Decision Trees change result at every run, how can I trust of my results?

Mark

2022年5月17日 03:03

Given a database, I split the data in train and test. I want to use a decision-tree classifier (sklearn) for a binary classification problem. Considering I already found the best parameters for my model, if I run the model on the test set I obtain at each run (considering the same hyper-parameters) a different result. Why that? Considering I am using as metric the accuracy score, I have variations from 0.5 to 0.8. Which result should I take as correct, …

Topic: decision-trees cross-validation machine-learning

Category: Data Science

Restrictions on my skewed validation data

Tobias

2022年5月8日 17:40

I have a severely skewed data sets consisting of 20 something classes where the smallest class contains on the order of 1000 samples and the largest several millions. Regarding the validation data, I understand that I should make sure that it represent a similar ratio between classes compared to the one in my original raw data. Hence, I shouldn't do any under- or over-sampling on that validation data, but can do it on the training data. Because I have such …

Topic: cross-validation class-imbalance classification machine-learning

Category: Data Science

binary classification pipeline to select threshold

lml

2022年5月8日 10:03

There are quite a few questions regarding the optimisation of binary threshold in a classification problem. However, I haven't found a single end-to-end solution to this problem. In an existing project, I have come up with the following pipeline to train a binary classifier: Outer-CV due to small to moderate data size. Inner-CV to tune hyperparameters Train model with tuned hyperparameters on outer-cv trainset Predict on the outer-cv test set Find optimal threshold using prediction probabilities Get score converting prediction …

Topic: hyperparameter-tuning cross-validation classification

Category: Data Science

Optimizing decision threshold on model with oversampled/imbalanced data

rayven1lk

2022年5月5日 03:01

I'm working on developing a model with a highly imbalanced dataset (0.7% Minority class). To remedy the imbalance, I was going to oversample using algorithms from imbalanced-learn library. I had a workflow in mind which I wanted to share and get an opinion on if I'm heading in the right direction or maybe I missed something. Split Train/Test/Val Setup pipeline for GridSearch and optimize hyper-parameters (pipeline will only oversample training folds) Scoring metric will be AUC as training set is …

Topic: grid-search smote model-selection cross-validation

Category: Data Science

What does a leaf size of 1 in K-neighbors regression mean?

Caterina

2022年5月3日 17:51

I am doing hyperparameter tuning + cross validation and I'm constantly getting that the optimal size of the leaf should be 1. Should I worry? Is this a sign of overfitting?

Topic: k-nn hyperparameter-tuning cross-validation scikit-learn

Category: Data Science

Choose CNN architecture first, then optimize parameters - validation vs test performance to pick architecture?

sob3kx

2022年5月2日 08:01

I am doing a few experiments on medical data. I am about to transfer learn the pretrained networks for my problem. Firstly, I have to pick a network architecture. Secondly, I would like to optimize it's parameters/parameters of optimizer, to get better performance. I would like to pick the network architecture based on 10-fold cross validation of several architectures. I will perform cross validation in a way that I have data split to train:test in a 80:20 manner, then train …

Topic: cnn image-classification cross-validation

Category: Data Science

Why does CV yield lower score?

ranit.b

2022年4月30日 10:04

My training accuracy was better than my test accuracy, hence I thought my model was over-fitted and tried Cross-validation. The model further degraded. Is that my input data need to be sanitised further and of better quality? Please share your thoughts what could be getting wrong here. My data distribution: Code snippets... My function get_score: def get_score(model, X_train, X_test, y_train, y_test): model.fit(X_train, y_train.values.ravel()) pred0 = model.predict(X_test) return accuracy_score(y_test, pred0) Logic: print('*TRAIN* Accuracy Score => '+str(accuracy_score(y_train, m.predict(X_train)))) # LinearSVC() used print('*TEST* …

Topic: prediction multiclass-classification cross-validation

Category: Data Science

How does the datasampler widget's cross-validation option of Orange software work?

Veli Özcan Budak

2022年4月28日 16:20

I use a datasampler widget to split a dataset (train and test) with the cross-validation selection. I wonder how it works because some points did not seem clear to me. Question 1: As seen in the figure, I split the data into five subsets (each has 20 observations). Then, I selected one of the subsets (remaining data) to test the models, which means the four subsets are used for training. At this point, while the algorithms build a model, are …

Topic: orange cross-validation

Category: Data Science

stratified segment-grouped k-fold cross-validation

Gunners

2022年4月28日 14:01

I have a music numerical data (2282 rows × 173 columns) to predict the target sad, happy, angry, relaxed. Now one of the attribute is segment_id and I want to group the data according to segment_id and apply stratified CV. How can I do it? I have 26 segments and each segment appear at least 50 times or more in the data set. I have no idea where to start? Could someone give me some hints? If you need further …

Topic: data cross-validation

Category: Data Science

Using keras with sklearn: apply class_weight with cross_val_score

AylaRT

2022年4月25日 15:02

I have a highly imbalanced dataset (± 5% positive instances), for which I am training binary classifiers. I am using nested 5-fold cross-validation with grid search for hyperparameter tuning. I want to avoid undersampling, so I have been looking into the class_weight hyperparameter. For sklearn's decisiontree classifier, this works really well and is easily given as a hyperparameter. However, this is not an option for sklearn's neural network (multi-layer perceptron) as far as I can tell. I have been using …

Topic: keras cross-validation class-imbalance scikit-learn

Category: Data Science

Difference between Jackknife vs bootsrap vs cross validation

PicaR

2022年4月25日 14:24

I have doubts about the differences between these three methods and I would like to clarify the following: Main differences Advantages of one over the other Context of use of each method etc... If anyone could help me, I would appreciate it.

Topic: bootstraping sampling cross-validation evaluation machine-learning

Category: Data Science

About