What to do if your adversarial validation show different distributions for an NLP problem?

I was trying to figure out if the test set from a competition is similar to the train set. This was done in a NLP competition, in which I had two columns, tweet and type, and I needed to predict the type of crime the tweet was reporting. So I decided to check if the train set is too different from the test set. This is what I've done so far: # drop the target column from the training data …
Category: Data Science

Does high accuracy metrics with small (but equally sampled) dataset means a good model?

I have been training my CNN with 200 images per class for a classification problem. There problem is a binary classification one. And with the amount of test data ( 25 per class) I am getting good accuracy, precision and recall values. Does that mean my model is actually good?
Category: Data Science

How do I perform Leave One Out Cross Validation For Top n Recommendation Sytems?

I am new in making recommendation systems . I am using the surpriselib library to evaluate my recommendations. All the Accuracy Metrics are well supported in this library. But I also want to compute the Hit Rate of my top n recommender system. I know the formula for hit rate is: (no items users have already purchased)/(no of users) But this does not makes sense to me because to train and test the user vs item ratings I have only …
Category: Data Science

Cross-validation for anomaly detection on time series data

I want to perform k-fold cross-validation for the setting where I have a training dataset consisting of a sequential time series that is fully benign and a test dataset (also a sequential time series) which contains labeled anomalies. I already took a look at this post, but as my data is sequential, the answer doesn't work out. I am especially stuck with the factor that for K-fold cross-validation, you use (k-1)/k parts of your data for training and 1/k parts …
Category: Data Science

Cross-Validation for Unsupervised Anomaly Detection with Isolation Forest

I am wondering whether I can perform any kind of Cross-Validation or GridSearchCV for unsupervised learning. The thing is that I have the ground truth labels (but since it is unsupervised I just drop them for training and then reuse them for measuring accuracy, auc, aucpr, f1-score over the test set). Is there any way to do this?
Category: Data Science

Compare cross validation values of Bernoulli NB and Multinomial NB

I'm testing the Multinomial NB and Bernoulli NB on my dataset and I'm using the cross validation score to better understand which of the two algorithms work better. This is the first classifier: from sklearn.naive_bayes import MultinomialNB clf_multinomial = MultinomialNB() clf_multinomial.fit(X_train, y_train) y_predicted = clf_multinomial.predict(X_test) score = clf_multinomial.score(X_test, y_test) scores = cross_val_score(clf_multinomial, X_train, y_train, cv=5) print(scores) print(score) And these are the scores: [0.75 0.875 0.66666667 0.95833333 0.86956522] 0.8637666498061035 This is the second classifier: from sklearn.naive_bayes import BernoulliNB clf_multivariate = BernoulliNB() …
Category: Data Science

How are parameters selected in cross-validation?

Suppose I'm training a linear regression model using k-fold cross-validation. I'm training K times each time with a different training and test data set. So each time I train, I get different parameters (feature coefficients in the linear regression case). So I will have K parameters at the end of cross-validation. How do I arrive at the final parameters for my model? If I'm using it to tune hyperparameters as well, do I have to do another cross-validation after fixing …
Category: Data Science

Using cross validation score to perform feature selection

So to perform my feature selection I ran cross validation over and over again, each time trying different subsets of my attributes and repeated this until I got the best cross validation score I could get. Is this alright to do or I am creating a major bias? I suspect that this could cause a bias and possibly result in data leakage because I am probably learning something about my test set by doing this, but how bad of a …
Category: Data Science

What parameters to use when normalising training, validation, and testing data?

I know a similar post was made here, but I wanted to ask some follow up questions. I am conducting a cross-validation search to find values of a set of hyper-parameters and need to normalise the data. If we split up the data as follows: 'Training' (call this set 'A' for now) and testing data Split the 'training' into training (call this set 'B' for now) and validation sets what parameters should be used when normalising the datasets? Am I …
Category: Data Science

Decision Trees change result at every run, how can I trust of my results?

Given a database, I split the data in train and test. I want to use a decision-tree classifier (sklearn) for a binary classification problem. Considering I already found the best parameters for my model, if I run the model on the test set I obtain at each run (considering the same hyper-parameters) a different result. Why that? Considering I am using as metric the accuracy score, I have variations from 0.5 to 0.8. Which result should I take as correct, …
Category: Data Science

Restrictions on my skewed validation data

I have a severely skewed data sets consisting of 20 something classes where the smallest class contains on the order of 1000 samples and the largest several millions. Regarding the validation data, I understand that I should make sure that it represent a similar ratio between classes compared to the one in my original raw data. Hence, I shouldn't do any under- or over-sampling on that validation data, but can do it on the training data. Because I have such …
Category: Data Science

binary classification pipeline to select threshold

There are quite a few questions regarding the optimisation of binary threshold in a classification problem. However, I haven't found a single end-to-end solution to this problem. In an existing project, I have come up with the following pipeline to train a binary classifier: Outer-CV due to small to moderate data size. Inner-CV to tune hyperparameters Train model with tuned hyperparameters on outer-cv trainset Predict on the outer-cv test set Find optimal threshold using prediction probabilities Get score converting prediction …
Category: Data Science

Optimizing decision threshold on model with oversampled/imbalanced data

I'm working on developing a model with a highly imbalanced dataset (0.7% Minority class). To remedy the imbalance, I was going to oversample using algorithms from imbalanced-learn library. I had a workflow in mind which I wanted to share and get an opinion on if I'm heading in the right direction or maybe I missed something. Split Train/Test/Val Setup pipeline for GridSearch and optimize hyper-parameters (pipeline will only oversample training folds) Scoring metric will be AUC as training set is …
Category: Data Science

Choose CNN architecture first, then optimize parameters - validation vs test performance to pick architecture?

I am doing a few experiments on medical data. I am about to transfer learn the pretrained networks for my problem. Firstly, I have to pick a network architecture. Secondly, I would like to optimize it's parameters/parameters of optimizer, to get better performance. I would like to pick the network architecture based on 10-fold cross validation of several architectures. I will perform cross validation in a way that I have data split to train:test in a 80:20 manner, then train …
Category: Data Science

Why does CV yield lower score?

My training accuracy was better than my test accuracy, hence I thought my model was over-fitted and tried Cross-validation. The model further degraded. Is that my input data need to be sanitised further and of better quality? Please share your thoughts what could be getting wrong here. My data distribution: Code snippets... My function get_score: def get_score(model, X_train, X_test, y_train, y_test): model.fit(X_train, y_train.values.ravel()) pred0 = model.predict(X_test) return accuracy_score(y_test, pred0) Logic: print('*TRAIN* Accuracy Score => '+str(accuracy_score(y_train, m.predict(X_train)))) # LinearSVC() used print('*TEST* …
Category: Data Science

How does the datasampler widget's cross-validation option of Orange software work?

I use a datasampler widget to split a dataset (train and test) with the cross-validation selection. I wonder how it works because some points did not seem clear to me. Question 1: As seen in the figure, I split the data into five subsets (each has 20 observations). Then, I selected one of the subsets (remaining data) to test the models, which means the four subsets are used for training. At this point, while the algorithms build a model, are …
Category: Data Science

stratified segment-grouped k-fold cross-validation

I have a music numerical data (2282 rows × 173 columns) to predict the target sad, happy, angry, relaxed. Now one of the attribute is segment_id and I want to group the data according to segment_id and apply stratified CV. How can I do it? I have 26 segments and each segment appear at least 50 times or more in the data set. I have no idea where to start? Could someone give me some hints? If you need further …
Category: Data Science

Using keras with sklearn: apply class_weight with cross_val_score

I have a highly imbalanced dataset (± 5% positive instances), for which I am training binary classifiers. I am using nested 5-fold cross-validation with grid search for hyperparameter tuning. I want to avoid undersampling, so I have been looking into the class_weight hyperparameter. For sklearn's decisiontree classifier, this works really well and is easily given as a hyperparameter. However, this is not an option for sklearn's neural network (multi-layer perceptron) as far as I can tell. I have been using …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.