dataset split for image classification

I am trying to do image classification for 14 categories (around 1000 images for each cat). And i initially created two folders for training and validation. In this case, do I still need to set a validation split or a subset in a code? or I can use the whole files as train_ds and val_ds by deleting them Folder names in the training and validation directory are same. data_dir = 'trainingdatav1' data_val = 'Validationv1' train_ds = tf.keras.preprocessing.image_dataset_from_directory( data_dir, validation_split=0.1, #is …
Category: Data Science

Is this XGBoost model tending to overfit?

Here is the list of hyperparameters that I used: params = { 'scale_pos_weight': [1.0], 'eta': [0.05, 0.1, 0.15, 0.9, 1.0], 'max_depth': [1, 2, 6, 10, 15, 20], 'gamma': [0.0, 0.4, 0.5, 0.7] } The dataset is imbalanced so I used scale_pos_weight parameter. After 5 fold cross validation the f1 score that I got is: 0.530726530426833
Category: Data Science

Overfitting problem: high accurance and low accurancy validation for image classification

I want to define a model to predict 3 categories of images. I'm learnong on the field :-) I've 1500 images (500 for each category) in 3 directories. I've read in this blog many suggestions: use a simple loss function use droput use shuffle I've applied these tricks but the model still overfits ... This is the code I'm using, any suggestion? dim_x = 500 dim_y = 200 dim_kernel = (3,3) data_gen = ImageDataGenerator(rescale=1/255,validation_split=0.3) data_dir = image_path train_data_generator=data_gen.flow_from_directory( data_dir, target_size=(dim_x,dim_y), …
Category: Data Science

how to reduce overfitting and improve confusion matrix

I am trying to apply the following model on my data which is consists of (4030 samples as 5 classes) each sample is MFCC features which is extracted from an audio clip consisting of (20 second) and I am trying to apply classification, but I got very poor accuracy and I also have overfitting, , Although I am using data augmentation and I also try to apply Batch Normalization to improve overfitting but the result is very bad. the Model: …
Category: Data Science

Training loss decreasing while Validation loss is not decreasing

I am wondering why validation loss of this regression problem is not decreasing while I have implemented several methods such as making the model simpler, adding early stopping, various learning rates, and also regularizers, but none of them have worked properly. any suggestions would be appreciated. here is my code and my outputs: optimizer = keras.optimizers.Adam(lr=1e-3) model = Sequential() model.add(LSTM(units=50, activation='relu', activity_regularizer=tf.keras.regularizers.l2(1e-2), return_sequences=True, input_shape=(x_train.shape[1], x_train.shape[2]))) model.add(Dropout(0.2)) model.add(LSTM(units=50, activation='relu', activity_regularizer=tf.keras.regularizers.l2(1e-2), return_sequences=False)) model.add(Dropout(0.2)) model.add(Dense(y_train.shape[1])) model.compile(optimizer=optimizer, loss='mae') callback = tf.keras.callbacks.EarlyStopping(monitor='loss', patience=3) history = …
Category: Data Science

How solved "ValueError: y should be a 1d array, got an array of shape () instead."?

from tkinter import * from tkinter import ttk from tkmacosx import Button top = Tk() top.title("Jobs") top.geometry("1000x800") line1 = LabelFrame(top, text='') line1.pack(expand = 'yes', fill = 'both') n = StringVar() categorychoosen = ttk.Combobox(line1, width = 27, textvariable = n) # Adding combobox drop down list categorychoosen['values'] = ('Advocate','Arts','Automation Testing','Blockchain','Business Analyst', 'Web Designing') categorychoosen.place(x=50, y=150) categorychoosen.current() name=Label(line3,text="Welcom to ... company",font =("Arial", 10)) name.place(x=0, y=0) n1 = StringVar() sectionchoosen = ttk.Combobox(line3, width = 27, textvariable = n1) # Adding combobox drop down …
Category: Data Science

How to train a keras model on both original and augmented data from ImageDataGenerator?

I have a dataset that contains about 87000 images in a directory, with each class in a separate subfolder. I've tried the class ImageDataGenerator() and the function flow_from_directory() for generating the images, it worked completely fine but I have a question.. Does flow_from_directory() only yield the augmented images? and if this is the case, how can I train my model "which has overfit the training set" on both original and augmented data? Thanks
Category: Data Science

How can i deal with this overfitting?

I trained my data over 40 epochs but got finally this shape. How can I deal with this problem? Please as I used 30.000 for training and 5000 for testing and lr_schedule = keras.optimizers.schedules.ExponentialDecay( initial_learning_rate=4e-4, decay_steps=50000, decay_rate=0.5) should I increase the number of data in testing or make changes in the model? EDIT After I add regularization I got this shape and the loss started from a number greater than before in the previous shape, does that normal? Is this …
Category: Data Science

Minimum number of samples to train XGBoost without overfitting

When using Neural Networks for image processing I learned a rule of thumb: to avoid overfitting, supply at least 10 training examples for every neuron. Is there a similar rule of thumb for classifiers such as XGBoost, presumably taking into account the number of features and estimators? And, considering the 'curse of dimensionality' shouldn't the rule of thumb be that n_training is geometric in n_dimensions, and not linear?
Category: Data Science

Is my model overfitting ? Training Acc :93 % test accuracy 82%

I am using LGBM model for binary classification. After hyper-parameter tuning I get Training accuracy 0.9340 Test accuracy 0.8213 can I say my model is overfitting? Or is it acceptable in the industry? Also to add to this when I increase the num_leaves for the same model,I am able to achieve: Train Accuracy : 0.8675 test accuracy : 0.8137 Which one of these results are acceptable and can be reported?
Category: Data Science

Training Object Detection model on just 10 images

I am trying to train an object detection model using Mask-RCNN with Resnet50 as backbone. I am using the pre-trained models from PyTorch's Torchvision library. I have only 10 images that I can use to train. Of the same 10 images, I am using 3 images for validation. For the evaluation, I am using the evaluation method used in COCO dataset which is also provided as .py scripts in the TorchVision's github repository. To have enough samples for training, I …
Category: Data Science

SciKit-Learn Decision Tree Overfitting

We have a project to utilize a few algorithms we have learned so far. I've been using SciKit-Learn to perform these algorithms, but when it comes to decision trees I keep getting a feeling I am overfitting. I'm using a dataset about the weather, giving characteristics such as city, state, month, year, wind direction, wind speed, etc... where the target variable is the average temperature for the day. Now I know this is hard to classify, as it is pretty …
Category: Data Science

Does eval loss decreasing slower than train loss indicate overfitting?

I am training a binary classifier using an efficientnetv2 model with a 1M image dataset where I do a 60/20/20 split. Does this graph mean that the model is over-fitting? I can see that the train loss is going down much faster than the eval loss but the eval loss is still going down and the accuracy is going up. Accuracy may seem to be low but it is actually a pretty decent amount for the problem I am working …
Category: Data Science

Overfitted model produces similar AUC on test set, so which model do I go with?

I was trying to compare the effect of running GridSearchCV on a dataset which was oversampled prior and oversampled after the training folds are selected. The oversampling approach I used was random oversampling. Understand that the first approach is wrong since observations that the model has seen bleed into the test set. Was just curious about how much of a difference this causes. I generated a binary classification dataset with following: # Generate binary classification dataset with 5% minority class, …
Category: Data Science

Low classification accuracy

I want to do a multi class classification with 6 classes. Whole dataset has 12750 and 56 features samples, so every class has 2125 samples. Before prediction I reduces amount of outliers by winsorization (for 1 and 99 percentile) and I reduced skewness in features which has more than 1 and less than -1 skewness by Yeo-Johnson transformation and I got dataset: https://i.stack.imgur.com/miy8i.png Later, of course, I splitted dataset for 80% of training data and 20% of test data and …
Category: Data Science

Correctly evaluate model with oversampling and cross-validation

I'm dealing with a classic case of dataset with binary imbalanced target (event 3%, non event 97%). My idea is to apply some sort of sampling (over/under, SMOTE etc.) to address the issue. As I see, the correct way of doing this is to sample ONLY the train set, in order to have a test performance that is more similar to reality. Moreover, I want to use CV for hyperparameters tuning. So, the tasks in order are Divide dataset into …
Category: Data Science

Multilabel Classification - Overfitting?

My task is the following: To input drug combinations and output renal failure-related symptoms from the drug combinations. Both the drug combinations and renal-failure related symptoms are represented as one-hot encoded (for example, someone getting symptom 1 and symptom 3 out of a total of 4 symptoms is represented as [1,0,1,0]). So far, I have ran the data through the following models and they have produced this interesting graph. The left-hand graph depicts the training and validation loss of the …
Category: Data Science

my k-fold validation is giving a lot of 100% in the concatenated confusion matrix, is it because of overfitting?

The confusion matrix is a concatenated one from a 5-fold stratified cross-validation of my data set. I used rbf kernel for the svm classifier. Is it telling me the classifier is overfitting? Plus when I plot the confusion matrix from the training datasets ( 70% training 30% testing ), it is giving pretty much the same confusion matrix as the cross-validation one. The unseen testing datasets is also giving pretty much the same confusion matrix. should I worry about overfitting?
Category: Data Science

Is data leakage giving me misleading results? Independent test set says no!

TLDR: I evaluated a classification model using 10-fold CV with data leakage in the training and test folds. The results were great. I then solved the data leakage and the results were garbage. I then tested the model in an independent new dataset and the results were similar to the evaluation performed with data leakage. What does this mean? Was my data leakage not relevant? Can I trust my model evaluation and report that performance ? Extended version: I'm developing …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.