Methods for augmenting binary datasets

I have a small (~100 samples) dataset with roughly 20 features which are mostly binary, and a few are numeric (~5). I wanted to use methods for augmenting the training set and see if I can get better test accuracy. What methods/code can I use for augmenting binary datasets?
Category: Data Science

How to train a keras model on both original and augmented data from ImageDataGenerator?

I have a dataset that contains about 87000 images in a directory, with each class in a separate subfolder. I've tried the class ImageDataGenerator() and the function flow_from_directory() for generating the images, it worked completely fine but I have a question.. Does flow_from_directory() only yield the augmented images? and if this is the case, how can I train my model "which has overfit the training set" on both original and augmented data? Thanks
Category: Data Science

after augmentation validation accuracy going down?

My main question is about augmentation. if I process the augmentation I believe it always better than less data but in my case the validation accuracy going down train : 7000 images , validation: 3000 images : validation accuracy:0.89 train : 40000 images , validation: 17990 images : validation accuracy:0.85 my augmentation code def data_augmentation_folder(trainImagesPath,saveDir): #X_train=load_training_data(trainImagesPath,"train") print("=====================================================") X_train = cleanData(trainImagesPath) X_train = np.array(X_train) print(X_train[0].shape) for i in range(5): #print(i) datagen = ImageDataGenerator(rotation_range=15, width_shift_range=0.1, height_shift_range=0.1, shear_range=0.01, zoom_range=[0.9, 1.25], horizontal_flip=True, vertical_flip=False, fill_mode='reflect', …
Category: Data Science

Same validation accuracy, different train accuracy for two neural networks models

I'm performing emotion classification over FER2013 dataset. I'm trying to measure different models performance, and when I checked ImageDataGenerator with a model I had already used I came up with the following situation: Model without data augmentation got: train_accuracy = 0.76 val_accuracy = 0.70 Model with data augmentation got: train_accuracy = 0.86 val_accuracy = 0.70 As you can see, validation accuracy is the same in both models, but train accuracy is significantly different. In this case: Should I go with …
Category: Data Science

Data Augmentation Keras length of data

I'm confused when I add data augmentation should I get more data or the same data I tested my x_train length to confirm but I got the same length before augmentation and after augmentation is that correct or should I get the double of my data? print(len(x_train)) output : 5484 after augmentation : datagen = ImageDataGenerator( featurewise_center=True, # set input mean to 0 over the dataset samplewise_center=True, # set each sample mean to 0 featurewise_std_normalization=True, # divide inputs by std …
Category: Data Science

Non-Real Time Data Augmentation for CNN Classification. What are the drawbacks?

When people talk about and use data augmentation, are they mostly referring to real-time data augmentation? In the case of image classification, that would involve augmenting the data right before fitting the model, and a new augmented image is used every epoch. In this case only augmented images are used to train the model and the raw image is never used, so the size of the input doesn’t actually change. But what about non-real-time data augmentation? By this, I mean …
Category: Data Science

What's the difference between keras api augmentation and data augmentation definition?

The augmentation definition is increasing the number of images by using rotation, crop and flip to avoid overfitting. The keras API apply augmentation but no increasing the number of image. What keras augmentation does in images? Is API augmentation such as preprocessing of images? Is augmentation replace the original image with new augmented images?
Category: Data Science

Is There Techniques for creating synthetic Data for Regression Problem i tried SMOTE and its variant but these are for classification problem

This is my data "Volume" is my Target variable and all other are Independent variables i just applied labelencoder on Area_categ , wind_direction_labelencod and on current _label_encode and now i want to apply tecnique that increase my dataset rows and colums like in case of classification SMOTE do balance classes.Please give me solution for this if possible with deep learing techniques then please do help us.
Category: Data Science

Data augmentation in images

Suppose there is a ML network that takes grayscale images as the input. The images that I have are RGB images. So, instead of converting these RGB images to grayscale, I treat each individual colour bands as distinct inputs to the network. that is, instead of feeding RGB image A to the network, I feed the R matrix of A as the first input, followed by the G matrix and then the B matrix. This leads to 3 times more …
Category: Data Science

Does synthetic data be over sampled as well?

I'm building a binary text classifier, the ratio between the positives and negatives is 1:100 (100 / 10000). By using back translation as an augmentation, I was able to get 400 more positives. Then I decided to do up sampling to balance the data. Do I include only the positive data points (100) or should I also include the 400 that I have generated? I will definitely try both, but I wanted to know if there is any rule of …
Category: Data Science

When using Data augmentation is it ok to validate only with the original images?

I'm working on a multi-classification deep learning algorithm and I was getting big over-fitting: My model is supposed to classify sunglasses on 17 different brands, but I only had around 400 images from each brand so I created a folder with data augmented x3 times, generating images with these parameters: datagen = ImageDataGenerator( rotation_range=30, width_shift_range=0.2, height_shift_range=0.2, shear_range=0.2, zoom_range=0.2, horizontal_flip=True, fill_mode='nearest') After doing so i got these results: I don't know if it's correct to do the validation only using the …
Category: Data Science

How should I improve my CNN binary classification model from overfitting and underfitting

I am trying to do the cats & dogs classification problem, the problem is that my model is overfitting and I have tried all the techniques I know in order to solve but nothing is working such as dropout, data augmentation, l2 and l1 reg. Can you please help me? After the end of the training, my train accuracy was: 0.7868 and my validation accuracy was 0.7044. my image size are (h=48,w=48 with 3 channels, and batch size = 128) …
Category: Data Science

Data Augmentation Multi Outputs

This question is asked several times here on SE, but I havent been able to find the right answer. I'm trying to build a network with 1 input and 2 outputs. I don't have a lot of data so I would like to use a generator for augmentation (preferably with imgaug). My code: seq = iaa.Sequential([ .... ]) gen = ImageDataGenerator(preprocessing_function=seq.augment_image) batch_size = 64 def generate_data_generator(generator, X, Y1, Y2): genX = gen.flow(X, batch_size=batch_size, seed=42) genY1 = gen.flow(Y1, batch_size=batch_size, seed=42) while …
Category: Data Science

Removing outliers from a multi-dimensional dataset & Data augmentation

Removing the outliers of a single-dimensional data can be easily done by removing the points that are outside of the IQR range. But how should the process of detecting and removing outliers be done if the dataset is composed of multiple dimensions of data? Here's my approach: the dataset consisted seven different dimensions of data. When illustrated on a dataframe, there are seven different columns; each row acting as a metadata explaining the properties of a single data. I looped …
Category: Data Science

Baseline model and transfer learning

I've tried to find any guidance on using transfer learning when building baseline models for ML projects (CNN in my case) but found no clues on good practices in the matter. My logic says that no baseline model should be pretrained first as it is complicating it without any known reason to do it (as yet it is not proven we need it). But it is not the first time my logic may be wrong in the case of DS. …
Category: Data Science

Why we call Mix-up method is a data augmentation technique?

I am bit confused in the Mixup data augmentation technique, let me explain the problem briefly: What is Mixup For further detail you may refer to original paper . We double or quadruple the data using classic augmentation techniques (e.g., Jittering, Scaling, Magnitude Warping). For instance, if the original data set contained 4000 samples, there will be 8000 samples in the data set after the augmentation. On the other hand, according to my understanding, in Mixup data augmentation, we do …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.