Methods for augmenting binary datasets

I have a small (~100 samples) dataset with roughly 20 features which are mostly binary, and a few are numeric (~5). I wanted to use methods for augmenting the training set and see if I can get better test accuracy. What methods/code can I use for augmenting binary datasets?
Category: Data Science

What's the order in applying SMOTE transformation in a pipeline?

Here's the thing, I have an imbalanced data and I was thinking about using SMOTE transformation. However, when doing that using a sklearn pipeline, I get an error because of missing values. This is my code: from sklearn.pipeline import Pipeline # SELECAO DE VARIAVEIS categorical_features = [ "MARRIED", "RACE" ] continuous_features = [ "AGE", "SALARY" ] features = [ "MARRIED", "RACE", "AGE", "SALARY" ] # PIPELINE continuous_transformer = Pipeline( steps=[ ("imputer", SimpleImputer(strategy="most_frequent")), ("scaler", StandardScaler()), ] ) categorical_transformer = Pipeline( steps=[ …
Category: Data Science

Optimizing decision threshold on model with oversampled/imbalanced data

I'm working on developing a model with a highly imbalanced dataset (0.7% Minority class). To remedy the imbalance, I was going to oversample using algorithms from imbalanced-learn library. I had a workflow in mind which I wanted to share and get an opinion on if I'm heading in the right direction or maybe I missed something. Split Train/Test/Val Setup pipeline for GridSearch and optimize hyper-parameters (pipeline will only oversample training folds) Scoring metric will be AUC as training set is …
Category: Data Science

Unbalanced data set - how to optimize hyperparams via grid search?

I would like to optimize the hyperparameters C and Gamma of an SVC by using grid search for an unbalanced data set. So far I have used class_weights='balanced' and selected the best hyperparameters based on the average of the f1-scores. However, the data set is very unbalanced, i.e. if I chose GridSearchCV with cv=10, then some minority classes are not represented in the validation data. I'm thinking of using SMOTE, but I see the problem here that I would have …
Category: Data Science

class imbalance - applied SMOTE - next steps

I am new to ML and learnt a lot from your valuable posts. I need your advise with the following situation and guidance on if the steps make sense. I have a binary classification problem, my dataset has a severe imbalance approximately 2% positive cases (4,000 cases) out of a total of 200,000 cases. I separated my dataset into a train and a test (80/20 stratified split). My train now has total of 160,000 cases (3,200 positive cases) and test …
Category: Data Science

Noise Elimination with majority vote filtering

I have a dataset with label noise which I wan't to clean with majority/consensus vote filtering. This will mean I will divide the data in K-Folds and train an ensemble model. Than using the predictions on the data I will remove rows, which are missclassified by most (majority voting) or all (consensus voting). I have a few questions on which I can't find the answers elsewhere: how to decide what models to use in the ensemble the dataset is very …
Category: Data Science

Solving multi-class imbalance classification using smote and OSS

I am trying to solve a multi-class imbalance classification problem. For that, I am using SMOTE for oversampling and OSS for under-sampling. But I have a doubt as I am working on multi-class so I have to convert it into binary classification. So we can convert it using OVA/OAA. So how can I use OVA/OAA with both under-sampling and oversampling on the same data-set?
Category: Data Science

Train score is very lower than Test score, is that normal?

I am working on very imbalanced dataset, I used SMOTEENN (SMOTE+ENN) to rebalance it, the following test is made using Random Forest Classifier : My train and Test score before using SMOTEENN: print('Train Score: ', rf_clf.score(x_train, y_train)) print('Test Score: ', rf_clf.score(x_test, y_test)) Train Score: 0.92 Test Score: 0.91 After using SMOTEEN : print('Train Score: ', rf_clf.score(x_train, y_train)) print('Test Score: ', rf_clf.score(x_test, y_test)) Train Score: 0.49 Test Score: 0.85 Edit x_train,x_test,y_train,y_test=train_test_split(feats,targ,test_size=0.3,random_state=47) scaler = MinMaxScaler() scaler_x_train = scaler.fit_transform(x_train) scaler_x_test = scaler.transform(x_test) X …
Category: Data Science

SMOTE for image dataset

I'm working on Image augmentation with Smote. I'm confused that how can SMOTE be useful for an image dataset with containing 5955 images with four classes(2552,227,621,2555). Could anyone please help me? It would be greatly appreciated! I appreciate your help in advance
Topic: smote
Category: Data Science

Preferred approaches for imbalanced data

I am building a binary classification model with imbalanced target variable (13% Class 1 vs 87% class 0). I am considering the following three options to handle the data imbalance Option1: Create a balanced training dataset where with 50% / 50% split of the target variable. Option 2: Samples the dataset as-is (i.e., 87% / 13% split) and use upsampling methods (e.g., SMOTE) to balance the target variable to 50% / 50% split. Option 3: Use learning methods with appropriate …
Category: Data Science

How does SMOTE work for dataset with only categorical variables?

I have a small dataset of 977 rows with a class proportion of 77:23. For the sake of metrics improvement, I have kept my minority class ('default') as class 1 (and 'not default' as class 0). My input variables are categorical in nature. So, the below is what I tried. Let's assume we don't have age and salary info a) Apply encoding like rare_encoding and ordinal_encoding to my dataset b) Split into train and test split (with stratify = y) …
Category: Data Science

Why SMOTE is not used in prize-winning Kaggle solutions?

Synthetic Minority Over-sampling Technique SMOTE, is a well known method to tackle imbalanced datasets. There are many papers with a lot of citations out-there claiming that it is used to boost accuracy in unbalanced data scenarios. But then, when I see Kaggle competitions, it is rarely used, to the best of my knowledge there are no prize-winning Kaggle/ML competitions where it is used to achieve the best solution. Why SMOTE is not used in Kaggle? I even see applied research …
Category: Data Science

Why does class_weight usually outperform SMOTE?

I'm trying to figure out what exactly class_weight from sklearn does. When working with imbalanced datasets, I'm always using class_weight because the results are usually better than using SMOTE. However, I'm not sure why. I've tried to find an answer, but most of answers regarding the subject are vague. For instance, the first answer here explain class_weight in a way that looks similar to SMOTE. This and this also didn't provide an answer. I read once that SMOTE is used …
Category: Data Science

Follow up question regarding Upsampling for Imbalanced Data and the use of ADASYN instead of SMOTE

I have a follow-up question regarding this topic. I have been working on a project predicting success(1) or failure(0) for organizations by using the Decision Tree and Random Forest algorithms. My dataset has a minority class of successes which I would like to upsample using SMOTE or ADASYN. I understand that the reasoning mentioned in this post applies to SMOTE and random upsampling by duplicating but does this also apply to upsampling via ADASYN? As I under ADASYN introduces even …
Category: Data Science

Is it good practice to use SMOTE when you have a data set that has imbalanced classes when using BERT model for text classification?

I had a question related to SMOTE. If you have a data set that is imbalanced, is it correct to use SMOTE when you are using BERT? I believe I read somewhere that you do not need to do this since BERT take this into account, but I'm unable to find the article where I read that. Either from your own research or experience, would you say that oversampling using SMOTE (or some other algorithm) is useful when classifying using …
Category: Data Science

Train/Test Split after performing SMOTE

I am dealing with a highly unbalanced dataset so I used SMOTE to resample it. After SMOTE resampling, I split the resampled dataset into training/test sets using the training set to build a model and the test set to evaluate it. However, I am worried that some data points in the test set might actually be jittered from data points in the training set (i.e. the information is leaking from the training set into the test set) so the test …
Category: Data Science

SMOTE for multi-class balance changes the shape of my dataset

So I have a dataset of shape (430,17), that consists of 13 classes (imbalanced) and 17 features. The end goal is to create a NN which btw works when I import the imblanced dataset, however when i try to over-sample the minority classes using SMOTE in jupyter notebook, the classes do get balanced but also the shape changes. from imblearn.over_sampling import SMOTE from sklearn.preprocessing import OneHotEncoder from imblearn.pipeline import Pipelineenter steps = [('onehot', OneHotEncoder()), ('smt', SMOTE())] pipeline = Pipeline(steps=steps) X_res, …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.