SMOTE vs SMOTE-NC for binary classifier with categorical and numeric data

I am using Xgboost for classification. My y is 0 or 1 (true or false). I have categorical and numeric features, so theoretically, I need to use SMOTE-NC instead of SMOTE. However, I get better results with SMOTE. Could anyone explain why this is happening? Also, if I use some encoder (BinaryEncoder, one hot, etc.) for categorical data, do I need to use SMOTE-NC after encoding, or before? I copied my example code (x and y is after cleaning, include …
Category: Data Science

Balancing the dataset using imblearn undersampling, oversampling and combine?

I have the imbalanced dataset: data['Class'].value_counts() Out[22]: 0 137757 1 4905 Name: Class, dtype: int64 X_train, X_valid, y_train, y_valid = train_test_split(input_x, input_y, test_size=0.20, random_state=seed) print(sorted(Counter(y_train).items())) [(0, 110215), (1, 3914)] I tried different imblearn functions: from imblearn.combine import SMOTEENN, SMOTETomek from imblearn.over_sampling import ADASYN, BorderlineSMOTE, RandomOverSampler, SMOTE from imblearn.under_sampling import CondensedNearestNeighbour, EditedNearestNeighbours, RepeatedEditedNearestNeighbours from imblearn.under_sampling import AllKNN, InstanceHardnessThreshold, NeighbourhoodCleaningRule, TomekLinks smote_enn = SMOTEENN(random_state=27) smote_tomek = SMOTETomek(random_state=27) adasyn = ADASYN(random_state=27) borderline = BorderlineSMOTE(random_state=27) ran_oversample = RandomOverSampler(random_state=27) smote = SMOTE(random_state=27) cnn = CondensedNearestNeighbour(random_state=27) …
Category: Data Science

SMOTE and oversampling with constraints

I'm trying to apply SMOTE to a dataset that has time-constraints. I have information about users visiting a website. For some features, there are time constraints, e.g having the first visit and the last visit at the website, the first visit (timestamp) is always lower or equal than the last visit. If I apply SMOTE(or SMOTENC for categorical), I end up having synthetic samples for which the last visit occurred before the first visit. This leads to a sample that …
Category: Data Science

SMOTE-NC does not help to oversample my mixed continuous/categorical dataset

When I use SMOTE-NC to oversample three classes of a 4-class classification problem, the Prec, Recall, and F1 metrics for minority classes are still VERY low (~3%). I have 32 categorical and 30 continuous variables in my dataset. All the categorical variables have been converted to binary columns using one-hot encoding. Also, before going for the over-sampling process, I am imputing all missing values using Iterativeimputer. Regarding the classifiers, I am using logistic regression, random forest and XGboost. May I …
Category: Data Science

Using SMOTENC in a pipeline

I am trying to figure out the appropriate way to build a pipeline to train a model which includes using the SMOTENC algorithm: Given that the N-Nearest Neighbors algorithm and Euclidian distance are used, should the data by normalized (Scale input vectors individually to unit norm). Prior to applying SMOTENC in the pipeline? Can the algorithm handle missing values? If data imputation and outlier removal based on median and percentiles values are performed prior to SMOTENC rather than after it, …
Category: Data Science

SMOTE for regression

I am looking into upsampling an imbalanced dataset for a regression problem (Numerical target variables) in python. I attached paper and R package that implement SMOTE for regression, can anyone recommend a similar package in Python? Otherwise, what other methods can be use to upsample the numerical target variable? SMOTE for Regression smoteRegress: SMOTE algorithm for imbalanced regression problems Update: I found the following python library which implements Synthetic Minority Over-Sampling Technique for Regression with Gaussian Noise smogn
Category: Data Science

How to use SMOTENC inside the Pipeline?

I would greatly appreciate if you could let me know how to use SMOTENC. I wrote: num_indices1 = list(X.iloc[:,np.r_[0:94,95,97,100:123]].columns.values) cat_indices1 = list(X.iloc[:,np.r_[94,96,98,99,123:160]].columns.values) print(len(num_indices1)) print(len(cat_indices1)) pipeline=Pipeline(steps= [ # Categorical features ('feature_processing', FeatureUnion(transformer_list = [ ('categorical', MultiColumn(cat_indices1)), #numeric ('numeric', Pipeline(steps = [ ('select', MultiColumn(num_indices1)), ('scale', StandardScaler()) ])) ])), ('clf', rg) ] ) Therefore, as it is indicated I have 5 categorical features. Really, indices 123 to 160 are related to one categorical feature with 37 possible values which is converted into 37 …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.