smotenc

SMOTE vs SMOTE-NC for binary classifier with categorical and numeric data

RafalQA

2021年9月21日 05:10

I am using Xgboost for classification. My y is 0 or 1 (true or false). I have categorical and numeric features, so theoretically, I need to use SMOTE-NC instead of SMOTE. However, I get better results with SMOTE. Could anyone explain why this is happening? Also, if I use some encoder (BinaryEncoder, one hot, etc.) for categorical data, do I need to use SMOTE-NC after encoding, or before? I copied my example code (x and y is after cleaning, include …

Topic: smotenc smote class-imbalance classification machine-learning

Category: Data Science

Balancing the dataset using imblearn undersampling, oversampling and combine?

hanzgs

2021年1月31日 21:31

I have the imbalanced dataset: data['Class'].value_counts() Out[22]: 0 137757 1 4905 Name: Class, dtype: int64 X_train, X_valid, y_train, y_valid = train_test_split(input_x, input_y, test_size=0.20, random_state=seed) print(sorted(Counter(y_train).items())) [(0, 110215), (1, 3914)] I tried different imblearn functions: from imblearn.combine import SMOTEENN, SMOTETomek from imblearn.over_sampling import ADASYN, BorderlineSMOTE, RandomOverSampler, SMOTE from imblearn.under_sampling import CondensedNearestNeighbour, EditedNearestNeighbours, RepeatedEditedNearestNeighbours from imblearn.under_sampling import AllKNN, InstanceHardnessThreshold, NeighbourhoodCleaningRule, TomekLinks smote_enn = SMOTEENN(random_state=27) smote_tomek = SMOTETomek(random_state=27) adasyn = ADASYN(random_state=27) borderline = BorderlineSMOTE(random_state=27) ran_oversample = RandomOverSampler(random_state=27) smote = SMOTE(random_state=27) cnn = CondensedNearestNeighbour(random_state=27) …

Topic: smotenc imbalanced-learn smote class-imbalance python

Category: Data Science

SMOTE and oversampling with constraints

Titus Pullo

2020年11月30日 15:39

I'm trying to apply SMOTE to a dataset that has time-constraints. I have information about users visiting a website. For some features, there are time constraints, e.g having the first visit and the last visit at the website, the first visit (timestamp) is always lower or equal than the last visit. If I apply SMOTE(or SMOTENC for categorical), I end up having synthetic samples for which the last visit occurred before the first visit. This leads to a sample that …

Topic: smotenc imbalanced-learn smote class-imbalance

Category: Data Science

SMOTE-NC does not help to oversample my mixed continuous/categorical dataset

Sarah

2020年11月2日 23:02

When I use SMOTE-NC to oversample three classes of a 4-class classification problem, the Prec, Recall, and F1 metrics for minority classes are still VERY low (~3%). I have 32 categorical and 30 continuous variables in my dataset. All the categorical variables have been converted to binary columns using one-hot encoding. Also, before going for the over-sampling process, I am imputing all missing values using Iterativeimputer. Regarding the classifiers, I am using logistic regression, random forest and XGboost. May I …

Topic: smotenc class-imbalance categorical-data

Category: Data Science

Using SMOTENC in a pipeline

thereandhere1

2020年9月19日 02:47

I am trying to figure out the appropriate way to build a pipeline to train a model which includes using the SMOTENC algorithm: Given that the N-Nearest Neighbors algorithm and Euclidian distance are used, should the data by normalized (Scale input vectors individually to unit norm). Prior to applying SMOTENC in the pipeline? Can the algorithm handle missing values? If data imputation and outlier removal based on median and percentiles values are performed prior to SMOTENC rather than after it, …

Topic: smotenc imbalanced-learn smote class-imbalance

Category: Data Science

SMOTE for regression

thereandhere1

2020年5月18日 09:18

I am looking into upsampling an imbalanced dataset for a regression problem (Numerical target variables) in python. I attached paper and R package that implement SMOTE for regression, can anyone recommend a similar package in Python? Otherwise, what other methods can be use to upsample the numerical target variable? SMOTE for Regression smoteRegress: SMOTE algorithm for imbalanced regression problems Update: I found the following python library which implements Synthetic Minority Over-Sampling Technique for Regression with Gaussian Noise smogn

Topic: smotenc imbalanced-learn smote sampling r

Category: Data Science

How to use SMOTENC inside the Pipeline?

ebrahimi

2019年10月25日 14:43

I would greatly appreciate if you could let me know how to use SMOTENC. I wrote: num_indices1 = list(X.iloc[:,np.r_[0:94,95,97,100:123]].columns.values) cat_indices1 = list(X.iloc[:,np.r_[94,96,98,99,123:160]].columns.values) print(len(num_indices1)) print(len(cat_indices1)) pipeline=Pipeline(steps= [ # Categorical features ('feature_processing', FeatureUnion(transformer_list = [ ('categorical', MultiColumn(cat_indices1)), #numeric ('numeric', Pipeline(steps = [ ('select', MultiColumn(num_indices1)), ('scale', StandardScaler()) ])) ])), ('clf', rg) ] ) Therefore, as it is indicated I have 5 categorical features. Really, indices 123 to 160 are related to one categorical feature with 37 possible values which is converted into 37 …

Topic: smotenc imbalanced-learn scikit-learn python

Category: Data Science

SMOTE vs SMOTE-NC for binary classifier with categorical and numeric data

Balancing the dataset using imblearn undersampling, oversampling and combine?

SMOTE and oversampling with constraints

SMOTE-NC does not help to oversample my mixed continuous/categorical dataset

Using SMOTENC in a pipeline

SMOTE for regression

How to use SMOTENC inside the Pipeline?

About