I am using Xgboost for classification. My y is 0 or 1 (true or false). I have categorical and numeric features, so theoretically, I need to use SMOTE-NC instead of SMOTE. However, I get better results with SMOTE. Could anyone explain why this is happening? Also, if I use some encoder (BinaryEncoder, one hot, etc.) for categorical data, do I need to use SMOTE-NC after encoding, or before? I copied my example code (x and y is after cleaning, include …
I'm trying to apply SMOTE to a dataset that has time-constraints. I have information about users visiting a website. For some features, there are time constraints, e.g having the first visit and the last visit at the website, the first visit (timestamp) is always lower or equal than the last visit. If I apply SMOTE(or SMOTENC for categorical), I end up having synthetic samples for which the last visit occurred before the first visit. This leads to a sample that …
When I use SMOTE-NC to oversample three classes of a 4-class classification problem, the Prec, Recall, and F1 metrics for minority classes are still VERY low (~3%). I have 32 categorical and 30 continuous variables in my dataset. All the categorical variables have been converted to binary columns using one-hot encoding. Also, before going for the over-sampling process, I am imputing all missing values using Iterativeimputer. Regarding the classifiers, I am using logistic regression, random forest and XGboost. May I …
I am trying to figure out the appropriate way to build a pipeline to train a model which includes using the SMOTENC algorithm: Given that the N-Nearest Neighbors algorithm and Euclidian distance are used, should the data by normalized (Scale input vectors individually to unit norm). Prior to applying SMOTENC in the pipeline? Can the algorithm handle missing values? If data imputation and outlier removal based on median and percentiles values are performed prior to SMOTENC rather than after it, …
I am looking into upsampling an imbalanced dataset for a regression problem (Numerical target variables) in python. I attached paper and R package that implement SMOTE for regression, can anyone recommend a similar package in Python? Otherwise, what other methods can be use to upsample the numerical target variable? SMOTE for Regression smoteRegress: SMOTE algorithm for imbalanced regression problems Update: I found the following python library which implements Synthetic Minority Over-Sampling Technique for Regression with Gaussian Noise smogn
I would greatly appreciate if you could let me know how to use SMOTENC. I wrote: num_indices1 = list(X.iloc[:,np.r_[0:94,95,97,100:123]].columns.values) cat_indices1 = list(X.iloc[:,np.r_[94,96,98,99,123:160]].columns.values) print(len(num_indices1)) print(len(cat_indices1)) pipeline=Pipeline(steps= [ # Categorical features ('feature_processing', FeatureUnion(transformer_list = [ ('categorical', MultiColumn(cat_indices1)), #numeric ('numeric', Pipeline(steps = [ ('select', MultiColumn(num_indices1)), ('scale', StandardScaler()) ])) ])), ('clf', rg) ] ) Therefore, as it is indicated I have 5 categorical features. Really, indices 123 to 160 are related to one categorical feature with 37 possible values which is converted into 37 …