How to evaluate data imputation techniques

I have a data set with 29 features 8 if them have missing values. I've tried Sklearn simple imputer and all it's strategies KNN imputer and several Number of K Iterative imputer and all combinations of imputation order , estimators, number of iterations. My question is how to evaluate the imputation techniques and choose the better one for my Data. I can't run a base line model and evaluate it's performance because I'm not familiar with balancing the data and …
Category: Data Science

Handling Imbalanced Datasets in Orange

I work in the medical domain, so class imbalance is the rule and not the exception. While I know Python has packages for class imbalance, I don't see an option in Orange for e.g. a SMOTE widget. I have read other threads in Stack Exchange regarding this, but I have not found an answer to how to tackle class imbalance in Orange without resorting to Python programming. Thanks
Category: Data Science

under sample to get specific number of samples per class using tomek links of imblearn

I have a dataset with classes in my target column distributed like shown below. counts percents 6 1507 27.045944 3 1301 23.348887 5 661 11.862886 4 588 10.552764 7 564 10.122039 8 432 7.753051 1 416 7.465901 2 61 1.094760 9 38 0.681981 10 4 0.071788 I would like to under sample my data and include, only 588 samples for a class at maximum; so that the classess 6, 3 & 5 only have ~588 samples available after undersampling. Here's …
Category: Data Science

imblearn error installing smote

I wanna install smote from imblearn package and I got the Following error: ImportError Traceback (most recent call last) <ipython-input-10-77606507c62c> in <module>() 66 len(data[data["num"]==0]) 67 #balancing dataset ---> 68 from imblearn.over_sampling import SMOTE 69 import matplotlib.pyplot as plt 70 sm = SMOTE(random_state=42) ~\Anaconda3\lib\site-packages\imblearn\__init__.py in <module>() 33 """ 34 ---> 35 from .base import FunctionSampler 36 from ._version import __version__ 37 ~\Anaconda3\lib\site-packages\imblearn\base.py in <module>() 17 from sklearn.utils import check_X_y 18 ---> 19 from .utils import check_sampling_strategy, check_target_type 20 from .utils.deprecation import …
Category: Data Science

Cross validation schema for imbalanced dataset

Based on a previous post, I understand the need to ensure that the validation folds during the CV process have the same imbalanced distribution as the original dataset when training a binary classification model with imbalance dataset. My question is regarding the best training schema. Let’s assume that I have an imbalanced dataset with 5M samples where 90% are pos class vs 10% neg class, and I am going to use 5-folds CV for model tuning. Also, let’s assume I …
Category: Data Science

Preferred approaches for imbalanced data

I am building a binary classification model with imbalanced target variable (13% Class 1 vs 87% class 0). I am considering the following three options to handle the data imbalance Option1: Create a balanced training dataset where with 50% / 50% split of the target variable. Option 2: Samples the dataset as-is (i.e., 87% / 13% split) and use upsampling methods (e.g., SMOTE) to balance the target variable to 50% / 50% split. Option 3: Use learning methods with appropriate …
Category: Data Science

Class imbalance: Will transforming multi-label (aka multi-task) to multi-class problem help?

I noticed this and this questions, but my problem is more about class imbalance. So now I have, say, 1000 targets and some input samples (with some feature vectors). Each input sample can have label '1' for many targets (currently tasks), meaning they interact. Label '0' means they don't interact (for each task, it is a binary classification problem). Unbalanced data My current issue is: For most targets there are <1% samples (perhaps 1 or 2) that are labelled 1. …
Category: Data Science

Give more weight to features based on distribution plot

I have a task to predict a binary variable purchase, their dataset is strongly imbalanced (10:100) and the models I have tried so far (mostly ensemble) fail. In addition, I have also tried to apply SMOTE to reduce imbalance, but the outcome is pretty much the same. Analyzing each feature in the dataset, I have noticed that there are some clearly visible differences in the distribution of features between purchase: 1 and purchase: 0 (see images) My question is: how …
Category: Data Science

Over-sampling when predicting a contionuous variable

Lets say i am predicting house selling prices (continuous) and therefore have multiple independent variables (numerical and categorical). Is it common practice to balance the dataset when the categorical independent variables are imbalanced? Ratio not higher 1:100. Or do i only balance the data when the dependent variable is imbalanced? Thanks
Category: Data Science

Explaining the logic behind the pipe_line method for cross-validation of imbalance datasets

Reading the following article: https://kiwidamien.github.io/how-to-do-cross-validation-when-upsampling-data.html There is an explanation of how to use from imblearn.pipeline import make_pipeline in order to perform a cross-validation on an imbalanced dataset while avoiding memory leakage. Here I copy the code used in the notebook linked by the article: X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=45) rf = RandomForestClassifier(n_estimators=100, random_state=13) imba_pipeline = make_pipeline(SMOTE(random_state=42), RandomForestClassifier(n_estimators=100, random_state=13)) cross_val_score(imba_pipeline, X_train, y_train, scoring='recall', cv=kf) new_params = {'randomforestclassifier__' + key: params[key] for key in params} grid_imba = GridSearchCV(imba_pipeline, param_grid=new_params, …
Category: Data Science

ColumnTransformer worse performance than sklearn pipeline

I have an (unbalanced , binary data) pipeline model consisting of two pipelines (preprocessing and the actual model). Now I wanted to include SimpleImputer into my preprocessing pipeline and because I don't want to apply it to all columns used ColumnTransformer but now I see that the performance with ColumnTransformer is a lot worse than with the sklearn pipeline (AUC before around 0.93 and with ColumnTransformerit's around 0.7). I filled the nan values before the pipeline to check if the …
Category: Data Science

Oversampling on Sequence(Text) data

Has anyone been able to perform synthetic oversampling on Sequential data? From what I've read and understand, the oversampling/undersampling techniques that are currently used are only applicable on structured, tabular data. But, if I've got a sequential data like this: Sequence Label [1,2,3,5,0,0,0,0] 3 [4,5,2,3,5,0,0,0] 5 [3,4,0,0,0,0,0,0] 7 where each sequence consists of integer tokens and padding, how do I perform SMOTE/ any other synthetic oversampling techniques? I don't want to do random replication of examples, since that's not very …
Category: Data Science

Positively skewed target label in regression

I have a dataset where the target label is positively skewed and produces a long tail, and currently I have a high residual on these values when experimenting with some linear, tree-based and neural-network regression models. I see the same problem with the Boston Housing prediction dataset, and recommendations to apply a log transformation to the target label. This has given some small improvement but not enough. Additionally I've tried to randomly duplicate values within the tail to shift the …
Category: Data Science

How to print feature names in conjunction with feature Importance using Imbalanced-learn library?

I used BalancedBaggingClassifier from imblearn library to do an unbalanced classification task. How can I get feature improtance of the estimator in conjunction with feature names especially when the max_features are less than the total number of features? For example, in the following code total number of features equal to 20 but max_features are 8. from collections import Counter from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from imblearn.ensemble import BalancedBaggingClassifier from xgboost.sklearn import XGBClassifier X, y = make_classification(n_classes=2, class_sep=2,weights=[0.1, …
Category: Data Science

Balancing the dataset using imblearn undersampling, oversampling and combine?

I have the imbalanced dataset: data['Class'].value_counts() Out[22]: 0 137757 1 4905 Name: Class, dtype: int64 X_train, X_valid, y_train, y_valid = train_test_split(input_x, input_y, test_size=0.20, random_state=seed) print(sorted(Counter(y_train).items())) [(0, 110215), (1, 3914)] I tried different imblearn functions: from imblearn.combine import SMOTEENN, SMOTETomek from imblearn.over_sampling import ADASYN, BorderlineSMOTE, RandomOverSampler, SMOTE from imblearn.under_sampling import CondensedNearestNeighbour, EditedNearestNeighbours, RepeatedEditedNearestNeighbours from imblearn.under_sampling import AllKNN, InstanceHardnessThreshold, NeighbourhoodCleaningRule, TomekLinks smote_enn = SMOTEENN(random_state=27) smote_tomek = SMOTETomek(random_state=27) adasyn = ADASYN(random_state=27) borderline = BorderlineSMOTE(random_state=27) ran_oversample = RandomOverSampler(random_state=27) smote = SMOTE(random_state=27) cnn = CondensedNearestNeighbour(random_state=27) …
Category: Data Science

What does IBA mean in imblearn classification report?

imblearn is a python library for handling imbalanced data. A code for generating classification report is given below. import numpy as np from imblearn.metrics import classification_report_imbalanced y_true = [0, 1, 2, 2, 2] y_pred = [0, 0, 2, 2, 1] target_names = ['class 0', 'class 1', 'class 2'] print(classification_report_imbalanced(y_true, y_pred,target_names=target_names)) The output for this is as follow pre rec spe f1 geo iba sup class 0 0.50 1.00 0.75 0.67 0.87 0.77 1 class 1 0.00 0.00 0.75 0.00 0.00 …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.