imbalance

GridSearch on imbalanced datasets

Valentin

2022年5月23日 06:06

Im trying to use gridsearch to find the best parameter for my model. Knowing that I have to implement nearmiss undersampling method while doing cross validation, should I fit my gridsearch on my undersampled dataset (no matter which under sampling techniques) or on my entire training data (whole dataset) before using cross validation?

Topic: hyperparameter-tuning imbalance scikit-learn machine-learning

Category: Data Science

under sampling the dataset of multi-label classifiction

Naveen Reddy Marthala

2021年10月26日 14:39

I have a multi-label dataset, whose label distribution looks something like this, with label on x-axis and number of rows it occurs in the dataset in y-axis. ## imports import numpy as np import pandas as pd %matplotlib inline from sklearn.datasets import make_multilabel_classification ## creating dummy data X, y = make_multilabel_classification(n_samples=100_000, n_features=2, n_classes=100, n_labels=10, random_state=42) X.shape, y.shape ((100000, 2), (100000, 100)) ## making it a dataframe final_df = pd.merge(left=pd.DataFrame(X), right=pd.DataFrame(y), left_index=True, right_index=True).copy() final_df.rename(columns={'0_x':'input_1', '1_x':'input_2', '0_y':0, '1_y':1}, inplace=True) final_df.columns = final_df.columns.astype(str) …

Topic: imbalance multilabel-classification sampling class-imbalance classification

Category: Data Science

Clustering method that allows to choose clusters' size

Álvaro F. b.f.

2021年10月13日 17:43

I have a multilabel dataset showing an extreme case of imbalance. I was thinking of clustering the less populated classes into bigger clusters of size at least N. My question: is there a clustering algorithm that allows one to merge together only the smaller, similar groups into clusters of size at least N? The idea is that the algorithm should ignore those labels that are already populated enough and focus on clustering together labels that are still underrepresented. To give …

Topic: imbalance clustering

Category: Data Science

Imbalanced Dataset (Transformers): How to Decide on Class Weights?

Aventinus

2021年8月8日 15:11

I'm using SimpleTranformers to train and evaluate a model. Since the dataset I am using is severely imbalanced, it is recommended that I assign weights to each label. An example of assigning weights for SimpleTranformers is given here. My question, however, is: How exactly do I choose what's the appropriate weight for each class? Is there a specific methodology, e.g., a formula that uses the ratio of the labels? Follow-up question: Are the weights used for the same dataset "universal"? …

Topic: bert transfer-learning imbalance class-imbalance

Category: Data Science

Categorization of approaches to deal with imbalanced classes

ebrahimi

2021年4月30日 09:34

What is the best way to categorize the approaches which have been developed to deal with imbalance class problem? This article categorizes them into: Preprocessing: includes oversampling, undersampling and hybrid methods, Cost-sensitive learning: includes direct methods and meta-learning which the latter further divides into thresholding and sampling, Ensemble techniques: includes cost-sensitive ensembles and data preprocessing in conjunction with ensemble learning. The second classification: Data Pre-processing: includes distribution change and weighting the data space. One-class learning is considered as distribution change. …

Topic: imbalanced-data imbalance class-imbalance classification machine-learning

Category: Data Science

unbalanced data classification

ebrahimi

2021年1月11日 09:40

I used XGBoost to predict company's bankruptcy, which is an extremely unbalanced dataset. Although I tried weighting method as well as parameter tuning, the best result which I could obtain is as follows: Best Parameters: {'clf__gamma': 0.1, 'clf__scale_pos_weight': 30.736842105263158, 'clf__min_child_weight': 1, 'clf__max_depth': 9} Best Score: 0.219278428798 Accuracy: 0.966850828729 AUC: 0.850038850039 F1 Measure: 0.4 Cohen Kappa: 0.383129792673 Precision: 0.444444444444 recall: 0.363636363636 Confusion Matrix: [[346 5] [ 7 4]] As the confusion matrix shows my model can not identify bankrupted companies very …

Topic: imbalance python-3.x class-imbalance classification

Category: Data Science

Stratified K Fold Cross Validation in Orange: python script

Emma Bartholomeeusen

2020年12月15日 14:27

I am using Orange to predict customer churn and compare different learners based on accuracy, F1, etc. As my problem is unbalanced (10% churn - 90% not churn), I want to oversample. However, when using orange, this is not possible to do the oversampling within the cross-validation (test & score block). Therefore, I want to, based on my input data, generate first 10 folds (stratified - where the distribution 10 % churn / 90 % not churn) is preserved. Then, …

Topic: imbalance orange cross-validation python

Category: Data Science

How and where to set weights in case of imbalanced cost sensitive learning in machine learning?

Fredrik

2020年9月9日 19:01

I confront with a binary classification machine learning task which is both slightly imbalanced and cost sensitive. I wonder what (and where in the modeling pipeline, say, in sklearn) is the best way to take all these considerations into account. Class proportionality: positive: 0.25% negative: 0.75%. This could be addressed with sklearn.utils.class_weigh.compute_class_weight: class_weights = compute_class_weight(y=y, class_weight='balanced') OK, but this is only for rebalancing proportionalty, I should take misclassification cost into consideration as well. Let's say that this is 10* larger …

Topic: imbalance weighted-data evaluation scikit-learn machine-learning

Category: Data Science

Clustering with imbalanced data and groups

DED

2020年6月30日 11:51

I have a problem that is about identifying clusters of highly correlated items. I initially focused on building a model and features that put similar data items close to each other. The main challenge is that I have a case of imbalanced data, as follows: Tens of Millions of items are random and not necessarily correlated. Hundreds of clusters of items (composed of 10-1000s of elements) exist* or may emerge. *I do have partial ground truth for the existing ones. …

Topic: imbalance clustering

Category: Data Science

About