under sampling the dataset of multi-label classifiction

I have a multi-label dataset, whose label distribution looks something like this, with label on x-axis and number of rows it occurs in the dataset in y-axis. ## imports import numpy as np import pandas as pd %matplotlib inline from sklearn.datasets import make_multilabel_classification ## creating dummy data X, y = make_multilabel_classification(n_samples=100_000, n_features=2, n_classes=100, n_labels=10, random_state=42) X.shape, y.shape ((100000, 2), (100000, 100)) ## making it a dataframe final_df = pd.merge(left=pd.DataFrame(X), right=pd.DataFrame(y), left_index=True, right_index=True).copy() final_df.rename(columns={'0_x':'input_1', '1_x':'input_2', '0_y':0, '1_y':1}, inplace=True) final_df.columns = final_df.columns.astype(str) …
Category: Data Science

Clustering method that allows to choose clusters' size

I have a multilabel dataset showing an extreme case of imbalance. I was thinking of clustering the less populated classes into bigger clusters of size at least N. My question: is there a clustering algorithm that allows one to merge together only the smaller, similar groups into clusters of size at least N? The idea is that the algorithm should ignore those labels that are already populated enough and focus on clustering together labels that are still underrepresented. To give …
Category: Data Science

Imbalanced Dataset (Transformers): How to Decide on Class Weights?

I'm using SimpleTranformers to train and evaluate a model. Since the dataset I am using is severely imbalanced, it is recommended that I assign weights to each label. An example of assigning weights for SimpleTranformers is given here. My question, however, is: How exactly do I choose what's the appropriate weight for each class? Is there a specific methodology, e.g., a formula that uses the ratio of the labels? Follow-up question: Are the weights used for the same dataset "universal"? …
Category: Data Science

Categorization of approaches to deal with imbalanced classes

What is the best way to categorize the approaches which have been developed to deal with imbalance class problem? This article categorizes them into: Preprocessing: includes oversampling, undersampling and hybrid methods, Cost-sensitive learning: includes direct methods and meta-learning which the latter further divides into thresholding and sampling, Ensemble techniques: includes cost-sensitive ensembles and data preprocessing in conjunction with ensemble learning. The second classification: Data Pre-processing: includes distribution change and weighting the data space. One-class learning is considered as distribution change. …
Category: Data Science

unbalanced data classification

I used XGBoost to predict company's bankruptcy, which is an extremely unbalanced dataset. Although I tried weighting method as well as parameter tuning, the best result which I could obtain is as follows: Best Parameters: {'clf__gamma': 0.1, 'clf__scale_pos_weight': 30.736842105263158, 'clf__min_child_weight': 1, 'clf__max_depth': 9} Best Score: 0.219278428798 Accuracy: 0.966850828729 AUC: 0.850038850039 F1 Measure: 0.4 Cohen Kappa: 0.383129792673 Precision: 0.444444444444 recall: 0.363636363636 Confusion Matrix: [[346 5] [ 7 4]] As the confusion matrix shows my model can not identify bankrupted companies very …
Category: Data Science

Stratified K Fold Cross Validation in Orange: python script

I am using Orange to predict customer churn and compare different learners based on accuracy, F1, etc. As my problem is unbalanced (10% churn - 90% not churn), I want to oversample. However, when using orange, this is not possible to do the oversampling within the cross-validation (test & score block). Therefore, I want to, based on my input data, generate first 10 folds (stratified - where the distribution 10 % churn / 90 % not churn) is preserved. Then, …
Category: Data Science

How and where to set weights in case of imbalanced cost sensitive learning in machine learning?

I confront with a binary classification machine learning task which is both slightly imbalanced and cost sensitive. I wonder what (and where in the modeling pipeline, say, in sklearn) is the best way to take all these considerations into account. Class proportionality: positive: 0.25% negative: 0.75%. This could be addressed with sklearn.utils.class_weigh.compute_class_weight: class_weights = compute_class_weight(y=y, class_weight='balanced') OK, but this is only for rebalancing proportionalty, I should take misclassification cost into consideration as well. Let's say that this is 10* larger …
Category: Data Science

Clustering with imbalanced data and groups

I have a problem that is about identifying clusters of highly correlated items. I initially focused on building a model and features that put similar data items close to each other. The main challenge is that I have a case of imbalanced data, as follows: Tens of Millions of items are random and not necessarily correlated. Hundreds of clusters of items (composed of 10-1000s of elements) exist* or may emerge. *I do have partial ground truth for the existing ones. …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.