class-imbalance

macro average and weighted average meaning in classification_report

user10296606

2022年6月4日 06:13

I use the "classification_report" from from sklearn.metrics import classification_report in order to evaluate the imbalanced binary classification Classification Report : precision recall f1-score support 0 1.00 1.00 1.00 28432 1 0.02 0.02 0.02 49 accuracy 1.00 28481 macro avg 0.51 0.51 0.51 28481 weighted avg 1.00 1.00 1.00 28481 I do not understand clearly what is the meaning of macro avg and weighted average? and how we can clarify the best solution based on how close their amount to one! …

Topic: class-imbalance accuracy classification

Category: Data Science

Labels as features in anomaly detection

Daniele

2022年5月31日 01:04

I have a dataset born to solve a classification problem. Due to the imbalances of the Y, i choose to move to an anomaly detection task. Should I use the Y i have inside the anomaly detection model as a features? Is it an overfitting Risk?

Topic: multiclass-classification anomaly-detection class-imbalance

Category: Data Science

Should I resample my dataset?

mike

2022年5月23日 23:05

The dataset that I have is some text data consisting of path names. I am using TF-IDF vectorizer and decision trees. The classes in my dataset are severely imbalanced. There are a few big classes with a number of samples more than 500 and some other minor classes with a number of samples less than 100. Some are even smaller (less than 20). This is real data collected, so the chance where the model seeing minor class in actual implementation …

Topic: decision-trees class-imbalance

Category: Data Science

Sampling Highly Imbalanced Large Dataset

Harshit Gupta

2022年5月18日 20:50

I am working on a model which will run monthly on 8M users. I've snapshot-wise data in training set, eg: Jan, 21 Snapshot : 8M Total : 233 Positives Rest Negative Feb, 21 Snapshot : 8M Total : 599 Positives Rest Negative March, 21 Snapshot : 8M Total : 600 Positives Rest Negative April, 21 Snapshot : 8M Total : 750 Positives Rest Negative similarly till March, 2022 I'm keeping March, 2022 as test set, which has 2000 positive labels …

Topic: binary-classification class-imbalance

Category: Data Science

How to weigh imbalanced softlabels?

Kay Lamerigts

2022年5月16日 23:02

The target is a probability between N classes, I don't want it to predict the class with the highest probability but the 'actual' probability per class. For example: | | Class 1 | Class 2 | Class 3 | ------------------------------------ | 1 | 0.9 | 0.05 | 0.05 | | 2 | 0.2 | 0.8 | 0 | | 3 | 0.3 | 0.3 | 0.4 | | 4 | 0.7 | 0 | 0.3 | ------------------------------------ | + | …

Topic: labels class-imbalance

Category: Data Science

focal loss function help

M. Ahmad

2022年5月14日 02:04

I am working on a relation extraction and classification problem. The data is in the form of text files. The data is imbalanced. I want to use focal loss function to address class imbalance problem in the data. My question is: Can focal loss be utilized for extraction and classification task to increase the accuracy? Focal loss has been applied on object detection task and for image classification task. The link is below. I want to use this on text …

Topic: loss-function multiclass-classification supervised-learning class-imbalance

Category: Data Science

In which situation should we consider a dataset as imbalanced?

ouyqf

2022年5月10日 12:01

I'm facing a problem about making a classification on a dataset. The target variable is binary (with 2 classes, 0 and 1). I have 8,161 samples in the training dataset. And for each class, I have: class 0: 6,008 samples, 73.6% of total numbers. class 1: 2,153 samples, 26.4% My questions are: In this case, should I consider the dataset I used as an imbalanced dataset? If it was, should I process the data before using RandomForest to make a …

Topic: class-imbalance random-forest classification machine-learning

Category: Data Science

class weighted classification

jared

2022年5月8日 18:46

I am working on my multi-class classification project and I have a question: I have three classes in proportion: 50%, 47% and 3%. I decided to use class_weight="balanced" parameter in random forest classifier. Now I want to calculate accuracy. Should I use balanced accuracy or can I use common accuracy?

Topic: class-imbalance classification machine-learning

Category: Data Science

Restrictions on my skewed validation data

Tobias

2022年5月8日 17:40

I have a severely skewed data sets consisting of 20 something classes where the smallest class contains on the order of 1000 samples and the largest several millions. Regarding the validation data, I understand that I should make sure that it represent a similar ratio between classes compared to the one in my original raw data. Hence, I shouldn't do any under- or over-sampling on that validation data, but can do it on the training data. Because I have such …

Topic: cross-validation class-imbalance classification machine-learning

Category: Data Science

Generate a balanced batch with ImageDataGenerator() and flow_from_directory()

Jennifer Voon

2022年5月7日 13:06

Hi I am new to python and deep learning. I am doing a multiclass classification. My 3-classes dataset is imbalanced, the classes take about 50%, 40%, and 20%. I am trying to generate mini batches with balanced classes. I am using class_weight to generate a balanced batch in fit_generator() but I am doubting if it actually works because the batches generated by train_datagen.flow_from_directory() is not balanced. the generated batches have weights around [0.43, 0.38, 0.19]. My code are as follow: …

Topic: multiclass-classification class-imbalance classification

Category: Data Science

Which metric should I use for classifying an imbalace data with fewer labels for the negative class?

PPR

2022年5月7日 08:03

From reading, I understand that when we have fewer positive class labels, it is better to use precision or recall as the evaluation metric. Which metric should I use when we have fewer negative samples? I'm looking for an approach other than switching the labels. Problem setting: I'm developing parametrized fragility functions for predicting damage to a structure (for example trees). An example of fragiltiy function is here The fragility function will estimate the probability of exceeding a damage state …

Topic: class-imbalance classification

Category: Data Science

What's the best way to validate a rare event detection model during training?

jack

2022年5月5日 14:01

When training a deep model for rare event detection (e.g. sound of an alarm in a home device audio stream), is it best to use a balanced validation set (50% alarm, 50% normal) to determine early stopping etc., or a validation set representative of reality? If an unbalanced, realistic validation set is used it may have to be huge to contain only a few positive event examples, so I'm wondering how this is typically dealt with. In the given example …

Topic: audio-recognition anomaly-detection class-imbalance deep-learning

Category: Data Science

Unbalanced multiclass data with XGBoost

shda

2022年5月5日 07:07

I have 3 classes with this distribution: Class 0: 0.1169 Class 1: 0.7668 Class 2: 0.1163 And I am using xgboost for classification. I know that there is a parameter called scale_pos_weight. But how is it handled for 'multiclass' case, and how can I properly set it?

Topic: xgboost multiclass-classification class-imbalance classification

Category: Data Science

Unbalanced data set - how to optimize hyperparams via grid search?

Code Now

2022年5月4日 06:07

I would like to optimize the hyperparameters C and Gamma of an SVC by using grid search for an unbalanced data set. So far I have used class_weights='balanced' and selected the best hyperparameters based on the average of the f1-scores. However, the data set is very unbalanced, i.e. if I chose GridSearchCV with cv=10, then some minority classes are not represented in the validation data. I'm thinking of using SMOTE, but I see the problem here that I would have …

Topic: grid-search smote multiclass-classification class-imbalance scikit-learn

Category: Data Science

Handling Imbalanced Datasets in Orange

Bob Hoyt

2022年4月28日 14:05

I work in the medical domain, so class imbalance is the rule and not the exception. While I know Python has packages for class imbalance, I don't see an option in Orange for e.g. a SMOTE widget. I have read other threads in Stack Exchange regarding this, but I have not found an answer to how to tackle class imbalance in Orange without resorting to Python programming. Thanks

Topic: imbalanced-learn orange class-imbalance

Category: Data Science

Overfitted model produces similar AUC on test set, so which model do I go with?

rayven1lk

2022年4月26日 21:02

I was trying to compare the effect of running GridSearchCV on a dataset which was oversampled prior and oversampled after the training folds are selected. The oversampling approach I used was random oversampling. Understand that the first approach is wrong since observations that the model has seen bleed into the test set. Was just curious about how much of a difference this causes. I generated a binary classification dataset with following: # Generate binary classification dataset with 5% minority class, …

Topic: gridsearchcv overfitting sampling class-imbalance random-forest

Category: Data Science

Using keras with sklearn: apply class_weight with cross_val_score

AylaRT

2022年4月25日 15:02

I have a highly imbalanced dataset (± 5% positive instances), for which I am training binary classifiers. I am using nested 5-fold cross-validation with grid search for hyperparameter tuning. I want to avoid undersampling, so I have been looking into the class_weight hyperparameter. For sklearn's decisiontree classifier, this works really well and is easily given as a hyperparameter. However, this is not an option for sklearn's neural network (multi-layer perceptron) as far as I can tell. I have been using …

Topic: keras cross-validation class-imbalance scikit-learn

Category: Data Science

ROC-AUC Imbalanced Data Score Interpretation

data wannabe

2022年4月24日 23:52

I have a binary response variable (label) in a dataset with around 50,000 observations. The training set is somewhat imbalanced with, =1 making up about 33% of the observation's and =0 making up about 67% of the observations. Right now with XGBoost I'm getting a ROC-AUC score of around 0.67. The response variable is binary so the baseline is 50% in term of chance, but at the same time the data is imbalanced, so if the model just guessed =0 …

Topic: binary-classification xgboost roc class-imbalance

Category: Data Science

under sample to get specific number of samples per class using tomek links of imblearn

Naveen Reddy Marthala

2022年4月24日 07:05

I have a dataset with classes in my target column distributed like shown below. counts percents 6 1507 27.045944 3 1301 23.348887 5 661 11.862886 4 588 10.552764 7 564 10.122039 8 432 7.753051 1 416 7.465901 2 61 1.094760 9 38 0.681981 10 4 0.071788 I would like to under sample my data and include, only 588 samples for a class at maximum; so that the classess 6, 3 & 5 only have ~588 samples available after undersampling. Here's …

Topic: imbalanced-learn sampling class-imbalance python

Category: Data Science

class imbalance - applied SMOTE - next steps

machlear7

2022年4月18日 18:45

I am new to ML and learnt a lot from your valuable posts. I need your advise with the following situation and guidance on if the steps make sense. I have a binary classification problem, my dataset has a severe imbalance approximately 2% positive cases (4,000 cases) out of a total of 200,000 cases. I separated my dataset into a train and a test (80/20 stratified split). My train now has total of 160,000 cases (3,200 positive cases) and test …

Topic: smote class-imbalance r

Category: Data Science

About