macro average and weighted average meaning in classification_report

I use the "classification_report" from from sklearn.metrics import classification_report in order to evaluate the imbalanced binary classification Classification Report : precision recall f1-score support 0 1.00 1.00 1.00 28432 1 0.02 0.02 0.02 49 accuracy 1.00 28481 macro avg 0.51 0.51 0.51 28481 weighted avg 1.00 1.00 1.00 28481 I do not understand clearly what is the meaning of macro avg and weighted average? and how we can clarify the best solution based on how close their amount to one! …
Category: Data Science

Should I resample my dataset?

The dataset that I have is some text data consisting of path names. I am using TF-IDF vectorizer and decision trees. The classes in my dataset are severely imbalanced. There are a few big classes with a number of samples more than 500 and some other minor classes with a number of samples less than 100. Some are even smaller (less than 20). This is real data collected, so the chance where the model seeing minor class in actual implementation …
Category: Data Science

Sampling Highly Imbalanced Large Dataset

I am working on a model which will run monthly on 8M users. I've snapshot-wise data in training set, eg: Jan, 21 Snapshot : 8M Total : 233 Positives Rest Negative Feb, 21 Snapshot : 8M Total : 599 Positives Rest Negative March, 21 Snapshot : 8M Total : 600 Positives Rest Negative April, 21 Snapshot : 8M Total : 750 Positives Rest Negative similarly till March, 2022 I'm keeping March, 2022 as test set, which has 2000 positive labels …
Category: Data Science

How to weigh imbalanced softlabels?

The target is a probability between N classes, I don't want it to predict the class with the highest probability but the 'actual' probability per class. For example: | | Class 1 | Class 2 | Class 3 | ------------------------------------ | 1 | 0.9 | 0.05 | 0.05 | | 2 | 0.2 | 0.8 | 0 | | 3 | 0.3 | 0.3 | 0.4 | | 4 | 0.7 | 0 | 0.3 | ------------------------------------ | + | …
Category: Data Science

focal loss function help

I am working on a relation extraction and classification problem. The data is in the form of text files. The data is imbalanced. I want to use focal loss function to address class imbalance problem in the data. My question is: Can focal loss be utilized for extraction and classification task to increase the accuracy? Focal loss has been applied on object detection task and for image classification task. The link is below. I want to use this on text …
Category: Data Science

In which situation should we consider a dataset as imbalanced?

I'm facing a problem about making a classification on a dataset. The target variable is binary (with 2 classes, 0 and 1). I have 8,161 samples in the training dataset. And for each class, I have: class 0: 6,008 samples, 73.6% of total numbers. class 1: 2,153 samples, 26.4% My questions are: In this case, should I consider the dataset I used as an imbalanced dataset? If it was, should I process the data before using RandomForest to make a …
Category: Data Science

class weighted classification

I am working on my multi-class classification project and I have a question: I have three classes in proportion: 50%, 47% and 3%. I decided to use class_weight="balanced" parameter in random forest classifier. Now I want to calculate accuracy. Should I use balanced accuracy or can I use common accuracy?
Category: Data Science

Restrictions on my skewed validation data

I have a severely skewed data sets consisting of 20 something classes where the smallest class contains on the order of 1000 samples and the largest several millions. Regarding the validation data, I understand that I should make sure that it represent a similar ratio between classes compared to the one in my original raw data. Hence, I shouldn't do any under- or over-sampling on that validation data, but can do it on the training data. Because I have such …
Category: Data Science

Generate a balanced batch with ImageDataGenerator() and flow_from_directory()

Hi I am new to python and deep learning. I am doing a multiclass classification. My 3-classes dataset is imbalanced, the classes take about 50%, 40%, and 20%. I am trying to generate mini batches with balanced classes. I am using class_weight to generate a balanced batch in fit_generator() but I am doubting if it actually works because the batches generated by train_datagen.flow_from_directory() is not balanced. the generated batches have weights around [0.43, 0.38, 0.19]. My code are as follow: …
Category: Data Science

Which metric should I use for classifying an imbalace data with fewer labels for the negative class?

From reading, I understand that when we have fewer positive class labels, it is better to use precision or recall as the evaluation metric. Which metric should I use when we have fewer negative samples? I'm looking for an approach other than switching the labels. Problem setting: I'm developing parametrized fragility functions for predicting damage to a structure (for example trees). An example of fragiltiy function is here The fragility function will estimate the probability of exceeding a damage state …
Category: Data Science

What's the best way to validate a rare event detection model during training?

When training a deep model for rare event detection (e.g. sound of an alarm in a home device audio stream), is it best to use a balanced validation set (50% alarm, 50% normal) to determine early stopping etc., or a validation set representative of reality? If an unbalanced, realistic validation set is used it may have to be huge to contain only a few positive event examples, so I'm wondering how this is typically dealt with. In the given example …
Category: Data Science

Unbalanced data set - how to optimize hyperparams via grid search?

I would like to optimize the hyperparameters C and Gamma of an SVC by using grid search for an unbalanced data set. So far I have used class_weights='balanced' and selected the best hyperparameters based on the average of the f1-scores. However, the data set is very unbalanced, i.e. if I chose GridSearchCV with cv=10, then some minority classes are not represented in the validation data. I'm thinking of using SMOTE, but I see the problem here that I would have …
Category: Data Science

Handling Imbalanced Datasets in Orange

I work in the medical domain, so class imbalance is the rule and not the exception. While I know Python has packages for class imbalance, I don't see an option in Orange for e.g. a SMOTE widget. I have read other threads in Stack Exchange regarding this, but I have not found an answer to how to tackle class imbalance in Orange without resorting to Python programming. Thanks
Category: Data Science

Overfitted model produces similar AUC on test set, so which model do I go with?

I was trying to compare the effect of running GridSearchCV on a dataset which was oversampled prior and oversampled after the training folds are selected. The oversampling approach I used was random oversampling. Understand that the first approach is wrong since observations that the model has seen bleed into the test set. Was just curious about how much of a difference this causes. I generated a binary classification dataset with following: # Generate binary classification dataset with 5% minority class, …
Category: Data Science

Using keras with sklearn: apply class_weight with cross_val_score

I have a highly imbalanced dataset (± 5% positive instances), for which I am training binary classifiers. I am using nested 5-fold cross-validation with grid search for hyperparameter tuning. I want to avoid undersampling, so I have been looking into the class_weight hyperparameter. For sklearn's decisiontree classifier, this works really well and is easily given as a hyperparameter. However, this is not an option for sklearn's neural network (multi-layer perceptron) as far as I can tell. I have been using …
Category: Data Science

ROC-AUC Imbalanced Data Score Interpretation

I have a binary response variable (label) in a dataset with around 50,000 observations. The training set is somewhat imbalanced with, =1 making up about 33% of the observation's and =0 making up about 67% of the observations. Right now with XGBoost I'm getting a ROC-AUC score of around 0.67. The response variable is binary so the baseline is 50% in term of chance, but at the same time the data is imbalanced, so if the model just guessed =0 …
Category: Data Science

under sample to get specific number of samples per class using tomek links of imblearn

I have a dataset with classes in my target column distributed like shown below. counts percents 6 1507 27.045944 3 1301 23.348887 5 661 11.862886 4 588 10.552764 7 564 10.122039 8 432 7.753051 1 416 7.465901 2 61 1.094760 9 38 0.681981 10 4 0.071788 I would like to under sample my data and include, only 588 samples for a class at maximum; so that the classess 6, 3 & 5 only have ~588 samples available after undersampling. Here's …
Category: Data Science

class imbalance - applied SMOTE - next steps

I am new to ML and learnt a lot from your valuable posts. I need your advise with the following situation and guidance on if the steps make sense. I have a binary classification problem, my dataset has a severe imbalance approximately 2% positive cases (4,000 cases) out of a total of 200,000 cases. I separated my dataset into a train and a test (80/20 stratified split). My train now has total of 160,000 cases (3,200 positive cases) and test …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.