I use the "classification_report" from from sklearn.metrics import classification_report in order to evaluate the imbalanced binary classification Classification Report : precision recall f1-score support 0 1.00 1.00 1.00 28432 1 0.02 0.02 0.02 49 accuracy 1.00 28481 macro avg 0.51 0.51 0.51 28481 weighted avg 1.00 1.00 1.00 28481 I do not understand clearly what is the meaning of macro avg and weighted average? and how we can clarify the best solution based on how close their amount to one! …
I have a dataset born to solve a classification problem. Due to the imbalances of the Y, i choose to move to an anomaly detection task. Should I use the Y i have inside the anomaly detection model as a features? Is it an overfitting Risk?
The dataset that I have is some text data consisting of path names. I am using TF-IDF vectorizer and decision trees. The classes in my dataset are severely imbalanced. There are a few big classes with a number of samples more than 500 and some other minor classes with a number of samples less than 100. Some are even smaller (less than 20). This is real data collected, so the chance where the model seeing minor class in actual implementation …
I am working on a model which will run monthly on 8M users. I've snapshot-wise data in training set, eg: Jan, 21 Snapshot : 8M Total : 233 Positives Rest Negative Feb, 21 Snapshot : 8M Total : 599 Positives Rest Negative March, 21 Snapshot : 8M Total : 600 Positives Rest Negative April, 21 Snapshot : 8M Total : 750 Positives Rest Negative similarly till March, 2022 I'm keeping March, 2022 as test set, which has 2000 positive labels …
The target is a probability between N classes, I don't want it to predict the class with the highest probability but the 'actual' probability per class. For example: | | Class 1 | Class 2 | Class 3 | ------------------------------------ | 1 | 0.9 | 0.05 | 0.05 | | 2 | 0.2 | 0.8 | 0 | | 3 | 0.3 | 0.3 | 0.4 | | 4 | 0.7 | 0 | 0.3 | ------------------------------------ | + | …
I am working on a relation extraction and classification problem. The data is in the form of text files. The data is imbalanced. I want to use focal loss function to address class imbalance problem in the data. My question is: Can focal loss be utilized for extraction and classification task to increase the accuracy? Focal loss has been applied on object detection task and for image classification task. The link is below. I want to use this on text …
I'm facing a problem about making a classification on a dataset. The target variable is binary (with 2 classes, 0 and 1). I have 8,161 samples in the training dataset. And for each class, I have: class 0: 6,008 samples, 73.6% of total numbers. class 1: 2,153 samples, 26.4% My questions are: In this case, should I consider the dataset I used as an imbalanced dataset? If it was, should I process the data before using RandomForest to make a …
I am working on my multi-class classification project and I have a question: I have three classes in proportion: 50%, 47% and 3%. I decided to use class_weight="balanced" parameter in random forest classifier. Now I want to calculate accuracy. Should I use balanced accuracy or can I use common accuracy?
I have a severely skewed data sets consisting of 20 something classes where the smallest class contains on the order of 1000 samples and the largest several millions. Regarding the validation data, I understand that I should make sure that it represent a similar ratio between classes compared to the one in my original raw data. Hence, I shouldn't do any under- or over-sampling on that validation data, but can do it on the training data. Because I have such …
Hi I am new to python and deep learning. I am doing a multiclass classification. My 3-classes dataset is imbalanced, the classes take about 50%, 40%, and 20%. I am trying to generate mini batches with balanced classes. I am using class_weight to generate a balanced batch in fit_generator() but I am doubting if it actually works because the batches generated by train_datagen.flow_from_directory() is not balanced. the generated batches have weights around [0.43, 0.38, 0.19]. My code are as follow: …
From reading, I understand that when we have fewer positive class labels, it is better to use precision or recall as the evaluation metric. Which metric should I use when we have fewer negative samples? I'm looking for an approach other than switching the labels. Problem setting: I'm developing parametrized fragility functions for predicting damage to a structure (for example trees). An example of fragiltiy function is here The fragility function will estimate the probability of exceeding a damage state …
When training a deep model for rare event detection (e.g. sound of an alarm in a home device audio stream), is it best to use a balanced validation set (50% alarm, 50% normal) to determine early stopping etc., or a validation set representative of reality? If an unbalanced, realistic validation set is used it may have to be huge to contain only a few positive event examples, so I'm wondering how this is typically dealt with. In the given example …
I have 3 classes with this distribution: Class 0: 0.1169 Class 1: 0.7668 Class 2: 0.1163 And I am using xgboost for classification. I know that there is a parameter called scale_pos_weight. But how is it handled for 'multiclass' case, and how can I properly set it?
I would like to optimize the hyperparameters C and Gamma of an SVC by using grid search for an unbalanced data set. So far I have used class_weights='balanced' and selected the best hyperparameters based on the average of the f1-scores. However, the data set is very unbalanced, i.e. if I chose GridSearchCV with cv=10, then some minority classes are not represented in the validation data. I'm thinking of using SMOTE, but I see the problem here that I would have …
I work in the medical domain, so class imbalance is the rule and not the exception. While I know Python has packages for class imbalance, I don't see an option in Orange for e.g. a SMOTE widget. I have read other threads in Stack Exchange regarding this, but I have not found an answer to how to tackle class imbalance in Orange without resorting to Python programming. Thanks
I was trying to compare the effect of running GridSearchCV on a dataset which was oversampled prior and oversampled after the training folds are selected. The oversampling approach I used was random oversampling. Understand that the first approach is wrong since observations that the model has seen bleed into the test set. Was just curious about how much of a difference this causes. I generated a binary classification dataset with following: # Generate binary classification dataset with 5% minority class, …
I have a highly imbalanced dataset (± 5% positive instances), for which I am training binary classifiers. I am using nested 5-fold cross-validation with grid search for hyperparameter tuning. I want to avoid undersampling, so I have been looking into the class_weight hyperparameter. For sklearn's decisiontree classifier, this works really well and is easily given as a hyperparameter. However, this is not an option for sklearn's neural network (multi-layer perceptron) as far as I can tell. I have been using …
I have a binary response variable (label) in a dataset with around 50,000 observations. The training set is somewhat imbalanced with, =1 making up about 33% of the observation's and =0 making up about 67% of the observations. Right now with XGBoost I'm getting a ROC-AUC score of around 0.67. The response variable is binary so the baseline is 50% in term of chance, but at the same time the data is imbalanced, so if the model just guessed =0 …
I have a dataset with classes in my target column distributed like shown below. counts percents 6 1507 27.045944 3 1301 23.348887 5 661 11.862886 4 588 10.552764 7 564 10.122039 8 432 7.753051 1 416 7.465901 2 61 1.094760 9 38 0.681981 10 4 0.071788 I would like to under sample my data and include, only 588 samples for a class at maximum; so that the classess 6, 3 & 5 only have ~588 samples available after undersampling. Here's …
I am new to ML and learnt a lot from your valuable posts. I need your advise with the following situation and guidance on if the steps make sense. I have a binary classification problem, my dataset has a severe imbalance approximately 2% positive cases (4,000 cases) out of a total of 200,000 cases. I separated my dataset into a train and a test (80/20 stratified split). My train now has total of 160,000 cases (3,200 positive cases) and test …