Class weights for imbalanced data in multilabel problems

I am trying to train a CNN for a multiclass - multilabel classification task (20 classes, each sample can belong to 1+ labels) and the dataset is highly imbalanced. In single-label cases I would use the compute_class_weights function from sklearn to calculate the class weights in order to help the optimizer to account for the minority class. However, for the multilabel case I feel its not working as supposed to, because it considers as number of samples the number of …
Category: Data Science

Clustering with custom criterion (minimum cluster weight)

Edit: following comment from @anony-mousse, I'm changing the question to search for a general clustering approach that matches this criterion (minimum weight per cluster). I am to use a clustering method on a set of $n$ weighted points: --------------------------------------------- | id | weight | feature_1| feature_2 | ... | --------------------------------------------- | 1 | 4 | 0.2345 | -0.2345 | ... | | 2 | 2 | 0.675 | 0.7433 | ... | | 3 | 15 | -0.45 | 0.123 …
Category: Data Science

How to weight loss in regression

I've got a regression problem where a model is required to predict a value in the range [0, 1]. I've tried to look at the distribution of the data and and it seems that there are more examples with a low value label ([0, 0.2]) than higher value labels ([0.2, 1]). When I try to train the model using the MAE metric, the model converges to a state where it has a very low loss, but it seems that the …
Category: Data Science

Weighted loss functions vs weighted sampling?

For image classification tasks, is there a practical difference between using weighted loss functions vs. using weighted sampling? (I would appreciate theoretical arguments, experience or published papers, anything really.) Some details: By "weighted sampling", I mean attributing different sampling probabilities for each sample in the training set. By "weighted loss functions", I mean weighting error terms differently depending on the sample considered.
Category: Data Science

How to apply class weight to a multi-output model?

I have a model with 2 categorical outputs. The first output layer can predict 2 classes: [0, 1] and the second output layer can predict 3 classes: [0, 1, 2]. How can I apply different class weight dictionaries for each of the outputs? For example, how could I apply the dictionary {0: 1, 1: 10} to the first output, and {0: 5, 1: 1, 2: 10} to the second output? I've tried to use the following class weights dictionary weight_class={'output1': …
Category: Data Science

Assign more importance to recent records during training

My goal is to build a classification model in order to predict if a customer will buy a product or not (binary classification). Since in the last months (let's say 3-4) I know that the advertising of the company is changed a bit, I want to put more emphasis on the newer records. I know that it is possible to specify the sample_weights parameter in most of the classification algorithms, but I don't know how to properly build these weights. …
Category: Data Science

Training a model where each response in the observation data has a different known varience

I have a dataset where each response variable is the number of successes of N Bernoulli trials with N and p (the probability of success) being different for each observation. The goal is to train a model to predict p given the predictors. However observations with a small N will have a higher variance and higher N. Consider the following scenario to illustrate better: Assume coins with different pictures on them have a different bias and that the bias is …
Category: Data Science

Understanding Weighted learning in Ensemble Classifiers

I'm currently studying Boosting techniques in Machine Learning and I happened to understand that in Algorithms like Adaboost, each of the training samples is given a weight depending on whether it was misclassified or not by the previous model in sequential boosting. Although I intuitively understand that by weighting examples, we are letting the model pay more attention to examples that were previously misclassified, I do not understand "how" the weights are taken into account by a machine learning algorithm. …
Category: Data Science

Assigning weights based on outcome probability

In a classification problem, is it suitable to assign sample weights based on their positive class probability? For example, if I am building a binary classification problem where one of the independent features has three possible values a – 2% of the samples, probability for positive class = 90% b – 8% of the samples, probability for positive class = 40% c – 90% of the samples, probability for positive class = 5% Can I assign the samples weights based …
Category: Data Science

XGBoost: How to obtain scale_pos_weight for multi classes?

I know there is a similar Qn at Unbalanced multiclass data with XGBoost. But I don't understand the reply provided by @Esmailian. What is the actual formula to obtain 1, 0.333 and 0.167? For example, if we have three imbalanced classes with ratios class A = 10% class B = 30% class C = 60% Their weights would be (dividing the smallest class by others) class A = 1.000 class B = 0.333 class C = 0.167 Will I obtain …
Category: Data Science

Ensemble/combining models weighted by number of observations?

Across a few different projects, I have hit a problem where I have two (or more) models: General-Purpose Model: A model which is based on a large amount of data not specifically relevant to my current classifier label goal, but which predict other labels using similar features. Cold-Start Model: A model trained on data specifically related to my current label/task, which initially starts with zero observations and goes up from there. So then, my question: what is an appropriate way …
Category: Data Science

Logistic Regression : shouldn't weighting by the number of instances give the same result ? What could explain the discrepency?

I am performing a logistic regression in a standard supervised framework (Data Set X, target y). The dataset X is composed of a handfull of categorical variables (that I one-hot encode), thus it contains a lot of redundant rows (1000s unique rows over millions of initial rows). Having a lot of redundant rows I was tempted to agregate them, weight them by their count in the fit and get approximately the same result. However I was surprised to get variation …
Category: Data Science

Weight for Samples on SVM

there is a option sample_weight in fit(X[, y, sample_weight]) function (OneClassSVM, sklearn library). If I use the option sample_weight , I might give some weight to some point(that are likely to be more normal points), right? Otherwise, what does mean the sample_weight? link: https://scikit-learn.org/stable/modules/generated/sklearn.svm.OneClassSVM.html#sklearn.svm.OneClassSVM.score_samples
Category: Data Science

Implementing class weighting in Faster RCNN

I have a dataset (around 45,000 screenshots) of UI elements (UI trees containing element types and bounding boxes) and associated screenshots: The dataset is highly imbalanced with the button element being highly overrepresented: When training on my local machine on a tiny subset of the data (900 screenshots for training, 100 for testing) and 10 epochs, my results aren't bad: I trained the model on Azure ML with 25,000 screenshots for 13 epochs (which took about 3 days) and my …
Category: Data Science

CNN - imbalanced classes, class weights vs data augmentation

I have a dataset with a few strongly imbalanced classes, eg. the smallest class is about 54 times smaller than the largest. Therefore, data augmentation in order to equalize the size of classes seems like a bad idea to me (in the example above each image would have to be augmented 54 times on average). So I thought that I could do less augmentation of minority classes and then use class weights in the loss function. Is this approach better …
Category: Data Science

Is there R functions that allow to test for overdispersion when fitting a model with survey design?

I realized I need to use the package survey to be able to include sample weights in my regression analysis. Initially, I wanted to use a negative binomial regression on each one of my outcomes as count data is more often than not overdispersed, so I tried using svyglm.nb. However, for one of the outcomes which has small values, svyglm.nb makes my program crash, so I think there might be some convergence issue. I thought using a Poisson regression might …
Category: Data Science

Weighting the loss function based on previous seen true positive rates

Similiar to class imbalance there is always something I would call "learnability imbalance" in multi-class classification. What I mean by that: Even when the classes are evenly distributed in the dataset some classes will be classified more easily by the model than others. An example would be a CNN model that classifies dog, cat and car. Dog and cat will most likely have a lower true positive rate than car because cats and dogs look more similiar to each other. …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.