imbalanced-data

Can the attention mask hold values between 0 and 1?

neel g

2022年5月25日 18:17

I am new to attention-based models and wanted to understand more about the attention mask in NLP models. attention_mask: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max input sequence length in the current batch. It's the mask that we typically use for attention when a batch has varying length sentences. So a normal attention mask is supposed to look …

Topic: imbalanced-data attention-mechanism nlp machine-learning

Category: Data Science

Model a classification problem with multiple categorical varialbes as input features only. Diff Model performance

Martin

2022年4月24日 21:47

I'm having an input data with 100k rows, 8 input features, I'm trying to predict y (binary 1/0). But all the X are categorical variables(strictly nominal variables, not ordinal). Some with 8 levels, some with 20 levels. The data is highly imbalanced. 0.5% of y is 1. I have cleaned up the data and applied one-hot-encoding to all 8 input variables. Looked up some paper and saw some examples using MCA, but since the input dimensions are small, I don't …

Topic: imbalanced-data machine-learning-model classification categorical-data

Category: Data Science

How to tackle imbalanced regression?

lte__

2022年4月22日 20:38

I've recently encountered a problem where I want to fit a regression model on data that's target variable is like 75% zeroes, and the rest is a continuous variable. This makes it a regression problem, however, the non-zero values also have a very high variance: they can take anywhere from between 1 to 105 million. What would be an effective approach to such a problem? Due to the high variance, I keep getting regressors that fit too much to the …

Topic: imbalanced-data regression

Category: Data Science

My semantic segmentation model classifies everything as background

BMC98

2022年4月16日 18:05

So, I am working on a semantic segmentation task using U-Net. The dataset is very unbalanced, with the background being by far the most common, and the last class being very scarce. First I trained it using Categorical Cross Entropy as the loss function, and in the end it simply classified everything as background (I used IoU as a measurement of success, and the confusion matrix had non-null values only on the first column, which can only mean that). I …

Topic: semantic-segmentation imbalanced-data keras machine-learning

Category: Data Science

Complex balanced dataloading from multiple imbalanced datasets?

Pablo Messina

2022年4月7日 12:58

The Setting Let's suppose that I have an imbalanced dataset. For training purposes, I want to implement a dataloading scheme that samples from this dataset in a more balanced way. I want to leverage existing metadata for this purpose. Each instance in my dataset belongs to either category $A$ or category $B$. Similarly, each category can be subdivided into several subcategories, namely, $A_1$, $A_2$, $A_3$, $A_4$, ..., $A_N$ and $B_1$, $B_2$, $B_3$, $B_4$, ..., $B_M$. How I want the dataloading …

Topic: imbalanced-data pytorch class-imbalance deep-learning

Category: Data Science

Measuring performance of customer purchase predictions

Shlomi Schwartz

2022年3月24日 09:12

My goal is to develop a model that predicts next customer purchases in USD (Update: During the time period of the dataset, if no purchase was made by the customer, the next purchase label is set to zero). I am trying to determine what would be the most effective metric for measuring the model's performance. Results looks like so: y_true_usd y_predicted_usd 1.2 0.8 0 0.3 0 1.1 0 0 0 0.1 5.3 4.3 First I thought about going with RMSE, …

Topic: rmse imbalanced-data metric predictive-modeling

Category: Data Science

How to define minority/majority class in a multi-classification task

Ziqi

2022年3月17日 19:45

I am studying classification in imbalanced datasets and I am learning under/over sampling strategies as a way to address the issue. While the literature agrees one needs to oversample 'minority' classes and downsample 'majority' classes, I have not been able to find a clear definition of how minority/majority is defined/measured. While this is not much an issue in a binary classification task, my problem is a multi-classification one, where there are over 200 classes, some have tens of thousands of …

Topic: imbalanced-data class-imbalance

Category: Data Science

Class imbalance: Will transforming multi-label (aka multi-task) to multi-class problem help?

jasperhyp

2022年2月24日 17:45

I noticed this and this questions, but my problem is more about class imbalance. So now I have, say, 1000 targets and some input samples (with some feature vectors). Each input sample can have label '1' for many targets (currently tasks), meaning they interact. Label '0' means they don't interact (for each task, it is a binary classification problem). Unbalanced data My current issue is: For most targets there are <1% samples (perhaps 1 or 2) that are labelled 1. …

Topic: imbalanced-data imbalanced-learn multilabel-classification multiclass-classification class-imbalance

Category: Data Science

Should I use "sample_weights" on a calibrator if I already used them while training the model (imbalanced dataset)?

Jacobo O

2022年2月22日 16:19

I was wondering what is the right way to proceed when you are dealing with an imbalanced dataset and you want to use a calibrator. When I work with a single model and imbalanced datasets I usually pass "sample_weights" to the model, but I don't know if "sample_weights" should be passed to the calibrator as well.

Topic: imbalanced-data probability-calibration

Category: Data Science

What is the best practice to normalize/standardize imbalanced data for outlier detection or binary classification task?

Mario

2022年2月21日 19:03

I'm researching Anomaly/outlier/fraud detection, and I'm looking for the best practice to pre-process the synthetic data for imbalanced data. I have checked all methodology for normalizing/standardizing, which are not sensitive to the presence of outliers and fit this case study. Based on scikit-learn 0.24.2 study about Compare the effect of different scalers on data with outliers, it has been stated here: If some outliers are present in the set, robust scalers or transformers are more appropriate. I'm using CTU-13 dataset, …

Topic: binary-classification categorical-encoding imbalanced-data normalization anomaly-detection

Category: Data Science

Can dataset with numeric (cardinal) dependent variable be unbalanced?

Plum

2022年2月19日 14:23

can a dataset be unbalanced if the dependent variable is numerical (cardinal scale)? Or does the question whether a dataset is (un)balanced only matter for datasets with a categorical dependent variable? I have a dataset with a metric dependent variable with about 1230 observations, and about 200 of them have an extremely low dependent variable value, and about 650 an extremely high value. The rest of the observations is distributed pretty equally in between. I'm wondering whether this dataset must …

Topic: imbalanced-data

Category: Data Science

Over-sampling when predicting a contionuous variable

Kev

2022年1月22日 02:55

Lets say i am predicting house selling prices (continuous) and therefore have multiple independent variables (numerical and categorical). Is it common practice to balance the dataset when the categorical independent variables are imbalanced? Ratio not higher 1:100. Or do i only balance the data when the dependent variable is imbalanced? Thanks

Topic: imbalanced-data imbalanced-learn numerical regression categorical-data

Category: Data Science

Why does class_weight usually outperform SMOTE?

dsbr_

2022年1月22日 00:48

I'm trying to figure out what exactly class_weight from sklearn does. When working with imbalanced datasets, I'm always using class_weight because the results are usually better than using SMOTE. However, I'm not sure why. I've tried to find an answer, but most of answers regarding the subject are vague. For instance, the first answer here explain class_weight in a way that looks similar to SMOTE. This and this also didn't provide an answer. I read once that SMOTE is used …

Topic: imbalanced-data smote class-imbalance classification

Category: Data Science

Rough ideas of expected performance boost from over-sampling techniques?

jjei

2022年1月16日 22:51

I am trying to train a classifier for a multi class classification task. However, the dataset is very imbalanced. About half of the around 160 unique labels are such that there are only 10 or less samples corresponding to each of these rare labels. There are about 20 labels that occur exactly once. So the dataset contains a few classes that are well represented and very long and skinny tail of rare labels. There are around 50 features (both numerical …

Topic: imbalanced-data class-imbalance classification

Category: Data Science

Can an Imbalanced Datset be an oportunity for Transfer Learning with Neural Networks?

Alexander Vocaet

2021年12月19日 21:40

While solving classification tasks on imbalanced datasets with Neural Networks(NN) there are two general ways of handling imbalanced data: A. Resample the data, either with over or undersampling until it's balanced. B. Compute a weight per sample, to weigh the losses according to the class occurrence. I thought about a third way, that might be possible: C. Train and Autoencoder on all the input data, and use its encoding part, first layers for the actual Classifier. This way, the information …

Topic: imbalanced-data loss-function neural-network

Category: Data Science

Influence of imbalanced feature on prediction

Reut

2021年11月9日 12:40

I want to use XGB regression. the dataframe is coneptually similar to this table: index feature 1 feature 2 feature 3 encoded_1 encoded_2 encoded_3 y 0 0.213 0.542 0.125 0 0 1 0.432 1 0.495 0.114 0.234 1 0 0 0.775 2 0.521 0.323 0.887 1 0 0 0.691 My question is, what is the influence of having imbalanced observations of the encoded features? for example, is I have more features that are "encoded 1" comapred to "encoded 2" or …

Topic: imbalanced-data xgboost regression python

Category: Data Science

Improving text classification & labeling in imbalanced dataset

Mtaly

2021年11月1日 11:13

I am trying to classify text titles (NLP) in categories. Let us say I have 6K titles that should fall into four categories. My questions: I do not understand why in some ML techniques categories are converted into numerical values "Transforming the prediction target"? will this impact the model accuracy instead of using nominal values? My data is severely imbalanced towards some categories, ex: CAT A has 4K titles and CAT B has 500 title. So oversampling or under sampling …

Topic: imbalanced-data data-cleaning machine-learning

Category: Data Science

Determining if a dataset is balanced

user166673

2021年10月4日 04:51

I'm learning about training sets and I have been provided with a set of labelled customer data that segments customers into one of two classes: A or B. The dataset also contains gender, age and profession attributes for each customer. The distribution of classes in the dataset is like this: 92% of customers are class A 8% of customers are class B Based on my understanding, this is an unbalanced dataset because the distribution of classes is not equal. However, …

Topic: imbalanced-data

Category: Data Science

Aren't balanced data sets important in regression?

Tfovid

2021年8月11日 16:13

Why is it that the necessity for balanced data sets is (almost) always exclusively mentioned in the context of classification but not of regression?

Topic: imbalanced-data training regression class-imbalance classification

Category: Data Science

How to increase the accuracy of an imbalanced dataset (not precision)?

section117

2021年7月23日 12:43

There's an imbalanced dataset in a Kaggle competition I'm trying. The target variable of the dataset is binary and it is biased towards 0. 0 - 70% 1 - 30% I tried several machine learning algorithms like Logistic Regression, Random Forest, Decision Trees etc. But all of them give an accuracy around 70%. It seems that the models always tend to predict 0. So I tried several methods to get an unbiased dataset like the following. Up sampling the dataset …

Topic: imbalanced-data preprocessing visualization dataset

Category: Data Science