Can the attention mask hold values between 0 and 1?

I am new to attention-based models and wanted to understand more about the attention mask in NLP models. attention_mask: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max input sequence length in the current batch. It's the mask that we typically use for attention when a batch has varying length sentences. So a normal attention mask is supposed to look …
Category: Data Science

Model a classification problem with multiple categorical varialbes as input features only. Diff Model performance

I'm having an input data with 100k rows, 8 input features, I'm trying to predict y (binary 1/0). But all the X are categorical variables(strictly nominal variables, not ordinal). Some with 8 levels, some with 20 levels. The data is highly imbalanced. 0.5% of y is 1. I have cleaned up the data and applied one-hot-encoding to all 8 input variables. Looked up some paper and saw some examples using MCA, but since the input dimensions are small, I don't …
Category: Data Science

How to tackle imbalanced regression?

I've recently encountered a problem where I want to fit a regression model on data that's target variable is like 75% zeroes, and the rest is a continuous variable. This makes it a regression problem, however, the non-zero values also have a very high variance: they can take anywhere from between 1 to 105 million. What would be an effective approach to such a problem? Due to the high variance, I keep getting regressors that fit too much to the …
Category: Data Science

My semantic segmentation model classifies everything as background

So, I am working on a semantic segmentation task using U-Net. The dataset is very unbalanced, with the background being by far the most common, and the last class being very scarce. First I trained it using Categorical Cross Entropy as the loss function, and in the end it simply classified everything as background (I used IoU as a measurement of success, and the confusion matrix had non-null values only on the first column, which can only mean that). I …
Category: Data Science

Complex balanced dataloading from multiple imbalanced datasets?

The Setting Let's suppose that I have an imbalanced dataset. For training purposes, I want to implement a dataloading scheme that samples from this dataset in a more balanced way. I want to leverage existing metadata for this purpose. Each instance in my dataset belongs to either category $A$ or category $B$. Similarly, each category can be subdivided into several subcategories, namely, $A_1$, $A_2$, $A_3$, $A_4$, ..., $A_N$ and $B_1$, $B_2$, $B_3$, $B_4$, ..., $B_M$. How I want the dataloading …
Category: Data Science

Measuring performance of customer purchase predictions

My goal is to develop a model that predicts next customer purchases in USD (Update: During the time period of the dataset, if no purchase was made by the customer, the next purchase label is set to zero). I am trying to determine what would be the most effective metric for measuring the model's performance. Results looks like so: y_true_usd y_predicted_usd 1.2 0.8 0 0.3 0 1.1 0 0 0 0.1 5.3 4.3 First I thought about going with RMSE, …
Category: Data Science

How to define minority/majority class in a multi-classification task

I am studying classification in imbalanced datasets and I am learning under/over sampling strategies as a way to address the issue. While the literature agrees one needs to oversample 'minority' classes and downsample 'majority' classes, I have not been able to find a clear definition of how minority/majority is defined/measured. While this is not much an issue in a binary classification task, my problem is a multi-classification one, where there are over 200 classes, some have tens of thousands of …
Category: Data Science

Class imbalance: Will transforming multi-label (aka multi-task) to multi-class problem help?

I noticed this and this questions, but my problem is more about class imbalance. So now I have, say, 1000 targets and some input samples (with some feature vectors). Each input sample can have label '1' for many targets (currently tasks), meaning they interact. Label '0' means they don't interact (for each task, it is a binary classification problem). Unbalanced data My current issue is: For most targets there are <1% samples (perhaps 1 or 2) that are labelled 1. …
Category: Data Science

Should I use "sample_weights" on a calibrator if I already used them while training the model (imbalanced dataset)?

I was wondering what is the right way to proceed when you are dealing with an imbalanced dataset and you want to use a calibrator. When I work with a single model and imbalanced datasets I usually pass "sample_weights" to the model, but I don't know if "sample_weights" should be passed to the calibrator as well.
Category: Data Science

What is the best practice to normalize/standardize imbalanced data for outlier detection or binary classification task?

I'm researching Anomaly/outlier/fraud detection, and I'm looking for the best practice to pre-process the synthetic data for imbalanced data. I have checked all methodology for normalizing/standardizing, which are not sensitive to the presence of outliers and fit this case study. Based on scikit-learn 0.24.2 study about Compare the effect of different scalers on data with outliers, it has been stated here: If some outliers are present in the set, robust scalers or transformers are more appropriate. I'm using CTU-13 dataset, …
Category: Data Science

Can dataset with numeric (cardinal) dependent variable be unbalanced?

can a dataset be unbalanced if the dependent variable is numerical (cardinal scale)? Or does the question whether a dataset is (un)balanced only matter for datasets with a categorical dependent variable? I have a dataset with a metric dependent variable with about 1230 observations, and about 200 of them have an extremely low dependent variable value, and about 650 an extremely high value. The rest of the observations is distributed pretty equally in between. I'm wondering whether this dataset must …
Category: Data Science

Over-sampling when predicting a contionuous variable

Lets say i am predicting house selling prices (continuous) and therefore have multiple independent variables (numerical and categorical). Is it common practice to balance the dataset when the categorical independent variables are imbalanced? Ratio not higher 1:100. Or do i only balance the data when the dependent variable is imbalanced? Thanks
Category: Data Science

Why does class_weight usually outperform SMOTE?

I'm trying to figure out what exactly class_weight from sklearn does. When working with imbalanced datasets, I'm always using class_weight because the results are usually better than using SMOTE. However, I'm not sure why. I've tried to find an answer, but most of answers regarding the subject are vague. For instance, the first answer here explain class_weight in a way that looks similar to SMOTE. This and this also didn't provide an answer. I read once that SMOTE is used …
Category: Data Science

Rough ideas of expected performance boost from over-sampling techniques?

I am trying to train a classifier for a multi class classification task. However, the dataset is very imbalanced. About half of the around 160 unique labels are such that there are only 10 or less samples corresponding to each of these rare labels. There are about 20 labels that occur exactly once. So the dataset contains a few classes that are well represented and very long and skinny tail of rare labels. There are around 50 features (both numerical …
Category: Data Science

Can an Imbalanced Datset be an oportunity for Transfer Learning with Neural Networks?

While solving classification tasks on imbalanced datasets with Neural Networks(NN) there are two general ways of handling imbalanced data: A. Resample the data, either with over or undersampling until it's balanced. B. Compute a weight per sample, to weigh the losses according to the class occurrence. I thought about a third way, that might be possible: C. Train and Autoencoder on all the input data, and use its encoding part, first layers for the actual Classifier. This way, the information …
Category: Data Science

Influence of imbalanced feature on prediction

I want to use XGB regression. the dataframe is coneptually similar to this table: index feature 1 feature 2 feature 3 encoded_1 encoded_2 encoded_3 y 0 0.213 0.542 0.125 0 0 1 0.432 1 0.495 0.114 0.234 1 0 0 0.775 2 0.521 0.323 0.887 1 0 0 0.691 My question is, what is the influence of having imbalanced observations of the encoded features? for example, is I have more features that are "encoded 1" comapred to "encoded 2" or …
Category: Data Science

Improving text classification & labeling in imbalanced dataset

I am trying to classify text titles (NLP) in categories. Let us say I have 6K titles that should fall into four categories. My questions: I do not understand why in some ML techniques categories are converted into numerical values "Transforming the prediction target"? will this impact the model accuracy instead of using nominal values? My data is severely imbalanced towards some categories, ex: CAT A has 4K titles and CAT B has 500 title. So oversampling or under sampling …
Category: Data Science

Determining if a dataset is balanced

I'm learning about training sets and I have been provided with a set of labelled customer data that segments customers into one of two classes: A or B. The dataset also contains gender, age and profession attributes for each customer. The distribution of classes in the dataset is like this: 92% of customers are class A 8% of customers are class B Based on my understanding, this is an unbalanced dataset because the distribution of classes is not equal. However, …
Category: Data Science

How to increase the accuracy of an imbalanced dataset (not precision)?

There's an imbalanced dataset in a Kaggle competition I'm trying. The target variable of the dataset is binary and it is biased towards 0. 0 - 70% 1 - 30% I tried several machine learning algorithms like Logistic Regression, Random Forest, Decision Trees etc. But all of them give an accuracy around 70%. It seems that the models always tend to predict 0. So I tried several methods to get an unbiased dataset like the following. Up sampling the dataset …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.