Sampling Highly Imbalanced Large Dataset

I am working on a model which will run monthly on 8M users. I've snapshot-wise data in training set, eg: Jan, 21 Snapshot : 8M Total : 233 Positives Rest Negative Feb, 21 Snapshot : 8M Total : 599 Positives Rest Negative March, 21 Snapshot : 8M Total : 600 Positives Rest Negative April, 21 Snapshot : 8M Total : 750 Positives Rest Negative similarly till March, 2022 I'm keeping March, 2022 as test set, which has 2000 positive labels …
Category: Data Science

Keras Binary Classification - Maximizing Recall

Let me start by saying my machine learning experience is... dangerous at this stage. I'm still a beginner. I have a binary classification data set of about 100 000 records. 10% of the records are positive and the rest obviously negative. Thus a highly skewed dataset. It is extremely important to maximize the positive (true positive) prediction accuracy (recall) at the expense of negative (true negative) prediction accuracy . Thus, I would rather have an overall 70% accuracy if positive …
Category: Data Science

How to use confidence labels?

I have 2 sets of training data in csv files. The training data have class labels, 1 for memorable, and 0 for not memorable. In addition, there is also a confidence label for each sample. The class labels were assigned based on decisions from 3 people viewing the photos. When they all agreed, the class label could be considered certain, and a confidence of 1 was written down. If they didn't all agree, then the classification decided on by the …
Category: Data Science

Binary document classification using keywords for a very small dataset

I have a set of 150 documents with their assigned binary class. I also have 1000 unlabeled documents. Each document is about the length of a journal paper. Each class has 15 associated keywords. I want to be able to predict the assigned class of the documents using this information. Does anyone have any ideas of how I could approach this problem?
Category: Data Science

LGBM model predicting only single class on unseen data!

I have built a LightGBM based machine learning model on data of molecules of two classes. The distribution is as follows. Class 0 has 5933 data points and class 1 has 4696. The train test accuracy I get on this data is around 87% and 82% respectively. The roc_auc_score is around 81.5%. But when I try to evaluate model performance on an entirely new dataset which model has never seen before with class label 0 and 1 both having 94 …
Category: Data Science

ML model shooting up prediction probabilities

I have a scikit-learn logistic regression binary classifier and tried training it on my dataset. My model does extremely well at a threshold of 0.95 instead of 0.5 and all my predictions on example cases are above 0.8 for both classes. I cannot figure out why my machine learning model is shooting up predictions so much. I would appreciate some potential work arounds for this.
Category: Data Science

Aggregated probability based on multiple predictions on independent samples using the same classifier

i have a understanding question regarding the interpretation of a aggregation of a machine learning classifier. Lets assume i have trained a binary classifier and it was validated with a accuracy of 70% (dataset is always balanced). My question is now, if this probability seems to low for me - and i would search for ways to improve that without any readjustments on the classifier - would the following idea be valid?: The classifier predicts three independent samples (always with …
Category: Data Science

Does eval loss decreasing slower than train loss indicate overfitting?

I am training a binary classifier using an efficientnetv2 model with a 1M image dataset where I do a 60/20/20 split. Does this graph mean that the model is over-fitting? I can see that the train loss is going down much faster than the eval loss but the eval loss is still going down and the accuracy is going up. Accuracy may seem to be low but it is actually a pretty decent amount for the problem I am working …
Category: Data Science

Meaningfully compare target vs observed TPR & FPR

Suppose I have a binary classifier $f$ which acts on an input $x$. Given a threshold $t$, the predicted binary output is defined as: $$ \widehat{y} = \begin{cases} 1, & f(x) \geq t \\ 0, & f(x) < t \end{cases} $$ I then compute the $TPR$ (true positive rate) and $FPR$ (false positive rate) metrics on the hold-out test set (call it $S_1$): $TPR_{S_1} = \Pr(\widehat{y} = 1 | y = 1, S_1)$ $FPR_{S_1} = \Pr(\widehat{y} = 1 | y …
Category: Data Science

ROC-AUC Imbalanced Data Score Interpretation

I have a binary response variable (label) in a dataset with around 50,000 observations. The training set is somewhat imbalanced with, =1 making up about 33% of the observation's and =0 making up about 67% of the observations. Right now with XGBoost I'm getting a ROC-AUC score of around 0.67. The response variable is binary so the baseline is 50% in term of chance, but at the same time the data is imbalanced, so if the model just guessed =0 …
Category: Data Science

Predictive Maintenance Question (binary classification)

I have a question regarding "Predictive Maintenance": in this tutorial here: https://docs.microsoft.com/en-us/learn/modules/predictive-maintenance-model-builder/3-choose-scenario-data It says: "Choosing a scenario for predictive maintenance Depending on what your data looks like, the predictive maintenance problem can be modeled through different tasks. For your use case, because the label is a binary value (0 or 1) that describes whether a machine is broken or not, the data classification scenario is appropriate" Now, how can this be used to predict machine failure BEFORE it gets broken? …
Category: Data Science

How do I choose the right parameters for just plain old simple standarddeviation?

I am evaluating different models that do binary classifications and basically generate trade signals. They make a prediction of either buy or sell for the next day. I look at 10 different underlying assets and have 3 different variations of data that I train the models with. I evaluated 12 different types of models. That leaves me with 10 x 3 x 12 = 360 different models/predictions. I backtested those trade signals they generate: Most of them do not really …
Category: Data Science

How to predict an outcome of the game (next row) based on all previous games (rows)?

I'm a data science student and I've come across a fairly unusual dataset (to me, which explains the vague title). It's of the following form: STAT_1 STAT_2 ... HOME AWAY NEXT_HOME NEXT_AWAY NEXT_RESULT 15 11 ... Team A Team B Team C Team D 1 11 18 ... Team C Team D Team E Team F 0 ... ... ... ... ... ... ... ... 10 11 ... Team W Team X Team Y Team Z 1 Basically, the rows …
Category: Data Science

Text Classification misclassifying?

I am trying to solve a binary classification problem. My labels are abusive (1) and non-abusive (0). My dataset was imbalanced (more 1 than 0s) and I used oversampling of the minority label (i.e. 1) to balance my dataset. I have also done pre-processing, feature engineering using TF-IDF and then fed the dataset into a pipeline using 3 classification algorithms namely: Logistic Regression, SVM, and Decision Tree. My evaluation metrics are: Logistic Regression: [[376 33] [ 18 69]] precision recall …
Category: Data Science

When would $\Theta_{Bayes}$ be on the Equal Error Rate curve

If we use the classic Bayesian classification for a 2 class problem and classify based on comparing likelihood ratio $LR(x) = \frac{p(x|s=1)}{p(x|s=2)} $ to a $\Theta_{Bayes} = \frac{P(s=2)}{P(s=1)}$ when would this $\Theta_{Bayes}$ create equal false positive and false negative rates? My intuition is if classes have equal priors and $\Theta_{Bayes} = 1$. Is this the case?
Category: Data Science

Loss drops to NaN after a short time for a time series classification

here is my model code for a binary classification of a time series: def make_model(feature_columns): feature_layer = tf.keras.layers.DenseFeatures(feature_columns) feature_layer_outputs = feature_layer(feature_layer_inputs) feature_layer_outputs = tf.expand_dims(feature_layer_outputs, 1) conv = keras.layers.Conv1D(filters=64, kernel_size=3, padding="same",kernel_regularizer=keras.regularizers.l1_l2(l1=0.01, l2=0.01))(feature_layer_outputs) conv = keras.layers.BatchNormalization()(conv) conv = keras.layers.ReLU()(conv) conv = keras.layers.Conv1D(filters=64, kernel_size=3, padding="same",kernel_regularizer=keras.regularizers.l1_l2(l1=0.01, l2=0.01))(conv) conv = keras.layers.BatchNormalization()(conv) conv = keras.layers.ReLU()(conv) conv = keras.layers.Conv1D(filters=64, kernel_size=3, padding="same",kernel_regularizer=keras.regularizers.l1_l2(l1=0.01, l2=0.01))(conv) conv = keras.layers.BatchNormalization()(conv) conv = keras.layers.ReLU()(conv) conv = keras.layers.Dropout(0.25)(conv) gap = keras.layers.GlobalAveragePooling1D()(conv) output_layer = keras.layers.Dense(1, activation="Softmax")(gap) return keras.models.Model(inputs=[v for v in feature_layer_inputs.values()], outputs=output_layer)` So i …
Category: Data Science

How should I improve my CNN binary classification model from overfitting and underfitting

I am trying to do the cats & dogs classification problem, the problem is that my model is overfitting and I have tried all the techniques I know in order to solve but nothing is working such as dropout, data augmentation, l2 and l1 reg. Can you please help me? After the end of the training, my train accuracy was: 0.7868 and my validation accuracy was 0.7044. my image size are (h=48,w=48 with 3 channels, and batch size = 128) …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.