I am working on a model which will run monthly on 8M users. I've snapshot-wise data in training set, eg: Jan, 21 Snapshot : 8M Total : 233 Positives Rest Negative Feb, 21 Snapshot : 8M Total : 599 Positives Rest Negative March, 21 Snapshot : 8M Total : 600 Positives Rest Negative April, 21 Snapshot : 8M Total : 750 Positives Rest Negative similarly till March, 2022 I'm keeping March, 2022 as test set, which has 2000 positive labels …
Let me start by saying my machine learning experience is... dangerous at this stage. I'm still a beginner. I have a binary classification data set of about 100 000 records. 10% of the records are positive and the rest obviously negative. Thus a highly skewed dataset. It is extremely important to maximize the positive (true positive) prediction accuracy (recall) at the expense of negative (true negative) prediction accuracy . Thus, I would rather have an overall 70% accuracy if positive …
I have 2 sets of training data in csv files. The training data have class labels, 1 for memorable, and 0 for not memorable. In addition, there is also a confidence label for each sample. The class labels were assigned based on decisions from 3 people viewing the photos. When they all agreed, the class label could be considered certain, and a confidence of 1 was written down. If they didn't all agree, then the classification decided on by the …
I have a set of 150 documents with their assigned binary class. I also have 1000 unlabeled documents. Each document is about the length of a journal paper. Each class has 15 associated keywords. I want to be able to predict the assigned class of the documents using this information. Does anyone have any ideas of how I could approach this problem?
I have built a LightGBM based machine learning model on data of molecules of two classes. The distribution is as follows. Class 0 has 5933 data points and class 1 has 4696. The train test accuracy I get on this data is around 87% and 82% respectively. The roc_auc_score is around 81.5%. But when I try to evaluate model performance on an entirely new dataset which model has never seen before with class label 0 and 1 both having 94 …
I have a scikit-learn logistic regression binary classifier and tried training it on my dataset. My model does extremely well at a threshold of 0.95 instead of 0.5 and all my predictions on example cases are above 0.8 for both classes. I cannot figure out why my machine learning model is shooting up predictions so much. I would appreciate some potential work arounds for this.
i have a understanding question regarding the interpretation of a aggregation of a machine learning classifier. Lets assume i have trained a binary classifier and it was validated with a accuracy of 70% (dataset is always balanced). My question is now, if this probability seems to low for me - and i would search for ways to improve that without any readjustments on the classifier - would the following idea be valid?: The classifier predicts three independent samples (always with …
I am training a binary classifier using an efficientnetv2 model with a 1M image dataset where I do a 60/20/20 split. Does this graph mean that the model is over-fitting? I can see that the train loss is going down much faster than the eval loss but the eval loss is still going down and the accuracy is going up. Accuracy may seem to be low but it is actually a pretty decent amount for the problem I am working …
Suppose I have a binary classifier $f$ which acts on an input $x$. Given a threshold $t$, the predicted binary output is defined as: $$ \widehat{y} = \begin{cases} 1, & f(x) \geq t \\ 0, & f(x) < t \end{cases} $$ I then compute the $TPR$ (true positive rate) and $FPR$ (false positive rate) metrics on the hold-out test set (call it $S_1$): $TPR_{S_1} = \Pr(\widehat{y} = 1 | y = 1, S_1)$ $FPR_{S_1} = \Pr(\widehat{y} = 1 | y …
I have a binary response variable (label) in a dataset with around 50,000 observations. The training set is somewhat imbalanced with, =1 making up about 33% of the observation's and =0 making up about 67% of the observations. Right now with XGBoost I'm getting a ROC-AUC score of around 0.67. The response variable is binary so the baseline is 50% in term of chance, but at the same time the data is imbalanced, so if the model just guessed =0 …
I have a question regarding "Predictive Maintenance": in this tutorial here: https://docs.microsoft.com/en-us/learn/modules/predictive-maintenance-model-builder/3-choose-scenario-data It says: "Choosing a scenario for predictive maintenance Depending on what your data looks like, the predictive maintenance problem can be modeled through different tasks. For your use case, because the label is a binary value (0 or 1) that describes whether a machine is broken or not, the data classification scenario is appropriate" Now, how can this be used to predict machine failure BEFORE it gets broken? …
I am working with binary classification and my classification report generated through scikit-learn looks like the image below. I am confused I have two precision-recall values one for class 0 and the other for class 1. which value I should consider while writing results?
I am evaluating different models that do binary classifications and basically generate trade signals. They make a prediction of either buy or sell for the next day. I look at 10 different underlying assets and have 3 different variations of data that I train the models with. I evaluated 12 different types of models. That leaves me with 10 x 3 x 12 = 360 different models/predictions. I backtested those trade signals they generate: Most of them do not really …
I'm a data science student and I've come across a fairly unusual dataset (to me, which explains the vague title). It's of the following form: STAT_1 STAT_2 ... HOME AWAY NEXT_HOME NEXT_AWAY NEXT_RESULT 15 11 ... Team A Team B Team C Team D 1 11 18 ... Team C Team D Team E Team F 0 ... ... ... ... ... ... ... ... 10 11 ... Team W Team X Team Y Team Z 1 Basically, the rows …
I am trying to solve a binary classification problem. My labels are abusive (1) and non-abusive (0). My dataset was imbalanced (more 1 than 0s) and I used oversampling of the minority label (i.e. 1) to balance my dataset. I have also done pre-processing, feature engineering using TF-IDF and then fed the dataset into a pipeline using 3 classification algorithms namely: Logistic Regression, SVM, and Decision Tree. My evaluation metrics are: Logistic Regression: [[376 33] [ 18 69]] precision recall …
If we use the classic Bayesian classification for a 2 class problem and classify based on comparing likelihood ratio $LR(x) = \frac{p(x|s=1)}{p(x|s=2)} $ to a $\Theta_{Bayes} = \frac{P(s=2)}{P(s=1)}$ when would this $\Theta_{Bayes}$ create equal false positive and false negative rates? My intuition is if classes have equal priors and $\Theta_{Bayes} = 1$. Is this the case?
I have got a binary classification problem with large dataset of dimensions (1155918, 55) Also dataset is fairly balanced of 67% Class 0 , 33% Class 1. I am getting test accuracy of 73% in test set and auc score is 50 % Recall is 0.02 for Class 1 I am using a logistic regression and also tried pycaret's classification algorithm
I am trying to do the cats & dogs classification problem, the problem is that my model is overfitting and I have tried all the techniques I know in order to solve but nothing is working such as dropout, data augmentation, l2 and l1 reg. Can you please help me? After the end of the training, my train accuracy was: 0.7868 and my validation accuracy was 0.7044. my image size are (h=48,w=48 with 3 channels, and batch size = 128) …