Estimating class prevalence in unlabelled data after predicting labels with a binary classifier

I'm looking to get an estimate of the prevalence of 1's (i.e. the rate of positive labels) in a very large dataset that I have. However, I am hoping to report this percentage as a 95% credible interval instead of as an exact estimate of rate, taking into account the model uncertainties. These are the steps I'm hoping to perform: Train a binary classifier on labelled training data. Use a labelled test set to estimate the specificity and sensitivity of …
Category: Data Science

Neural network / machine learning approach to model specific sequencing-classification problem in industry

I am working on a project which involves developing a machine learning/deep learning for an application in a roll-to-roll industry. For a long time, I have been looking for similar problems as a way to get some guidance but I was never able to find anything related. Basically, the problem can be seen as follows: An industrial machine is producing a roll of some material, which tends to have visible defects throughout the roll. I have already available a machine …
Category: Data Science

macro average and weighted average meaning in classification_report

I use the "classification_report" from from sklearn.metrics import classification_report in order to evaluate the imbalanced binary classification Classification Report : precision recall f1-score support 0 1.00 1.00 1.00 28432 1 0.02 0.02 0.02 49 accuracy 1.00 28481 macro avg 0.51 0.51 0.51 28481 weighted avg 1.00 1.00 1.00 28481 I do not understand clearly what is the meaning of macro avg and weighted average? and how we can clarify the best solution based on how close their amount to one! …
Category: Data Science

Subsampling the “right” amout of data to train an ML model

I am training a machine learning model (i.e., a classifier) on a large dataset. I know that I can get the same results using less data (about 30%) but I would like to avoid the trial and error process to find the 'right' amount of data to retain from the dataset. Of course I can create a script which automatically tried different thresholds but I was wondering if there is any principled way of doing this. It seems strange that …
Category: Data Science

Clustering time series data using dynamic time warping

I would like to cluster/group the curves in the attached picture with Python. The data is already normalized and my approach would be to use dtw (dynamic time warping) to calculate the distance and with that feature use a clustering algorithm (like kmeans or DBSCAN) to classify them. Do I pick out one trajectory as a starting curve to compare the other curves to, or do I calculate an 'average' curve of all curves and use that as the starting …
Category: Data Science

What to do when one feature has very large importance/weight?

I am new to Data Science and currently am trying to predict customers churn for a company that offers of subscription-based bookings management software. Its customers are gyms. I have a small unbalanced dataset of a historical data (False 670, True 230) with 2 numerical predictors: age(days since subscription), number of active days in the last month(days on which a customer(gym) had bookings) and 1 categorical: logo (boolean, if a customers uploaded a logo in a software). Predictors have following …
Category: Data Science

How to generate a rule-based system based on binary data?

I have a dataset where each row is a sample and each column is a binary variable. The meaning of $X_{i, j} = 1$ is that we've seen feature $j$ for sample $i$. $X_{i, j} = 0$ means that we haven't seen this feature but we might will. We have around $1000$ binary variables and around $200k$ samples. The target variable, $y$ is categorical. What I'd like to do is to find subsets of variables that precisely predict some $y_k$. …
Category: Data Science

GridSearch multiplying the number of trees in XGboost?

I'm having an issue: after running an XGboost in a HalvingGridSearchCV, I receive a certain number of estimators (50 for example), but the number of trees is constantly being multiplied by 3. I don't understand why. Here is the code: model = XGBClassifier(objective='multi:softprob', subsample = 0.9, colsample_bytree=0.5, num_class= 3) md = [3, 6, 10, 15] lr = [0.1, 0.5, 1] g = [0, 0.25, 1] rl = [0, 1, 10] spw = [1, 3, 5] ns = [5, 10, 20] …
Category: Data Science

Classification Produces too Many False Positives or False Negatives

I trying to classify this data set (https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset) to classify if a patient is at risk for having a stroke. As the title says, whatever test I run to classify the patients, I keep running into the final results having too many false-positives or too many false-negative results. The data itself is severely imbalanced (95% 0s to 5% 1 (had a stroke)) and in spite of doing various things to try and balance it or compensate for it, I keep …
Category: Data Science

When to use Random Forest over SVM and vice versa?

When would one use Random Forest over SVM and vice versa? I understand that cross-validation and model comparison is an important aspect of choosing a model, but here I would like to learn more about rules of thumb and heuristics of the two methods. Can someone please explain the subtleties, strengths, and weaknesses of the classifiers as well as problems, which are best suited to each of them?
Category: Data Science

How to export shap waterfall values to dataframe?

I am working on a binary classification using random forest model, neural networks in which am using SHAP to explain the model predictions. I followed the tutorial and wrote the below code to get the waterfall plot shown below row_to_show = 20 data_for_prediction = ord_test_t.iloc[row_to_show] # use 1 row of data here. Could use multiple rows if desired data_for_prediction_array = data_for_prediction.values.reshape(1, -1) rf_boruta.predict_proba(data_for_prediction_array) explainer = shap.TreeExplainer(rf_boruta) # Calculate Shap values shap_values = explainer.shap_values(data_for_prediction) shap.plots._waterfall.waterfall_legacy(explainer.expected_value[0], shap_values[0],ord_test_t.iloc[row_to_show]) This generated the plot as …
Category: Data Science

How to decide who to market? Clustering or Decision Tree?

I am working with a dataset that has enough observations and ~ 10 variables, half of the variables are numeric another half of the variables are categorical with 2-3 levels (demographics) one ID variable one last variable that has sales value, 0 for no sale and bill amount for sale Using this information, I want to understand which segments of my customers to market. I am using R for code but that's not relevant here. :) I am confused about …
Category: Data Science

Identify optimal thresholds for one-vs-one/one-vs-rest ROC-curve for multiclass classification

Say I have a multiclass classification problem with N classes. I have trained a classifier on a training set, I use a validation set and a One-vs-rest ROC-curve to give me N ROC curves. Since the ROC curve is created based on different thresholds of when we classify a sample as $Ci$ or not $Ci$. We can then chose (our) optimal FPR/TRP ratio and get the threshold (t) e.g say t=0.6 we classify a sample as $Ci$ if model_score>=0.6 else …
Category: Data Science

Image Classification problem for minute defect detection

I am tasked with the problem of finding defects in a compressor wheel.Here is how a good wheel looks like: Here is how a defective wheel looks like ( I have drawn a box around the defective area): I have continuous video feed of the wheels rotating as a data set. I tried training the "goodness" of a wheel using a fasterrcnn_resnet50_fpn model in pytorch. But the results were inaccurate. This is what I fed in the training data with …
Category: Data Science

Binary Classification Comparing two time series of variable length

Is there a machine learning model (something like LSTM or 1D-CNN) that takes two time series of variable length as input and outputs a binary classification (True/False whether time series are of same label)? So the data would look something like the following date value label 2020-01-01 2 0 # first input time series 2020-01-02 1 0 # first input time series 2020-01-03 1 0 # first input time series 2020-01-01 3 1 # second input time series 2020-01-03 1 …
Category: Data Science

I have data with customer personal information and customer transaction. I cannot figure out how to use the data for training my model?

Customer information attributes: ID Age Gender State etc Customer transaction ID Store ID No of items bought State etc Store info Store ID State Daily revenue Store size etc I want to predict if customer will buy at a particular store or not/ So can I have the train data with suppose 5 different stores for every customer where the customer shops and then predict in other store?
Category: Data Science

seasonality in classification model

I am building a classification model to predict customer status a year from a given time. There seems to be some seasonality, for example, more changes occur in Summer than in Winter etc. so my dataset (mainly labels) would change depending on how to define prediction time (eg 2020 Jan) and predicting time (eg 2021 Jan). Let's say there are 100 customers and I could make 1,200 entries (100 per month for every month in 2020, where labels are from …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.