classification

Estimating class prevalence in unlabelled data after predicting labels with a binary classifier

CadPat

2022年6月5日 03:03

I'm looking to get an estimate of the prevalence of 1's (i.e. the rate of positive labels) in a very large dataset that I have. However, I am hoping to report this percentage as a 95% credible interval instead of as an exact estimate of rate, taking into account the model uncertainties. These are the steps I'm hoping to perform: Train a binary classifier on labelled training data. Use a labelled test set to estimate the specificity and sensitivity of …

Topic: bayesian classification statistics machine-learning

Category: Data Science

Neural network / machine learning approach to model specific sequencing-classification problem in industry

Kunis

2022年6月4日 17:27

I am working on a project which involves developing a machine learning/deep learning for an application in a roll-to-roll industry. For a long time, I have been looking for similar problems as a way to get some guidance but I was never able to find anything related. Basically, the problem can be seen as follows: An industrial machine is producing a roll of some material, which tends to have visible defects throughout the roll. I have already available a machine …

Topic: lstm deep-learning classification machine-learning

Category: Data Science

macro average and weighted average meaning in classification_report

user10296606

2022年6月4日 06:13

I use the "classification_report" from from sklearn.metrics import classification_report in order to evaluate the imbalanced binary classification Classification Report : precision recall f1-score support 0 1.00 1.00 1.00 28432 1 0.02 0.02 0.02 49 accuracy 1.00 28481 macro avg 0.51 0.51 0.51 28481 weighted avg 1.00 1.00 1.00 28481 I do not understand clearly what is the meaning of macro avg and weighted average? and how we can clarify the best solution based on how close their amount to one! …

Topic: class-imbalance accuracy classification

Category: Data Science

Subsampling the “right” amout of data to train an ML model

giz

2022年6月4日 04:20

I am training a machine learning model (i.e., a classifier) on a large dataset. I know that I can get the same results using less data (about 30%) but I would like to avoid the trial and error process to find the 'right' amount of data to retain from the dataset. Of course I can create a script which automatically tried different thresholds but I was wondering if there is any principled way of doing this. It seems strange that …

Topic: sampling classification bigdata

Category: Data Science

Clustering time series data using dynamic time warping

timm2110

2022年6月3日 15:57

I would like to cluster/group the curves in the attached picture with Python. The data is already normalized and my approach would be to use dtw (dynamic time warping) to calculate the distance and with that feature use a clustering algorithm (like kmeans or DBSCAN) to classify them. Do I pick out one trajectory as a starting curve to compare the other curves to, or do I calculate an 'average' curve of all curves and use that as the starting …

Topic: classification time-series python clustering

Category: Data Science

What to do when one feature has very large importance/weight?

Daria

2022年6月3日 07:27

I am new to Data Science and currently am trying to predict customers churn for a company that offers of subscription-based bookings management software. Its customers are gyms. I have a small unbalanced dataset of a historical data (False 670, True 230) with 2 numerical predictors: age(days since subscription), number of active days in the last month(days on which a customer(gym) had bookings) and 1 categorical: logo (boolean, if a customers uploaded a logo in a software). Predictors have following …

Topic: data-science-model churn logistic-regression classification

Category: Data Science

How to generate a rule-based system based on binary data?

greenButMellow

2022年6月2日 22:38

I have a dataset where each row is a sample and each column is a binary variable. The meaning of $X_{i, j} = 1$ is that we've seen feature $j$ for sample $i$. $X_{i, j} = 0$ means that we haven't seen this feature but we might will. We have around $1000$ binary variables and around $200k$ samples. The target variable, $y$ is categorical. What I'd like to do is to find subsets of variables that precisely predict some $y_k$. …

Topic: decision-trees logistic-regression classification statistics machine-learning

Category: Data Science

tech to solve the incorrect classification with large cast

user6703592

2022年6月2日 19:44

I have a multi-class classifier where my model is habitually confusing one class with another. What techniques can I use to reduce this type of error (even at the cost of other types of errors)?

Topic: classification machine-learning

Category: Data Science

Machine learning interview of incorrect claissification

user6703592

2022年6月2日 19:42

I have a multi-class classifier where my model is habitually confusing one class with another. What techniques can I use to reduce this type of error (even at the cost of other types of errors)?

Topic: classification machine-learning

Category: Data Science

GridSearch multiplying the number of trees in XGboost?

Cosapocha

2022年6月2日 16:43

I'm having an issue: after running an XGboost in a HalvingGridSearchCV, I receive a certain number of estimators (50 for example), but the number of trees is constantly being multiplied by 3. I don't understand why. Here is the code: model = XGBClassifier(objective='multi:softprob', subsample = 0.9, colsample_bytree=0.5, num_class= 3) md = [3, 6, 10, 15] lr = [0.1, 0.5, 1] g = [0, 0.25, 1] rl = [0, 1, 10] spw = [1, 3, 5] ns = [5, 10, 20] …

Topic: gradient-boosting-decision-trees xgboost decision-trees classification machine-learning

Category: Data Science

Classification Produces too Many False Positives or False Negatives

E_J7

2022年6月2日 06:39

I trying to classify this data set (https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset) to classify if a patient is at risk for having a stroke. As the title says, whatever test I run to classify the patients, I keep running into the final results having too many false-positives or too many false-negative results. The data itself is severely imbalanced (95% 0s to 5% 1 (had a stroke)) and in spite of doing various things to try and balance it or compensate for it, I keep …

Topic: classification python machine-learning

Category: Data Science

When to use Random Forest over SVM and vice versa?

Rohit

2022年6月2日 06:02

When would one use Random Forest over SVM and vice versa? I understand that cross-validation and model comparison is an important aspect of choosing a model, but here I would like to learn more about rules of thumb and heuristics of the two methods. Can someone please explain the subtleties, strengths, and weaknesses of the classifiers as well as problems, which are best suited to each of them?

Topic: random-forest classification svm machine-learning

Category: Data Science

Ideal difference in the training accuracy and testing accuracy

girl101

2022年6月2日 00:06

In a data classification problem (with supervised learning), what should be the ideal difference in the training set accuracy and testing set accuracy? What should be the ideal range? Is a difference of 5% between the accuracy of training and testing set okay? Or does it signify overfitting?

Topic: training data supervised-learning accuracy classification

Category: Data Science

How to export shap waterfall values to dataframe?

The Great

2022年6月1日 19:37

I am working on a binary classification using random forest model, neural networks in which am using SHAP to explain the model predictions. I followed the tutorial and wrote the below code to get the waterfall plot shown below row_to_show = 20 data_for_prediction = ord_test_t.iloc[row_to_show] # use 1 row of data here. Could use multiple rows if desired data_for_prediction_array = data_for_prediction.values.reshape(1, -1) rf_boruta.predict_proba(data_for_prediction_array) explainer = shap.TreeExplainer(rf_boruta) # Calculate Shap values shap_values = explainer.shap_values(data_for_prediction) shap.plots._waterfall.waterfall_legacy(explainer.expected_value[0], shap_values[0],ord_test_t.iloc[row_to_show]) This generated the plot as …

Topic: neural-network classification python predictive-modeling machine-learning

Category: Data Science

How to decide who to market? Clustering or Decision Tree?

Data Enthusiast

2022年6月1日 05:03

I am working with a dataset that has enough observations and ~ 10 variables, half of the variables are numeric another half of the variables are categorical with 2-3 levels (demographics) one ID variable one last variable that has sales value, 0 for no sale and bill amount for sale Using this information, I want to understand which segments of my customers to market. I am using R for code but that's not relevant here. :) I am confused about …

Topic: decision-trees marketing classification predictive-modeling clustering

Category: Data Science

Identify optimal thresholds for one-vs-one/one-vs-rest ROC-curve for multiclass classification

CutePoison

2022年6月1日 00:03

Say I have a multiclass classification problem with N classes. I have trained a classifier on a training set, I use a validation set and a One-vs-rest ROC-curve to give me N ROC curves. Since the ROC curve is created based on different thresholds of when we classify a sample as $Ci$ or not $Ci$. We can then chose (our) optimal FPR/TRP ratio and get the threshold (t) e.g say t=0.6 we classify a sample as $Ci$ if model_score>=0.6 else …

Topic: multiclass-classification roc classification

Category: Data Science

Image Classification problem for minute defect detection

Flashy_Ad6486

2022年5月31日 14:13

I am tasked with the problem of finding defects in a compressor wheel.Here is how a good wheel looks like: Here is how a defective wheel looks like ( I have drawn a box around the defective area): I have continuous video feed of the wheels rotating as a data set. I tried training the "goodness" of a wheel using a fasterrcnn_resnet50_fpn model in pytorch. But the results were inaccurate. This is what I fed in the training data with …

Topic: image-classification deep-learning classification

Category: Data Science

Binary Classification Comparing two time series of variable length

gustavz

2022年5月31日 12:07

Is there a machine learning model (something like LSTM or 1D-CNN) that takes two time series of variable length as input and outputs a binary classification (True/False whether time series are of same label)? So the data would look something like the following date value label 2020-01-01 2 0 # first input time series 2020-01-02 1 0 # first input time series 2020-01-03 1 0 # first input time series 2020-01-01 3 1 # second input time series 2020-01-03 1 …

Topic: siamese-networks machine-learning-model keras classification time-series

Category: Data Science

I have data with customer personal information and customer transaction. I cannot figure out how to use the data for training my model?

Piyush Patil

2022年5月30日 22:08

Customer information attributes: ID Age Gender State etc Customer transaction ID Store ID No of items bought State etc Store info Store ID State Daily revenue Store size etc I want to predict if customer will buy at a particular store or not/ So can I have the train data with suppose 5 different stores for every customer where the customer shops and then predict in other store?

Topic: classification machine-learning

Category: Data Science

seasonality in classification model

Ella Jean

2022年5月30日 21:05

I am building a classification model to predict customer status a year from a given time. There seems to be some seasonality, for example, more changes occur in Summer than in Winter etc. so my dataset (mainly labels) would change depending on how to define prediction time (eg 2020 Jan) and predicting time (eg 2021 Jan). Let's say there are 100 customers and I could make 1,200 entries (100 per month for every month in 2020, where labels are from …

Topic: classification dataset machine-learning

Category: Data Science

About