Validating classification results

I created a model for only 2 classes and the classification report was: Although accuracy looks good, I don't think this model is good. The original data has 522 records of class 1 and 123 of class 2. So, I think that the model is guessing for the most common (class 1). When I applied the model on the original data, it was predicted 585 class 1 and 60 class 2. When I balanced the classes, the results were: The …
Category: Data Science

Cluster Evaluation with Jaccard and Rand Index

I've clusterized my data according to 3 criteria in 3 groups. I used kmeans to obtain those cluster so the label for each cluster is random and changes at each script run. To evaluate the consistency of my clusters I decided to use Jaccard index but I can't understand how to apply it properly. Let's say I have this data where alpha beta and gamma are the 3 methods, and the Cluster Index is the value returned by K-means for …
Category: Data Science

Song playlist recommendation system

I want to build a recommender system to suggest similar songs to continue a playlist (similar to what Spotify does by recommending similar songs at the end of a playlist). I want to build two models: one based on collaborative filtering and another one, a content-based model, to compare their results and choose the best one. Now, I have two questions: Where can I find a dataset with useful data for this type of work? How can I measure the …
Category: Data Science

Meaningfully compare target vs observed TPR & FPR

Suppose I have a binary classifier $f$ which acts on an input $x$. Given a threshold $t$, the predicted binary output is defined as: $$ \widehat{y} = \begin{cases} 1, & f(x) \geq t \\ 0, & f(x) < t \end{cases} $$ I then compute the $TPR$ (true positive rate) and $FPR$ (false positive rate) metrics on the hold-out test set (call it $S_1$): $TPR_{S_1} = \Pr(\widehat{y} = 1 | y = 1, S_1)$ $FPR_{S_1} = \Pr(\widehat{y} = 1 | y …
Category: Data Science

Bias-variance trade-off and model evaluation

Suppose that we have train a model (as defined by its hyperparameters) and we evaluated it on a test set using some performance metric (say $R^2$). If we now train the same model (as defined by its hyperparameters) on a different training data we will get (probably) a different value for $R^2$. If $R^2$ depends on the training set, then we will obtain a normal distribution around a mean value for $R^2$. Shouldn't therefore average the $R^2$ from the various …
Category: Data Science

Is data leakage giving me misleading results? Independent test set says no!

TLDR: I evaluated a classification model using 10-fold CV with data leakage in the training and test folds. The results were great. I then solved the data leakage and the results were garbage. I then tested the model in an independent new dataset and the results were similar to the evaluation performed with data leakage. What does this mean? Was my data leakage not relevant? Can I trust my model evaluation and report that performance ? Extended version: I'm developing …
Category: Data Science

How is model evaluation and re-training done after deployment without ground truth labels?

Suppose I deployed a model by manual labeling the ground truth labels with my training data, as the use case is such that there's no way to get the ground truth labels without humans. Once the model is deployed, if I wanted to evaluate how the model is doing on live data, how can I evaluate it without sampling some of that live data, that doesn't come with ground truth labels, and manually giving it the ground truth labels? And …
Category: Data Science

Is there a Mean Average Recall for Item Retrieval/ Recommendation Systems?

Mean Average Precision for Information retrieval is computed using Average Precision @ k (AP@k). AP@k is measured by first computing Precision @ k (P@k) and then averaging the P@k only for the k's where the document in position k is relevant. I still don't understand why the remaining P@k's are not used, but that is not my question. My question is: is there an equivalent Mean Average Recall (MAR) and Average Recall @ k (AR@k)? Recall @ k (R@k) is …
Category: Data Science

Repeatability tests for machine learning models (in the sense of measurement system analysis)

For analyzing a machine learning model, we usually calculate the model performance metrics (such as accuracy...) and during validation step make sure that the model has not overfitted. We can consider a machine learning model (for example a machine vision model) that is deployed to an industrial system that performs a classification (e.g., defect detection) task as a measurement device. From this point of view, I would like to know if performing "measurement system analysis" and specifically repeatability are necessary. …
Category: Data Science

Comparison of performance of regression models for multi-regression tasks

I have a sample time-series dataset (23, 14291) a pivot table count for 24hrs for some users. After pre-processing, I have a dataset with (23, 200) shape. I filtered some of the columns/features which don't have a time-series based nature to reach/keep meaningful columns/features by PCA method to keep those with a high amount of data variance or correlation matrix to exclude highly correlated columns/features. I took advantage of MultiOutputRegressor() and predicted all columns for a certain range of time …
Category: Data Science

Uncertainty about shape of ROC curve

I am working on a binary classification and the plotted ROC curves that I am using for evaluation together with AUC, have seemed strange to me. Here is an example. I understand that ROC is a visual representation of the true positive rate versus the false positive rate. When plotting the confusion matrix I can see there are significant number of false negatives and false positives alike: I fail to understand how it is possible that the ROC curve only …
Category: Data Science

Baseline result is much better than state-of-the-art model

I am researching about Deep Learning based Intrusion Detection System. I found a paper on a well-known journal, which is considered as a state-of-the-art method in this research area, because it got many citations. In the paper, they proposed using Inception Resnet v4 to solve the problem and got the lowest error rate, compared to other studies. I am developing a new method using their data pre-processing idea. First, I built a baseline, which is a very simple and shallow …
Category: Data Science

Evaluation Metric for Imbalanced and Ordinal Classification

I'm looking for an ML evaluation metric that would work well with imbalanced and ordinal multiclass datasets: Imagine you want to predict the severity of a disease that has 4 grades of severity where 1 is mild and 4 represent the worse outcome. Now, this dataset would realistically have the vast majority of patients in the mild zone (classes 1 or 2) and fewer in classes 3 and 4. (Imbalanced/skewed dataset). Now in the example, a classifier that predicts a …
Category: Data Science

Choose ROC/AUC vs. precision/recall curve?

I am trying to get a clear understanding on various classification metrics, including knowing when to choose ROC/AUC as opposed to opting for the Precision/Recall curve. I am reading Aurélien Géron's Hands-On Machine Learning with Scikit-Learn and TensorFlow book (page 92), where the following is stated: Since the ROC curve is so similar to the precision/recall (or PR) curve, you may wonder how to decide which one to use. As a rule of thumb, you should prefer the PR curve …
Category: Data Science

How to evaluate model accuracy at tail of empirical distribution?

I am making a nonlinear regression on stationary dependent variable and I want to precisely forecast extreme values of this variable. So when my model predicts extreme values I want them to be highly accurate. Less extreme forecasts (eg. positioned near mean) don't need to be "as much" accurate. What are some useful metrics with favorable statistical properties, used to compare multiple models when tail accuracy matters?
Category: Data Science

Quantitative measure of the smoothness of learning curves

$\DeclareMathOperator{\loss}{loss}$ $\DeclareMathOperator{\AvgVar}{AvgVar}$ Lat's say we have some deep learning task. We have our model and two sets of hyperparameters $A$ and $B$. We train both systems for 10000 mini-batches and we obtain two learning curves (losses on these train batches). Is there any quantitative measure of the smoothness of the learning curve? I saw few times in the articles that the authors just overlap two curves to show that one is smoother then the other, but obviously it would be …
Category: Data Science

How can i adapt accuracy metric for multiclass classification?

I have a problem which is multiclass e.g. That is 4 classes. I would like a custom metric to assess the model where only if class 3 is predicted as class 2 and class 2 is predicted as class 3 (i.e. those in the middle) then it is penalized less. How can i do this by adapting the sklearn accuracy_score metric of similar? e.g. comparing: predicted_labels = [1,3,0,0,2..] actual = [0,0,2,1,3,3...]
Category: Data Science

How to calculate mAP for multi-label classification using output predictions?

I have a model which predicts the actions happening in a video clip. Once I get these predictions, I use some rules(set of if-else conditions) to come up with composite labels for eg. action1_before_action2, action4_during_action5 etc. I also have the ground truth for these composite labels. How do I calculate the mAP score using my composite predictions? Notice, that for my composite predictions, I do not have sigmoid values. More details I have an action classification model that outputs the …
Category: Data Science

Assess feature importance in Keras for one-hot-encoded categorical features

An important aspect of tuning a model is assessing feature importance. In Keras, how to assess the importance of a categorical feature which is one-hot encoded? E.g. if a categorical feature is ice_cream_colour with a cardinality of 12 then I can assess the individual importances of ice_cream_colour_blue, ice_cream_colour_red, etc, but how to do it for the entire ice_cream_colour feature? A naïve approach would be to sum all individual importances, but this assumes that the relationship between distinct feature importances is …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.