How do i conduct t-test for comparing the accuracy of two binary classifiers?

I am using two binary classifiers that predicts the accuracy of samples over a dataset. Accuracy is defined as ratio of correct vs incorrect predictions. Do i need to take accuracies sampled over multiple experiments and use them as data for t-test. Can some explain please ? Also what will the result of the t-test convey?. Thanks in advance.
Category: Data Science

Are experiments using confidence interval can be said a statistical test

I am working on some algorithm that is comparing results with other model using confidence interval , 90%. Can this be said a statistical test ? I read a article where it said about statistical test with some confidence level. Is confidence level same as confidence interval in statistical tests ?
Category: Data Science

What kind of statistical test can be performed in a recommender system dataset that predicts the ratings for the movies?

The dataset consists of 1000s of users and users and each row of the dataset consist of user_id,movie_id and ratings the user provides to the movie. eg. 1,56,5 In my experiment i am calculating the mse and precision using collabarative filtering model. The error comes from difference in predicted and actual ratings. I want to conduct a statistical test now. Which statistical model is to performed and how? Thanks in advance.
Category: Data Science

A/B Testing (Binomial Distribution vs Random Distribution)

When performing an A/B test for the number of clicks for users viewing (each view is an impression) two variants of an ad, a binomial distribution can be assumed where each variant has a constant click-through rate. Example: Two Ads, -> Ad one has 1000 impressions and 20 clicks, CTR is 2%; -> Ad two has 900 impressions and 30 clicks, CTR is 3.3%. Test whether there is a difference between Click Through Rate (CTR) between Ads one and two. …
Category: Data Science

PSI where not to use

From what I understand PSI is used for continuous data. Generally, equal sized bins are created to compare two data set, and number of buckets is usually 10. Is that for a reason, why 10 bucket? Also, I was wondering if PSI can also be use categorical data less than 10 value? In case of categorical variables, what approach would be the best to estimate the shift in the population?
Category: Data Science

Drastic drop in Somers' D ? Why?

I came across to find the correlation between the ratings assigned by two coaches to a same group of 40 players. I have tabulated the results as below: The Somers' D is 50%. However, for the case below, The Somers' D is 94.7%. My question is, why both scenarios are having 2 deviations but the first scenario has so much lower Somers' D compared to the second scenario?
Category: Data Science

Analysis for basic weight training analysis?

TL;DR: I'm doing a fairly basic project which involves exercise. It seems that descriptive statistics and basic data vis (ex: line graph) would be most appropriate for this project, but I wonder if you have any recommendations for analyses. For this project, I am performing the same set of 15 single-joint exercises each week (we'll call these "Exercises"). Every 4 weeks, I'm performing 3 different multi-joint exercises (we'll call these "Lifts"). My goals are to: Track my progress (strength gains) …
Category: Data Science

Insights betwwen two columns/variables in Dataframe

I have data in two columns one is range of old credit score (Input score range) and new credit score (cvsc100). How do i find insights from both of them ? where the old is range of values and other column is not(CVSC100) I know how to calculate Pearson Correlation in Python of Dataframe of two column . but that should not be sufficient i believe. How should i proceed can you please advise
Category: Data Science

When do I need Statistical Signifcance testing and when not?

Hi there I have a handful of questions regarding statistical significance testing. As a newcomer I have sometimes topics that I do not really understand entirely. One of them is checking for statistical significance. For example, when I do A/B Testing I understand that I have to check whether my results are statistically significant (p value test) before looking for effect sizes. 1. Question: One question is if I only do Statistical Significance Tests in the context of Hypothesis Testing? …
Category: Data Science

p-value and effect size

Is it correct to say that the lower the p-value is the higher is the difference between the two means of the two groups in the t-test? For example, if I apply the t-test between two groups of measurements A and B and then to two groups of measurements B and C and I find that in the first case the p-value is lower than the second case, could one of the possible interpretations be that the difference between the …
Category: Data Science

How do you determine if a value is statistically significant?

I have collected some data I need to analyze. The data is the result of a survey in which I asked approx. 180 sellers at a bazaar, how important a certain buyer's characteristic is in relation to their price setting on a scale from '1 = absolutely unimportant' to '10 = extremely important' (for instance, how important is a buyer's nationality in relation to the price a merchant is offering his goods). I now have analyzed my results and clustered …
Category: Data Science

How do I handle string feature while performing model generation

I have data which looks like this shift_id user_id status organization_id location_id department_id open_positions city zip role_id specialty_id latitude longitude years_of_experience 2 9 S 1 1 19 1 brooklyn 48001 2 9 42.643 -82.583 6 60 S 12 19 20 1 test 68410 3 7 40.608 -95.856 9 61 S 12 19 20 1 new york 48001 1 7 42.643 -82.583 10 60 S 12 19 20 1 test 68410 3 7 40.608 -95.856 21 3 S 1 1 19 …
Category: Data Science

Calculate rate from related datasets

I have the monthly sales rate for various products. The products are sold in different countries. I'm looking for a meaningful way to calculate the sales rate at each country. The sales rate indicated below is across all countries. Product Global Sales Rate Pen 9 Pencil 4 Product Country Sold Pen India Pen Australia Pencil Italy Pencil Japan When there is a new product launch, business team creates an opportunity including products similar to the one being launched. I know …
Category: Data Science

evaluation metrics for multiple values per session

I have an application that executes my foo() function several times for each user session. There are 2 alternate algorithms that i can implement as "foo" function and my goal is to evaluate them based on execution delay . The number of times foo() is called per user session is variable but will not exceed 10000. Say delays values are: Algo1: [ [12, 30, 20, 40, 24, 280] , [13, 14, 15, 100], [20, 40] ] Algo2: [ [1, 10, …
Category: Data Science

Comparing data sets with different measurements

I'm currently writing a thesis based on Cyber Crime, however I'm unsure of the proper to compare/analyse my data sets to talk about them in my thesis. One piece of data (https://www.pandasecurity.com/mediacenter/src/uploads/2014/07/Pandalabs-2015-anual-EN.pdf on page 9) it states that the 'infection rates' of Sweden is 20.88% (bottom 3 ranking), USA at 29.48% (middle ranking) and China (first rank) having 57.24%. Another (http://www.virusradar.com/en/home/world) , uses a different measurement to define the 'threat rates', which is different to the one above, which has …
Category: Data Science

Time Series Analysis for Categorical Data Output

Suppose I am having dataset which consist of date as one column and fruits as second column which is categorical data having set of 4 different fruits in that column and my output column has 0's and 1's whether the particular fruit sold at that time or not.Based on this, I can able to predict for pattern like,what will be the status of particular fruit selling after some years? How to do time series analysis for those categorical data? Any …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.