A/B testing with non-Gaussian distributions

I have two sets of samples (A, B) with a relatively high number (~10,000) and I want to see if a factor has affected sample B or not. Naturally, I should use A/B testing. The problem is, the distributions are not normal and I'm interested in the maximum change, not the mean values! So if all you know is how CLT is gonna make everything Gaussian, this is a good point to stop and move on to the next question. …
Category: Data Science

A/B Testing (Binomial Distribution vs Random Distribution)

When performing an A/B test for the number of clicks for users viewing (each view is an impression) two variants of an ad, a binomial distribution can be assumed where each variant has a constant click-through rate. Example: Two Ads, -> Ad one has 1000 impressions and 20 clicks, CTR is 2%; -> Ad two has 900 impressions and 30 clicks, CTR is 3.3%. Test whether there is a difference between Click Through Rate (CTR) between Ads one and two. …
Category: Data Science

Hypothesis test for classification model

I have a model that outputs 0 or 1 for interest/not-interest in a job. I'm doing an A/B/C test comparing two models (treatment groups) and none (control group). ANOVA for hypothesis testing and t-test with Bonferroni correction for posthoc testing is my plan. But both tests assume normality. Can we have normality for 0 and 1? If so, how? If not, what's the best test (including posthoc)?
Category: Data Science

A/B test results contradictory with offline machine learning model performance

This seems to be a common problem when bringing machine learning models to production. Let's say we have an optimized machine learning model which gives decent performance metric in the unseen testing dataset. We are quite satisfied with that, and decided to bring the model online. Then we use A/B test to compare our website performance (i.e., revenue, customer engagement etc) with and without the new model. Somehow, our new model is not a clear winner or even a clear …
Category: Data Science

Can I use multi armed bandits to optimize how much both algorithms are weighted when creating a composite score?

So, I'm aware that multi-armed bandits are great for evaluating multiple models and from what I understand, it is mainly used to pick a specific model. I would still like to evaluate two models but I want to do it differently. Take a look at this simple equation: W_A * RecoScore_A + W_B * RecoScore_B = CompScore Rather than optimize for a specific model for a given user, I'd like to optimize for a given set of weights. I'm wondering …
Category: Data Science

Significant testing - repeated observation over multiple days

I work in mobile gaming, and want to analyze A/B test groups, but I believe I'm introducing errors in my calculations. The metric I'm looking at is: num of unique players who engaged in battle that day/ num of unique players who were active that day. I currently have my data that with each row as active players for the day: date, player_id, group A/B, boolean 0,1 if engaged in battle that day I split the groups A/B and take …
Topic: ab-test
Category: Data Science

A/B test on model - split on results

I developed a predictive model that assigns the best product (P1, P2, P3) for each customer. I wanted to compare the conversion rate using this model VS the as-is deterministic assignment, so I applied an A/B test: I decided the product between P1, P2, P3 using the model on 50% of my users using the deterministic rules on the other 50% and then I compared the different conversion rates. My question is: is it correct to split the analysis on …
Category: Data Science

What is the minimum size of the test set?

The mean of a population of binary values can be sampled with about 1000 samples at 95% confidence, and 3000 samples at 99% confidence. Assuming a binary classification problem, why is the 80/20% rule always used, and not the fact that with a few thousand samples the mean accuracy can be estimated with > 95% confidence?
Category: Data Science

Causal Inference where the treatment assignment is randomized

I have mostly worked with Observational data where the treatment assignment was not randomized. In the past, I have used PSM, IPTW to balance and then calculate ATE. My problem is: Now I am working on a problem where the treatment assignment is randomized meaning there won't be a confounding effect. But treatment and control groups have different sizes. There's a bucket imbalance. Now should I just analyze the data as it is and run statistical significance and Statistical power …
Category: Data Science

AB testing split algorithm

I want to understand what is the most effective algorithm for splitting. I have ids of users and I want to split them into 2 groups. Now I have 2 variants: Modulo approach - let's say we will place all even ids into one group, odd numbers into another. Pros - for any sequence we will have a uniform distribution of users. So for any day or hour, users that registered during that time will be equally divided between 2 …
Category: Data Science

Multivariate testing

I'm going to run a test with 4 different variants (3 variants and a control group), and we want to find the variant with the highest conversion. Are there any resources/methods in R/python to: Perform a test to tell if a variant converts significantly better than the others? Calculate sample size before performing this test? Either frequentist or Bayesian methods work for me, thanks! The context is that the amount of data is not huge, I have around 5000 users …
Category: Data Science

Intragroup indepence in two groups analysis

I am working in an experiment in which I want to analyze the impact of a treatment on two different groups of customers. Most of the method for analysis I have checked (for example t-test) have as a hypothesis the existence intragroup and crossgroup independence. I can assume the crossgroup independence because the two groups are randomly split, but I have some doubts about the meaning of the intragroup independence. We can assume that there is no causal effect of …
Category: Data Science

How do I conduct an experiment on the new pricing if it's impossible to conduct an A/B test?

We want to introduce a new price list for the customers of our international SaaS company. Beforehand we want to test this new price list in several countries. A/B test cannot be conducted here because it is forbidden to have different prices for different customers of the same country. Thus we want to introduce the new pricing policy in several countries and then figure out whether the new one is better than the old one. My questions are: How to …
Category: Data Science

Practical constraints in A/B testing

I saw an article about an A/B test that google had performed way back. They wanted to decide what shade of blue a button should be and how that affects click-through rate. They divided users randomly into 100 buckets - each corresponding to a shade of blue they wanted to check (so the color is a factor with 100 levels). Now this is all well and good if all the buckets (or "treatment groups") sufficiently represent the target population. In …
Category: Data Science

How to create A/B test segements for highly variable data

I have a data in which there is a high degree of variability. My Objective is to do an AB test to check the behavior change due to new changes. All samples has shown historically high and low performances. This means if I take any 2 cohorts randomly, they show vast historical comparison difference Following is the example for weekly comparison. Same behavior holds true for monthly and daily too. W1: -10.04% W2: 3.9% W3: -4.2% W4: -3.7% W5: 5.4% …
Topic: ab-test
Category: Data Science

What is the right approach to bucket users for algorithms with different coverage for A/B testing

I've couple of recommendation algorithms that I want to A/B test. Algorithm A has 90% user coverage and algorithm B has 95% user coverage. That means if the algorithms are asked to provide recommendations for 1000 users, algorithm A can give it for 900 of the users and algorithm B can give it for 950 other users. Say for example out of these 1000 users 87% has recommendations from both algorithm, 3% has recommendations from only algorithm A and 8% …
Category: Data Science

Recommend System AB test metric events

I build personal recomendation system for choosing games. In website on main page on special place there is collection of personal games recomendation. And after AB test(between 2 recommend system) I don't understand, what events I should collect. Only events after click on recomend icon or all events(recommend events plus events without choosing recommend game-user can choose game on other places such as finder)? For example, one of the metric is sum payment per user per game. Should I collect …
Category: Data Science

A/B testing: How to calculate p-value on post test segments?

My question on A/B testing is about doing post test segmentation analysis. For example: I run an A/B test on my website to track bounce rate. On the treatment group, i put a video to explain my company. On the control group i put just plain text. I pick a segment of users who are first time users from USA to be split 50/50 into the 2 groups. Metric that i am tracking is average bounce rate (assume 20%). Power …
Category: Data Science

Time duration for ML models A/B testing

I am going to perform A/B tests for ML models. However I am not sure how long should I run it online in order to see significant differnce. What would be the right time frame ? and what will be the reason behind the time frame ? The A/B test will run againts the None ML systems. Usally we run for none ML features for 2 weeks max. Thank you
Category: Data Science

Treatment and Control selection in A/B Testing

I'm hoping to get a better understanding of A/B Testing design. In particular, I'm interested in understanding how treatment and control units are selected. I read that these 2 groups are selected randomly (for example, here), but then there are also approaches where after picking the treatment (either randomly or not) the control is selected based on "similarity" to the treatment group. Are both approaches valid and what's the rationale for picking one or the other? For example, Alteryx has …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.