A/B testing: How to calculate p-value on post test segments?

My question on A/B testing is about doing post test segmentation analysis.

For example:

I run an A/B test on my website to track bounce rate. On the treatment group, i put a video to explain my company. On the control group i put just plain text. I pick a segment of users who are first time users from USA to be split 50/50 into the 2 groups.

Metric that i am tracking is average bounce rate (assume 20%). 
Power effect (0.8)
effect size i expect to see(10% so bounce rate should fall to (20% - 0.10 * 20% = 18%))
Calculated sample size required is say 1000 for each group. 

Say i run the test for the correct amount of time. At the end of the test, i get a p-value of 0.06. i do not reject the null hypothesis.

However, when i do post test segmentation analysis, for example, i saw that users who signed up for a free trial, 44% of them played the video.

In this case, how do i calculate if the 44% was significant? (while taking into account the multiple comparison problem?) Like in the Airbnb experiment, they did post segmentation analysis on the browser type and was able to calculate the p-value.

My approach

Does this mean that for every segment that i want to analyze, i need to have at least 1000 samples? Also how would i recalculate the p-value given that the p-value of this A/B test was already generated above as 0.06?

Topic hypothesis-testing ab-test experiments statistics

Category Data Science


I recently wrote about this in a blog post. Given this is a rate evaluation metric, you will want to use the z-test. The basic steps are (more details in the blog post)

  1. calculate the pooled standard error of your pairwise comparison
  2. calculate z-statistic by normalizing the delta or lift by the standard error
  3. look up the cdf value of the normalized delta
  4. p-value = 1-cdf(z)
  5. given that this is also an A/B/n test, you should also want to apply multiple testing correction using the bonferroni procedure or the benjamini hochberg procedure, when evaluating significance

Well if you want to answer the question if a single segment reaches the same level and you ignore all other segments behaviors then this should be the required number (given that initial performance of the segments was the same).

As a warning when you use to many segments this: https://xkcd.com/882/ can happen.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.