Can we use difference-in-differences with a biased A/B test?

We noticed we had a biased sample in our A/B test and was wondering if difference-in-differences would help us make valid conclusions about the data, or if there was another way to proceed.

We ran an new experiment on our site, where we offered 50% of our users a new feature. We assigned users with odd ids into the experiment group and users with even ids into the control group and then ran the experiment. However, we saw that even prior to running the experiment, there was a statistically difference between the two groups. We think this is because we ran many experiments where we segmented based on odd/even of the id, so people in the experiment group have seen many treatments.

So lesson learned, we'll flip a coin next time instead of using the id. However, we'd like to see if we can still make inferences from the current experiment. I've heard of something called difference-in-differences. Would this work in this case, or is there a different approach that would work better? Ideally, we don't want to scrap the test and start over since 50% of our users have already seen the new feature.

Topic experiments

Category Data Science

Your treatment group and control group are different in an unknown number of ways. There are a number of covariates that make these groups different. They were initially randomly selected so the difference between any static covariates (age, demographics) won't be statistically significant (unless you're unlucky). However their past exposure to different experiences with your company, and any resulting effects of that, are significant differences between these groups. Directly measuring and controlling for these differences (stratifying by them, adding them to a model, etc.) would help achieve ignorability (that all differences in covariates between the treatment + control groups are accounted for).

Difference in difference is a way to estimate the experiment's effect for a treatment and control group that have different initial values of the key metric, so they are different in at least some way. It accounts for the pre-treatment difference between the control group and the treatment group.

In the common AB test, your treatment effect is the post-treatment difference between treatment and control. But with difference in difference, we subtract from this effect the pre-treatment difference between the treatment and control.

Another way to think of it is that we take the treatment + control group's post-treatment values and subtract their respective pre-treatment values, before the final comparison.

The comparisons are of course a statistical test on the difference of these modified final values, hence the "difference in difference." You can also take a bayesian testing approach to get a probability of one being superior to the other rather than the frequentist question of statistical significance.

However, the following assumption will still apply to either test:

Although difference in difference is a way of handling the pre-treatment difference, it assumes that the post-treatment difference should've been the same, if there had not been a treatment. Put another way, the key assumption in difference in difference is that the treatment group would have undergone the same CHANGE in its key metric as did the control group, if the treatment group had not received the treatment. We are assuming that the pre-treatment difference between the treatment + control groups should have been equal to the post-treatment difference between the treatment + control groups if no treatment was applied. The difference between this hypothetical (counterfactual) and the true outcome is assumed to be our effect size.

So, we need to evaluate this assumption in our context. Do we think that your treatment group would have undergone the same change in its metric that the control group underwent? Without knowing much about the covariates that make those groups different, it's hard to say that with any confidence. But to be fair, assumptions like these are commonplace in causal inference.

In the end, you have to make a business decision about whether you have the time to re-run an A/B, and whether you shouldn't take action in the meantime. Difference in difference will at least let you account for the initial difference between treatment and control, but you'll have to think about the chances that these two groups would have made the same change over this time. If it's a short time and a stable metric, then maybe it's not such a bad assumption.

If the control group's metric didn't change over the time period, then you're just looking at a pre vs. post on the treatment group. From that perspective, your measure is more like the Average Treatment effect on the Treated (ATT), rather than the Average Treatment Effect (ATE) on all users. That is to say that your experiment will tell you how users like the even id users react to the change, but it won't tell you how users like the odd users would react to the change.

Perhaps you can dive into the differences between even and odd users (e.g. purchasing behavior, etc.), to better understand what you've learned.

Why are you hesitant to start over? A/B testing is double-edged sword because it is a relatively straightforward exercise (as these things go) but it has the power to completely pivot your business in a very short amount of time. Plus keep in mind that the thresholds you are seeking to surpass can be "tempermental" (for the lack of a better word) where a z-value of (for example) 1.92 can't be, "oh, that's close enough . . . ".

As a data scientist, I would never want to sign my name to an experiment that I didn't have complete confidence in both the design and execution. If there is anything that can affect your final scores, you should absolutely consider scrapping it and starting over.

At this point in the experiment, nothing you can do can make the two groups equal. What you can do is make assumptions about how the two groups will behave.

If you can safely assume that previous treatments that you have applied using the even/odd sampling method will not interact with the new A/B test, then a difference-in-differences method would be appropriate.

As a simple example of when this assumption (and a difference-in-differences) is inappropriate: Let's say in your first test you send group A (the odds) a book titled "The importance of eating healthy." In the second test, you send group A (the odds) dietary supplements. Group A has been compromised by prior treatments that could impact the effects on the second test.


Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.