Survival Analysis for A/B Tests

Survival Analysis for A/B Tests

Home Posts

A/B tests are an effective way to validate hypotheses concerning web data and optimize websites based on those confirmations. An A/B test is essentially an experiment using visitors to a website that indiscriminately assigns a control or treatment condition to each visitor. Contingent on how the tests are conducted; there are many inferences that can be drawn from the data.

A branch of statistical analysis called Survival Analysis provides a set of algorithms and procedures that allows for analyzing behavior and time-to-event data. Time-to-event data is data that is marked by some distinctive (binary) event in time. For example, one may conduct an experiment on a business and collect time-to-event data. Suppose you owned a banana stand, you could track each person who visited your banana stand and collect data on how long it took them to buy a banana. The binary event would be buying the banana or not, and your time-to-event data would be the time it took them to buy the banana or to pass on it entirely. While this example is a bit contrived, it illustrates the architecture of survival data. There is an event (binary) and then time to that event.

Survival Analysis techniques are used in a wide array of applications, including machine failures and medical experiments. A/B testing is another area where Survival Analysis can be applied. A/B tests have treatment and control groups, and many of the events measured on a web page are time-to-event in nature. For example, one can test how two experiences influence site abandonment or “bounce rate”. Survival Analysis allows one to analyze variance in the outcome and partition along very specific lines.

In order to use Survival Analysis we need individual level data. This may be difficult to obtain depending on what testing platform is being used, but individual level data will provide much deeper insights than grouping will. Time-to-event data and data indicating which group (experimental or control) the participant was in, is also necessary. Once these conditions are met, any additional data can be used to split the test into smaller groups for analysis.

In this blog, I will provide some examples of how one can make these segmentations and comparisons in R, the statistical programming language. In each section I refer to the code as code figure [1-] and the plots as figure [1-]. If you would like the complete code or data used in this example please contact me.

The data used in this blog is from an A/B test where the sample size was 1,958. Meta data was collected on each user including browser type, a time stamp of when they entered the site, among other metrics. For the purposes of this blog, I will only use the browser type and time stamp for analysis.

Main Test
In this test I want to show the difference in bounce rate between a control and test group. This can be done with some simple lines of code in R. In Code Figure. 1, I fit a Kaplan-Meier curve and then plotted the result, which shows a visual depiction of survival between the two groups. A Kaplan-Meier Curve is a curve that shows the probability of surviving at each point in time. For this example, the curve can tell what the probability of surviving or “sticking” is, for each group, at any specific point in time. For a simple interpretation of this curve: the further a curve is pushed to the top right hand corner of the chart; the better that group is at sticking to the site. In Figure. 1, viewing the result of the Kaplan-Meier curve; experience B has the better stick rate. As seen in the results, for a majority of the time, users assigned to experience B stick to the site more often than the users assigned to experience A. It may be difficult to determine statistical significance by just looking at this plot, I elaborate on this in the Device section where I share some statistical tests that may be helpful in determining this.


Time of day
Since data was captured when a visitor arrived to the site, it is possible to see if there are any differences with people who visit the site early in the day versus later in the day by using these same methods. What I will do is create groups based on time before 5 PM for “day” and after 5 PM for “night” (Code Figure.2). If more granularities with the time of day are required, groups can be created for whatever time segment is desired, using the same code; however, for the purposes of this blog, two groups will suffice. In Figure. 2 there is an increased stick rate for experience B at night, compared to the other 3 experiences. The other 3 experiences (B-Night, A-Day, A-Night) perform similarly, with A-night performing slightly better.


4Figure 2

Visitors to a site will often use different devices including smart phones, tablets, desktops and other devices. When capturing this data, these differences were simplified into two groups; “web” refers to desktop computer users and “mobile” refers to users on any other device. Clearly one can partition these groups into more granular categories, but for the purposes of this blog, this will suffice. In Figure. 3, experience B shows a higher “stick rate” for both web and mobile users of the site. This might indicate that experience B is the optimal experience to be rolled out to the site overall. This finding is interesting and can be explained by a number of factors, including usability concerns and other data not captured in this study. To confirm any hypothesis surrounding this data, further statistical tests are necessary. Some statistical tests that can be used to elaborate on the findings here include the Wilcoxon test and log-rank test.

5Figure 3

Survival Analysis provides a great way to dig deeper into testing data. With a few lines of code, one can uncover anomalies in the data that their testing platform is unlikely to discover. There are additional statistical tests that can be done in order to determine if there are any real differences in these groups, which I will cover in a future blog.

Leave a Reply