The “Power” of A/B Testing

Introduction

Customers around the world flock to Zulily on a daily basis to discover and shop products that are curated specially for them. To serve our customers better, we are constantly innovating to improve the customer experience. To help us decide between different customer experiences, we primarily use split testing or A/B testing, the industry standard for making scientific decisions about products. In a previous post, we talked about the fundamentals of A/B testing and described our home-grown solution. In this article, we are going to explore one choice behind designing sound A/B tests: we will calculate the number of people that need to be included in your test.

To briefly recap the idea of A/B testing, in an A/B test we compare the performance of two variants of the same experience. For example, in a test we might compare two ways of ordering the navigation tabs on the website (see Figure 1). In the original version (variant A or control), the tabs are ordered such that events launched today (New Today) appear before events that are ending soon (Ends Soon). In the new version of the website (variant B or control), Ends Soon appears before New Today. The test is set up such that, for a pre-defined period of time, customers visiting the website would be shown either variant A or variant B. Then, using statistical methods, we would measure the incremental improvement in the experience of customers that were shown variant B over those who were shown variant A. Finally, if there was a statistically significant improvement, we might decide to change order of the tabs on the website.

Since Zulily relies heavily on A/B testing to make business decisions, we are careful about avoiding common pitfalls of the method. The length of an A/B test ties in strongly with the success of the business for two reasons:

• If A/B tests run for more days than necessary, the pace of innovation at the company will slow down.
• If A/B tests run for fewer days than required to achieve statistically sound results, the results will be misguided.

To account for these issues, before making decisions based on the A/B test results, we run what’s called a ‘power analysis.’ A power analysis ensures that a certain number of people have been captured in the test to confirm or deny whether variant B was an improvement over variant A, which is the focus of this article. We also make sure that the test is run long enough so that short-term business cycles are accounted for. The calculation for the number of people needed in a test is a function of three things, effect size, significance level ($\alpha$), and power ($1-\beta$).

To consult the statistician after an experiment is finished is often merely to ask [them] to conduct a post mortem examination. [They] can perhaps say what the experiment died of.

– Robert Fisher, Statistician

Common Terms

Before we get into the mechanics of that calculation, let us familiarize ourselves with some common statistical terms. In an A/B test, we are trying to estimate how all the customers behave (population) by measuring the behavior of a subset of customers (sample). For this, we ensure that our sample is representative of our entire customer base. During the test, we measure the behavior of the customers in our sample. Measurements might include the number of items purchased, the time spent on the website, or the money spent on the website by each customer.

For example, to test whether variant B outperformed variant A, we might want to know if the customers exposed to variant B spent more money than the customers exposed to variant A. In this test, our default position is that variant B made no difference on the behavior of the customers when compared to variant A (null hypothesis). As more customers are exposed to these variants and start purchasing products, we collect measurements on more customers, which allows us to either accept or reject this null hypothesis. The difference between the behavior of customers exposed to variant A and variant B is known as the effect size.

Further, there are a number of parameters set under the hood. Before starting the test, we assign a significance level ($\alpha$) which means that we might reject the null hypothesis when it is actually true in 5% of the cases (Type I error rate). Further, we will assign a power ($1-\beta$) which means that when the null hypothesis does not hold, or variant B changes the behavior of customers, the test will allow us to reject the null hypothesis 80% of the time. Importantly, these parameters need to be set at the beginning of the test and upheld for the duration of the test to avoid p-hacking, which leads to misguided results.

Estimating the Number of Customers for the A/B Test

For this exercise, let us revisit the previous example where we show customers two versions of the Zulily navigation bar. Let us say we want to see if this change makes customers less or more engaged on Zulily’s website. One metric that can capture this is the proportion of customers who revisit the website when shown variant B versus variant A. for this exercise, let us say that we are interested in a boost in this metric of at least 1 % (effect size). If we see at least this effect size, we might implement the variant B. Question is, how many customers should be exposed to each variant to allow us to confirm that this change of 1% exists?

Starting off, we define some parameters. First, we define the significance level at 0.05. Second, from the central limit theorem, we assume that the average money that a group of customers spend on the website is normally distributed. Third, we direct 50% the customers visiting the site to variant A and 50% to variant B. These last two points greatly simplify the math behind the calculation. Now, we can estimate the number of people that need to be exposed to each variant.

where, $\sigma$ is the standard deviation of the population, $\delta$ is the change we expect to see, the $z_{1-\frac{\alpha}{2}}$, $z_{1-\beta}$ are quantile values calculated from a normal distribution. For the case of the parameters defined above, significance level of 0.05 and power of 0.80, and if we wanted to detect a 1% change in the proportion of people revisiting the website, our formula would simplify to:

This formula gives us the number of people that need to be exposed to one variant. Finally, since the customers were split evenly between variant A and variant B, we would need twice the number of the people in the entire test. This estimate can change significantly if any of the parameters change. For example, to detect a larger difference at this significance level, we would need much smaller samples. Further, if the observations are not normally distributed, then we would need a more complicated approach.

Benefits of this calculation

In short, getting as estimate of the number of customers needed allows us to design our experiments better. We suggest conducting a power analysis both before and after starting a test for several reasons:

• Before starting the test – This gives us an estimate of how long our test should be run to detect the effect that we are anticipating. Ideally, this is done once to design the experiment and the results are tallied when the requisite number of people are exposed to both variants. However, the mean and standard deviation used in the calculation before starting the test are approximations to the actual values that we might see during the test. Thus, these a priori estimates might be off.
• After starting the test – As the test progresses, the mean and standard deviation converge to values representative of the current sample which allows us to get more accurate estimates of the sample size. This is especially useful in cases where the new experience introduces unexpected changes in the behavior of the customers leading to significantly different mean and standard deviation values than those estimated earlier.

Conclusion

At Zulily, we strive to make well-informed choices for our customers by listening to their voice through our A/B testing platform, among other channels, and ensuring that we are constantly serving the needs of the customers. While obtaining an accurate estimate of the number of people for the test is challenging, we hold it central to the process. Most people agree that the benefits of a well-designed, statistically sound A/B testing system far outweigh the benefits from obtaining quick, but misdirected numbers. Therefore, we aim for a high-level level of scientific rigor in our tests.

I would like to thank my colleagues in the data science team, Demitri Plessas and Pamela Moriarty, and my manager, Paul Sheets, for taking time to review this article. This article is possible due to the excellent work by the entire data science team of maintaining the A/B testing platform, and ensuring that experiments at Zulily are well-designed.

|

USING DATA TO DRIVE SUCCESS

Learn how Zulily and Sounders FC get the most out of their metrics!

On Tuesday, September 10th, Zulily was proud to partner with Seattle Sounders FC for a tech talk on data science, machine learning and AI. This exclusive talk was led by Olly Downs, VP of Data & Machine Learning at Zulily, and Ravi Ramineni, Director of Soccer Analytics at Sounders FC.

Zulily and Sounders FC both use deep analysis of data to improve the performance of their enterprises. At Zulily, applying advanced analytics and machine learning to the shopping experience enables us to better engage customers and drive daily sales. For Sounders FC, the metrics reflect how each player contributes to the outcome of each game; understanding the relationship between player statistics, training focus and performance on the field helps bring home the win. For both organizations, being intentional about the metrics we select and optimize for is critical to success.

We would like to thank everyone who attended the event for a great night of discussion and for developing new ties within the Seattle developer community. For any developers who missed this engaging discussion, we invite you to view the full presentation and audience discussion:

Acknowledgments:

Thanks to Olly Downs and Ravi Ramineni for presenting their talks, Sounders FC for hosting, and Luke Friang for providing a warm welcome. This would not have been possible without the many volunteers from Zulily, Bellevue School of AI for co-listing the event, as well as all the attendees for making the tech talk a success!

If you’d like to chat further with one of our recruiters regarding a position within data science, machine learning or any of our other existing roles, feel free to reach out directly to techjobs@zulily.com to get the conversation started. Also be sure to follow us on LinkedIn and Twitter.

AT THE TOP OF YOUR GAME

Seattle Female Leaders Discuss Their Paths to Success

Panel Highlights:

“As you grow in your career, you are being sought for your leadership and critical thinking skills, and for your ability to diagnose and solve problems, not regurgitate facts.” Kelly Wolf, VP of People at Zulily

I wouldn’t be where I am if it wasn’t for my mentors. We need to push more, take more risk to support each other and come together as a community. It doesn’t matter if you’re a man or a woman, we all need to work together.”Kat Khosrowyar, Head Coach at Reign Academy, former Head Coach of Iran’s national soccer team, Chemical Engineer

“I am not a developer, but currently mentor a female developer. She drives the topic, and I act as a sounding board. Working on a predominately male team, she needed a different confidante to work through issues, approach, development ideas and career path goals.”Jana Krinsky, Director of Studio at Zulily

“During meetings, I sometimes tell myself, ‘should I be here? I’m in over my head.’ And I sort of have to call bull**** on myself. I think we all need to do that.”Angela Dunleavy-Stowell, CEO at FareStart, Co-Founder at Ethan Stowell Restaurants

“When you have confidence in yourself, when you think ‘I’m going to own it, this is going to happen because I’m going to make it happen,’ it matters. As women, we can’t use apologetic language like ‘Sorry, whenever you have a second, I would like to speak to you’ — we don’t need to be sorry for doing our jobs. Women need to start changing those sentences to, ‘when would be a good time to talk about this project?’ and treating people as your equal, not as someone who’s above you.”Celia Jiménez Delgado, right wing-back for Reign FC + Spain’s national soccer team, Aerospace Engineer

“We all have to find our courage. Because if you want to grow and be in a leadership role, that’s going to be a requirement. I think identifying that early in your career is a great way to avoid some pitfalls, down the road.”Angela Dunleavy-Stowell, CEO at FareStart, Co-Founder at Ethan Stowell Restaurants

Acknowledgments:

Thanks to Celia Jiménez Delgado, Kat Khosrowyar, Jana Krinsky, Angela Dunleavy-Stowell and Kelly Wolf for this engaging panel! We’d also like to give a big thanks to FareStart, Reign FC  and Reign Academy for supporting this event. This would not have been possible without the many volunteers from Zulily as well as all the attendees for making the night a success!

If you’d like to chat further with one of our recruiters regarding a position within data science, machine learning or any of our other existing roles, feel free to reach out directly to techjobs@zulily.com to get the conversation started. Also be sure to follow us on LinkedIn and Twitter.

Remember Bart Simpson’s punishment for being bad? He had to write the same thing on the chalkboard over and over again, and he absolutely hated it! We as humans hate repetitive actions, and that’s why we invented computers – to help us optimize our time to do more interesting work.

At zulily, our Marketing Specialists previously published ads to Facebook individually. However, they quickly realized that creating ads manually was limiting to the scale they could reach in their work: acquiring new customers and retaining existing shoppers. So in partnership with the marketing team, we worked together to build a solution that would help the team use resources efficiently.

At first, we focused on automating individual tasks. For instance, we wrote a tool that Marketing used to stitch images into a video ad. That was cool and saved some time but still didn’t necessarily allow us to operate at scale.

Now, we are finally at the point where the entire process runs end-to-end efficiently, and we are able to publish hundreds of ads per day, up from a handful.

Here’s how we engineered it.

The Architecture

Sales Events

Sales Events is an internal system at zulily that stores the data about all sales events we run; typically, we launch 100+ sales each day that could include 9,000 products that last three days. Each event includes links to appropriate products and product images. The system exposes the data through a REST API.

Evaluate an Event

This component holds the business logic that allows us to pick events that we want to advertise, using a rules-based system uniquely built for our high-velocity business. We implemented the component as an Airflow DAG that hits the Sales Events system multiple times a day for new events to evaluate. When a decision to advertise is made, the component triggers the next step.

Make Creatives

In this crucial next step, our zulily-built tool creates a video advertisement, which is uploaded to AWS S3 as an MP4 file. These creatives also include metadata used to match Creatives with Placements downstream.

Product Sort

A sales event at zulily could easily have dozens if not hundreds of products. We have a Machine Learning model that uses a proprietary algorithm to rank products for a given event. The Product Sort is available through a REST API, and we use it to optimize creative assets.

Match Creatives to Placements

A creative is a visual item that needs to be published so that a potential shopper on Facebook can see it. That end result advertisement that is seen by the potential shopper is described by a Placement. A Placement defines where on Facebook the ad will go and who the audience should be for the ad. We match creatives with placements using Match Filters defined by Marketing Specialists.

Define Match Filters

Match Filters allow Marketing Specialists to define rules that will pick a Placement for a new Creative.

These rules are based on the metadata of Creatives: “If a Creative has a tag X with the value Y, match it to the Placement Z.”

MongoDB

Once we match a Creative with one or more Placements, we persist the result in MongoDB. We use the schemaless database technology rather than a SQL database because we want to be able to extend the schema of Creatives and Placements without having to update table definitions. MongoDB (version 3.6 and above) also gives us a change stream, which is essentially a log of changes happening to a collection. We rely on this feature to automatically kick off the next step.