# The “Power” of A/B Testing

## Introduction

Customers around the world flock to Zulily on a daily basis to discover and shop products that are curated specially for them. To serve our customers better, we are constantly innovating to improve the customer experience. To help us decide between different customer experiences, we primarily use split testing or A/B testing, the industry standard for making scientific decisions about products. In a previous post, we talked about the fundamentals of A/B testing and described our home-grown solution. In this article, we are going to explore one choice behind designing sound A/B tests: we will calculate the number of people that need to be included in your test.

To briefly recap the idea of A/B testing, in an A/B test we compare the performance of two variants of the same experience. For example, in a test we might compare two ways of ordering the navigation tabs on the website (see Figure 1). In the original version (variant A or control), the tabs are ordered such that events launched today (New Today) appear before events that are ending soon (Ends Soon). In the new version of the website (variant B or control), Ends Soon appears before New Today. The test is set up such that, for a pre-defined period of time, customers visiting the website would be shown either variant A or variant B. Then, using statistical methods, we would measure the incremental improvement in the experience of customers that were shown variant B over those who were shown variant A. Finally, if there was a statistically significant improvement, we might decide to change order of the tabs on the website.

Since Zulily relies heavily on A/B testing to make business decisions, we are careful about avoiding common pitfalls of the method. The length of an A/B test ties in strongly with the success of the business for two reasons:

• If A/B tests run for more days than necessary, the pace of innovation at the company will slow down.
• If A/B tests run for fewer days than required to achieve statistically sound results, the results will be misguided.

To account for these issues, before making decisions based on the A/B test results, we run what’s called a ‘power analysis.’ A power analysis ensures that a certain number of people have been captured in the test to confirm or deny whether variant B was an improvement over variant A, which is the focus of this article. We also make sure that the test is run long enough so that short-term business cycles are accounted for. The calculation for the number of people needed in a test is a function of three things, effect size, significance level ($\alpha$), and power ($1-\beta$).

To consult the statistician after an experiment is finished is often merely to ask [them] to conduct a post mortem examination. [They] can perhaps say what the experiment died of.

– Robert Fisher, Statistician

## Common Terms

Before we get into the mechanics of that calculation, let us familiarize ourselves with some common statistical terms. In an A/B test, we are trying to estimate how all the customers behave (population) by measuring the behavior of a subset of customers (sample). For this, we ensure that our sample is representative of our entire customer base. During the test, we measure the behavior of the customers in our sample. Measurements might include the number of items purchased, the time spent on the website, or the money spent on the website by each customer.

For example, to test whether variant B outperformed variant A, we might want to know if the customers exposed to variant B spent more money than the customers exposed to variant A. In this test, our default position is that variant B made no difference on the behavior of the customers when compared to variant A (null hypothesis). As more customers are exposed to these variants and start purchasing products, we collect measurements on more customers, which allows us to either accept or reject this null hypothesis. The difference between the behavior of customers exposed to variant A and variant B is known as the effect size.

Further, there are a number of parameters set under the hood. Before starting the test, we assign a significance level ($\alpha$) which means that we might reject the null hypothesis when it is actually true in 5% of the cases (Type I error rate). Further, we will assign a power ($1-\beta$) which means that when the null hypothesis does not hold, or variant B changes the behavior of customers, the test will allow us to reject the null hypothesis 80% of the time. Importantly, these parameters need to be set at the beginning of the test and upheld for the duration of the test to avoid p-hacking, which leads to misguided results.

## Estimating the Number of Customers for the A/B Test

For this exercise, let us revisit the previous example where we show customers two versions of the Zulily navigation bar. Let us say we want to see if this change makes customers less or more engaged on Zulily’s website. One metric that can capture this is the proportion of customers who revisit the website when shown variant B versus variant A. for this exercise, let us say that we are interested in a boost in this metric of at least 1 % (effect size). If we see at least this effect size, we might implement the variant B. Question is, how many customers should be exposed to each variant to allow us to confirm that this change of 1% exists?

Starting off, we define some parameters. First, we define the significance level at 0.05. Second, from the central limit theorem, we assume that the average money that a group of customers spend on the website is normally distributed. Third, we direct 50% the customers visiting the site to variant A and 50% to variant B. These last two points greatly simplify the math behind the calculation. Now, we can estimate the number of people that need to be exposed to each variant.

where, $\sigma$ is the standard deviation of the population, $\delta$ is the change we expect to see, the $z_{1-\frac{\alpha}{2}}$, $z_{1-\beta}$ are quantile values calculated from a normal distribution. For the case of the parameters defined above, significance level of 0.05 and power of 0.80, and if we wanted to detect a 1% change in the proportion of people revisiting the website, our formula would simplify to:

This formula gives us the number of people that need to be exposed to one variant. Finally, since the customers were split evenly between variant A and variant B, we would need twice the number of the people in the entire test. This estimate can change significantly if any of the parameters change. For example, to detect a larger difference at this significance level, we would need much smaller samples. Further, if the observations are not normally distributed, then we would need a more complicated approach.

## Benefits of this calculation

In short, getting as estimate of the number of customers needed allows us to design our experiments better. We suggest conducting a power analysis both before and after starting a test for several reasons:

• Before starting the test – This gives us an estimate of how long our test should be run to detect the effect that we are anticipating. Ideally, this is done once to design the experiment and the results are tallied when the requisite number of people are exposed to both variants. However, the mean and standard deviation used in the calculation before starting the test are approximations to the actual values that we might see during the test. Thus, these a priori estimates might be off.
• After starting the test – As the test progresses, the mean and standard deviation converge to values representative of the current sample which allows us to get more accurate estimates of the sample size. This is especially useful in cases where the new experience introduces unexpected changes in the behavior of the customers leading to significantly different mean and standard deviation values than those estimated earlier.

## Conclusion

At Zulily, we strive to make well-informed choices for our customers by listening to their voice through our A/B testing platform, among other channels, and ensuring that we are constantly serving the needs of the customers. While obtaining an accurate estimate of the number of people for the test is challenging, we hold it central to the process. Most people agree that the benefits of a well-designed, statistically sound A/B testing system far outweigh the benefits from obtaining quick, but misdirected numbers. Therefore, we aim for a high-level level of scientific rigor in our tests.

I would like to thank my colleagues in the data science team, Demitri Plessas and Pamela Moriarty, and my manager, Paul Sheets, for taking time to review this article. This article is possible due to the excellent work by the entire data science team of maintaining the A/B testing platform, and ensuring that experiments at Zulily are well-designed.

|

# The Thrill-Ride Ahead for zulily Engineers

zulily is an e-commerce company that is changing the retail landscape through our ‘browse and discover’ business model.  Long term, there are tremendous international expansion opportunities as we change the way people shop across the planet. At zulily we’re building a future where the simultaneous release of 9,000 product styles across 100+ events can occur seamlessly in multiple languages across multiple platforms. From an engineering perspective, as we expand globally the size and scope of our technical challenge is nothing short of localization’s Mount Everest climb.

Navigating steep localization challenges is not new to companies expanding globally yet zulily faces a particularly unique thrill-ride ahead. For me personally, this is incredibly exciting. Prior to coming to zulily one role I held at my former company was global readiness – to ensure the products and services that the company delivered to customers were culturally, politically and geographically appropriate. I was in the unique role of mitigating the company’s risk of negative press, boycotts, protests, lawsuits or being banned by governments. The content my team reviewed was never considered life-threatening until the 2005 incident when the cultural editor of Jyllands-Posten in Denmark commissions twelve cartoonists to draw cartoons of Islamic prophet Muhammad. Those cartoons were published and their publication led to the loss of life and property. Suddenly my team took notice. Having content thoughtfully considered and ready for global markets took on a grave seriousness moving from a ‘nice-to-have’ to a ‘must-have’ risk management function. While my role was to ensure the political correctness (neutrality) of content across 500+ product groups, the group I led rarely dealt with the size and scope of technical localization challenges that zulily is facing as we expand globally.

As companies expand globally it’s important for the employees to conceptually shift their mindset from a U.S. centric perspective to a global view. This paradigm leap in how the employees of an organization consider themselves is significant. When people are increasingly thinking globally about their role and the impact of their decisions on a global audience, specifically in the area of technology and how we enable our platforms, tools and systems to be ‘world-ready’, opportunities for growth and development naturally occur while cultural content risks reduce. Today all of the content on our eight country-specific sites is in English yet we are now thinking about how to tackle bigger challenges in the future, such as supporting multiple languages. The technical implementation for international expansion has enabled our developers and product managers to gain a new appreciation for the importance of global readiness and the challenge of ‘going international’.

zulily offers over 9,000 product styles through over 100 merchandising events on a typical day. As we expand into new markets around the globe we face an extraordinary challenge from both an engineering and operations perspective. Imagine, every day at 6 a.m. PT we’re publishing the content equivalent of a daily edition of The New York Times (about one half million new words per day or over 100 million new words per year).  Another way to conceptualize the technical problem – each day we offer roughly the same number of SKUs of what you would find in a typical Costco store. Launching a new Costco store every day is difficult enough in one language yet as we scale our offerings globally the complexity to simultaneously produce this extremely high volume of content in multiple languages grows exponentially. Arguably, no e-commerce company in the world is publishing the volume of text that zulily produces on a daily basis. Further, the technical challenge is amplified because the massive volume of content must be optimized to work across multiple platforms, e.g., iPhone, iPad, Android and web based devices. In fact, over 56% of our orders are placed on mobile devices. From a user-experience across platforms and languages, text expansion and contraction become a significant issue. European languages such as French, German and Italian may require up to 30% more space than English. Double byte characters such as Chinese and Japanese will require less space.

The content on our sites is produced in-house each day by our own talented copy writers and editors. Not unlike publishing a newspaper the deadline for a 6 a.m. launch of fresh, new product styles is typically the night before. Last minute edits can happen as late as midnight! While this makes for an exciting and dynamic environment, it requires some of the brightest engineering and operational minds on the planet to bring it all together with the quality and performance we expect.

From a technical perspective our recent global expansion to Mexico, Hong Kong and Singapore faced typical localization hurdles. We needed to implement standard solutions to accommodate additional address line fields in the shipping address; ensure proper currency symbols are displayed; coding rules that allowed us to process orders without postal codes which are not required in Hong Kong.