Building zulily’s big data platform

In July 2014 we started our journey to building a new data platform that would allow us to use big data to drive business decisions. I would like to start with a quote from our 2015 Q2 earnings that was highlighted in various press outlets and share how we built a data platform that allows zulily to make decisions which were near impossible to do before.

“We compared two sets of customers from 2012 that both came in through the same channel, a display ad on a major homepage, but through two different types of ads,” [Darrell] Cavens said. “The first ad featured a globally distributed well-known shoe brand, and the second was a set of uniquely styled but unbranded women’s dresses. When we analyze customers coming in through the two different ad types, the shoe ad had more than twice the rate of customer activations on day one. But 2.5 years later, the spend from customers that came in through the women dresses ad was significantly higher than the shoe ad with the difference increasing over time.” – www.bizjournals.com

Our vision is for every decision, at every level in the organization, to be driven by data. In early 2014 we realized the data platform we had which was combination of SQL server database for data warehousing primarily for structured operational data + Hadoop cluster for unstructured data was too limiting. We started with defining core principles for our new data platform (v3).

Principles for new platform

  1. Break silos of unstructured and structured data: One of the key issues with our old infrastructure was we had unstructured data in a Hadoop cluster and structured in SQL server, this was extremely limiting. Almost every major decision required us to perform analytics where we had to correlate usage patterns on site(unstructured) with demand or shipment or other transactions(structured). We had to come up with a solution that allowed us to correlate these different types of datasets.
  2. High scale and lowest possible cost: We already had a hadoop cluster but one of the challenges was our data was growing exponentially and as the storage need grew we had to start adding nodes to the cluster which increased the cost even if the compute was not growing at same pace. In short, we were paying at rate of compute(which is way higher) at pace of growth rate of storage. We had to break this. Additionally we wanted to build a highly scalable system which would support our growth.
  3. High availability or extremely low recovery time with eye on cost: Simply put we wanted high availability but not pay for it.
  4. Enable real-time analytics: Existing data platform has no concept of real-time data processing, hence anytime users needed to analyze things in real-time they had to be given access to production operational system which in turn increases risk, this had to change.
  5. Self service: Our tools and systems/services need to be easy for our users to use with no involvement from data team.
  6. Protect the platform from users: Another challenge in existing system was lack of isolation between core data platform components and user facing services. We wanted to make in new platform individual users of the system could not bring down the entire system.
  7. Service(Data) level agreement(SLA): We wanted to be able to give SLA’s for our datasets to our users. This includes data accuracy and freshness. In past this was a huge issue due to design of the system and lack of tools to monitor and report SLAs.

zulily data platform (v3)

zulily data platform includes multiple tools, platforms and products. The combination is what makes it possible to solve complex problem at scale. image

There are 6 key broad components in the stack:

  1. Centralized big data storage built on Google cloud storage (GCS). This allows us to have all our data in one place (principle 1) and share it across all systems and layers of the stack.
  2. Big Data Platform is primarily for batch processing of data at scale. We use Hadoop on Google compute engine for our big data processing.
    • All our data is stored in GCS not HDFS, this enables us to decouple storage from compute and manage cost (principle 2).
    • Additionally the hadoop cluster for us is transient as it has no data so if our cluster completely goes away we can built new one on the fly. We think about our cluster as a transient entity (principle 3).
    • We have started looking at Google Dataflow for specific scenarios but more on that soon. Our goal is to make all data available for analytics anywhere from 30 minutes to few hours based on the SLA.
  3. Real-time data platform is a zulily built stack for high scale data collection and processing (principle 4). We now collect more than 10k events per second and it is growing really fast as we see value through our decision portal and other scenarios.
  4. Business data warehouse built on Google BigQuery enables us to provide highly scalable analytics service to 100’s of users in the company.
    • We push all our data, structured and unstructured, real-time and batch into Google BigQuery hence breaking all silos for data (principle 1).
    • It also enables us to keep our big data platform and data warehouse completely isolated (principle 6) making sure issues in one environment don’t impact the other system.
    • Keeping the interface(SQL) same for analysts between old and new platforms enabled us to lower barrier for adoption (principle 5)
    • We also have a high speed low latency data store that hosts part of our business data warehouse data for access through APIs.
  5. Data Access and Visualization component enables business users to make decisions based on information available.
    1. Real-time decision portal or zuPulse enables business to get insights into the business in real-time.
    2. Tableau is our reporting and analytics visualization tool. Business users use the tool to create reports for their daily execution. This enables our engineering team to focus on core data and high scale predictive scenarios and makes reporting self service (principle 5).
    3. ZATA our data access service enables us to expose all our data through an API. Any data in the data warehouse or low latency data store can be automatically exposed through ZATA. This enables various internal applications at zulily to show critical information to users including our vendors who can see their historical sales on vendor portal.
  6. Core data services are backbone to everything we do – from moving data to cloud, monitoring to and making sure we abide by our SLA’s etc. We continuously add new services and enhance existing services. The goal of these services is to make developers and analysts across other areas more efficient.
    • zuSync is our data synchronization service which enables us to get data from any source system and move to any destination. Source or Destination can be operational databases (SQL Server, MySQL etc), a queue (RabbitMQ), Google cloud storage (GCS), Google BigQuery, services (http/tcp/udp), ftp location…
    • zuflow is our workflow service built on top of Google BigQuery and allows analysts to define & schedule query chains for batch analytics. It also has ability to notify and deliver data in different formats.
    • zuDataMon is our data monitoring service which allows us to make sure data is consistent across all systems (principle 7). This helps us catch issues with our services or with the data early. Today we have more than 70% of issues identified by our monitoring tool instead of users reporting them. our goal is to get this number to be in 90%+.

This is a very high level view of what our data@zulily team does. Our goal is to share more often, with a deeper dive on our various pieces of technology, in the coming weeks. Stay tuned…

Partnering with zulily – A Facebook Perspective

During F8, Facebook’s global developer conference (March 25-26, 2015), Facebook announced a new program with zulily as one of the launch partners: Messenger for Business. Just a month later, the service was live and in production! We on the zulily engineering team truly enjoyed the partnership with Facebook on this exciting product launch. It’s been a few months since the launch and we reached out to Rob Daniel, senior Product Manager at Messenger, to share his team’s experience on collaborating with zulily.

Here’s what Rob had to say:

How Messenger for Business Started

Messenger serves over 700 million active people with the mission of reinventing and improving everyday communication. We invest in performance, reliability, and the core aspects of a trusted and fast communication tool. Furthermore, we build features that allow people to better express themselves to their friends and contacts, including through stickers, free video and voice calling, GIFs, and other rich media content. We build the product to be as inclusive as possible – even supporting people that prefer to register for Messenger without a Facebook account – because we know that the most important feature of a messaging app is that all your friends and anyone you’d want to speak with is reachable, instantly.

In that vein, we found that some of people’s most important interactions are not just with friends and family, but with businesses they care about and interact with almost every day. However, until recently Messenger didn’t fully support a robust channel for communicating with these businesses. And when we looked at how people interact with businesses today – email, phone (typically…IVR phone trees), text – there were obvious drawbacks. Either the channels were nosiy (email), or lacked richeness (text), or were just really heavyweight and inefficient (voice).

Reinventing How People Communicate with Businesses

When we started talking with zulily about a partnership to reinvent how people communicate with businesses, under the mission of creating a delightful and high utility experience, we found we met an incredible match.

Why? Simply put, zulily thinks about Mom first. That’s the same people-centric approach to building products that’s at the heart of Messenger. And along the way, as we scoped out the interactions for launch, it was evident that we had the same mindset: build a product with incredible utility and value to people, establish a relationship and build trust, and over time build interactions that pay dividends via higher customer loyalty and lifetime value. And despite time pressures and the many opportunities to put business values over people values, we collectively held the line that people were first, business was second.

We also found that “zulily Speed” was no joke, and matched the Facebook mantra of “Move Fast.” Up and through our announcement at F8 (Facebook’s two-day developer conference) and launch later the next month, our two teams moved in sync at incredible speed. As an example, late one evening an issue was spotted by a Messenger engineer and by 10:00 am the next morning the problem had been identified, fixed and pushed to production. That kind of turn around time and coordination is just unheard of between companies at zulily and Facebook’s scale.

Speaking to the strength of the individual engineers on both sides, team sizes were small. This led to quick decision making and efficient, frequent communications between teams, forming a unique bond of trust and transparency. Despite the distance between Seattle and Menlo Park, we rarely had timing miscues and latency was incredibly low.

This Journey is 1% Finished

At Facebook we say “this journey is 1% finished,” highlighting that regardless of our past accomplishments we have a long journey ahead, and that we owe people that use our services so much more. And in that same spirit, we respect that Messenger is only beginning to provide zulily and other businesses the communication capabilities they need to form lasting, trusted, and valuable relationships with their customers. But we’re thrilled to be teamed up with a company, like zulily, that has the commitment and vision alignment to help shape the experience along the way and see the opportunity along the path ahead.

Sampling keys in a Redis cluster

We love Redis here at zulily. We store hundreds of millions of keys across many Redis instances, and we built our own internal distributed cache on top of Redis which powers the shopping experience for zulily customers.

One challenge when running a large, distributed cache using Redis (or many other key/value stores for that matter) is the opaque nature of the key spaces. It can be difficult to determine the overall composition of your Redis dataset, since most Redis commands operate on a single key. This is especially true when multiple codebases or teams use the same Redis instance(s), or when sharding your dataset over a large number of Redis instances.

Today, we’re open sourcing a Go package that we wrote to help with that task: reckon.

reckon enables us to periodically sample random keys from Redis instances across our fleet, aggregate statistics about the data contained in them — and then produce basic reports and metrics.

While there are some existing solutions for sampling a Redis key space, the reckon package has a few advantages:

Programmatic access to sampling results

Results from reckon are returned in data structures, not just printed to stdout or a file. This is what allows a user of reckon to sample data across a cluster of redis instances and merge the results to get an overall picture of the keyspaces. We include some example code to do just that.

Arbitrary aggregation based on key and redis data type

reckon also allows you to define arbitrary buckets based on the name of the sampled key and/or the Redis data type (hash, set, list, etc.). During sampling, reckon compiles statistics about the various redis data types, and aggregates those statistics according to the buckets you defined.

Any type that implements the Aggregator interface can instruct reckon about how to group the Redis keys that it samples. This is best illustrated with some simple examples:

To aggregate only Redis sets whose keys start with the letter a:


func setsThatStartWithA(key string, valueType reckon.ValueType) []string {
  if strings.HasPrefix(key, "a") && valueType == reckon.TypeSet {
    return []string{"setsThatStartWithA"}
  }
  return []string{}
}

To aggregate sampled keys of any Redis data type that are longer than 80 characters:


func longKeys(key string, valueType reckon.ValueType) []string {
  if len(key) > 80 {
    return []string{"long-keys"}
  }
  return []string{}
}

HTML and plain-text reports

When you’re done sampling, aggregating and/or combining the results produced by reckon you can easily produce a report of the findings in either plain-text or static HTML. An example HTML report is shown below:

reckon-random-sets

a sample report showing key/value size distributions

The report shows the number of keys sampled, along with some example keys and elements of those keys (the number of example keys/elements is configurable). Additionally, a distribution of the sizes of both the keys and elements is shown — in both standard and “power-of-two” form. The power-of-two form shows a more concise view of the distribution, using a concept borrowed from the original Redis sampler: each row shows a number p, along with the number of keys/elements that are <= p and > p/2

For instance, using the example report shown above, you can see that:

  • 68% of the keys sampled had key lengths between 8 and 16 characters
  • 89.69% of the sets sampled had between 16 and 32 elements
  • the mean number of elements in the sampled sets is 19.7

We have more features and refinements in the works for reckon, but in the meantime, check out the repo on github and let us know what you think. The codebase includes several example binaries to get you started that demonstrate the various usages of the package.

Pull requests are always welcome — and remember: Always be samplin’.

The way we Go(lang)

Here at zulily, Go is increasingly becoming the language of choice for many new projects, from tiny command-line apps to high-volume, distributed services. We love the language and the tooling, and some of us are more than happy to talk your ear off about it. Setting aside the merits and faults of the language design for a moment (over which much digital ink has already been spilled), it’s undeniable that Go provides several capabilities that make a developer’s life much easier when it comes to building and deploying software: static binaries and (extremely) fast compilation.

What makes a good build?

In general, the ideal software build should be:

  • fast
  • predictable
  • repeatable

Being fast allows developers to quickly iterate through the develop/build/test cycle, and predictable/repeatable builds allow for confidence when shipping new code to production, rolling back to a prior version or attempting to reproduce bugs.

Fast builds are provided by the Go compiler, which was designed such that:

It is possible to compile a large Go program in a few seconds on a single computer.

(There’s much more to be said on that topic in this interesting talk.)

We accomplish predictable and repeatable builds using a somewhat unconventional build tool: a Docker container.

Docker container as “build server”

Many developers use a remote build server or CI server in order to achieve predictable, repeatable builds. This makes intuitive sense, as the configuration and software on a build server can be carefully managed and controlled. Developer workstation setups become irrelevant since all builds happen on a remote machine. However, if you’ve spent any time around Docker containers, you know that a container can easily provide the same thing: a hermetically sealed, controlled environment in which to build your software, regardless of the software and configuration that exist outside the container.

By building our Go binaries using a Docker container, we reap the same benefits of a remote build server, and retain the speed and short dev/build/test cycle that makes working with Go so productive.

Our build container:

  • uses a known, pinned version of Go (v1.4.2 at the time of writing)
  • compiles binaries as true static binaries, with no cgo or dynamically-linked networking packages
  • uses vendored dependencies provided by godep
  • versions the binary with the latest git SHA in the source repo

This means that our builds stay consistent regardless of which version of Go is installed on a developer’s workstation or which Go packages happen to be on their $GOPATH! It doesn’t matter if the developer has godep or golint installed, whether they’re running an old version of Go, the latest stable version of Go or even a bleeding-edge build from source!

Git SHA as version number

godep is becoming a de facto standard for managing dependencies in Go projects, and vendoring (aka copying code into your project’s source tree) is the suggested way to produce repeatable Go builds. Godep vendors dependent code and keeps track of the git SHA for each dependency. We liked this approach, and decided to use git SHAs as versions for our binaries.

We accomplish this by “stamping” each of our binaries with the latest git SHA during the build process, using the ldflags option of the Go linker. For example:

ldflags "-X main.BuildSHA ${GIT_SHA}"

This little gem sets the value of the BuildSHA variable in the main package to be the value of the GIT_SHA environment variable (which we set to the latest git SHA in the current repo). This means that the following Go code, when built using the above technique, will print the latest git SHA in its source repo:

package main

import "fmt"

var BuildSHA string // set by the compiler at build time!

func main() {
  fmt.Printf("I'm running version: %s\n", BuildSHA)
}

Enter: boilerplate

Today, we’re open sourcing a simple project that we use for “bootstrapping” a new Go project that accomplishes all of the above. Enter: boilerplate

Boilerplate can be used to quickly set up a new Go project that includes:

  • a Docker container for performing Go builds as described above
  • a Makefile for building/testing/linting/etc. (because make is all you need)
  • a simple Dockerfile that uses the compiled binary as the container’s entrypoint
  • basic .gitignore and .dockerignore files

It even stubs out a Go source file for your binary’s main package.

You can find boilerplate on github. The project’s README includes some quick examples, as well as more details about the generated project.

Now, go forth and build! (pun intended)

zulily hosted Women Who Code Event

WomenWhoCode_003

A sample of zulily’s “women in tech”

zulily knows that women in tech rock! On January 28, 2015 zulily hosted the Women Who Code Seattle chapter’s kick-off meeting, bringing women in technology together to discuss their passions. zulily has built its business on providing great value to moms. Our female engineers are a big driving force in delivering this expectation, so we were very excited to partner with  Women Who Code.  

Women Who Code is a global nonprofit organization that supports women in technology, helping them advance their careers. They provide supportive environments where women can learn from each other while offering mentorship and networking. 

The night started with mingling in a variety of technology discussions ranging from software, robotics, data analysis and bio engineering! College students and seasoned professionals gathered to share experiences, answer questions and consume food and wine. The atmosphere was inspiring and empowering. 

Kristin Smith, CEO of Code Fellows, shared with us her story and career path from Amazon to zulily to CEO of Code Fellows. Kristin talked about how women should be persistent in constantly learning new things and trying new technologies to empower their skills and overcome their self-doubt. “Embrace the ambiguity. Embrace the fear. That’s the time when things are going to explode!” 

WomenWhoCode2015_092

Kristin Smith, CEO of Code Fellows

We continued the night with interesting tech talks from women who work in zulily tech. Jaime Dughi is Sr. Product Manager of zulily’s Mobile Development team. She walked us through zulily’s unique business model and core development principles: fearless innovation and moving fast. She described the future of the mobile app platform in zulily’s business and how her team is advancing mobile development to tackle the demands of zulily customers while embracing the ever-changing mobile app landscape. (slides)

WomenWhoCode2015_122

Jaime Dughi – Sr. Product Manager, zulily Mobile Development Team

Our next speaker was Sara Adineh, Software Engineer on the Personalization Team. Sara remarked on her passion in bringing more women in tech, and shared her past volunteer activities: helping run a club for female students at her school to support women in the tech and science fields. She dove deep into what the Personalization Team does at zulily. “Personalization is the heart of zulily’s business model, presenting something special every day for each member” said Sara. She described the mathematics and Machine Learning techniques that zulily uses to understand what zulily’s customers need, so we can provide them something fresh every day. Sara talked about one of the engineering principles of zulily’s tech culture: zulily had open-sourced some of its projects as well as contributed to open-source technologies (such as Go, Kubernetes, and NSQ) used by the company. (Slides)

WomenWhoCode2015_189

Sara Adineh – Software Engineer, zulily Personalization Team

Our last speaker was Echo Li, Data Engineer on the Business Intelligence and Data Services team at zulily. She started with “I’m a woman, I’m a mom, I’m an engineer and I’m proud!” and the crowd went wild! She continued explaining her work as a data engineer at zulily, and spoke on how to understand data and get meaningful information out of it. She talked about different stacks of the data-processing pipeline and analytics here at zulily. She shared with us the growth of zulily’s big data platform from SQL server to Hadoop, Google Cloud and Big Query. (Slides)

WomenWhoCode2015_209

Echo Li – Sr. Data Warehouse Engineer, BI and Data Services

It was a night of thoughtful questions, engaged attendees and new friendships made. You can find it on the comments. We are looking forward to hosting the next Women in Tech event! Follow us if you want to hear about our future events!

WomenWhoCode2015_109

WomenWhoCode2015_167

WomenWhoCode2015_013

Facebook–zulily Joint Hack Day

zulily launches over 9,000 unique styles – more than the size of an average Costco – every day. This causes unique time and scale challenges for the Marketing team as they rapidly create, place and manage ads. Facebook and zulily’s Business and Technology teams have been working together to build a platform – the Acquisition Marketing Platform (AMP) – to automate the ad management process on Facebook.

During this partnership, we realized that both companies share a passion for enabling the engineering teams to move fast and put new, exciting features in the hands of customers. In that spirit, we hosted a joint hack day to see how we can further advance the AMP toolset.

FullSizeRender[1]
Joint FB-ZU Hack day participants: David Whitney (FB), Leo Hu (ZU) and Tim Moon (ZU). Not in picture: Omar Zayat (FB), Justin Sadoski (FB), Mason Bryant (ZU) and Rahul Srivastava (ZU)

The key wins for the hack day:

  • End-to-end automation: Currently, Marketing uses AMP to create the ad assets and Facebook’s PowerEditor to target and place the ads. The joint team built a prototype, utilizing the FB Graph and Ads APIs, to automate the ad process entirely within AMP. This is a huge win for the team! Marketing no longer has to switch between tools and now has a central place to manage all FB ads. This feature will further enable Marketing to scale up their ad creation process. The AMP Engineering team is now working on building a production version of the prototype.
  • Ad objects hierarchy management: FB ads have a hierarchy of campaign, ad-set and ad. The joint team spent time diving deep into the API to manage this hierarchy. Building this understanding is critical to automating management of campaigns, ad sets and reporting on key metrics.
  • API-driven ad Targeting: There are many ways to define the target of an ad on FB: interest, website customer audience, custom audience, etc. The team used the Targeting API to dynamically change the target range of an ad. This enables zulily to build some really interesting scenarios that can leverage real-time data in its systems; as we track the cost and conversion of every ad that is launched, we can now automate the increase or decrease the targeting reach via the API calls.
  • Account management: In October, FB released a new model to manage business ad accounts. The team spent time revamping the existing account model. This was a huge help and removed numerous access- and account-related roadblocks to further development. It would have taken us days of back-and-forth over email and conference calls to adopt the new model. During the hack day, this was completed in hours!

We had a fun and highly productive day-and-a-half of hacking. It was amazing to see the cool ideas and products that materialized from this event. The AMP team is super excited (and the Marketing team even moreso!) to build all the features that are now on their backlog. We are looking forward to future FB-ZU hack days!

Simulating Decisions to Improve Them

One of the jobs of the Data Science team is to help zulily make better decisions through data. One way that manifests itself is via experimentation. Like most ecommerce sites, zulily continuously runs experiments to improve the customer experience. Our team’s contribution is to think about the planning and analysis of those tests to make sure that when the results are read they are trustworthy and that ultimately the right decision is made.

Coming in hot

As a running example throughout this post, consider a landing page experiment.  At zulily, we have several landing pages that are often the first thing a visitor sees after they click an advertisement on a third-party site. For example, if a person was searching for pet-related products, and they clicked on one of zulily’s ads, they might land here.  Note: while that landing page is real, all the underlying data in this post is randomly generated.

The experiment is to modify the landing page in some way to see if conversion rate is improved.  Hopefully data has been gathered to motivate the experiment but, please, just take this at face-value.

Any single landing page is not hugely critical, but in aggregate they’re important for zulily, and small improvements in conversion rates or other metrics can have a large impact on the bottom line. In this example, the outcome metric (what is trying to be improved) is conversion rate, which is simply the number of conversion over the total number of visitors.

Thinking with the End in Mind

Planning an experiment consists of many things, but often the most opaque part is the implications associated with power. The simple definition of power: given some effect of the treatment, how likely will that effect be detectable. The implication here is that the more confident one wants to be in their ability to detect the change, the longer the test needs to run. How long the test needs to run has a direct bearing on the number of tests a company can run as a whole and which tests should be given priority… unless you don’t care about conflating treatments, but then you have bigger problems :).

Power analysis, however, is a challenge. For anything beyond a simple AB test, a lot needs to be thought through to determine the appropriate test. Therefore, it is often easier to think about the data, then work backwards through the analysis, and then the power.

Ultimately power analysis boils down to the simple question: for how long does a test need to run?

To illustrate this, consider the example experiment, where the underlying conversion rate for landing page A (the control) is 10%, and the expected conversion rate of the treatment is 10.5%. While these are the underlying conversion rates, due to randomness the realized conversion rate will likely be different, but hopefully close.

Imagine that each page is a coin, and each time a customer lands on the page the coin is flipped. Even though it’s known a priori that the underlying conversion rate for page A is 10%, if the coin is flipped 1000 times, it’s unlikely that it will be “heads” 100 times.  If you were to run two tests, you would get two different results, even though all variables are the same.

The rest of the post walks through the analysis of a single experiment, then describes how to expand that single experiment analysis into an analysis of the decision making process, and finally discusses a couple examples of complications that often arise in testing and how they can be incorporated into the analysis of the decision making.

A Single Experiment

For instance: flipping the “landing page” coin, so to speak, 1000 times for each page, A and B. In this one experiment, the realized conversion rates (for fake data) are in the bar chart below.

Figure1

That plot sure looks convincing, but just looking at the plot is not a sufficient way to analyze a test. Think back to the coin flipping example; since the difference in the underlying “heads” rate was only 0.5%, or 5 heads per 1000 flips, it wouldn’t be too surprising if A happened to have more heads than B in any given 1000 flips.

The good news is that statistical tools exist to help make it possible to understand how likely the difference observed was truly due to an underlying difference in rates, due to randomness.

The data collected would look something like:

Selection_258

where Treatment is the landing page treatment, and Converted is 0 for a visit without a conversion and 1 for a visit with a conversion.  For an experiment like this, that has a binary outcome, the statistical tool to choose is logistic regression.

Now we get to see some code!

Assuming the table from above is represented by the “visits” dataframe, the model is very simple to fit in statsmodels.

The output:

Selection_260

This is a lot information, but the decision will likely be based on only one number: in the second table, the interception between “C(Treatment)[T.B]” and “P>|z|”. This is one minus the probability that the difference between the two conversion rates is actually different, or the significance level. The convention is that if that value is less than .05, the difference is significant. Another number worth mentioning is the coefficient of the treatment. This is how much change is estimated. It’s important, because even if the outcome was significant it’s possible to have been a significant negative coefficient, and then the decision is worse than just not accepting a better page, since we’ll accept a worse landing page.

In this case the significance level is greater than .05, so the decision would be that the difference observed is not indicative of an actual difference in conversion rates. This is clearly the wrong decision. The conversion rates were specified as being different, but that difference cannot be statistically detected.

This is ultimately the challenge with power and sample sizes. Had the experiment been run again, with a larger sample, it is possible that we would have detected the change and made the correct decision.  Unfortunately, the planning was done incorrectly and only 1000 samples were taken.

Always Be Sampling

Although the wrong decision was made in the last experiment, we want to improve our decision making.  It is possible to analyze our analysis through simulation. It is a matter of replicating — many times over — the analysis and decision process from earlier. Then it is possible to find out how often the correct decision would be made given the actual difference in conversion rate.

Put another way, the task is to:

  1. Generate a random dataset set for the treatment and control group based on the expected conversion rates.
  2. Fit the model that would have been from the example above.
  3. Measure the outcome based on the decision criteria; here it’ll just be a significant p-value < 0.05.

And now be prepared for the most challenging part: the simulation. These three steps are going to be wrapped in a for loop, and the outcome is collected in an array. Here’s a simple example in python of how that could be carried out.

Running that experiment 500 times, with a sample size of 1000, would yield a correct decision roughly 4.2% of the time.  Ugh.

This is roughly the power the of the experiment at a sample size of 1000.  If we did this experiment 500 times, we would rarely make the correct decision.  To correct this, we need to change the experiment plan to generate more trials.

To get a sense for the power at different sample sizes we choose several possible sample sizes, then run the above simulation for those samples. Now there are two for loops: one for to iterate through the sample size, and one to carry out the analysis above.

Here is the outcome of the decisions at various sample sizes.  The “Power” column is the proportion of time a correct decision was made at the specified sample size.

Power

Not until 10^4.5 samples — roughly 31,000 — does the probability of making the correct decision become greater than 50%.  It is now a matter of making the business decision about how necessary it is to detect the effect.  Typically it is around 80%, in the same way that the significance level is normally around 5%… convention.  It would be easy to repeat this test for several intermediate sample sizes, between 10^4.5 and 10^5, to determine a sample size that has the power the business is comfortable with.

Uncertainty in Initial Conversion Rate

The outcome of the experiment was given a lot of criticism, but the underlying conversion rate was (more or less) taken for granted. The problem is there’s probably a lot of error in the estimated effect of the treatment before the experiment, and some error in the estimated effect of the control, since the control is based on past performance and the treatment is based on a combination of analysis and conjecture.

For example, say we had historical data that indicated that 1,000 out of 10,000 people had converted for the control thus far, and we ran a similar test to the control recently, so we have some confidence that 105 out of 1,000 people would convert.

If that was the prior information for each page the distribution for conversion rate for each landing page over 1,000 experiments could look like:

Selection_250

Even though it appears that landing page B does have a higher conversion rate on average, its distribution around that average is much wider.  To factor in that uncertainty, we can rerun that model with but instead of assuming a fixed conversion rate, we can sample from the distribution of the conversion rate before each simulation. Here’s the outcome, similar to above, of the proportion of times we’d make the correct decision.

Sadly our power was destroyed by the randomness associated with the uncertainty between landing pages.  Here’s the same power by sample size table as above.  For example, at 10^5.0 it is likely that conversion rate for landing page B was less than landing page A.

Selection_251

An alternative route in a situation like this is the use of a beta-binomial model to continue to incorporate additional data to the initial conversion rates.

More Complicated Experiments

The initial example was a very simple test, but more complex tests are often useful.  With more complex experiments, the framework for planning needs to expand to facilitate better decision making.

Consider a similar example to the original one with an additional complication. Since the page is a landing page, the user had to come from somewhere.  These sources of traffic are also sources of variation. Just like how any realized experiment could vary from the expectation, any given source’s underlying conversion rate could also could also vary from the expectation. In the face of that uncertainty, it would be a good idea to run the test on multiple ads.

To simplify our assumptions, consider that the expected change in conversion rate is still 0.5%, but across three ads the conversion rate varies individually by -0.01%, 0.00% and +0.01% due to the individual ad-level characteristics.

For example, this could be outcome of one possible experiment with two landing pages and three ads.

Selection_252

Thankfully statsmodels has a consistent API so just a few things need to change to fit this model:

  • Use gee instead of logit. This is a general estimating equation.  It enables a correlation within groups for a GLM to be fit, or, for these purposes, a logit regression with the group level variances taken into consideration.
  • Pass the groups via the “groups” argument.
  • Specify the family of the GLM; here it’s binomial with a logit link function (the default argument).

Those changes would like this:

Selection_253

The decision criteria here is similar to the first case, so we cannot say anything about the effect of the landing page, and likely this test would not roll out.  Now that the basic model is constructed, we follow the same process to estimate how much power the experiment would have at various levels of sample size.

Conclusion

Experiments are challenging to execute well, even with these additional tools.  The groups that have sufficient size to necessitate testing are normally sufficiently large and complex that wrong decisions can be made.  Through simulation and thinking about the decision-making process, it is possible to quantify how often a wrong decision could occur, its impact, and how to best mitigate the problem.

(By the way, zulily is actively looking for someone to make experimentation better, so if you feel that you qualify, please apply!)