The “Power” of A/B Testing


Customers around the world flock to Zulily on a daily basis to discover and shop products that are curated specially for them. To serve our customers better, we are constantly innovating to improve the customer experience. To help us decide between different customer experiences, we primarily use split testing or A/B testing, the industry standard for making scientific decisions about products. In a previous post, we talked about the fundamentals of A/B testing and described our home-grown solution. In this article, we are going to explore one choice behind designing sound A/B tests: we will calculate the number of people that need to be included in your test.

Figure 1. For example, in an A/B test we might compare two variants of the navigation bar on the website. Here, the New Today tab, displaying new events, appears either in position one (original version) or position two (test version).

To briefly recap the idea of A/B testing, in an A/B test we compare the performance of two variants of the same experience. For example, in a test we might compare two ways of ordering the navigation tabs on the website (see Figure 1). In the original version (variant A or control), the tabs are ordered such that events launched today (New Today) appear before events that are ending soon (Ends Soon). In the new version of the website (variant B or control), Ends Soon appears before New Today. The test is set up such that, for a pre-defined period of time, customers visiting the website would be shown either variant A or variant B. Then, using statistical methods, we would measure the incremental improvement in the experience of customers that were shown variant B over those who were shown variant A. Finally, if there was a statistically significant improvement, we might decide to change order of the tabs on the website.

Since Zulily relies heavily on A/B testing to make business decisions, we are careful about avoiding common pitfalls of the method. The length of an A/B test ties in strongly with the success of the business for two reasons:

  • If A/B tests run for more days than necessary, the pace of innovation at the company will slow down.
  • If A/B tests run for fewer days than required to achieve statistically sound results, the results will be misguided.

To account for these issues, before making decisions based on the A/B test results, we run what’s called a ‘power analysis.’ A power analysis ensures that a certain number of people have been captured in the test to confirm or deny whether variant B was an improvement over variant A, which is the focus of this article. We also make sure that the test is run long enough so that short-term business cycles are accounted for. The calculation for the number of people needed in a test is a function of three things, effect size, significance level (\alpha), and power (1-\beta).

To consult the statistician after an experiment is finished is often merely to ask [them] to conduct a post mortem examination. [They] can perhaps say what the experiment died of.

– Robert Fisher, Statistician

Common Terms

Before we get into the mechanics of that calculation, let us familiarize ourselves with some common statistical terms. In an A/B test, we are trying to estimate how all the customers behave (population) by measuring the behavior of a subset of customers (sample). For this, we ensure that our sample is representative of our entire customer base. During the test, we measure the behavior of the customers in our sample. Measurements might include the number of items purchased, the time spent on the website, or the money spent on the website by each customer.

For example, to test whether variant B outperformed variant A, we might want to know if the customers exposed to variant B spent more money than the customers exposed to variant A. In this test, our default position is that variant B made no difference on the behavior of the customers when compared to variant A (null hypothesis). As more customers are exposed to these variants and start purchasing products, we collect measurements on more customers, which allows us to either accept or reject this null hypothesis. The difference between the behavior of customers exposed to variant A and variant B is known as the effect size.

Figure 2. In A/B testing, different types of errors can occur, depending on where the results lie on this graph. Therefore, the parameters that we set under the hood, namely significance level and power, need to be set carefully, keeping our appetite for error in mind.

Further, there are a number of parameters set under the hood. Before starting the test, we assign a significance level (\alpha) which means that we might reject the null hypothesis when it is actually true in 5% of the cases (Type I error rate). Further, we will assign a power (1-\beta) which means that when the null hypothesis does not hold, or variant B changes the behavior of customers, the test will allow us to reject the null hypothesis 80% of the time. Importantly, these parameters need to be set at the beginning of the test and upheld for the duration of the test to avoid p-hacking, which leads to misguided results.

Figure 2. In A/B testing, different types of errors can occur, depending on where the results lie on this graph. Therefore, the parameters that we set under the hood, namely significance level and power, need to be set carefully, keeping our appetite for error in mind.

Estimating the Number of Customers for the A/B Test

For this exercise, let us revisit the previous example where we show customers two versions of the Zulily navigation bar. Let us say we want to see if this change makes customers less or more engaged on Zulily’s website. One metric that can capture this is the proportion of customers who revisit the website when shown variant B versus variant A. for this exercise, let us say that we are interested in a boost in this metric of at least 1 % (effect size). If we see at least this effect size, we might implement the variant B. Question is, how many customers should be exposed to each variant to allow us to confirm that this change of 1% exists?

Starting off, we define some parameters. First, we define the significance level at 0.05. Second, from the central limit theorem, we assume that the average money that a group of customers spend on the website is normally distributed. Third, we direct 50% the customers visiting the site to variant A and 50% to variant B. These last two points greatly simplify the math behind the calculation. Now, we can estimate the number of people that need to be exposed to each variant.

where, \sigma is the standard deviation of the population, \delta is the change we expect to see, the z_{1-\frac{\alpha}{2}}, z_{1-\beta} are quantile values calculated from a normal distribution. For the case of the parameters defined above, significance level of 0.05 and power of 0.80, and if we wanted to detect a 1% change in the proportion of people revisiting the website, our formula would simplify to:

This formula gives us the number of people that need to be exposed to one variant. Finally, since the customers were split evenly between variant A and variant B, we would need twice the number of the people in the entire test. This estimate can change significantly if any of the parameters change. For example, to detect a larger difference at this significance level, we would need much smaller samples. Further, if the observations are not normally distributed, then we would need a more complicated approach.

Benefits of this calculation

In short, getting as estimate of the number of customers needed allows us to design our experiments better. We suggest conducting a power analysis both before and after starting a test for several reasons:

  • Before starting the test – This gives us an estimate of how long our test should be run to detect the effect that we are anticipating. Ideally, this is done once to design the experiment and the results are tallied when the requisite number of people are exposed to both variants. However, the mean and standard deviation used in the calculation before starting the test are approximations to the actual values that we might see during the test. Thus, these a priori estimates might be off.
  • After starting the test – As the test progresses, the mean and standard deviation converge to values representative of the current sample which allows us to get more accurate estimates of the sample size. This is especially useful in cases where the new experience introduces unexpected changes in the behavior of the customers leading to significantly different mean and standard deviation values than those estimated earlier.


At Zulily, we strive to make well-informed choices for our customers by listening to their voice through our A/B testing platform, among other channels, and ensuring that we are constantly serving the needs of the customers. While obtaining an accurate estimate of the number of people for the test is challenging, we hold it central to the process. Most people agree that the benefits of a well-designed, statistically sound A/B testing system far outweigh the benefits from obtaining quick, but misdirected numbers. Therefore, we aim for a high-level level of scientific rigor in our tests.

I would like to thank my colleagues in the data science team, Demitri Plessas and Pamela Moriarty, and my manager, Paul Sheets, for taking time to review this article. This article is possible due to the excellent work by the entire data science team of maintaining the A/B testing platform, and ensuring that experiments at Zulily are well-designed.


Learn how Zulily and Sounders FC get the most out of their metrics!

On Tuesday, September 10th, Zulily was proud to partner with Seattle Sounders FC for a tech talk on data science, machine learning and AI. This exclusive talk was led by Olly Downs, VP of Data & Machine Learning at Zulily, and Ravi Ramineni, Director of Soccer Analytics at Sounders FC.

Zulily and Sounders FC both use deep analysis of data to improve the performance of their enterprises. At Zulily, applying advanced analytics and machine learning to the shopping experience enables us to better engage customers and drive daily sales. For Sounders FC, the metrics reflect how each player contributes to the outcome of each game; understanding the relationship between player statistics, training focus and performance on the field helps bring home the win. For both organizations, being intentional about the metrics we select and optimize for is critical to success.

We would like to thank everyone who attended the event for a great night of discussion and for developing new ties within the Seattle developer community. For any developers who missed this engaging discussion, we invite you to view the full presentation and audience discussion:


Thanks to Olly Downs and Ravi Ramineni for presenting their talks, Sounders FC for hosting, and Luke Friang for providing a warm welcome. This would not have been possible without the many volunteers from Zulily, Bellevue School of AI for co-listing the event, as well as all the attendees for making the tech talk a success!

For more information:

If you’d like to chat further with one of our recruiters regarding a position within data science, machine learning or any of our other existing roles, feel free to reach out directly to to get the conversation started. Also be sure to follow us on LinkedIn and Twitter.


Seattle Female Leaders Discuss Their Paths to Success

On September 5th, Zulily was proud to partner with Reign FC for a thought leadership event, celebrating and highlighting leadership skills from Seattle leaders, including women in STEM, sports and business. This discussion was led by Kelly Wolf, with Celia Jiménez Delgado, Kat Khosrowyar, Jana Krinsky and Angela Dunleavy-Stowell. Our panel addressed a variety of topics, including mentorship, leadership, impostor syndrome, advocacy and unconventional advice. Following the discussion and audience Q&A, attendees had the opportunity to meet Celia Jiménez Delgado, Bev Yanez. Morgan Andrews, Bethany Balcer, Lauren Barnes, Michelle Betos, Darian Jenkins, Megan Oyster, Taylor Smith and Morgan Proffitt of Reign FC! Attendees were also able to take a professional headshot, courtesy of Zulily’s hardworking studio team. We’d like to thank all who were able to attend, as well as the Zulily staff whose efforts made this event a success.

Panel Highlights:

“As you grow in your career, you are being sought for your leadership and critical thinking skills, and for your ability to diagnose and solve problems, not regurgitate facts.” Kelly Wolf, VP of People at Zulily

I wouldn’t be where I am if it wasn’t for my mentors. We need to push more, take more risk to support each other and come together as a community. It doesn’t matter if you’re a man or a woman, we all need to work together.”Kat Khosrowyar, Head Coach at Reign Academy, former Head Coach of Iran’s national soccer team, Chemical Engineer

“I am not a developer, but currently mentor a female developer. She drives the topic, and I act as a sounding board. Working on a predominately male team, she needed a different confidante to work through issues, approach, development ideas and career path goals.”Jana Krinsky, Director of Studio at Zulily

“During meetings, I sometimes tell myself, ‘should I be here? I’m in over my head.’ And I sort of have to call bull**** on myself. I think we all need to do that.”Angela Dunleavy-Stowell, CEO at FareStart, Co-Founder at Ethan Stowell Restaurants

“When you have confidence in yourself, when you think ‘I’m going to own it, this is going to happen because I’m going to make it happen,’ it matters. As women, we can’t use apologetic language like ‘Sorry, whenever you have a second, I would like to speak to you’ — we don’t need to be sorry for doing our jobs. Women need to start changing those sentences to, ‘when would be a good time to talk about this project?’ and treating people as your equal, not as someone who’s above you.”Celia Jiménez Delgado, right wing-back for Reign FC + Spain’s national soccer team, Aerospace Engineer

“We all have to find our courage. Because if you want to grow and be in a leadership role, that’s going to be a requirement. I think identifying that early in your career is a great way to avoid some pitfalls, down the road.”Angela Dunleavy-Stowell, CEO at FareStart, Co-Founder at Ethan Stowell Restaurants


Thanks to Celia Jiménez Delgado, Kat Khosrowyar, Jana Krinsky, Angela Dunleavy-Stowell and Kelly Wolf for this engaging panel! We’d also like to give a big thanks to FareStart, Reign FC  and Reign Academy for supporting this event. This would not have been possible without the many volunteers from Zulily as well as all the attendees for making the night a success!

For more information:

If you’d like to chat further with one of our recruiters regarding a position within data science, machine learning or any of our other existing roles, feel free to reach out directly to to get the conversation started. Also be sure to follow us on LinkedIn and Twitter.

Interning at Zulily

Hello, I’m Han. I am from Turkey but moved to the Greater Seattle area 5 years ago. I finished high school in Bellevue, went to Bellevue College, transferred to the University of Washington Computer Science department, and now going into my Senior year. This summer for 3 months, I worked as an intern in Zulily’s Member Engagement Platform (MEP) team. This post is about my internship journey.

This was my first internship, so I came to my first day of work with one goal; learning, but I didn’t know it was going to be something bigger. 

My first month I worked on a Java project in which I would download data from an outside source and use our customer data to map customers to their time-zones. During that time, I learned about AWS services, such as ECS, ECS, Lambda, Route53, Step Function, etc. I learned containerized deployments, created CI/CD and CloudFormation files.  

On my second month, I got into working with the UI. Before my internship, I have never worked on any front-end UI work. But during my second month, I learned how to work on a React app, and use JavaScript. I was working with engineering and Marketing to implement features to the MEP UI.  

Beginning of my third month I was working with our Facebook Messenger bot, implementing features that users were able to use in the Messenger App. I was then working on projects both in front and back-end, I was touching and getting my hands dirty in every part of the stack. I was learning, deploying and helping.  

In the beginning, either my manager or other engineers would assign me tasks. But after 2 months, I was picking up my own. I was picking tasks, working with engineers and marketing, going to design meetings, and helping other engineers.  

Before the start of the internship, my friends and my adviser told me that I should expect to be given one big project that I would work on somewhere in the corner by myself. They also warned me that my project would probably never be deployed or used. But here at Zulily that wasn’t the case at all. I was working with the team, as a part of the team, not as an outsider intern. I was deploying new features every week. Features I can look back at, show others and be proud of. I was coming to work every day with a goal to finish tasks and leaving with the feeling of accomplishment. I felt like I was part of a bigger family. In my team, everyone was helping each other, they were working together in order to succeed together, just like a team, just like a family.  

Now that I’m at the end of my internship, I’m leaving with a lot of accomplishments, a lot more knowledge, and going back to school to finish my last year. In conclusion, I believe interning at Zulily was the perfect internship. I, as an intern, learned about the company, the team, the workflow, the projects. I learned new engineering skills, learned how to work in a different environment with a team, accomplished a lot of tasks and finally contributed to the team and the company in general.  

My First Six months at Zulily After a Bootcamp

Hi, I’m Mark — a mathematician turned a coffeeshop owner turned an actuary turned a software engineer. When I discovered my passion for software development, I quit my day job, went through an immersive development bootcamp and got recruited by Zulily’s Member Engagement Platform team. This post is about my first six months here. 

Things I had to learn on the go: 

  • Becoming an IntelliJ power user 
  • Navigating the code base with many unfamiliar projects 
  • Learning about containerized deployments and CI/CD 
  • Getting my hands dirty with Java, Scala, Python, JavaScript, shell scripting and beyond 
  • Reaching high into the AWS cloud: ECS, SQS, Kinesis, S3, Lambda, Step Functions, EMR, CloudWatch… 

Things that caught me by surprise: 

  • I shipped my code to production two weeks after my start date. 
  • Engineers had autonomy and trust. The managers removed obstacles. 
  • The amount of collective knowledge and experience around me was stunning. 
  • I was expected to work directly with the Marketing team to clarify requirements. No middlemen. 
  • Iteration was king! 

Things that I am still learning with the help from my team: 

  • Form validation in React; 
  • Designing and building high-scale distributed systems; 
  • Tuning AWS services to match our use cases; 
  • On-call rotations (the team owns several tier-1 services); 
  • And many more… 

There is one recurring theme in all these experiences. No matter how new, huge, or unfamiliar a task may seem at first, it’s just a matter of breaking it down methodically into manageable chunks that can then be understood, built, and tested. This process can be time consuming, and I feel lucky that my team understands that. It’s clear that engineers at Zulily are encouraged to take the time and build products that last.    

One of the things that I find truly amazing here is the breadth of work that is considered ‘full stack’. DevOps, Big Data, Micro Services, and React client apps are just some of the areas in which I have been able to expand my knowledge in the last several months. It may have been overwhelming at first, but when you are among teammates that have vast expertise, acquiring these new skills became an exciting part of my daily routine at Zulily. 

It’s hard to compare what I know now to what I knew six months ago —my understanding has expended in breadth and depth — and I’m excited to see what the next six months will bring. 

Leveraging Serverless Tech without falling into the “Ownerless” trap

The appeal of serverless architecture sometimes feels like a ­­silver bullet: Offload your application code to your favorite cloud providers’ serverless offering and enjoy the benefits of extreme scalability, rapid development, and pay-for-use pricing. At this point in time most cloud providers’ solutions and tooling are well defined, and you can get your code executing in a serverless fashion with little effort.

The waters start to get a little murkier when considering making an entire application serverless while maintaining the coding standards and maintainability of traditional server-based architecture. Serverless has a tendency to become ownerless (aka hard to debug and painful to work on) over time without the proper structure, tooling and visibility. Eventually these applications suffer due to code fragility and slow development time, despite how easy and fast they were to prototype.

On the pricing team at Zulily, our core services are written in Java using the Dropwizard framework. We host on AWS using Docker and Kubernetes and deploy using Gitlab’s CICD framework. For most of our use cases this setup works well, but it’s not perfect for everything. We do feel confident though that this setup provides the structure and tools that help in avoiding the ownerless trap and this setup helps us achieve our goals of well-engineered service development. While this list of goals will differ between teams and projects, for our purposes it can be summarized below:


  • Testability (Tests & Local debugging & Multiple environments)
  • Observability (Logging & Monitoring)
  • Developer Experience (DRY & Code Reviews & Speed & Maintainability)
  • Deployments
  • Performance (Latency & Scalability)

We are currently redesigning our competitive shopping pipeline and thought this would be a good use case to experiment with a serverless design. For our use case we were most interested in its ability to handle unknown and intermittent load and were also curious to see if it improved our speed of development and overall maintainability. We think we have been able to leverage the benefits of a serverless architecture while avoiding some of the pitfalls. In this post, I’ll provide an overview of the choices we made and lessons we learned.


There are many serverless frameworks, some are geared towards being platform agnostic, others focus on providing better UIs, some make deployment a breeze, etc… We found most of these frameworks were abstraction layers on top of AWS/GCP/Azure and since we are primarily an AWS-leaning shop at Zulily, we decided to stick with their native tools where we could, knowing that later we could swap out components if necessary. Some of these tools were already familiar to the team and we wanted to be able to take advantage of any new releases by AWS without waiting for an abstraction layer to implement said features.

Basic architecture of serverless stack:

Below is a breakdown of the major components/tools we are using in our serverless stack:

API Gateway:

What it is: Managed HTTP service that acts as a layer (gateway) in front of your actual application code.

How we use it: API gateway handles all the incoming traffic and outgoing responses to our service. We use it for parameter validation, request/response transformation, throttling, load balancing, security and documentation. This allows us to keep our actual application code simple while we offload most of the boilerplate API logic and request handling to API Gateway. API gateway also acts as a façade layer where we can provide clients with a single service and proxy requests to multiple backends. This is useful for supporting legacy Java endpoints while the system is migrated and updated.


What it is: Lambda is a serverless compute service that runs your code in response to events and automatically manages the underlying compute resources.

How we use it: We use Lambdas to run our application code using the API gateway Lambda proxy integration. Our Lambdas are written in Nodejs to reduce cold start time and memory footprint (compared to Java). We have a 1:1 relationship between an API endpoint and Lambda to ensure we can fine tune the memory/runtime per Lambda and utilize API Gateway to its fullest extent, for example, to bounce malformed requests at the gateway level and prevent Lambda execution. We also use Lambda layers, a feature introduced by AWS in 2018, to share common code between our endpoints while keeping the actual application logic isolated. This keeps our deployment size smaller per lambda and we don’t have to replicate shared code.

SAM (Cloudformation):

What it is: AWS SAM is an open-source framework that you can use to build serverless applications on AWS. It is an extension of Cloudformation which lets you bundle and define serverless AWS resources in template form, create stacks in AWS, and enable permissions. It also provides a set of command line tools for running serverless applications locally.

How we use it: This is the glue for our whole system. We use SAM to describe and define our entire stack in config files, run the application locally, and deploy to integration and production environments. This really ties everything together and without this tool we would be managing all the different components separately. Our API gateway, Lambdas (+ layers) and permissions are described and provisioned in yaml files.

Example of defining a lambda proxy in template.yaml

Example of shared layers in template.yaml

Example of defining corresponding API gateway endpoint in template.yaml


What it is: A managed key value and document data store.

How we use it: We use DynamoDB for our ingestion pipeline getting data into our system. Dynamo was the initial reason we chose to experiment with serverless for this project. Your serverless application is limited by your database’s ability to scale fluidly. In other words, it doesn’t matter if your application can handle large spikes in traffic if your database layer is going to return capacity errors in response with this increase in load. Our ingestion pipeline spikes daily from ~5 to ~700 read or write capacity units per second and is scheduled to increase, so we wanted to make sure throwing additional batches of reads or writes to our datastore from our serverless API wasn’t going to cause a bottleneck during a spike. In addition to being able to scale fluidly, Dynamo has also simplified our Lambda–datastore interaction as we don’t have the complexity overhead of trying to open and share database connections between lambdas because Dynamo’s low-level API uses HTTP(S). This is not to say it’s impossible to use a different database (lots of examples out there of this), but its arguably simpler to use something like Dynamo.

Normal spikey ingestion traffic for our service with dynamo scaling

Other tools: Gitlab CICD, Cloudwatch, Elk stack, Grafana:

These are fairly common tools, so we won’t go into detail about what they are.

How we use it: We use Gitlab CICD to deploy all of our other services, so we wanted to keep that the same for our serverless application. Our Gitlab runner uses a Docker image that has AWS SAM for building, testing, deploying (in stages) and rolling back our serverless application. Our team already uses the Elk stack and Grafana for logging, visualization and alerting. For this service all of our logs from API Gateway, Lambda and Dynamo get picked up in Cloudwatch. We use Cloudwatch as a data source in Grafana and have a utility that migrates our Cloudwatch logs to Logstash. That way we can keep our normal monitoring and logging systems without having to go to separate tools just for this project.


So now that we have laid out our basic architecture and tooling: How well does this system address our goals mentioned at the start of this post? Below is a breakdown for how these have been addressed and a (yes, very subjective) grade for each.

Testability (Tests & Local debugging & Multiple environments) de

Overall Grade: A-

Positives: The tooling for serverless has come a long way and impressed us in this category. One of the most useful features is the ability to start your serverless app locally by running

sam local start-api <OPTIONS>

This starts a local http server based on your AWS::Serverless::Api specification in your template.yaml. When you make a request to your local server, it reads the corresponding CodeUri property of your lambda (AWS::Serverless::Function) and starts a docker container (same image AWS runs deployed lambdas) to run your lambda locally in conjunction with the request. We were able to write unit tests for all the lambdas and lambda layer code as well as deploy to specific integ/prod environments (discussed more below). There are additional console and CLI tools for triggering API gateway endpoints and lambdas directly.

Negatives: Most of the negatives for this category are nitpicks. Unit testing with layers took some tweaking and feels a little clumsy and sam local start-api doesn’t exactly mimic the deployed instance in how it handles errors and parameter validation. In addition, requests to the local instance were slow because it starts a docker container locally every time an endpoint is requested.

Observability (Logging & Monitoring)

Overall Grade: B

Positives: At the end of the day this works pretty well and mostly mimics our other services where we have our logs searchable in Kibana and our data visible in Grafana. The integration with Cloudwatch is pretty seamless and really easy to get started with.

Negatives: Observability in the serverless world is still tough. It tends to be distributed across lots of components and can be frustrating to track down where things unraveled. With the rapid development of tooling in this category, we don’t see this being a long-term problem. One tool that we have not tried, but could bump this grade up, is AWS X-ray which is a distributed tracing system for AWS components.

Developer Experience (DRY & Code Reviews & Speed & Maintainability)

Overall Grade: A-

Positives: Developer experience was good, bordering on great. Each endpoint is encapsulated in its own small codebase which makes adding new code or working on existing code really easy.  Lambda layers have solved a lot of the DRY issues. We share response, error and database libraries between the lambdas as well as NPM modules. All of our lambda code gets reviewed and deployed like our normal services.

Negatives. In our view the two biggest downsides have been unwieldy configuration and immature documentation. While configuration has its benefits, and it’s great to be able to see the entire infrastructure of your application in code, this can be tough to jump into and SAM/Cloudformation has a learning curve. Documentation is solid but could be better. Part of the issue is the rapid pace of feature releases and some confusion on best practices. 


Overall Grade: A

Positives: Deployments are awesome with SAM/Cloudformation. From our Gitlab runner we execute:

aws cloudformation package --template-file template.yaml   --output-template-file packaged_{env}.yaml --s3-bucket {s3BucketName}

which uploads template resources to s3 and replaces paths in our template. We can then run:

aws cloudformation deploy --template-file packaged_{env}.yaml   --stack-name {stackName} --capabilities CAPABILITY_IAM --region us-east-1   --parameter-overrides DeployEnv={env} 

 which creates a change set (shows effects of proposed deployment) and then creates the stack in AWS. This will create/update all the resources, IAM permissions and associations defined in template.yaml. This is really fast, a normal build and deployment of all of our Lambdas, API Gateway endpoints, permissions etc… takes ~90 seconds (not including running our tests).

Negatives: Unclear best practices for deploying to multiple environments. One approach is to use different API stages in API gateway and have your other resources versioned for those stages. Or (the way we chose) is to have completely different stacks for different environments and pass a DeployEnv variable into our Cloudformation scripts.

Performance (Latency & Scalability)

Overall Grade: A-

Positives: To test our performance and latency before going live we proxied all of our requests in parallel to both our existing (Dropwizard) API and new serverless API. It is important to consider the differences between the systems (data stores, caches etc..), but after some tweaking we were able to achieve equivalent P99, P95 and P50 response times between the two systems. In addition to the ability to massively scale out instances in response to traffic, the beauty of the serverless API is that you can fine tune performance on a function (endpoint) basis. CPU share is proportionally allocated depending on overall memory of each Lambda, so increasing the memory of each Lambda has the ability to directly decrease latency. When we first started routing traffic in parallel to our serverless API we noticed some of the bulk API endpoints were not performing as well as we had hoped. Instead of increasing the overall CPU/memory of the entire deployment, we just had to do this for the slow Lambdas.

Negatives: A well-known and often discussed issue with serverless is dealing with cold start times. This refers to the latency increase when an initial instance is brought up by your cloud provider to process a request to your serverless app. In practice this is on the scale of sometimes several hundred ms latency added on cold start for our endpoints. Once spun up, instances won’t get torn down immediately and subsequent requests (if routed to this same instance) won’t suffer from that initial cold start time. There are several strategies to avoid these cold starts like “pre-warming” your API with dummy requests when you know traffic is about to spike. You can also configure your application to run with more memory which persist for longer between requests (+ increase CPU). Both of these strategies cost money, so the goal is to try to find the right balance for your service.

Example of scaling (at ~14:40) an arbitrary computation-heavy lambda from 128mb to 512mb in response times

Overall latency with spikes over a ~3 hour period (most lambdas configured at 128mb or 256mb). Spikes re either cold start times or calls that involve several hundred queries (bulk APIs) or a combo. Note* this performance is tweaked to be equivalent to the p99 p90 and p50 times of the old service.


At the end of the day this setup has been a fun experiment and for the most part satisfied our requirements. Now that we have set this up once, we will definitely use it again for suitable projects in the future. A few things we didn’t touch on in this post are pricing and security, two very important goals for any project. Pricing is tricky to get an accurate comparison of because it depends on traffic, function size, function execution time and many other configurable options along the entire stack. It is worth mentioning, though, that if you have large and consistent traffic to your APIs, it is highly unlikely that serverless will be as cost effective as rolling you own (large) instance. That being said, if you want a low maintenance service that can scale on demand to handle spikey, intermittent traffic, it is definitely worth considering. Security considerations are also highly dependent on your use case. For us, this is an internal facing application so strategies like associating an internal firewall, limited resource IAM permissions and basic gateway token validation sufficed. For public facing applications there are a variety of security strategies to consider that aren’t mentioned in this post. Overall, we would argue that the tools are at a point where it does not feel like you are sacrificing in terms of development standards by choosing a serverless setup. Still not a silver bullet, but a great tool to have in the toolbox.


Overview of serverless computing:

API gateway:








Elk stack:

Gitlab CICD:

Improving Marketing Efficiency with Machine Learning


Here at Zulily, we offer thousands of new products to our customers at a great value every day. These products are available for about 72 hours; to inform existing and potential customers about our ever-changing offerings, the Marketing team launches new ads daily for these offerings on Facebook.

To get the biggest impact, we only run the best-performing ads. When done manually, choosing the best ads is time-consuming and doesn’t scale. Moreover, the optimization lags behind the continuously changing spend and customer activation data, which means wasted marketing dollars. Our solution to this problem is an automated, real-time ad pause mechanism powered by Machine Learning.

Predicting CpTA

Marketing uses various metrics to measure ad efficiency. One of them is CpTA or Cost per Total Activation (see this blog post for a deeper dive on how we calculate this metric). Lower CpTA means spending less money to get new customers so lower is better.

To pause ads with high CpTA, we trained a Machine Learning model to predict the next-hour CpTA using the historical performance data we have for ads running on Facebook. If the model predicts that the next-hour CpTA of an ad will exceed a certain threshold, that ad will be paused automatically. The marketing team is empowered to change the threshold at any time.

Ad Pause Service

We host the next-hour CpTA model as a service and have other wrapper microservices deployed to gather and pass along the real-time predictor data to the model. These predictors include both relatively static attributes about the ad and dynamic data such as the ad’s performance for the last hour. This microservice architecture allows us to iterate quickly when doing model improvements and allows for tight monitoring of the entire pipeline.

The end-to-end flow works as follows. We receive spend data from Facebook for every ad hourly. We combine that with activation and revenue data from the Zulily web site and mobile apps to calculate the current CpTA. Then we use CpTA threshold values set by Marketing and our next-hour CpTA prediction to evaluate and act on the ad. This automatic flow helps manage the large number of continuously changing ads.

Screen Shot 2019-02-27 at 11.23.06 PM

Results and Conclusion

The automatic ad pause system has increased our efficiency through the Facebook channel and gave Marketing more time to do what they do best: getting people excited about fresh and unique products offered by Zulily. Stay tuned for our next post where we take a deeper dive into the ML models.

Serving dynamically resized images at scale using Serverless

At zulily, we take pride in helping our customers discover amazing products at tremendous value. We rely heavily on highly curated images and videos to narrate these stories. These highly curated images, in particular, form the bulk of the content on our site and apps.


Two of our popular events from weekend of Oct 20th 2018

Today, we will talk about how zulily leverages serverless technologies for serving optimized images on the fly on a variety of devices with varying resolutions and screen sizes

In any given month, we serve over 23 billion image requests using our CDN partners. This results in over a Petabyte per month of data transferred to our users around the globe.

We use AWS Simple Storage Service (S3) bucket as the origin for all our images. Our in-house studio and merchandising teams upload rich images using internal tools to S3 buckets. As you can imagine, these images are pretty huge in terms of file size. Downloading and displaying these images as-is would result in sub-optimal experience for our customers and waste of bandwidth.


Dynamic image resizing on the fly

Dynamic image resizing on the fly

Continue reading

Making Facebook Ad Publishing More Efficient

Remember Bart Simpson’s punishment for being bad? He had to write the same thing on the chalkboard over and over again, and he absolutely hated it! We as humans hate repetitive actions, and that’s why we invented computers – to help us optimize our time to do more interesting work.

At zulily, our Marketing Specialists previously published ads to Facebook individually. However, they quickly realized that creating ads manually was limiting to the scale they could reach in their work: acquiring new customers and retaining existing shoppers. So in partnership with the marketing team, we worked together to build a solution that would help the team use resources efficiently.

At first, we focused on automating individual tasks. For instance, we wrote a tool that Marketing used to stitch images into a video ad. That was cool and saved some time but still didn’t necessarily allow us to operate at scale.

Now, we are finally at the point where the entire process runs end-to-end efficiently, and we are able to publish hundreds of ads per day, up from a handful.

Here’s how we engineered it.

The Architecture

Automated Ad Publishing Architecture (1)

Sales Events

Sales Events is an internal system at zulily that stores the data about all sales events we run; typically, we launch 100+ sales each day that could include 9,000 products that last three days. Each event includes links to appropriate products and product images. The system exposes the data through a REST API.

Evaluate an Event

This component holds the business logic that allows us to pick events that we want to advertise, using a rules-based system uniquely built for our high-velocity business. We implemented the component as an Airflow DAG that hits the Sales Events system multiple times a day for new events to evaluate. When a decision to advertise is made, the component triggers the next step.

Make Creatives

In this crucial next step, our zulily-built tool creates a video advertisement, which is uploaded to AWS S3 as an MP4 file. These creatives also include metadata used to match Creatives with Placements downstream.

Product Sort

A sales event at zulily could easily have dozens if not hundreds of products. We have a Machine Learning model that uses a proprietary algorithm to rank products for a given event. The Product Sort is available through a REST API, and we use it to optimize creative assets.

Match Creatives to Placements

A creative is a visual item that needs to be published so that a potential shopper on Facebook can see it. That end result advertisement that is seen by the potential shopper is described by a Placement. A Placement defines where on Facebook the ad will go and who the audience should be for the ad. We match creatives with placements using Match Filters defined by Marketing Specialists.

Define Match Filters

Match Filters allow Marketing Specialists to define rules that will pick a Placement for a new Creative.

Screen Shot 2018-09-26 at 9.04.15 AM

These rules are based on the metadata of Creatives: “If a Creative has a tag X with the value Y, match it to the Placement Z.”


Once we match a Creative with one or more Placements, we persist the result in MongoDB. We use the schemaless database technology rather than a SQL database because we want to be able to extend the schema of Creatives and Placements without having to update table definitions. MongoDB (version 3.6 and above) also gives us a change stream, which is essentially a log of changes happening to a collection. We rely on this feature to automatically kick off the next step.

Publish Ads to Facebook

Once the ad definition is ready, and the new object is pushed to the MongoDB collection, we publish the ad to Facebook through a REST API. Along the way, the process automatically picks up videos to S3 and uploads them to Facebook. Upon a successful publish, the process marks the Ad as synced in the MongoDB collection.

Additional Technical Details

While this post is fairly high level, we want to share a few important technical details about the architecture that can be instructive for engineers interested in building something similar.

    1. Self-healing. We run our services on Kubernetes, which means that the service auto-recovers. This is key in an environment where we only have a limited time (in our case, typically three days) to advertise an event.
    2. Retry logic. Whenever you work with an external API, you want to have some retry logic to minimize downtime due to external issues. We use exponential retry, but every use case is different. If the number of re-tries is exhausted, we write the event to a Dead Letter Queue so it can be processed later.
    3. Event-driven architecture. In addition to MongoDB change streams, we also rely on message services such as AWS Kinesis and SQS (alternatives such as Kafka and RabbitMQ are readily available if you are not in AWS). This allows us to de-couple individual components of the system to achieve a stable and reliable design.
    4. Data for the users. While it’s not shown directly on the diagram, the system publishes business data it generates (Creatives, Placements, and Ads) to zulily’s analytics solution where it can be easily accessed by Marketing Specialists. If your users can access the data easily, it’ll make validations quicker, help build trust in the system and ultimately allow for more time to do more interesting work – not just troubleshooting.

In short, employing automated processes can help Marketing Tech teams scale and optimize work.

Realtime Archival of Mongo Collections to BigQuery

Here at zulily, we maintain a wide variety of database solutions.  In this post, we’ll focus on two that we use for two differing needs in Ad Tech: MongoDB for our real-time needs and Google’s BigQuery for our long-term archival & analytics needs. In many cases we end up with data that needs to be represented and maintained in both databases, which causes issues due to their conflicting needs and abilities. While previously we had solved this issue by creating manual batch jobs for whatever collections we wanted represented in BigQuery, it resulted in delays of data appearing in BigQuery and issues with maintaining multiple mostly-similar-but-slightly-different jobs. As such, we’ve developed a new service, called mongo_bigquery_archiver, that automates the process of archiving data to BigQuery with only minimal configuration.

For the configuration of the archiver, we have a collection, ‘bigquery_archiver_settings,’ established on the Mongo server whose contents we want to back up, which is under a custom database ‘config’ on that server. This collection maintains one document for each collection that we want to back up to BigQuery, which serves both as configuration and an up-to-date reference for the status of that backup process. Starting out, the configuration is simple:

    "mongo_database" : The Mongo database that houses the collection we want to archive.
    "mongo_collection" : The Mongo collection we want to archive.
    "bigquery_table" : The BigQuery table we want to create to store this data. Note that the table doesn't need to already exist! The archiver can create them on the fly.
    "backfill_full_collection" : Whether to upload existing documents in the Mongo collection, or only upload future ones. This field is useful for collections that may have some junk test data in them starting out, or for quickly testing it for onboarding purposes.
    "dryrun": Whether to actually begin uploading to BigQuery or just perform initial checks. This is mostly useful for the schema creation feature discussed below.

And that’s it! When that configuration is added to the collection, if the archiver process is active, it’ll automatically pick up the new configuration and create a subprocess to begin archiving that Mongo collection. The first step is determining the schema to use when creating the BigQuery table. If one already exists or is (optionally) specified in the configuration document manually, it’ll use that. Otherwise, it’ll begin the process of determining the maximum viable schema. By iterating through the entire collection, it analyzes each document to determine what fields exist across the superset of all present documents, coming up with all fields that maintain consistent types across every document and/or exist in some documents and not others, treating ones with the missing fields as null values. In addition, it analyzes subdocuments in the same way, creating RECORD field definitions that similarly load all the valid fields, recurring as necessary based on the depth of the collection’s documents. When complete, it stores the generated maximum viable schema in the configuration document for the user to review and modify as needed, in case there’s extraneous fields that, while possible to upload to BigQuery, would just result in useless overhead. It creates a BigQuery table based on this generated schema, and moves on to the next step.

Here’s an example of the generated schema:

    "columns" : [
            "name" : "_id",
            "type" : "STRING",
            "name" : "tracking_code",
            "type" : "INTEGER",
            "mongo_field" : "tracking_code"
            "name" : "midnight_to_timestamp",
            "type" : "RECORD",
            "mongo_field" : "midnight_to_timestamp",
            "fields" : [
                    "name" : "cpta",
                    "type" : "FLOAT"
                    "name" : "activations",
                    "type" : "INTEGER"
                    "name" : "spend",
                    "type" : "FLOAT"
                    "name" : "date_start",
                    "type" : "STRING"
                    "name" : "ad_spend_id",
                    "type" : "STRING"
            "name" : "created_on_timestamp",
            "type" : "FLOAT",
            "mongo_field" : "created_on_timestamp"
            "name" : "lifetime_to_timestamp",
            "type" : "RECORD",
            "mongo_field" : "lifetime_to_timestamp",
            "fields" : [
                    "name" : "cpta",
                    "type" : "FLOAT"
                    "name" : "activations",
                    "type" : "INTEGER"
                    "name" : "spend",
                    "type" : "FLOAT"

Next, it uses a new feature added in Mongo 3.6, change streams. A change stream is like a subscription to the OpLog on the Mongo server for the collection in question – whenever an operation comes in that modifies the collection, the subscriber is notified about it. From here, we maintain a subscription for the watched collections, and whenever an update comes in, we process that update to get the current state of the document in Mongo, without querying again since we can configure the change stream to also give us the latest version of the document. By filtering down using the generated schema from before, we can upload the change to the BigQuery table via the BigQuery streaming API, along with the kind of operation it is – create, replace, update, or delete. In case of a delete, we blank all fields but the _id field and log that as a DELETE operation in the table. This table now represents the entire history of operations on the collection in question and can be used to track the state of the collection across time.

Since a full-history table makes it somewhat cumbersome to get the most current version of the data, the archiver automatically creates views on top of all generated tables, using the following query:

        PARTITION BY _id ORDER BY time_archived DESC) rank FROM project.dataset_id.table_id
WHERE rank=1 AND operationType != 'delete'

What this query does is determine the latest operation performed on each unique document ID, which is guaranteed to be unique in the Mongo collection. If the latest operation is a DELETE, it doesn’t show any entries for that particular ID in the final query results, otherwise it shows the status of the document as of the latest modification. This results in a view of each unique document exactly as of its latest update. With the minimal latency of the BigQuery streaming API, changes are reflected in BigQuery within seconds of their creation in Mongo, allowing for sophisticated real-time analytics via BigQuery. While BigQuery does not have official SLAs about the performance of the streaming API, we consistently see results uploaded within 3-4 seconds at most via query results. The preview mechanism via the BigQuery UI does not accurately reflect it, but querying the table via a SELECT statement properly shows the results.


An example of the architecture. Note that normal user access
always hits Mongo and never goes to BigQuery, making the process more efficient.

Thanks to the Archiver, we’ve been able to leverage the strengths of both Mongo and BigQuery in our rapidly-modifying datasets while not having to actively maintain two disparate data loading processes for each system.

We’ve open-sourced the Archiver on Github under the Apache V2 license: If you choose to use it for your own needs, check out Google’s for streaming inserts into BigQuery.