Evolution of Zulily’s Airflow Infrastructure

At Zulily, our need to perform various dependent actions on different schedules has continued to grow. Sometimes, we need to communicate inventory changes to advertising platforms. At other times, we need to aggregate data and produce reports on the effectiveness of spend, conversions of ads or other tasks. Early on, we knew that we needed a reliable workflow management system. We started with Apache Airflow version 1.8 two years ago and have continued to add more hardware resources to it as the demands of our workloads increased.  

Apache Airflow is a workflow management system that allows developers to describe workflow tasks and their dependency graph as code in Python. This allows us to keep a history of the changes and build solid pipelines step-by-step with proper monitoring. 

Our Airflow deployment runs a large majority of our advertising management and reporting workflows. As our usage of Airflow increased, we have made our Airflow deployment infrastructure more resilient to failures leveraging the new KubernetesPodOperator. This post will talk about our journey with Airflow from Celery to KubernetesPodOperator. 

Our First Airflow 1.8 Deployment using Celery Executor 

With this version of Airflow we needed to maintain many separate services: the scheduler, RabbitMQ, and workers.

Components and Concepts 

  • DAG is the one complete workflow definition code that is composed of tasks and their dependencies with other tasks. The AWS Elastic File Share contains the code for the DAGs.  
  • Git Syncer is responsible for polling and getting the DAG code from Zulily’s Gitlab at regular intervals of 5 minutes and putting the code on the AWS EFS.  
  • AWS EFS is the file share that has all the DAG code. It is mounted on the Webserver pod and Scheduler pod. 
  • Webserver pod hosts the Airflow UI that shows running tasks, task history and allows users to start and stop tasks and view logs of tasks that already completed.  
  • Scheduler pod reads the DAG code from AWS EFS and reads the scheduling data from the Airflow Metadata DB and schedules tasks on the Worker pods by pushing them on the RabbitMQ. 
  • Airflow Metadata DB contains the scheduling information and history of DAG runs. 
  • Workers deque the tasks from the RabbitMQ and execute them copying the logs to S3 when done. 


  • Having the ability to add more worker nodes as loads increased was a plus. 
  • By using the Git Syncer we were able to sync code every 5 minutes. From a developer point of view, after merging code in master branch it would automatically get pulled to production machines within 5 minutes.  


  • Multiple single points of failure: RabbitMQ, GitSyncer 
  • All DAGs and the Airflow scheduler comprised of one application that shared packages across the board. This means that there was one gigantic Pipfile and each package in that had to be compatible with all the others.  
  • All the DAGs had to be written in Python which restricted the ability to re-use existing components written in Java and other languages.  

Our Current Airflow 1.10.4 Deployment using KubernetesPodOperator 

In Airflow version 1.10.2 a new kind of operator called the KubernetesPodOperator was introduced. This allowed us to reduce setup steps and make the overall setup more robust and resilient by leveraging our existing Kubernetes cluster. 

Differences and New Components 

  • DAG continues to be a Python definition of dependencies.  We used the KubernetesPodOperator to define all our DAGs. As a result, our DAG becomes a tree of task containers. We used the LocalExecutor to run our DAGs from the scheduler. 
  • Temporary Task Pods run Task Containers which operate like any other container and  contain the business logic needed for that task. The key benefit is that there is no need to bundle in any Airflow specific packages in the task container. This is a game changer as it allows us to use pretty much any code, written in any language, that can be made into a container, to be used as tasks inside an Airflow DAG. For our Python DAGs, it also breaks up the giant Pipfile into smaller Pipfiles, one per Python task container, making the package dependencies much more manageable.  
  • GitSyncer goes away. Git Syncer was polling Zulily’s Gitlab every 5 minutes. We avoid that by using kubectl cp command during the CI CD. After developer merges code to master, as a step in CI CD, the updated DAG definition is copied over to the AWS EFS. Hence, the need for polling every 5 minutes is eliminated using this push-based approach. 
  • Webserver and Scheduler containers both run on the same pod. Starting from the Airflow Kubernetes deploy yaml, we removed the portions for setting up the git sync and created one pod with both Webserver and Scheduler containers. This simplified deployment. We used a minimal version of the Airflow Dockerfile for our Webserver and Scheduler containers.  
  • Scheduling: When the scheduler needs to schedule a new task, using the Kubernetes API, it creates a temporary worker pod with the container image specified and starts it. After the task has completed, the logs are copied over to S3. Then the worker pod ends. By following this approach, the task worker containers are automatically distributed over the whole Kubernetes cluster. In order to increase processing capacity of Airflow, we simply need to scale up Kubernetes which we do using Kops.   

High Level Sequence of Events 

  1. Developer pushes or merges DAG definition and Task container code to master. 
  2. The CI CD is setup to:
    1. Build and push the task container images to AWS ECR. 
    2. Push the new DAG definition to the AWS EFS. 
  3. When the Scheduler needs to schedule a task: 
    1. It reads the DAG definition from the AWS EFS. 
    2. It creates the short-lived Task pod with the Task container image specified. 
    3. Task executes and finishes.
    4. Logs are copied to S3. 
    5. Task pod exits. 

Resiliency & Troubleshooting 

  • Using a single pod for Airflow webserver and scheduler containers simplifies things; Kubernetes will bring this pod up if it goes down. We do need to handle orphaned task pods.
  • Worker pods are temporary. If they error out, we investigate, fix and re-run. The KubernetesPodOperator provides the option to keep older task containers around for troubleshooting purposes. 
  • As before, Airflow Metadata DB is a managed AWS RDS instance for us.  
  • The DAGs volume is also an AWS EFS. If we were to lose this in the case of a catastrophic, although unlikely event, we can always restore from our source repository. 
  • Kubernetes cluster has been working for us without any major resiliency issues. We have it deployed on AWS EC2s and use Kops for cluster management. AWS EKS is something we are exploring at the moment. 


Airflow continues to be a great tool helping us achieve our business goals. Using the KubernetesPodOperator and the LocalExecutor with Airflow version 1.10.4, we have streamlined our infrastructure and made it more resilient in the face of machine failures. Our package dependencies have become more manageable and our tasks have become more flexible. We are piggy-backing on our existing Kubernetes infrastructure instead of maintaining another queuing and worker mechanism. This has enabled our developers to devote more time to improve customer experience and to worry less about infrastructure. We are excited about future developments in Airflow and how they enable us to drive Zulily business! 

The “Power” of A/B Testing


Customers around the world flock to Zulily on a daily basis to discover and shop products that are curated specially for them. To serve our customers better, we are constantly innovating to improve the customer experience. To help us decide between different customer experiences, we primarily use split testing or A/B testing, the industry standard for making scientific decisions about products. In a previous post, we talked about the fundamentals of A/B testing and described our home-grown solution. In this article, we are going to explore one choice behind designing sound A/B tests: we will calculate the number of people that need to be included in your test.

Figure 1. For example, in an A/B test we might compare two variants of the navigation bar on the website. Here, the New Today tab, displaying new events, appears either in position one (original version) or position two (test version).

To briefly recap the idea of A/B testing, in an A/B test we compare the performance of two variants of the same experience. For example, in a test we might compare two ways of ordering the navigation tabs on the website (see Figure 1). In the original version (variant A or control), the tabs are ordered such that events launched today (New Today) appear before events that are ending soon (Ends Soon). In the new version of the website (variant B or control), Ends Soon appears before New Today. The test is set up such that, for a pre-defined period of time, customers visiting the website would be shown either variant A or variant B. Then, using statistical methods, we would measure the incremental improvement in the experience of customers that were shown variant B over those who were shown variant A. Finally, if there was a statistically significant improvement, we might decide to change order of the tabs on the website.

Since Zulily relies heavily on A/B testing to make business decisions, we are careful about avoiding common pitfalls of the method. The length of an A/B test ties in strongly with the success of the business for two reasons:

  • If A/B tests run for more days than necessary, the pace of innovation at the company will slow down.
  • If A/B tests run for fewer days than required to achieve statistically sound results, the results will be misguided.

To account for these issues, before making decisions based on the A/B test results, we run what’s called a ‘power analysis.’ A power analysis ensures that a certain number of people have been captured in the test to confirm or deny whether variant B was an improvement over variant A, which is the focus of this article. We also make sure that the test is run long enough so that short-term business cycles are accounted for. The calculation for the number of people needed in a test is a function of three things, effect size, significance level (\alpha), and power (1-\beta).

To consult the statistician after an experiment is finished is often merely to ask [them] to conduct a post mortem examination. [They] can perhaps say what the experiment died of.

– Robert Fisher, Statistician

Common Terms

Before we get into the mechanics of that calculation, let us familiarize ourselves with some common statistical terms. In an A/B test, we are trying to estimate how all the customers behave (population) by measuring the behavior of a subset of customers (sample). For this, we ensure that our sample is representative of our entire customer base. During the test, we measure the behavior of the customers in our sample. Measurements might include the number of items purchased, the time spent on the website, or the money spent on the website by each customer.

For example, to test whether variant B outperformed variant A, we might want to know if the customers exposed to variant B spent more money than the customers exposed to variant A. In this test, our default position is that variant B made no difference on the behavior of the customers when compared to variant A (null hypothesis). As more customers are exposed to these variants and start purchasing products, we collect measurements on more customers, which allows us to either accept or reject this null hypothesis. The difference between the behavior of customers exposed to variant A and variant B is known as the effect size.

Figure 2. In A/B testing, different types of errors can occur, depending on where the results lie on this graph. Therefore, the parameters that we set under the hood, namely significance level and power, need to be set carefully, keeping our appetite for error in mind.

Further, there are a number of parameters set under the hood. Before starting the test, we assign a significance level (\alpha) which means that we might reject the null hypothesis when it is actually true in 5% of the cases (Type I error rate). Further, we will assign a power (1-\beta) which means that when the null hypothesis does not hold, or variant B changes the behavior of customers, the test will allow us to reject the null hypothesis 80% of the time. Importantly, these parameters need to be set at the beginning of the test and upheld for the duration of the test to avoid p-hacking, which leads to misguided results.

Figure 2. In A/B testing, different types of errors can occur, depending on where the results lie on this graph. Therefore, the parameters that we set under the hood, namely significance level and power, need to be set carefully, keeping our appetite for error in mind.

Estimating the Number of Customers for the A/B Test

For this exercise, let us revisit the previous example where we show customers two versions of the Zulily navigation bar. Let us say we want to see if this change makes customers less or more engaged on Zulily’s website. One metric that can capture this is the proportion of customers who revisit the website when shown variant B versus variant A. for this exercise, let us say that we are interested in a boost in this metric of at least 1 % (effect size). If we see at least this effect size, we might implement the variant B. Question is, how many customers should be exposed to each variant to allow us to confirm that this change of 1% exists?

Starting off, we define some parameters. First, we define the significance level at 0.05. Second, from the central limit theorem, we assume that the average money that a group of customers spend on the website is normally distributed. Third, we direct 50% the customers visiting the site to variant A and 50% to variant B. These last two points greatly simplify the math behind the calculation. Now, we can estimate the number of people that need to be exposed to each variant.

where, \sigma is the standard deviation of the population, \delta is the change we expect to see, the z_{1-\frac{\alpha}{2}}, z_{1-\beta} are quantile values calculated from a normal distribution. For the case of the parameters defined above, significance level of 0.05 and power of 0.80, and if we wanted to detect a 1% change in the proportion of people revisiting the website, our formula would simplify to:

This formula gives us the number of people that need to be exposed to one variant. Finally, since the customers were split evenly between variant A and variant B, we would need twice the number of the people in the entire test. This estimate can change significantly if any of the parameters change. For example, to detect a larger difference at this significance level, we would need much smaller samples. Further, if the observations are not normally distributed, then we would need a more complicated approach.

Benefits of this calculation

In short, getting as estimate of the number of customers needed allows us to design our experiments better. We suggest conducting a power analysis both before and after starting a test for several reasons:

  • Before starting the test – This gives us an estimate of how long our test should be run to detect the effect that we are anticipating. Ideally, this is done once to design the experiment and the results are tallied when the requisite number of people are exposed to both variants. However, the mean and standard deviation used in the calculation before starting the test are approximations to the actual values that we might see during the test. Thus, these a priori estimates might be off.
  • After starting the test – As the test progresses, the mean and standard deviation converge to values representative of the current sample which allows us to get more accurate estimates of the sample size. This is especially useful in cases where the new experience introduces unexpected changes in the behavior of the customers leading to significantly different mean and standard deviation values than those estimated earlier.


At Zulily, we strive to make well-informed choices for our customers by listening to their voice through our A/B testing platform, among other channels, and ensuring that we are constantly serving the needs of the customers. While obtaining an accurate estimate of the number of people for the test is challenging, we hold it central to the process. Most people agree that the benefits of a well-designed, statistically sound A/B testing system far outweigh the benefits from obtaining quick, but misdirected numbers. Therefore, we aim for a high-level level of scientific rigor in our tests.

I would like to thank my colleagues in the data science team, Demitri Plessas and Pamela Moriarty, and my manager, Paul Sheets, for taking time to review this article. This article is possible due to the excellent work by the entire data science team of maintaining the A/B testing platform, and ensuring that experiments at Zulily are well-designed.


Learn how Zulily and Sounders FC get the most out of their metrics!

On Tuesday, September 10th, Zulily was proud to partner with Seattle Sounders FC for a tech talk on data science, machine learning and AI. This exclusive talk was led by Olly Downs, VP of Data & Machine Learning at Zulily, and Ravi Ramineni, Director of Soccer Analytics at Sounders FC.

Zulily and Sounders FC both use deep analysis of data to improve the performance of their enterprises. At Zulily, applying advanced analytics and machine learning to the shopping experience enables us to better engage customers and drive daily sales. For Sounders FC, the metrics reflect how each player contributes to the outcome of each game; understanding the relationship between player statistics, training focus and performance on the field helps bring home the win. For both organizations, being intentional about the metrics we select and optimize for is critical to success.

We would like to thank everyone who attended the event for a great night of discussion and for developing new ties within the Seattle developer community. For any developers who missed this engaging discussion, we invite you to view the full presentation and audience discussion:


Thanks to Olly Downs and Ravi Ramineni for presenting their talks, Sounders FC for hosting, and Luke Friang for providing a warm welcome. This would not have been possible without the many volunteers from Zulily, Bellevue School of AI for co-listing the event, as well as all the attendees for making the tech talk a success!

For more information:

If you’d like to chat further with one of our recruiters regarding a position within data science, machine learning or any of our other existing roles, feel free to reach out directly to techjobs@zulily.com to get the conversation started. Also be sure to follow us on LinkedIn and Twitter.


Seattle Female Leaders Discuss Their Paths to Success

On September 5th, Zulily was proud to partner with Reign FC for a thought leadership event, celebrating and highlighting leadership skills from Seattle leaders, including women in STEM, sports and business. This discussion was led by Kelly Wolf, with Celia Jiménez Delgado, Kat Khosrowyar, Jana Krinsky and Angela Dunleavy-Stowell. Our panel addressed a variety of topics, including mentorship, leadership, impostor syndrome, advocacy and unconventional advice. Following the discussion and audience Q&A, attendees had the opportunity to meet Celia Jiménez Delgado, Bev Yanez. Morgan Andrews, Bethany Balcer, Lauren Barnes, Michelle Betos, Darian Jenkins, Megan Oyster, Taylor Smith and Morgan Proffitt of Reign FC! Attendees were also able to take a professional headshot, courtesy of Zulily’s hardworking studio team. We’d like to thank all who were able to attend, as well as the Zulily staff whose efforts made this event a success.

Panel Highlights:

“As you grow in your career, you are being sought for your leadership and critical thinking skills, and for your ability to diagnose and solve problems, not regurgitate facts.” Kelly Wolf, VP of People at Zulily

I wouldn’t be where I am if it wasn’t for my mentors. We need to push more, take more risk to support each other and come together as a community. It doesn’t matter if you’re a man or a woman, we all need to work together.”Kat Khosrowyar, Head Coach at Reign Academy, former Head Coach of Iran’s national soccer team, Chemical Engineer

“I am not a developer, but currently mentor a female developer. She drives the topic, and I act as a sounding board. Working on a predominately male team, she needed a different confidante to work through issues, approach, development ideas and career path goals.”Jana Krinsky, Director of Studio at Zulily

“During meetings, I sometimes tell myself, ‘should I be here? I’m in over my head.’ And I sort of have to call bull**** on myself. I think we all need to do that.”Angela Dunleavy-Stowell, CEO at FareStart, Co-Founder at Ethan Stowell Restaurants

“When you have confidence in yourself, when you think ‘I’m going to own it, this is going to happen because I’m going to make it happen,’ it matters. As women, we can’t use apologetic language like ‘Sorry, whenever you have a second, I would like to speak to you’ — we don’t need to be sorry for doing our jobs. Women need to start changing those sentences to, ‘when would be a good time to talk about this project?’ and treating people as your equal, not as someone who’s above you.”Celia Jiménez Delgado, right wing-back for Reign FC + Spain’s national soccer team, Aerospace Engineer

“We all have to find our courage. Because if you want to grow and be in a leadership role, that’s going to be a requirement. I think identifying that early in your career is a great way to avoid some pitfalls, down the road.”Angela Dunleavy-Stowell, CEO at FareStart, Co-Founder at Ethan Stowell Restaurants


Thanks to Celia Jiménez Delgado, Kat Khosrowyar, Jana Krinsky, Angela Dunleavy-Stowell and Kelly Wolf for this engaging panel! We’d also like to give a big thanks to FareStart, Reign FC  and Reign Academy for supporting this event. This would not have been possible without the many volunteers from Zulily as well as all the attendees for making the night a success!

For more information:

If you’d like to chat further with one of our recruiters regarding a position within data science, machine learning or any of our other existing roles, feel free to reach out directly to techjobs@zulily.com to get the conversation started. Also be sure to follow us on LinkedIn and Twitter.

Interning at Zulily

Hello, I’m Han. I am from Turkey but moved to the Greater Seattle area 5 years ago. I finished high school in Bellevue, went to Bellevue College, transferred to the University of Washington Computer Science department, and now going into my Senior year. This summer for 3 months, I worked as an intern in Zulily’s Member Engagement Platform (MEP) team. This post is about my internship journey.

This was my first internship, so I came to my first day of work with one goal; learning, but I didn’t know it was going to be something bigger. 

My first month I worked on a Java project in which I would download data from an outside source and use our customer data to map customers to their time-zones. During that time, I learned about AWS services, such as ECS, ECS, Lambda, Route53, Step Function, etc. I learned containerized deployments, created CI/CD and CloudFormation files.  

On my second month, I got into working with the UI. Before my internship, I have never worked on any front-end UI work. But during my second month, I learned how to work on a React app, and use JavaScript. I was working with engineering and Marketing to implement features to the MEP UI.  

Beginning of my third month I was working with our Facebook Messenger bot, implementing features that users were able to use in the Messenger App. I was then working on projects both in front and back-end, I was touching and getting my hands dirty in every part of the stack. I was learning, deploying and helping.  

In the beginning, either my manager or other engineers would assign me tasks. But after 2 months, I was picking up my own. I was picking tasks, working with engineers and marketing, going to design meetings, and helping other engineers.  

Before the start of the internship, my friends and my adviser told me that I should expect to be given one big project that I would work on somewhere in the corner by myself. They also warned me that my project would probably never be deployed or used. But here at Zulily that wasn’t the case at all. I was working with the team, as a part of the team, not as an outsider intern. I was deploying new features every week. Features I can look back at, show others and be proud of. I was coming to work every day with a goal to finish tasks and leaving with the feeling of accomplishment. I felt like I was part of a bigger family. In my team, everyone was helping each other, they were working together in order to succeed together, just like a team, just like a family.  

Now that I’m at the end of my internship, I’m leaving with a lot of accomplishments, a lot more knowledge, and going back to school to finish my last year. In conclusion, I believe interning at Zulily was the perfect internship. I, as an intern, learned about the company, the team, the workflow, the projects. I learned new engineering skills, learned how to work in a different environment with a team, accomplished a lot of tasks and finally contributed to the team and the company in general.  

My First Six months at Zulily After a Bootcamp

Hi, I’m Mark — a mathematician turned a coffeeshop owner turned an actuary turned a software engineer. When I discovered my passion for software development, I quit my day job, went through an immersive development bootcamp and got recruited by Zulily’s Member Engagement Platform team. This post is about my first six months here. 

Things I had to learn on the go: 

  • Becoming an IntelliJ power user 
  • Navigating the code base with many unfamiliar projects 
  • Learning about containerized deployments and CI/CD 
  • Getting my hands dirty with Java, Scala, Python, JavaScript, shell scripting and beyond 
  • Reaching high into the AWS cloud: ECS, SQS, Kinesis, S3, Lambda, Step Functions, EMR, CloudWatch… 

Things that caught me by surprise: 

  • I shipped my code to production two weeks after my start date. 
  • Engineers had autonomy and trust. The managers removed obstacles. 
  • The amount of collective knowledge and experience around me was stunning. 
  • I was expected to work directly with the Marketing team to clarify requirements. No middlemen. 
  • Iteration was king! 

Things that I am still learning with the help from my team: 

  • Form validation in React; 
  • Designing and building high-scale distributed systems; 
  • Tuning AWS services to match our use cases; 
  • On-call rotations (the team owns several tier-1 services); 
  • And many more… 

There is one recurring theme in all these experiences. No matter how new, huge, or unfamiliar a task may seem at first, it’s just a matter of breaking it down methodically into manageable chunks that can then be understood, built, and tested. This process can be time consuming, and I feel lucky that my team understands that. It’s clear that engineers at Zulily are encouraged to take the time and build products that last.    

One of the things that I find truly amazing here is the breadth of work that is considered ‘full stack’. DevOps, Big Data, Micro Services, and React client apps are just some of the areas in which I have been able to expand my knowledge in the last several months. It may have been overwhelming at first, but when you are among teammates that have vast expertise, acquiring these new skills became an exciting part of my daily routine at Zulily. 

It’s hard to compare what I know now to what I knew six months ago —my understanding has expended in breadth and depth — and I’m excited to see what the next six months will bring. 

Leveraging Serverless Tech without falling into the “Ownerless” trap

The appeal of serverless architecture sometimes feels like a ­­silver bullet: Offload your application code to your favorite cloud providers’ serverless offering and enjoy the benefits of extreme scalability, rapid development, and pay-for-use pricing. At this point in time most cloud providers’ solutions and tooling are well defined, and you can get your code executing in a serverless fashion with little effort.

The waters start to get a little murkier when considering making an entire application serverless while maintaining the coding standards and maintainability of traditional server-based architecture. Serverless has a tendency to become ownerless (aka hard to debug and painful to work on) over time without the proper structure, tooling and visibility. Eventually these applications suffer due to code fragility and slow development time, despite how easy and fast they were to prototype.

On the pricing team at Zulily, our core services are written in Java using the Dropwizard framework. We host on AWS using Docker and Kubernetes and deploy using Gitlab’s CICD framework. For most of our use cases this setup works well, but it’s not perfect for everything. We do feel confident though that this setup provides the structure and tools that help in avoiding the ownerless trap and this setup helps us achieve our goals of well-engineered service development. While this list of goals will differ between teams and projects, for our purposes it can be summarized below:


  • Testability (Tests & Local debugging & Multiple environments)
  • Observability (Logging & Monitoring)
  • Developer Experience (DRY & Code Reviews & Speed & Maintainability)
  • Deployments
  • Performance (Latency & Scalability)

We are currently redesigning our competitive shopping pipeline and thought this would be a good use case to experiment with a serverless design. For our use case we were most interested in its ability to handle unknown and intermittent load and were also curious to see if it improved our speed of development and overall maintainability. We think we have been able to leverage the benefits of a serverless architecture while avoiding some of the pitfalls. In this post, I’ll provide an overview of the choices we made and lessons we learned.


There are many serverless frameworks, some are geared towards being platform agnostic, others focus on providing better UIs, some make deployment a breeze, etc… We found most of these frameworks were abstraction layers on top of AWS/GCP/Azure and since we are primarily an AWS-leaning shop at Zulily, we decided to stick with their native tools where we could, knowing that later we could swap out components if necessary. Some of these tools were already familiar to the team and we wanted to be able to take advantage of any new releases by AWS without waiting for an abstraction layer to implement said features.

Basic architecture of serverless stack:

Below is a breakdown of the major components/tools we are using in our serverless stack:

API Gateway:

What it is: Managed HTTP service that acts as a layer (gateway) in front of your actual application code.

How we use it: API gateway handles all the incoming traffic and outgoing responses to our service. We use it for parameter validation, request/response transformation, throttling, load balancing, security and documentation. This allows us to keep our actual application code simple while we offload most of the boilerplate API logic and request handling to API Gateway. API gateway also acts as a façade layer where we can provide clients with a single service and proxy requests to multiple backends. This is useful for supporting legacy Java endpoints while the system is migrated and updated.


What it is: Lambda is a serverless compute service that runs your code in response to events and automatically manages the underlying compute resources.

How we use it: We use Lambdas to run our application code using the API gateway Lambda proxy integration. Our Lambdas are written in Nodejs to reduce cold start time and memory footprint (compared to Java). We have a 1:1 relationship between an API endpoint and Lambda to ensure we can fine tune the memory/runtime per Lambda and utilize API Gateway to its fullest extent, for example, to bounce malformed requests at the gateway level and prevent Lambda execution. We also use Lambda layers, a feature introduced by AWS in 2018, to share common code between our endpoints while keeping the actual application logic isolated. This keeps our deployment size smaller per lambda and we don’t have to replicate shared code.

SAM (Cloudformation):

What it is: AWS SAM is an open-source framework that you can use to build serverless applications on AWS. It is an extension of Cloudformation which lets you bundle and define serverless AWS resources in template form, create stacks in AWS, and enable permissions. It also provides a set of command line tools for running serverless applications locally.

How we use it: This is the glue for our whole system. We use SAM to describe and define our entire stack in config files, run the application locally, and deploy to integration and production environments. This really ties everything together and without this tool we would be managing all the different components separately. Our API gateway, Lambdas (+ layers) and permissions are described and provisioned in yaml files.

Example of defining a lambda proxy in template.yaml

Example of shared layers in template.yaml

Example of defining corresponding API gateway endpoint in template.yaml


What it is: A managed key value and document data store.

How we use it: We use DynamoDB for our ingestion pipeline getting data into our system. Dynamo was the initial reason we chose to experiment with serverless for this project. Your serverless application is limited by your database’s ability to scale fluidly. In other words, it doesn’t matter if your application can handle large spikes in traffic if your database layer is going to return capacity errors in response with this increase in load. Our ingestion pipeline spikes daily from ~5 to ~700 read or write capacity units per second and is scheduled to increase, so we wanted to make sure throwing additional batches of reads or writes to our datastore from our serverless API wasn’t going to cause a bottleneck during a spike. In addition to being able to scale fluidly, Dynamo has also simplified our Lambda–datastore interaction as we don’t have the complexity overhead of trying to open and share database connections between lambdas because Dynamo’s low-level API uses HTTP(S). This is not to say it’s impossible to use a different database (lots of examples out there of this), but its arguably simpler to use something like Dynamo.

Normal spikey ingestion traffic for our service with dynamo scaling

Other tools: Gitlab CICD, Cloudwatch, Elk stack, Grafana:

These are fairly common tools, so we won’t go into detail about what they are.

How we use it: We use Gitlab CICD to deploy all of our other services, so we wanted to keep that the same for our serverless application. Our Gitlab runner uses a Docker image that has AWS SAM for building, testing, deploying (in stages) and rolling back our serverless application. Our team already uses the Elk stack and Grafana for logging, visualization and alerting. For this service all of our logs from API Gateway, Lambda and Dynamo get picked up in Cloudwatch. We use Cloudwatch as a data source in Grafana and have a utility that migrates our Cloudwatch logs to Logstash. That way we can keep our normal monitoring and logging systems without having to go to separate tools just for this project.


So now that we have laid out our basic architecture and tooling: How well does this system address our goals mentioned at the start of this post? Below is a breakdown for how these have been addressed and a (yes, very subjective) grade for each.

Testability (Tests & Local debugging & Multiple environments) de

Overall Grade: A-

Positives: The tooling for serverless has come a long way and impressed us in this category. One of the most useful features is the ability to start your serverless app locally by running

sam local start-api <OPTIONS>

This starts a local http server based on your AWS::Serverless::Api specification in your template.yaml. When you make a request to your local server, it reads the corresponding CodeUri property of your lambda (AWS::Serverless::Function) and starts a docker container (same image AWS runs deployed lambdas) to run your lambda locally in conjunction with the request. We were able to write unit tests for all the lambdas and lambda layer code as well as deploy to specific integ/prod environments (discussed more below). There are additional console and CLI tools for triggering API gateway endpoints and lambdas directly.

Negatives: Most of the negatives for this category are nitpicks. Unit testing with layers took some tweaking and feels a little clumsy and sam local start-api doesn’t exactly mimic the deployed instance in how it handles errors and parameter validation. In addition, requests to the local instance were slow because it starts a docker container locally every time an endpoint is requested.

Observability (Logging & Monitoring)

Overall Grade: B

Positives: At the end of the day this works pretty well and mostly mimics our other services where we have our logs searchable in Kibana and our data visible in Grafana. The integration with Cloudwatch is pretty seamless and really easy to get started with.

Negatives: Observability in the serverless world is still tough. It tends to be distributed across lots of components and can be frustrating to track down where things unraveled. With the rapid development of tooling in this category, we don’t see this being a long-term problem. One tool that we have not tried, but could bump this grade up, is AWS X-ray which is a distributed tracing system for AWS components.

Developer Experience (DRY & Code Reviews & Speed & Maintainability)

Overall Grade: A-

Positives: Developer experience was good, bordering on great. Each endpoint is encapsulated in its own small codebase which makes adding new code or working on existing code really easy.  Lambda layers have solved a lot of the DRY issues. We share response, error and database libraries between the lambdas as well as NPM modules. All of our lambda code gets reviewed and deployed like our normal services.

Negatives. In our view the two biggest downsides have been unwieldy configuration and immature documentation. While configuration has its benefits, and it’s great to be able to see the entire infrastructure of your application in code, this can be tough to jump into and SAM/Cloudformation has a learning curve. Documentation is solid but could be better. Part of the issue is the rapid pace of feature releases and some confusion on best practices. 


Overall Grade: A

Positives: Deployments are awesome with SAM/Cloudformation. From our Gitlab runner we execute:

aws cloudformation package --template-file template.yaml   --output-template-file packaged_{env}.yaml --s3-bucket {s3BucketName}

which uploads template resources to s3 and replaces paths in our template. We can then run:

aws cloudformation deploy --template-file packaged_{env}.yaml   --stack-name {stackName} --capabilities CAPABILITY_IAM --region us-east-1   --parameter-overrides DeployEnv={env} 

 which creates a change set (shows effects of proposed deployment) and then creates the stack in AWS. This will create/update all the resources, IAM permissions and associations defined in template.yaml. This is really fast, a normal build and deployment of all of our Lambdas, API Gateway endpoints, permissions etc… takes ~90 seconds (not including running our tests).

Negatives: Unclear best practices for deploying to multiple environments. One approach is to use different API stages in API gateway and have your other resources versioned for those stages. Or (the way we chose) is to have completely different stacks for different environments and pass a DeployEnv variable into our Cloudformation scripts.

Performance (Latency & Scalability)

Overall Grade: A-

Positives: To test our performance and latency before going live we proxied all of our requests in parallel to both our existing (Dropwizard) API and new serverless API. It is important to consider the differences between the systems (data stores, caches etc..), but after some tweaking we were able to achieve equivalent P99, P95 and P50 response times between the two systems. In addition to the ability to massively scale out instances in response to traffic, the beauty of the serverless API is that you can fine tune performance on a function (endpoint) basis. CPU share is proportionally allocated depending on overall memory of each Lambda, so increasing the memory of each Lambda has the ability to directly decrease latency. When we first started routing traffic in parallel to our serverless API we noticed some of the bulk API endpoints were not performing as well as we had hoped. Instead of increasing the overall CPU/memory of the entire deployment, we just had to do this for the slow Lambdas.

Negatives: A well-known and often discussed issue with serverless is dealing with cold start times. This refers to the latency increase when an initial instance is brought up by your cloud provider to process a request to your serverless app. In practice this is on the scale of sometimes several hundred ms latency added on cold start for our endpoints. Once spun up, instances won’t get torn down immediately and subsequent requests (if routed to this same instance) won’t suffer from that initial cold start time. There are several strategies to avoid these cold starts like “pre-warming” your API with dummy requests when you know traffic is about to spike. You can also configure your application to run with more memory which persist for longer between requests (+ increase CPU). Both of these strategies cost money, so the goal is to try to find the right balance for your service.

Example of scaling (at ~14:40) an arbitrary computation-heavy lambda from 128mb to 512mb in response times

Overall latency with spikes over a ~3 hour period (most lambdas configured at 128mb or 256mb). Spikes re either cold start times or calls that involve several hundred queries (bulk APIs) or a combo. Note* this performance is tweaked to be equivalent to the p99 p90 and p50 times of the old service.


At the end of the day this setup has been a fun experiment and for the most part satisfied our requirements. Now that we have set this up once, we will definitely use it again for suitable projects in the future. A few things we didn’t touch on in this post are pricing and security, two very important goals for any project. Pricing is tricky to get an accurate comparison of because it depends on traffic, function size, function execution time and many other configurable options along the entire stack. It is worth mentioning, though, that if you have large and consistent traffic to your APIs, it is highly unlikely that serverless will be as cost effective as rolling you own (large) instance. That being said, if you want a low maintenance service that can scale on demand to handle spikey, intermittent traffic, it is definitely worth considering. Security considerations are also highly dependent on your use case. For us, this is an internal facing application so strategies like associating an internal firewall, limited resource IAM permissions and basic gateway token validation sufficed. For public facing applications there are a variety of security strategies to consider that aren’t mentioned in this post. Overall, we would argue that the tools are at a point where it does not feel like you are sacrificing in terms of development standards by choosing a serverless setup. Still not a silver bullet, but a great tool to have in the toolbox.


Overview of serverless computing: https://martinfowler.com/articles/serverless.html

API gateway: https://docs.aws.amazon.com/apigateway/index.html#lang/en_us

Lambda: https://docs.aws.amazon.com/lambda/index.html#lang/en_us

SAM: https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/what-is-sam.html

Cloudformation: https://docs.aws.amazon.com/cloudformation/index.html#lang/en_us

Dynamo: https://docs.aws.amazon.com/dynamodb/index.html#lang/en_us

Cloudwatch: https://docs.aws.amazon.com/cloudwatch/index.html#lang/en_us

Xray: https://docs.aws.amazon.com/xray/index.html#lang/en_us

Grafana: http://docs.grafana.org/

Elk stack: https://www.elastic.co/guide/index.html

Gitlab CICD: https://docs.gitlab.com/ee/ci/

Improving Marketing Efficiency with Machine Learning


Here at Zulily, we offer thousands of new products to our customers at a great value every day. These products are available for about 72 hours; to inform existing and potential customers about our ever-changing offerings, the Marketing team launches new ads daily for these offerings on Facebook.

To get the biggest impact, we only run the best-performing ads. When done manually, choosing the best ads is time-consuming and doesn’t scale. Moreover, the optimization lags behind the continuously changing spend and customer activation data, which means wasted marketing dollars. Our solution to this problem is an automated, real-time ad pause mechanism powered by Machine Learning.

Predicting CpTA

Marketing uses various metrics to measure ad efficiency. One of them is CpTA or Cost per Total Activation (see this blog post for a deeper dive on how we calculate this metric). Lower CpTA means spending less money to get new customers so lower is better.

To pause ads with high CpTA, we trained a Machine Learning model to predict the next-hour CpTA using the historical performance data we have for ads running on Facebook. If the model predicts that the next-hour CpTA of an ad will exceed a certain threshold, that ad will be paused automatically. The marketing team is empowered to change the threshold at any time.

Ad Pause Service

We host the next-hour CpTA model as a service and have other wrapper microservices deployed to gather and pass along the real-time predictor data to the model. These predictors include both relatively static attributes about the ad and dynamic data such as the ad’s performance for the last hour. This microservice architecture allows us to iterate quickly when doing model improvements and allows for tight monitoring of the entire pipeline.

The end-to-end flow works as follows. We receive spend data from Facebook for every ad hourly. We combine that with activation and revenue data from the Zulily web site and mobile apps to calculate the current CpTA. Then we use CpTA threshold values set by Marketing and our next-hour CpTA prediction to evaluate and act on the ad. This automatic flow helps manage the large number of continuously changing ads.

Screen Shot 2019-02-27 at 11.23.06 PM

Results and Conclusion

The automatic ad pause system has increased our efficiency through the Facebook channel and gave Marketing more time to do what they do best: getting people excited about fresh and unique products offered by Zulily. Stay tuned for our next post where we take a deeper dive into the ML models.

Serving dynamically resized images at scale using Serverless

At zulily, we take pride in helping our customers discover amazing products at tremendous value. We rely heavily on highly curated images and videos to narrate these stories. These highly curated images, in particular, form the bulk of the content on our site and apps.


Two of our popular events from weekend of Oct 20th 2018

Today, we will talk about how zulily leverages serverless technologies for serving optimized images on the fly on a variety of devices with varying resolutions and screen sizes

In any given month, we serve over 23 billion image requests using our CDN partners. This results in over a Petabyte per month of data transferred to our users around the globe.

We use AWS Simple Storage Service (S3) bucket as the origin for all our images. Our in-house studio and merchandising teams upload rich images using internal tools to S3 buckets. As you can imagine, these images are pretty huge in terms of file size. Downloading and displaying these images as-is would result in sub-optimal experience for our customers and waste of bandwidth.


Dynamic image resizing on the fly

Dynamic image resizing on the fly

Continue reading