Site icon zulily Engineering Blog

Evolution of Zulily’s Airflow Infrastructure

At Zulily, our need to perform various dependent actions on different schedules has continued to grow. Sometimes, we need to communicate inventory changes to advertising platforms. At other times, we need to aggregate data and produce reports on the effectiveness of spend, conversions of ads or other tasks. Early on, we knew that we needed a reliable workflow management system. We started with Apache Airflow version 1.8 two years ago and have continued to add more hardware resources to it as the demands of our workloads increased.  

Apache Airflow is a workflow management system that allows developers to describe workflow tasks and their dependency graph as code in Python. This allows us to keep a history of the changes and build solid pipelines step-by-step with proper monitoring. 

Our Airflow deployment runs a large majority of our advertising management and reporting workflows. As our usage of Airflow increased, we have made our Airflow deployment infrastructure more resilient to failures leveraging the new KubernetesPodOperator. This post will talk about our journey with Airflow from Celery to KubernetesPodOperator. 

Our First Airflow 1.8 Deployment using Celery Executor 

With this version of Airflow we needed to maintain many separate services: the scheduler, RabbitMQ, and workers.

Components and Concepts 



Our Current Airflow 1.10.4 Deployment using KubernetesPodOperator 

In Airflow version 1.10.2 a new kind of operator called the KubernetesPodOperator was introduced. This allowed us to reduce setup steps and make the overall setup more robust and resilient by leveraging our existing Kubernetes cluster. 

Differences and New Components 

High Level Sequence of Events 

  1. Developer pushes or merges DAG definition and Task container code to master. 
  2. The CI CD is setup to:
    1. Build and push the task container images to AWS ECR. 
    2. Push the new DAG definition to the AWS EFS. 
  3. When the Scheduler needs to schedule a task: 
    1. It reads the DAG definition from the AWS EFS. 
    2. It creates the short-lived Task pod with the Task container image specified. 
    3. Task executes and finishes.
    4. Logs are copied to S3. 
    5. Task pod exits. 

Resiliency & Troubleshooting 


Airflow continues to be a great tool helping us achieve our business goals. Using the KubernetesPodOperator and the LocalExecutor with Airflow version 1.10.4, we have streamlined our infrastructure and made it more resilient in the face of machine failures. Our package dependencies have become more manageable and our tasks have become more flexible. We are piggy-backing on our existing Kubernetes infrastructure instead of maintaining another queuing and worker mechanism. This has enabled our developers to devote more time to improve customer experience and to worry less about infrastructure. We are excited about future developments in Airflow and how they enable us to drive Zulily business! 

Exit mobile version