In 2014, we made a decision to build our core data platform on Google Cloud Platform and one of the products which was critical for the decision was Google BigQuery. The scale at which it enabled us to perform analysis we knew would be critical in long run for our business. Today we have more than 200 unique users performing analysis on a monthly basis.
Once we started using Google BiqQuery at scale we soon realized our analysts needed better tooling around it. The key requests we started getting were
- Ability to schedule jobs: Analysts needed to have ability to run queries at regular intervals to generate data and metrics.
- Define workflow of queries: Basically analysts wanted to run multiple queries in a sequence and share data across them through temp tables.
- Simplified data sharing: Finally it became clear teams needed to share this data generated with other systems. For example download it to leverage in R programs or send it to another system to process through Kafka.
zuFlow is zulily’s a query workflow and scheduling solution for Google BigQuery. There are few key concepts
- Job: Job is a executable entity that encompasses multiple queries with a schedule.
- Query: SQL statement that can be executed on Google BigQuery
- Keyword: Variable defined to be used in the queries
- Looper: Ability to define loops like foreach statements.
High Level Design
zuFlow is a web application that enables users to setup jobs and run them either on demand or based on a schedule.
- We use Django with NGINX for handling our web traffic.
- We leverage Google Cloud SQL for storing config db & keep track of runtime state
- We have integrated the system with off the shelf open source scheduler called SOS. We do plan to migrate this to Airflow in future.
- Flowrunner is the brain of the system written in python. It leverages data from config db and executes the queries and stores back the runtime details in the db. Few key capabilities it provides are
- Concurrency: We have to manage our concurrency to make sure we are not overloading the system
- Retry: In few scenarios based on error codes we retry the queries
- Cleanup: It is responsible for cleaning up after the jobs are run including historical data cleanup
Job Viewer: Once logged-in you can see your jobs or you can view all jobs in the system
Creating Job: You can provide it a name, schedule to run and email address of the owner.
Keywords/variables: You can create keywords which you can reuse in your query. This enables analysts to define a parameter and use it in there queries instead of hardcoding values. We also have predefined system keywords for date time handling and making it easier for users to shard tables. Examples:
- DateTime: CURRENT_TIMESTAMP_PT, CURRENT_DATE_PT, CURRENT_MONTH_PT, CURRENT_TIME_PT, LAST_RUN_TIMESTAMP_PT, LAST_RUN_TIMESTAMP, LAST_RUN_DATE_PT
BQ Format Pacific date of the last run of this job (will be CURRENT_DATE_PT on first run)
- Sharding: *_BQ
Will provide formatted version of date strings for table shard references (without dashs – YYYYmmdd)
Looping: Very soon after our first release we got requests to add loops. This enables users to define variable and loop through the values.
Query Definition: Now you are ready to write a Google BigQuery query and define where the output will be stored. There are 4 options
- BQ Table: In this case you provide BQ table and decide if you want to replace it or append to it. You can also define the output table as temp table and system will clean it up after execution of job is completed.
- CSV: If you pick CSV you need to provide GCS location for output
- Cloud SQL(MySQL): You can also export to the Cloud SQL.
- Kafka: You can provide Kakfa topic name to publish results as messages.
You can define multiple queries and share data across them through temp tables in BQ.
Job Overview: This shows the full definition of the job.
We have thought about open sourcing the solution. Please let us know if you are interested in this system