ZATA: How we used Kubernetes and Google Cloud to expose our Big Data platform as a set of RESTful web services

Authors: Shailu Mishra, Sudhir Hasbe

In our initial blog post about zulily big data platform, We briefly talked about ZATA (zulily data access service).Today we want to deep dive into ZATA and explain our thought process and how we built it.

Goals

As a data platform team we had three goals:

  1. Rich data generated by our team shouldn’t be limited to analysts. It should be available for systems & applications via simple and consistent API.
  2. Have the flexibility to change our backend data storage solutions over time without impacting our clients
  3. Zero development time for incremental data driven APIs

ZATA was our solution for achieving our above goals. We abstracted our data platform using a REST-based service layer that our clients could use to fetch datasets. We were able to swap out storage layers without any change for our client systems.

Selecting Data Storage solution

There are three different attributes you have to figure out before you pick a storage technology:

  1. Size of Data: Is it big data or relatively small data? In short, do you need something that will fit in My SQL or do you need to look at solutions like Google Big Query or AWS Aurora?
  2. Query Latency: How fast do you need to respond to Queries? Is it milliseconds or are few seconds OK – especially for large datasets
  3. Data Type: Is it relational data or is it key value pairs or is it complex JSON documents or it is a search pattern?

As an enterprise, we need all combinations of these. The following are choices our team has made over time for different attributes:

  1. Google Big Query: Great for large datasets(in terabytes) but latency is in seconds and supports columnar storage
  2. AWS Aurora: Great for large datasets (in 100s of gigabytes) with very low latency for queries
  3. PostgresXL: Great for large datasets(100s of gigs to terabytes) with amazing performance for aggregation queries. This is very difficult to manage and still early in its maturity cycle. We eventually moved our datasets to AWS Aurora.
  4. Google Cloud SQL, MySQL or SQL Server: For Small datasets(GBs) with real low latency in milliseconds)
  5. Mongo DB or Google Big Table: Good for large scale datasets with low latency document lookup.
  6. Elastic Search: We use Elastic Search for scenarios related to search both fuzzy and exact match.

Zata Architecture

clip_image001

Key runtime components for ZATA are

Mapping Layer

This looks at the incoming URLs and maps them to backend systems. For example: Request: http://xxxxx.zulily.com/dataset/product-offering?eventStartDate=[2013-11-15,2013-12-01]&outputFields=eventId,vendorId,productId,grossUnits maps to

  1. Google Big Query(based on config db mapping for product-offering )
  2. Dataset used is product-offering which is just a view in the Google Big Query system
  3. Where eventStartDate=[2013-11-15,2013-12-01] is transformed to where eventstartDate between 2013-11-15 & 2013-12-01
  4. Output fields that are requested are eventId,vendorId,productId,grossUnit
  5. Query for Google Big Query is:

Select eventId,vendorId,productId,grossUnit from product-offering  where eventStartDate=[2013-11-15,2013-12-01]

The mapping layer decides what mappings to use and how to transform the http request to something that backend will understand. This will be very different for MongoDB or Google Big Table.

Execution Layer

Execution layer is responsible for generating queries using the protocol that the storage engine will understand. It also executes the queries against backend and fetches result sets in an efficient manner. Our current implementation supports various protocols such as mongodb, standard JDBC as well as http request for Google BigQuery, Big Table and elasticsearch.

Transform Layer

This layer is responsible for transforming data coming from any of the backend sources and normalizing it. This allows our clients to be agnostic of storage mechanism in our backend systems. We went JSON as the schema format given how prevalent it is amongst services and application developers

In previous example from Mapping layer the response will be following.

[

{“eventId”: “12345”, “vendorId”: “123”, “productId”: “3456”, “grossUnits”: “10”},

{“eventId”: “23456”, “vendorId”: “123”, “productId”: “2343”, “grossUnits”: “234”},

{“eventId”: “33445”, “vendorId”: “456”, “productId”: “8990”, “grossUnits”: “23”},

{“eventId”: “45566”, “vendorId”: “456”, “productId”: “2343”, “grossUnits”: “88”}

]

API auto discovery

Our third goal was to have zero development time for incremental data driven API. We achieved this by creating an auto discovery service. The job of this service is to regularly poll the backend storage service for changes and automatically add service definitions to the config db. For example, in Google Big query or My SQL, once you add a view in schema called “zata” we automatically add the API to ZATA service. This way the data engineer can keep adding services for dataset they created without anyone writing new code.

API Schema Definition

Schema service enables users to look for all the APIs supported by zata and also view its schema to understand what requests they can send. Clients can get the list of available datasets;

Dataset Request: http://xxxxx.zulily.com/dataset

[
{ “datasetName”: “product-offering-daily”,….},
{ “datasetName”: “sales-hourly”,…………………},
{ “datasetName”: “product-offering “,………….}
]

Schema Request: Then they can drill down to the schema of a selected dataset; http://xxxxx.zulily.com/dataset/product-offering/schema/

[
{ “fieldName”: “eventId”, “fieldType”: “INTEGER” },
{ “fieldName”: “eventStartDate”, “fieldType”: “DATETIME”},
{ “fieldName”: “eventEndDate”, “fieldType”: “DATETIME” },
{ “fieldName”: “vendorId”, “fieldType”: “INTEGER” },
{ “fieldName”: “productStyle”, “fieldType”: “VARCHAR” },
{ “fieldName”: “grossUnits”, “fieldType”: “INTEGER” },
{ “fieldName”: “netUnits”, “fieldType”: “INTEGER” },
{ “fieldName”: “grossSales”, “fieldType”: “NUMERIC” },
{ “fieldName”: “netSales”, “fieldType”: “NUMERIC” }
]

So far, the client is not aware of the location or has any knowledge of the storage system and this makes the whole data story more agile. It is moved from one location to another, or the schema is altered, it will be fine for all downstream system since the access points and the contracts are managed by Zata.

Storage Service Isolation

As we rolled out ZATA over time, we realized the need for storage service isolation. Having a single service support multiple backend storage solutions with different latency requirements didn’t work very well. The slowest backend tends to slow things down for everyone else.

This forced us to rethink about zata deployment strategy. Around the same time, we were experimenting with dockers and using Kubernetes as an orchestration mechanism.

We ended up creating separate docker containers and kubernetes service for each of the backend storage solutions. So we now have a zata-bigquery service which handles all bigquery specific calls. Similary we have a zata-mongo, zata-jdbc and zata-es service. Each of these kubernetes service can be individually scaled based on anticipated load.

In addition to individual kubernetes service, we also created a zata-router service which is essentially nginx hosted in docker. Zata-router service accepts on incoming HTTP requests for zata and based on the nginx config, it routes HTTP traffic to various kubernetes services available in the cluster. The nginx config in zata-router service is dynamically refreshed by polling service to make new APIs discoverable.

clip_image003

ZATA has enabled us to make our data more accessible across the organization while enabling us to move fast and change storage layer as we scaled up.

The way we Go(lang)

Here at zulily, Go is increasingly becoming the language of choice for many new projects, from tiny command-line apps to high-volume, distributed services. We love the language and the tooling, and some of us are more than happy to talk your ear off about it. Setting aside the merits and faults of the language design for a moment (over which much digital ink has already been spilled), it’s undeniable that Go provides several capabilities that make a developer’s life much easier when it comes to building and deploying software: static binaries and (extremely) fast compilation.

What makes a good build?

In general, the ideal software build should be:

  • fast
  • predictable
  • repeatable

Being fast allows developers to quickly iterate through the develop/build/test cycle, and predictable/repeatable builds allow for confidence when shipping new code to production, rolling back to a prior version or attempting to reproduce bugs.

Fast builds are provided by the Go compiler, which was designed such that:

It is possible to compile a large Go program in a few seconds on a single computer.

(There’s much more to be said on that topic in this interesting talk.)

We accomplish predictable and repeatable builds using a somewhat unconventional build tool: a Docker container.

Docker container as “build server”

Many developers use a remote build server or CI server in order to achieve predictable, repeatable builds. This makes intuitive sense, as the configuration and software on a build server can be carefully managed and controlled. Developer workstation setups become irrelevant since all builds happen on a remote machine. However, if you’ve spent any time around Docker containers, you know that a container can easily provide the same thing: a hermetically sealed, controlled environment in which to build your software, regardless of the software and configuration that exist outside the container.

By building our Go binaries using a Docker container, we reap the same benefits of a remote build server, and retain the speed and short dev/build/test cycle that makes working with Go so productive.

Our build container:

  • uses a known, pinned version of Go (v1.4.2 at the time of writing)
  • compiles binaries as true static binaries, with no cgo or dynamically-linked networking packages
  • uses vendored dependencies provided by godep
  • versions the binary with the latest git SHA in the source repo

This means that our builds stay consistent regardless of which version of Go is installed on a developer’s workstation or which Go packages happen to be on their $GOPATH! It doesn’t matter if the developer has godep or golint installed, whether they’re running an old version of Go, the latest stable version of Go or even a bleeding-edge build from source!

Git SHA as version number

godep is becoming a de facto standard for managing dependencies in Go projects, and vendoring (aka copying code into your project’s source tree) is the suggested way to produce repeatable Go builds. Godep vendors dependent code and keeps track of the git SHA for each dependency. We liked this approach, and decided to use git SHAs as versions for our binaries.

We accomplish this by “stamping” each of our binaries with the latest git SHA during the build process, using the ldflags option of the Go linker. For example:

ldflags "-X main.BuildSHA ${GIT_SHA}"

This little gem sets the value of the BuildSHA variable in the main package to be the value of the GIT_SHA environment variable (which we set to the latest git SHA in the current repo). This means that the following Go code, when built using the above technique, will print the latest git SHA in its source repo:

package main

import "fmt"

var BuildSHA string // set by the compiler at build time!

func main() {
  fmt.Printf("I'm running version: %s\n", BuildSHA)
}

Enter: boilerplate

Today, we’re open sourcing a simple project that we use for “bootstrapping” a new Go project that accomplishes all of the above. Enter: boilerplate

Boilerplate can be used to quickly set up a new Go project that includes:

  • a Docker container for performing Go builds as described above
  • a Makefile for building/testing/linting/etc. (because make is all you need)
  • a simple Dockerfile that uses the compiled binary as the container’s entrypoint
  • basic .gitignore and .dockerignore files

It even stubs out a Go source file for your binary’s main package.

You can find boilerplate on github. The project’s README includes some quick examples, as well as more details about the generated project.

Now, go forth and build! (pun intended)