zulily’s Kubernetes launch presentation at OSCON

In July Steve Reed from zulily presented at O’Reilly’s Open Source Convention (OSCON). He spoke to zulily’s pre-launch experience with Kubernetes. It was an honor for zulily to be asked to speak as part of the Kubernetes customer showcase, given the success we have had with Kubernetes.

Kubernetes launch announcement: http://googlecloudplatform.blogspot.com/2015/07/Kubernetes-V1-Released.html

Leveraging Google Cloud Dataflow for clickstream processing

Clickstream is the largest data set at zulily. As mentioned in zulily’s big data platform we store all our data in Google cloud storage and use Hadoop cluster in Google Compute Engine to process it. Our data size has been growing at a significant pace, this required us to reevaluate our design. The dataset that was growing fastest was our clickstream dataset.

We had to make a decision to either increase the size of cluster drastically or try something new. We came across Cloud Dataflow and started evaluating it as an alternative.

Why did we move clickstream processing to Cloud Dataflow?

  1. Instead of increasing the size of the cluster to manage peak loads of clickstream Cloud Dataflow gave us ability to have an on-demand cluster that could dynamically grow or shrink reducing costs
  2. Clickstream data was growing very fast and the ability to isolate it from other operational data processing would help the whole system in the long run
  3. Clickstream data processing did not require complex correlations and joins which are a must in some of the other datasets – this made the transition easy.
  4. Programming model was extremely simple and transition from development perspective was easy
  5. Most importantly, we have always thought about our data processing clusters as transient entities, which meant dropping another entity in this case Cloud Dataflow which was different from Hadoop was not a big deal. Our data would still reside in GCS and we could still use clickstream data from Hadoop cluster where required.

High Level Flow

image

One of the core principles of our clickstream system is to be self serviceable. In short, teams can start recording new types of events anytime by just registering a schema to our service. Clickstream events are pushed to our data platform through our real-time system(more later) and through log shipping to GCS.

In the new process using Cloud Dataflow we

  1. Create pipeline object and read data from GCS into PCollectionTuple object
  2. Create multiple outputs from data in the logs based on event types and schema using PCollectionTuple
  3. Apply ParDo transformations on the data to make it optimized for querying by business users (e.g. transform all dates to PST)
  4. Store the data back into GCS, which is our central data store but also for use from other Hadoop processes  and loading to Big query.
  5. Loading to BigQuery is not done directly from Dataflow but we use bq load command which we found to be much more performant

The next step for us is to identify other scenarios which are a good fit to transition to Cloud Dataflow, which have similar characteristics to clickstream – high volume, unstructured data not requiring heavy correlation with other data sets.

Building zulily’s big data platform

In July 2014 we started our journey to building a new data platform that would allow us to use big data to drive business decisions. I would like to start with a quote from our 2015 Q2 earnings that was highlighted in various press outlets and share how we built a data platform that allows zulily to make decisions which were near impossible to do before.

“We compared two sets of customers from 2012 that both came in through the same channel, a display ad on a major homepage, but through two different types of ads,” [Darrell] Cavens said. “The first ad featured a globally distributed well-known shoe brand, and the second was a set of uniquely styled but unbranded women’s dresses. When we analyze customers coming in through the two different ad types, the shoe ad had more than twice the rate of customer activations on day one. But 2.5 years later, the spend from customers that came in through the women dresses ad was significantly higher than the shoe ad with the difference increasing over time.” – www.bizjournals.com

Our vision is for every decision, at every level in the organization, to be driven by data. In early 2014 we realized the data platform we had which was combination of SQL server database for data warehousing primarily for structured operational data + Hadoop cluster for unstructured data was too limiting. We started with defining core principles for our new data platform (v3).

Principles for new platform

  1. Break silos of unstructured and structured data: One of the key issues with our old infrastructure was we had unstructured data in a Hadoop cluster and structured in SQL server, this was extremely limiting. Almost every major decision required us to perform analytics where we had to correlate usage patterns on site(unstructured) with demand or shipment or other transactions(structured). We had to come up with a solution that allowed us to correlate these different types of datasets.
  2. High scale and lowest possible cost: We already had a hadoop cluster but one of the challenges was our data was growing exponentially and as the storage need grew we had to start adding nodes to the cluster which increased the cost even if the compute was not growing at same pace. In short, we were paying at rate of compute(which is way higher) at pace of growth rate of storage. We had to break this. Additionally we wanted to build a highly scalable system which would support our growth.
  3. High availability or extremely low recovery time with eye on cost: Simply put we wanted high availability but not pay for it.
  4. Enable real-time analytics: Existing data platform has no concept of real-time data processing, hence anytime users needed to analyze things in real-time they had to be given access to production operational system which in turn increases risk, this had to change.
  5. Self service: Our tools and systems/services need to be easy for our users to use with no involvement from data team.
  6. Protect the platform from users: Another challenge in existing system was lack of isolation between core data platform components and user facing services. We wanted to make in new platform individual users of the system could not bring down the entire system.
  7. Service(Data) level agreement(SLA): We wanted to be able to give SLA’s for our datasets to our users. This includes data accuracy and freshness. In past this was a huge issue due to design of the system and lack of tools to monitor and report SLAs.

zulily data platform (v3)

zulily data platform includes multiple tools, platforms and products. The combination is what makes it possible to solve complex problem at scale. image

There are 6 key broad components in the stack:

  1. Centralized big data storage built on Google cloud storage (GCS). This allows us to have all our data in one place (principle 1) and share it across all systems and layers of the stack.
  2. Big Data Platform is primarily for batch processing of data at scale. We use Hadoop on Google compute engine for our big data processing.
    • All our data is stored in GCS not HDFS, this enables us to decouple storage from compute and manage cost (principle 2).
    • Additionally the hadoop cluster for us is transient as it has no data so if our cluster completely goes away we can built new one on the fly. We think about our cluster as a transient entity (principle 3).
    • We have started looking at Google Dataflow for specific scenarios but more on that soon. Our goal is to make all data available for analytics anywhere from 30 minutes to few hours based on the SLA.
  3. Real-time data platform is a zulily built stack for high scale data collection and processing (principle 4). We now collect more than 10k events per second and it is growing really fast as we see value through our decision portal and other scenarios.
  4. Business data warehouse built on Google BigQuery enables us to provide highly scalable analytics service to 100’s of users in the company.
    • We push all our data, structured and unstructured, real-time and batch into Google BigQuery hence breaking all silos for data (principle 1).
    • It also enables us to keep our big data platform and data warehouse completely isolated (principle 6) making sure issues in one environment don’t impact the other system.
    • Keeping the interface(SQL) same for analysts between old and new platforms enabled us to lower barrier for adoption (principle 5)
    • We also have a high speed low latency data store that hosts part of our business data warehouse data for access through APIs.
  5. Data Access and Visualization component enables business users to make decisions based on information available.
    1. Real-time decision portal or zuPulse enables business to get insights into the business in real-time.
    2. Tableau is our reporting and analytics visualization tool. Business users use the tool to create reports for their daily execution. This enables our engineering team to focus on core data and high scale predictive scenarios and makes reporting self service (principle 5).
    3. ZATA our data access service enables us to expose all our data through an API. Any data in the data warehouse or low latency data store can be automatically exposed through ZATA. This enables various internal applications at zulily to show critical information to users including our vendors who can see their historical sales on vendor portal.
  6. Core data services are backbone to everything we do – from moving data to cloud, monitoring to and making sure we abide by our SLA’s etc. We continuously add new services and enhance existing services. The goal of these services is to make developers and analysts across other areas more efficient.
    • zuSync is our data synchronization service which enables us to get data from any source system and move to any destination. Source or Destination can be operational databases (SQL Server, MySQL etc), a queue (RabbitMQ), Google cloud storage (GCS), Google BigQuery, services (http/tcp/udp), ftp location…
    • zuflow is our workflow service built on top of Google BigQuery and allows analysts to define & schedule query chains for batch analytics. It also has ability to notify and deliver data in different formats.
    • zuDataMon is our data monitoring service which allows us to make sure data is consistent across all systems (principle 7). This helps us catch issues with our services or with the data early. Today we have more than 70% of issues identified by our monitoring tool instead of users reporting them. our goal is to get this number to be in 90%+.

This is a very high level view of what our data@zulily team does. Our goal is to share more often, with a deeper dive on our various pieces of technology, in the coming weeks. Stay tuned…

Partnering with zulily – A Facebook Perspective

During F8, Facebook’s global developer conference (March 25-26, 2015), Facebook announced a new program with zulily as one of the launch partners: Messenger for Business. Just a month later, the service was live and in production! We on the zulily engineering team truly enjoyed the partnership with Facebook on this exciting product launch. It’s been a few months since the launch and we reached out to Rob Daniel, senior Product Manager at Messenger, to share his team’s experience on collaborating with zulily.

Here’s what Rob had to say:

How Messenger for Business Started

Messenger serves over 700 million active people with the mission of reinventing and improving everyday communication. We invest in performance, reliability, and the core aspects of a trusted and fast communication tool. Furthermore, we build features that allow people to better express themselves to their friends and contacts, including through stickers, free video and voice calling, GIFs, and other rich media content. We build the product to be as inclusive as possible – even supporting people that prefer to register for Messenger without a Facebook account – because we know that the most important feature of a messaging app is that all your friends and anyone you’d want to speak with is reachable, instantly.

In that vein, we found that some of people’s most important interactions are not just with friends and family, but with businesses they care about and interact with almost every day. However, until recently Messenger didn’t fully support a robust channel for communicating with these businesses. And when we looked at how people interact with businesses today – email, phone (typically…IVR phone trees), text – there were obvious drawbacks. Either the channels were nosiy (email), or lacked richeness (text), or were just really heavyweight and inefficient (voice).

Reinventing How People Communicate with Businesses

When we started talking with zulily about a partnership to reinvent how people communicate with businesses, under the mission of creating a delightful and high utility experience, we found we met an incredible match.

Why? Simply put, zulily thinks about Mom first. That’s the same people-centric approach to building products that’s at the heart of Messenger. And along the way, as we scoped out the interactions for launch, it was evident that we had the same mindset: build a product with incredible utility and value to people, establish a relationship and build trust, and over time build interactions that pay dividends via higher customer loyalty and lifetime value. And despite time pressures and the many opportunities to put business values over people values, we collectively held the line that people were first, business was second.

We also found that “zulily Speed” was no joke, and matched the Facebook mantra of “Move Fast.” Up and through our announcement at F8 (Facebook’s two-day developer conference) and launch later the next month, our two teams moved in sync at incredible speed. As an example, late one evening an issue was spotted by a Messenger engineer and by 10:00 am the next morning the problem had been identified, fixed and pushed to production. That kind of turn around time and coordination is just unheard of between companies at zulily and Facebook’s scale.

Speaking to the strength of the individual engineers on both sides, team sizes were small. This led to quick decision making and efficient, frequent communications between teams, forming a unique bond of trust and transparency. Despite the distance between Seattle and Menlo Park, we rarely had timing miscues and latency was incredibly low.

This Journey is 1% Finished

At Facebook we say “this journey is 1% finished,” highlighting that regardless of our past accomplishments we have a long journey ahead, and that we owe people that use our services so much more. And in that same spirit, we respect that Messenger is only beginning to provide zulily and other businesses the communication capabilities they need to form lasting, trusted, and valuable relationships with their customers. But we’re thrilled to be teamed up with a company, like zulily, that has the commitment and vision alignment to help shape the experience along the way and see the opportunity along the path ahead.

Sampling keys in a Redis cluster

We love Redis here at zulily. We store hundreds of millions of keys across many Redis instances, and we built our own internal distributed cache on top of Redis which powers the shopping experience for zulily customers.

One challenge when running a large, distributed cache using Redis (or many other key/value stores for that matter) is the opaque nature of the key spaces. It can be difficult to determine the overall composition of your Redis dataset, since most Redis commands operate on a single key. This is especially true when multiple codebases or teams use the same Redis instance(s), or when sharding your dataset over a large number of Redis instances.

Today, we’re open sourcing a Go package that we wrote to help with that task: reckon.

reckon enables us to periodically sample random keys from Redis instances across our fleet, aggregate statistics about the data contained in them — and then produce basic reports and metrics.

While there are some existing solutions for sampling a Redis key space, the reckon package has a few advantages:

Programmatic access to sampling results

Results from reckon are returned in data structures, not just printed to stdout or a file. This is what allows a user of reckon to sample data across a cluster of redis instances and merge the results to get an overall picture of the keyspaces. We include some example code to do just that.

Arbitrary aggregation based on key and redis data type

reckon also allows you to define arbitrary buckets based on the name of the sampled key and/or the Redis data type (hash, set, list, etc.). During sampling, reckon compiles statistics about the various redis data types, and aggregates those statistics according to the buckets you defined.

Any type that implements the Aggregator interface can instruct reckon about how to group the Redis keys that it samples. This is best illustrated with some simple examples:

To aggregate only Redis sets whose keys start with the letter a:


func setsThatStartWithA(key string, valueType reckon.ValueType) []string {
  if strings.HasPrefix(key, "a") && valueType == reckon.TypeSet {
    return []string{"setsThatStartWithA"}
  }
  return []string{}
}

To aggregate sampled keys of any Redis data type that are longer than 80 characters:


func longKeys(key string, valueType reckon.ValueType) []string {
  if len(key) > 80 {
    return []string{"long-keys"}
  }
  return []string{}
}

HTML and plain-text reports

When you’re done sampling, aggregating and/or combining the results produced by reckon you can easily produce a report of the findings in either plain-text or static HTML. An example HTML report is shown below:

reckon-random-sets

a sample report showing key/value size distributions

The report shows the number of keys sampled, along with some example keys and elements of those keys (the number of example keys/elements is configurable). Additionally, a distribution of the sizes of both the keys and elements is shown — in both standard and “power-of-two” form. The power-of-two form shows a more concise view of the distribution, using a concept borrowed from the original Redis sampler: each row shows a number p, along with the number of keys/elements that are <= p and > p/2

For instance, using the example report shown above, you can see that:

  • 68% of the keys sampled had key lengths between 8 and 16 characters
  • 89.69% of the sets sampled had between 16 and 32 elements
  • the mean number of elements in the sampled sets is 19.7

We have more features and refinements in the works for reckon, but in the meantime, check out the repo on github and let us know what you think. The codebase includes several example binaries to get you started that demonstrate the various usages of the package.

Pull requests are always welcome — and remember: Always be samplin’.

The way we Go(lang)

Here at zulily, Go is increasingly becoming the language of choice for many new projects, from tiny command-line apps to high-volume, distributed services. We love the language and the tooling, and some of us are more than happy to talk your ear off about it. Setting aside the merits and faults of the language design for a moment (over which much digital ink has already been spilled), it’s undeniable that Go provides several capabilities that make a developer’s life much easier when it comes to building and deploying software: static binaries and (extremely) fast compilation.

What makes a good build?

In general, the ideal software build should be:

  • fast
  • predictable
  • repeatable

Being fast allows developers to quickly iterate through the develop/build/test cycle, and predictable/repeatable builds allow for confidence when shipping new code to production, rolling back to a prior version or attempting to reproduce bugs.

Fast builds are provided by the Go compiler, which was designed such that:

It is possible to compile a large Go program in a few seconds on a single computer.

(There’s much more to be said on that topic in this interesting talk.)

We accomplish predictable and repeatable builds using a somewhat unconventional build tool: a Docker container.

Docker container as “build server”

Many developers use a remote build server or CI server in order to achieve predictable, repeatable builds. This makes intuitive sense, as the configuration and software on a build server can be carefully managed and controlled. Developer workstation setups become irrelevant since all builds happen on a remote machine. However, if you’ve spent any time around Docker containers, you know that a container can easily provide the same thing: a hermetically sealed, controlled environment in which to build your software, regardless of the software and configuration that exist outside the container.

By building our Go binaries using a Docker container, we reap the same benefits of a remote build server, and retain the speed and short dev/build/test cycle that makes working with Go so productive.

Our build container:

  • uses a known, pinned version of Go (v1.4.2 at the time of writing)
  • compiles binaries as true static binaries, with no cgo or dynamically-linked networking packages
  • uses vendored dependencies provided by godep
  • versions the binary with the latest git SHA in the source repo

This means that our builds stay consistent regardless of which version of Go is installed on a developer’s workstation or which Go packages happen to be on their $GOPATH! It doesn’t matter if the developer has godep or golint installed, whether they’re running an old version of Go, the latest stable version of Go or even a bleeding-edge build from source!

Git SHA as version number

godep is becoming a de facto standard for managing dependencies in Go projects, and vendoring (aka copying code into your project’s source tree) is the suggested way to produce repeatable Go builds. Godep vendors dependent code and keeps track of the git SHA for each dependency. We liked this approach, and decided to use git SHAs as versions for our binaries.

We accomplish this by “stamping” each of our binaries with the latest git SHA during the build process, using the ldflags option of the Go linker. For example:

ldflags "-X main.BuildSHA ${GIT_SHA}"

This little gem sets the value of the BuildSHA variable in the main package to be the value of the GIT_SHA environment variable (which we set to the latest git SHA in the current repo). This means that the following Go code, when built using the above technique, will print the latest git SHA in its source repo:

package main

import "fmt"

var BuildSHA string // set by the compiler at build time!

func main() {
  fmt.Printf("I'm running version: %s\n", BuildSHA)
}

Enter: boilerplate

Today, we’re open sourcing a simple project that we use for “bootstrapping” a new Go project that accomplishes all of the above. Enter: boilerplate

Boilerplate can be used to quickly set up a new Go project that includes:

  • a Docker container for performing Go builds as described above
  • a Makefile for building/testing/linting/etc. (because make is all you need)
  • a simple Dockerfile that uses the compiled binary as the container’s entrypoint
  • basic .gitignore and .dockerignore files

It even stubs out a Go source file for your binary’s main package.

You can find boilerplate on github. The project’s README includes some quick examples, as well as more details about the generated project.

Now, go forth and build! (pun intended)

zulily hosted Women Who Code Event

WomenWhoCode_003

A sample of zulily’s “women in tech”

zulily knows that women in tech rock! On January 28, 2015 zulily hosted the Women Who Code Seattle chapter’s kick-off meeting, bringing women in technology together to discuss their passions. zulily has built its business on providing great value to moms. Our female engineers are a big driving force in delivering this expectation, so we were very excited to partner with  Women Who Code.  

Women Who Code is a global nonprofit organization that supports women in technology, helping them advance their careers. They provide supportive environments where women can learn from each other while offering mentorship and networking. 

The night started with mingling in a variety of technology discussions ranging from software, robotics, data analysis and bio engineering! College students and seasoned professionals gathered to share experiences, answer questions and consume food and wine. The atmosphere was inspiring and empowering. 

Kristin Smith, CEO of Code Fellows, shared with us her story and career path from Amazon to zulily to CEO of Code Fellows. Kristin talked about how women should be persistent in constantly learning new things and trying new technologies to empower their skills and overcome their self-doubt. “Embrace the ambiguity. Embrace the fear. That’s the time when things are going to explode!” 

WomenWhoCode2015_092

Kristin Smith, CEO of Code Fellows

We continued the night with interesting tech talks from women who work in zulily tech. Jaime Dughi is Sr. Product Manager of zulily’s Mobile Development team. She walked us through zulily’s unique business model and core development principles: fearless innovation and moving fast. She described the future of the mobile app platform in zulily’s business and how her team is advancing mobile development to tackle the demands of zulily customers while embracing the ever-changing mobile app landscape. (slides)

WomenWhoCode2015_122

Jaime Dughi – Sr. Product Manager, zulily Mobile Development Team

Our next speaker was Sara Adineh, Software Engineer on the Personalization Team. Sara remarked on her passion in bringing more women in tech, and shared her past volunteer activities: helping run a club for female students at her school to support women in the tech and science fields. She dove deep into what the Personalization Team does at zulily. “Personalization is the heart of zulily’s business model, presenting something special every day for each member” said Sara. She described the mathematics and Machine Learning techniques that zulily uses to understand what zulily’s customers need, so we can provide them something fresh every day. Sara talked about one of the engineering principles of zulily’s tech culture: zulily had open-sourced some of its projects as well as contributed to open-source technologies (such as Go, Kubernetes, and NSQ) used by the company. (Slides)

WomenWhoCode2015_189

Sara Adineh – Software Engineer, zulily Personalization Team

Our last speaker was Echo Li, Data Engineer on the Business Intelligence and Data Services team at zulily. She started with “I’m a woman, I’m a mom, I’m an engineer and I’m proud!” and the crowd went wild! She continued explaining her work as a data engineer at zulily, and spoke on how to understand data and get meaningful information out of it. She talked about different stacks of the data-processing pipeline and analytics here at zulily. She shared with us the growth of zulily’s big data platform from SQL server to Hadoop, Google Cloud and Big Query. (Slides)

WomenWhoCode2015_209

Echo Li – Sr. Data Warehouse Engineer, BI and Data Services

It was a night of thoughtful questions, engaged attendees and new friendships made. You can find it on the comments. We are looking forward to hosting the next Women in Tech event! Follow us if you want to hear about our future events!

WomenWhoCode2015_109

WomenWhoCode2015_167

WomenWhoCode2015_013