In July Steve Reed from zulily presented at O’Reilly’s Open Source Convention (OSCON). He spoke to zulily’s pre-launch experience with Kubernetes. It was an honor for zulily to be asked to speak as part of the Kubernetes customer showcase, given the success we have had with Kubernetes.
Kubernetes launch announcement: http://googlecloudplatform.blogspot.com/2015/07/Kubernetes-V1-Released.html
Clickstream is the largest data set at zulily. As mentioned in zulily’s big data platform we store all our data in Google cloud storage and use Hadoop cluster in Google Compute Engine to process it. Our data size has been growing at a significant pace, this required us to reevaluate our design. The dataset that was growing fastest was our clickstream dataset.
We had to make a decision to either increase the size of cluster drastically or try something new. We came across Cloud Dataflow and started evaluating it as an alternative.
Why did we move clickstream processing to Cloud Dataflow?
- Instead of increasing the size of the cluster to manage peak loads of clickstream Cloud Dataflow gave us ability to have an on-demand cluster that could dynamically grow or shrink reducing costs
- Clickstream data was growing very fast and the ability to isolate it from other operational data processing would help the whole system in the long run
- Clickstream data processing did not require complex correlations and joins which are a must in some of the other datasets – this made the transition easy.
- Programming model was extremely simple and transition from development perspective was easy
- Most importantly, we have always thought about our data processing clusters as transient entities, which meant dropping another entity in this case Cloud Dataflow which was different from Hadoop was not a big deal. Our data would still reside in GCS and we could still use clickstream data from Hadoop cluster where required.
High Level Flow
One of the core principles of our clickstream system is to be self serviceable. In short, teams can start recording new types of events anytime by just registering a schema to our service. Clickstream events are pushed to our data platform through our real-time system(more later) and through log shipping to GCS.
In the new process using Cloud Dataflow we
- Create pipeline object and read data from GCS into PCollectionTuple object
- Create multiple outputs from data in the logs based on event types and schema using PCollectionTuple
- Apply ParDo transformations on the data to make it optimized for querying by business users (e.g. transform all dates to PST)
- Store the data back into GCS, which is our central data store but also for use from other Hadoop processes and loading to Big query.
- Loading to BigQuery is not done directly from Dataflow but we use bq load command which we found to be much more performant
The next step for us is to identify other scenarios which are a good fit to transition to Cloud Dataflow, which have similar characteristics to clickstream – high volume, unstructured data not requiring heavy correlation with other data sets.
In July 2014 we started our journey to building a new data platform that would allow us to use big data to drive business decisions. I would like to start with a quote from our 2015 Q2 earnings that was highlighted in various press outlets and share how we built a data platform that allows zulily to make decisions which were near impossible to do before.
“We compared two sets of customers from 2012 that both came in through the same channel, a display ad on a major homepage, but through two different types of ads,” [Darrell] Cavens said. “The first ad featured a globally distributed well-known shoe brand, and the second was a set of uniquely styled but unbranded women’s dresses. When we analyze customers coming in through the two different ad types, the shoe ad had more than twice the rate of customer activations on day one. But 2.5 years later, the spend from customers that came in through the women dresses ad was significantly higher than the shoe ad with the difference increasing over time.” – www.bizjournals.com
Our vision is for every decision, at every level in the organization, to be driven by data. In early 2014 we realized the data platform we had which was combination of SQL server database for data warehousing primarily for structured operational data + Hadoop cluster for unstructured data was too limiting. We started with defining core principles for our new data platform (v3).
Principles for new platform
- Break silos of unstructured and structured data: One of the key issues with our old infrastructure was we had unstructured data in a Hadoop cluster and structured in SQL server, this was extremely limiting. Almost every major decision required us to perform analytics where we had to correlate usage patterns on site(unstructured) with demand or shipment or other transactions(structured). We had to come up with a solution that allowed us to correlate these different types of datasets.
- High scale and lowest possible cost: We already had a hadoop cluster but one of the challenges was our data was growing exponentially and as the storage need grew we had to start adding nodes to the cluster which increased the cost even if the compute was not growing at same pace. In short, we were paying at rate of compute(which is way higher) at pace of growth rate of storage. We had to break this. Additionally we wanted to build a highly scalable system which would support our growth.
- High availability or extremely low recovery time with eye on cost: Simply put we wanted high availability but not pay for it.
- Enable real-time analytics: Existing data platform has no concept of real-time data processing, hence anytime users needed to analyze things in real-time they had to be given access to production operational system which in turn increases risk, this had to change.
- Self service: Our tools and systems/services need to be easy for our users to use with no involvement from data team.
- Protect the platform from users: Another challenge in existing system was lack of isolation between core data platform components and user facing services. We wanted to make in new platform individual users of the system could not bring down the entire system.
- Service(Data) level agreement(SLA): We wanted to be able to give SLA’s for our datasets to our users. This includes data accuracy and freshness. In past this was a huge issue due to design of the system and lack of tools to monitor and report SLAs.
zulily data platform (v3)
zulily data platform includes multiple tools, platforms and products. The combination is what makes it possible to solve complex problem at scale.
There are 6 key broad components in the stack:
- Centralized big data storage built on Google cloud storage (GCS). This allows us to have all our data in one place (principle 1) and share it across all systems and layers of the stack.
- Big Data Platform is primarily for batch processing of data at scale. We use Hadoop on Google compute engine for our big data processing.
- All our data is stored in GCS not HDFS, this enables us to decouple storage from compute and manage cost (principle 2).
- Additionally the hadoop cluster for us is transient as it has no data so if our cluster completely goes away we can built new one on the fly. We think about our cluster as a transient entity (principle 3).
- We have started looking at Google Dataflow for specific scenarios but more on that soon. Our goal is to make all data available for analytics anywhere from 30 minutes to few hours based on the SLA.
- Real-time data platform is a zulily built stack for high scale data collection and processing (principle 4). We now collect more than 10k events per second and it is growing really fast as we see value through our decision portal and other scenarios.
- Business data warehouse built on Google BigQuery enables us to provide highly scalable analytics service to 100’s of users in the company.
- We push all our data, structured and unstructured, real-time and batch into Google BigQuery hence breaking all silos for data (principle 1).
- It also enables us to keep our big data platform and data warehouse completely isolated (principle 6) making sure issues in one environment don’t impact the other system.
- Keeping the interface(SQL) same for analysts between old and new platforms enabled us to lower barrier for adoption (principle 5)
- We also have a high speed low latency data store that hosts part of our business data warehouse data for access through APIs.
- Data Access and Visualization component enables business users to make decisions based on information available.
- Real-time decision portal or zuPulse enables business to get insights into the business in real-time.
- Tableau is our reporting and analytics visualization tool. Business users use the tool to create reports for their daily execution. This enables our engineering team to focus on core data and high scale predictive scenarios and makes reporting self service (principle 5).
- ZATA our data access service enables us to expose all our data through an API. Any data in the data warehouse or low latency data store can be automatically exposed through ZATA. This enables various internal applications at zulily to show critical information to users including our vendors who can see their historical sales on vendor portal.
- Core data services are backbone to everything we do – from moving data to cloud, monitoring to and making sure we abide by our SLA’s etc. We continuously add new services and enhance existing services. The goal of these services is to make developers and analysts across other areas more efficient.
- zuSync is our data synchronization service which enables us to get data from any source system and move to any destination. Source or Destination can be operational databases (SQL Server, MySQL etc), a queue (RabbitMQ), Google cloud storage (GCS), Google BigQuery, services (http/tcp/udp), ftp location…
- zuflow is our workflow service built on top of Google BigQuery and allows analysts to define & schedule query chains for batch analytics. It also has ability to notify and deliver data in different formats.
- zuDataMon is our data monitoring service which allows us to make sure data is consistent across all systems (principle 7). This helps us catch issues with our services or with the data early. Today we have more than 70% of issues identified by our monitoring tool instead of users reporting them. our goal is to get this number to be in 90%+.
This is a very high level view of what our data@zulily team does. Our goal is to share more often, with a deeper dive on our various pieces of technology, in the coming weeks. Stay tuned…