In July 2014 we started our journey to building a new data platform that would allow us to use big data to drive business decisions. I would like to start with a quote from our 2015 Q2 earnings that was highlighted in various press outlets and share how we built a data platform that allows zulily to make decisions which were near impossible to do before.
“We compared two sets of customers from 2012 that both came in through the same channel, a display ad on a major homepage, but through two different types of ads,” [Darrell] Cavens said. “The first ad featured a globally distributed well-known shoe brand, and the second was a set of uniquely styled but unbranded women’s dresses. When we analyze customers coming in through the two different ad types, the shoe ad had more than twice the rate of customer activations on day one. But 2.5 years later, the spend from customers that came in through the women dresses ad was significantly higher than the shoe ad with the difference increasing over time.” – www.bizjournals.com
Our vision is for every decision, at every level in the organization, to be driven by data. In early 2014 we realized the data platform we had which was combination of SQL server database for data warehousing primarily for structured operational data + Hadoop cluster for unstructured data was too limiting. We started with defining core principles for our new data platform (v3).
Principles for new platform
- Break silos of unstructured and structured data: One of the key issues with our old infrastructure was we had unstructured data in a Hadoop cluster and structured in SQL server, this was extremely limiting. Almost every major decision required us to perform analytics where we had to correlate usage patterns on site(unstructured) with demand or shipment or other transactions(structured). We had to come up with a solution that allowed us to correlate these different types of datasets.
- High scale and lowest possible cost: We already had a hadoop cluster but one of the challenges was our data was growing exponentially and as the storage need grew we had to start adding nodes to the cluster which increased the cost even if the compute was not growing at same pace. In short, we were paying at rate of compute(which is way higher) at pace of growth rate of storage. We had to break this. Additionally we wanted to build a highly scalable system which would support our growth.
- High availability or extremely low recovery time with eye on cost: Simply put we wanted high availability but not pay for it.
- Enable real-time analytics: Existing data platform has no concept of real-time data processing, hence anytime users needed to analyze things in real-time they had to be given access to production operational system which in turn increases risk, this had to change.
- Self service: Our tools and systems/services need to be easy for our users to use with no involvement from data team.
- Protect the platform from users: Another challenge in existing system was lack of isolation between core data platform components and user facing services. We wanted to make in new platform individual users of the system could not bring down the entire system.
- Service(Data) level agreement(SLA): We wanted to be able to give SLA’s for our datasets to our users. This includes data accuracy and freshness. In past this was a huge issue due to design of the system and lack of tools to monitor and report SLAs.
zulily data platform (v3)
There are 6 key broad components in the stack:
- Centralized big data storage built on Google cloud storage (GCS). This allows us to have all our data in one place (principle 1) and share it across all systems and layers of the stack.
- Big Data Platform is primarily for batch processing of data at scale. We use Hadoop on Google compute engine for our big data processing.
- All our data is stored in GCS not HDFS, this enables us to decouple storage from compute and manage cost (principle 2).
- Additionally the hadoop cluster for us is transient as it has no data so if our cluster completely goes away we can built new one on the fly. We think about our cluster as a transient entity (principle 3).
- We have started looking at Google Dataflow for specific scenarios but more on that soon. Our goal is to make all data available for analytics anywhere from 30 minutes to few hours based on the SLA.
- Real-time data platform is a zulily built stack for high scale data collection and processing (principle 4). We now collect more than 10k events per second and it is growing really fast as we see value through our decision portal and other scenarios.
- Business data warehouse built on Google BigQuery enables us to provide highly scalable analytics service to 100’s of users in the company.
- We push all our data, structured and unstructured, real-time and batch into Google BigQuery hence breaking all silos for data (principle 1).
- It also enables us to keep our big data platform and data warehouse completely isolated (principle 6) making sure issues in one environment don’t impact the other system.
- Keeping the interface(SQL) same for analysts between old and new platforms enabled us to lower barrier for adoption (principle 5)
- We also have a high speed low latency data store that hosts part of our business data warehouse data for access through APIs.
- Data Access and Visualization component enables business users to make decisions based on information available.
- Real-time decision portal or zuPulse enables business to get insights into the business in real-time.
- Tableau is our reporting and analytics visualization tool. Business users use the tool to create reports for their daily execution. This enables our engineering team to focus on core data and high scale predictive scenarios and makes reporting self service (principle 5).
- ZATA our data access service enables us to expose all our data through an API. Any data in the data warehouse or low latency data store can be automatically exposed through ZATA. This enables various internal applications at zulily to show critical information to users including our vendors who can see their historical sales on vendor portal.
- Core data services are backbone to everything we do – from moving data to cloud, monitoring to and making sure we abide by our SLA’s etc. We continuously add new services and enhance existing services. The goal of these services is to make developers and analysts across other areas more efficient.
- zuSync is our data synchronization service which enables us to get data from any source system and move to any destination. Source or Destination can be operational databases (SQL Server, MySQL etc), a queue (RabbitMQ), Google cloud storage (GCS), Google BigQuery, services (http/tcp/udp), ftp location…
- zuflow is our workflow service built on top of Google BigQuery and allows analysts to define & schedule query chains for batch analytics. It also has ability to notify and deliver data in different formats.
- zuDataMon is our data monitoring service which allows us to make sure data is consistent across all systems (principle 7). This helps us catch issues with our services or with the data early. Today we have more than 70% of issues identified by our monitoring tool instead of users reporting them. our goal is to get this number to be in 90%+.
This is a very high level view of what our data@zulily team does. Our goal is to share more often, with a deeper dive on our various pieces of technology, in the coming weeks. Stay tuned…