Calculating Ad Performance Metrics in Real Time

Authors: Sergey Podlazov, Rahul Srivastava

zulily is a flash sales company.  We post a product on the site, and puff… it’s gone in 72 hours.  Online ads for those products come and go just as fast, which doesn’t leave us much time to manually evaluate the performance of the ads and take corrective actions if needed.  To optimize our ad spend, we need to know in real-time how each ad is doing, and this is exactly what we engineered.

While we track multiple metrics to measure impact of an ad, I am going to focus on one that provides a good representation of the system architecture.  This is an engineering blog after all!

The metric in question is Cost per Total Activation, or CpTA in short.  The formula for the metric is this:  divide the total cost of the ad by the number of customer activations.  We call the numerator in this formula “spend” and refer to the denominator as an “activation”.  For example, if an ad costs zulily $100 between midnight and 15:45 PST on January 31 and results in 20 activations, the CpTA for this ad as of 15:45 PST is $100/20 = $5.

Here’s how zulily collects this metric in real-time.  For the sake of simplicity, I will skip archiving processes that are sprinkled on top the architecture below.

Screen Shot 2018-01-30 at 6.22.22 PM

The source of the spend for the metric is an advertiser API, e.g. Facebook.  We’ve implemented a Spend Producer (in reference to the Producer-Consumer model) that queries the API every 15 minutes for live ads and pushes the spend into a MongoDB.  Each spend record has a tracking code that uniquely identifies the ad.

The source for the activations is a Kafka stream of purchase orders that customers place with zulily.  We consume these orders and throw them into an AWS Kinesis stream.  This gives us the ability to process and archive the orders without causing an extra strain on Kafka.  It’s important to note that relevant orders also have the ad’s tracking code, just like the spend.  That’s the link that glues spend and activations together.

The Activation Evaluator application examines each purchase and determines if the purchase is an activation.  To do that, it looks up the previous purchase in a MongoDB collection for the customer Id on the purchase order.  If the most recent transaction is non-existent or older than X days, the purchase is an activation.  The Activation Evaluator updates the customer record with the date of the new purchase.  To make sure that we don’t drop any data if the Activation Evaluator runs into issues, we don’t move the checkpoint in the Kinesis stream until the write to Mongo is confirmed.

The Activation Evaluator sends evaluated purchases into another Kinesis stream.  Chaining up Kinesis stream is a pretty common pattern for AWS applications, as it allows for the separation of concern and makes the whole system more resilient to failure of individual components.

The Activation Calculator reads the evaluated purchases from the second Kinesis stream and captures them in Mongo.  We index the data by tracking code and timestamp, and voila, a simple count() will return the number of activations for a specified period.

The last step in the process is to take the Spend and divide it by the activations.  Done.

With this architecture, zulily measures a key advertising performance metric every 15 minutes and uses it to pause poorly-performing ads.  The metric also serves as an input for various Machine Learning models, but more on those in a future blog post… Stay tuned!!




From Cart Pick to Put Walls

A critical part of any e-commerce company is getting product to its customers. While many of the customer experience discussions that you hear about companies focus on their website and apps or customer service and support, we often forget to think about those companies delivering their customers’ products when the company said they would. This part of the promise made (or implied) for customers is critical for building trust and providing a great end-to-end customer experience. Most large e-commerce companies operate — or pay someone else to operate — one or more “fulfillment centers”, which is where products are stored and combined with other items that need to be sent to the customer. zulily’s unique business model means we work with both big brands and boutique, smaller vendors with a variety of different capabilities, and so our products are inspected for quality, frequently bagged to keep clothing from getting dirty and often need barcoding (as many smaller vendors may not have them). The quality of zulily’s fulfillment processes drives our ability to deliver on our promises to customers and zulily’s software drives those fulfillment processes.

All fulfillment center systems start with a few basic needs: be able to receive products in from vendors, store products in a way that they can be later retrieved, and ship the product to customers. “Shipping product out,” also known as “outbound” is the most expensive operation inside the fulfillment center, so we have invested heavily in making it efficient. The problem seems simple at first glance. You gather product for customer shipment, put products in boxes, put labels on the boxes, and hand the box to UPS or USPS, etc.. The trick is making this process as efficient as possible. When zulily first started, each associate would walk the length of the warehouse picking each item and sorting it into 1 of 20 shoebox sized bins they had in their cart with each bin representing a customer shipment. Once all of the shipments had been picked, the picker delivers the completed cart to a packing station. The job of collecting products to be shipped out is known as “picking” and when our warehouse was fairly small, this strategy of one person picking the whole order worked fine. As the company has grown, our warehouses did too – some of our buildings have a million square feet of storage spread over multiple floors. Now these pickers were walking quite a long way in order for just 20 shipments. We could have just increased the size or quantity of the carts, but this is a solution that costs more as the company grows.  In addition, concerns about safety related to pulling more or larger carts and the complexities of taking one cart to multiple floors of a building make this idea impractical, to say the least.


A pick cart. Each of the 20 slots on the cart represents a single customer shipment. The picker, guided by an app on a mobile device, walks the storage area until they’ve picked all of the items for the 20 shipments. We call this process “pick to shipment” because no further sorting is necessary to make sure each shipment is fully assembled.

We needed a solution that would allow pickers to spend less time walking between bins and more time picking items from those bins. We have developed a solution such that the picking software tries to keep a given picker within a zone of 10-20 storage aisles and invested in a conveyor system to carry the picked items out of the picking locations. The picker focuses on picking everything that can be picked within their zone and there’s no need for a picker to leave a zone unless they are needed in another zone. The biggest difference from the old model is that the picker is no longer assembling complete shipments. If you ordered a pair of shoes and a t-shirt from zulily, it’s unlikely that those two items would be found in the same zone due to storage considerations. Instead of an individual picker picking for 20 orders, we now have one picker picking for many orders at the same time, but staying within a certain physical area of the building. This is considerably more efficient for the pickers, but it means that we now needed a solution to assemble these zone picks into customer shipments.


The picker picks for multiple shipments into a single container. Because the sorting into customer shipments happens later, this solution is called “pick to sort”.

In order to take the efficiently picked items and sort them into the right order to be sent to our customers, we have implemented a sorting solution that uses a physical solution we call a “put wall”. A put wall looks like a large shelf with no back divided into sections (called “slots”), each measuring about one foot cubed. Working at these put walls is an employee (called a “putter”) whose job is to take products from the pick totes and sort them into a slot in that put wall. Each slot in the wall is assigned to a shipment. Once all the products needed for a given shipment have been put into the slot, an indicator light on the other side of the wall lets a packer know that the shipment is ready to be placed into a box and shipped out to our customer. In larger warehouses, having just one put wall is not practical because putters would end up having to move too much distance and all the efficiency gained in packing would be lost on the putting side, so defining an appropriate size for each put wall is critical. This creates an interesting technical challenge as we have to make sure that the right products all end up in the put wall at the right time. Our picking system has to make sure that once we start picking a shipment to a wall that all the other products for that shipment also go to that wall as quickly as possible. This challenge is made more difficult by the physical capacity of the put walls. We need to limit how much is going to the wall to avoid a situation where there is no slot for a new shipment to go. We also have to make sure that each of the walls have enough work so we don’t have idle associates. When selecting shipments to be picked, we must include shipments that are due out today, but also include future work to make the operation efficient.  To do this, we have pickers rotate picking against different put walls to make sure that they get an even spread of work. A simple round-robin rotation would be naive, since throughput of the put walls is determined by humans with a wide range of different work rates. In order to solve this problem, we turn to control theory to help us select a put wall for a picker based on many of the above requirements. We also need to make sure that when the first product shows up for a shipment there is room in the wall for it.


As totes full of picked items are conveyed to the put wall, a putter scans each item and puts them into a slot representing a customer shipment. He is guided by both his mobile device and flashing lights on the put wall which guide him to the correct slot.

As we scaled up our operation, we initially saw that adding more pickers and put walls was not providing as much gain in throughput as we expected. In analyzing the data from the system, we determined that one of the problems was how we were selecting our put walls. Our initial implementation would select a wall for a shipment based on that wall having enough capacity and need. The problem with this approach is that we didn’t consider the makeup of each of the shipments. If you imagine a shipment that is composed of multiple products spread throughout the warehouse, you have situations where a picker has to walk through their zone N times, where N is the number of put walls we are using at any given time. As we turn on more and more put walls, that picker will have to walk through the zone that many more times. We realized that if we can create some affinity between zones and walls, we can limit the amount of put walls that a picker needs to pick and make them more efficient. We did this by assigning put walls a set of zones and try to make the vast majority of shipments for that put wall come from those zones. While we need to sometimes have larger sets that normal to cover a given shipment, we can overall significantly improve pick performance and increase the overall throughput for putters and packers.

And that’s really just the beginning of the story for a small part of our fulfillment center software suite. As the business grows, we continue to find new ways to further optimize these processes to make better use of our employees’ time and save literally millions of dollars while also increasing our total capacity using the same buildings and people! This is true of most of the software in the fulfillment space – improved algorithms are not just a fun and challenging part of the job, but also critical to the long-term success of our business.
Continue reading

Simulating Decisions to Improve Them

One of the jobs of the Data Science team is to help zulily make better decisions through data. One way that manifests itself is via experimentation. Like most ecommerce sites, zulily continuously runs experiments to improve the customer experience. Our team’s contribution is to think about the planning and analysis of those tests to make sure that when the results are read they are trustworthy and that ultimately the right decision is made.

Coming in hot

As a running example throughout this post, consider a landing page experiment.  At zulily, we have several landing pages that are often the first thing a visitor sees after they click an advertisement on a third-party site. For example, if a person was searching for pet-related products, and they clicked on one of zulily’s ads, they might land here.  Note: while that landing page is real, all the underlying data in this post is randomly generated.

The experiment is to modify the landing page in some way to see if conversion rate is improved.  Hopefully data has been gathered to motivate the experiment but, please, just take this at face-value.

Any single landing page is not hugely critical, but in aggregate they’re important for zulily, and small improvements in conversion rates or other metrics can have a large impact on the bottom line. In this example, the outcome metric (what is trying to be improved) is conversion rate, which is simply the number of conversion over the total number of visitors.

Thinking with the End in Mind

Planning an experiment consists of many things, but often the most opaque part is the implications associated with power. The simple definition of power: given some effect of the treatment, how likely will that effect be detectable. The implication here is that the more confident one wants to be in their ability to detect the change, the longer the test needs to run. How long the test needs to run has a direct bearing on the number of tests a company can run as a whole and which tests should be given priority… unless you don’t care about conflating treatments, but then you have bigger problems :).

Power analysis, however, is a challenge. For anything beyond a simple AB test, a lot needs to be thought through to determine the appropriate test. Therefore, it is often easier to think about the data, then work backwards through the analysis, and then the power.

Ultimately power analysis boils down to the simple question: for how long does a test need to run?

To illustrate this, consider the example experiment, where the underlying conversion rate for landing page A (the control) is 10%, and the expected conversion rate of the treatment is 10.5%. While these are the underlying conversion rates, due to randomness the realized conversion rate will likely be different, but hopefully close.

Imagine that each page is a coin, and each time a customer lands on the page the coin is flipped. Even though it’s known a priori that the underlying conversion rate for page A is 10%, if the coin is flipped 1000 times, it’s unlikely that it will be “heads” 100 times.  If you were to run two tests, you would get two different results, even though all variables are the same.

The rest of the post walks through the analysis of a single experiment, then describes how to expand that single experiment analysis into an analysis of the decision making process, and finally discusses a couple examples of complications that often arise in testing and how they can be incorporated into the analysis of the decision making.

A Single Experiment

For instance: flipping the “landing page” coin, so to speak, 1000 times for each page, A and B. In this one experiment, the realized conversion rates (for fake data) are in the bar chart below.


That plot sure looks convincing, but just looking at the plot is not a sufficient way to analyze a test. Think back to the coin flipping example; since the difference in the underlying “heads” rate was only 0.5%, or 5 heads per 1000 flips, it wouldn’t be too surprising if A happened to have more heads than B in any given 1000 flips.

The good news is that statistical tools exist to help make it possible to understand how likely the difference observed was truly due to an underlying difference in rates, due to randomness.

The data collected would look something like:


where Treatment is the landing page treatment, and Converted is 0 for a visit without a conversion and 1 for a visit with a conversion.  For an experiment like this, that has a binary outcome, the statistical tool to choose is logistic regression.

Now we get to see some code!

Assuming the table from above is represented by the “visits” dataframe, the model is very simple to fit in statsmodels.

The output:


This is a lot information, but the decision will likely be based on only one number: in the second table, the interception between “C(Treatment)[T.B]” and “P>|z|”. This is one minus the probability that the difference between the two conversion rates is actually different, or the significance level. The convention is that if that value is less than .05, the difference is significant. Another number worth mentioning is the coefficient of the treatment. This is how much change is estimated. It’s important, because even if the outcome was significant it’s possible to have been a significant negative coefficient, and then the decision is worse than just not accepting a better page, since we’ll accept a worse landing page.

In this case the significance level is greater than .05, so the decision would be that the difference observed is not indicative of an actual difference in conversion rates. This is clearly the wrong decision. The conversion rates were specified as being different, but that difference cannot be statistically detected.

This is ultimately the challenge with power and sample sizes. Had the experiment been run again, with a larger sample, it is possible that we would have detected the change and made the correct decision.  Unfortunately, the planning was done incorrectly and only 1000 samples were taken.

Always Be Sampling

Although the wrong decision was made in the last experiment, we want to improve our decision making.  It is possible to analyze our analysis through simulation. It is a matter of replicating — many times over — the analysis and decision process from earlier. Then it is possible to find out how often the correct decision would be made given the actual difference in conversion rate.

Put another way, the task is to:

  1. Generate a random dataset set for the treatment and control group based on the expected conversion rates.
  2. Fit the model that would have been from the example above.
  3. Measure the outcome based on the decision criteria; here it’ll just be a significant p-value < 0.05.

And now be prepared for the most challenging part: the simulation. These three steps are going to be wrapped in a for loop, and the outcome is collected in an array. Here’s a simple example in python of how that could be carried out.

Running that experiment 500 times, with a sample size of 1000, would yield a correct decision roughly 4.2% of the time.  Ugh.

This is roughly the power the of the experiment at a sample size of 1000.  If we did this experiment 500 times, we would rarely make the correct decision.  To correct this, we need to change the experiment plan to generate more trials.

To get a sense for the power at different sample sizes we choose several possible sample sizes, then run the above simulation for those samples. Now there are two for loops: one for to iterate through the sample size, and one to carry out the analysis above.

Here is the outcome of the decisions at various sample sizes.  The “Power” column is the proportion of time a correct decision was made at the specified sample size.


Not until 10^4.5 samples — roughly 31,000 — does the probability of making the correct decision become greater than 50%.  It is now a matter of making the business decision about how necessary it is to detect the effect.  Typically it is around 80%, in the same way that the significance level is normally around 5%… convention.  It would be easy to repeat this test for several intermediate sample sizes, between 10^4.5 and 10^5, to determine a sample size that has the power the business is comfortable with.

Uncertainty in Initial Conversion Rate

The outcome of the experiment was given a lot of criticism, but the underlying conversion rate was (more or less) taken for granted. The problem is there’s probably a lot of error in the estimated effect of the treatment before the experiment, and some error in the estimated effect of the control, since the control is based on past performance and the treatment is based on a combination of analysis and conjecture.

For example, say we had historical data that indicated that 1,000 out of 10,000 people had converted for the control thus far, and we ran a similar test to the control recently, so we have some confidence that 105 out of 1,000 people would convert.

If that was the prior information for each page the distribution for conversion rate for each landing page over 1,000 experiments could look like:


Even though it appears that landing page B does have a higher conversion rate on average, its distribution around that average is much wider.  To factor in that uncertainty, we can rerun that model with but instead of assuming a fixed conversion rate, we can sample from the distribution of the conversion rate before each simulation. Here’s the outcome, similar to above, of the proportion of times we’d make the correct decision.

Sadly our power was destroyed by the randomness associated with the uncertainty between landing pages.  Here’s the same power by sample size table as above.  For example, at 10^5.0 it is likely that conversion rate for landing page B was less than landing page A.


An alternative route in a situation like this is the use of a beta-binomial model to continue to incorporate additional data to the initial conversion rates.

More Complicated Experiments

The initial example was a very simple test, but more complex tests are often useful.  With more complex experiments, the framework for planning needs to expand to facilitate better decision making.

Consider a similar example to the original one with an additional complication. Since the page is a landing page, the user had to come from somewhere.  These sources of traffic are also sources of variation. Just like how any realized experiment could vary from the expectation, any given source’s underlying conversion rate could also could also vary from the expectation. In the face of that uncertainty, it would be a good idea to run the test on multiple ads.

To simplify our assumptions, consider that the expected change in conversion rate is still 0.5%, but across three ads the conversion rate varies individually by -0.01%, 0.00% and +0.01% due to the individual ad-level characteristics.

For example, this could be outcome of one possible experiment with two landing pages and three ads.


Thankfully statsmodels has a consistent API so just a few things need to change to fit this model:

  • Use gee instead of logit. This is a general estimating equation.  It enables a correlation within groups for a GLM to be fit, or, for these purposes, a logit regression with the group level variances taken into consideration.
  • Pass the groups via the “groups” argument.
  • Specify the family of the GLM; here it’s binomial with a logit link function (the default argument).

Those changes would like this:


The decision criteria here is similar to the first case, so we cannot say anything about the effect of the landing page, and likely this test would not roll out.  Now that the basic model is constructed, we follow the same process to estimate how much power the experiment would have at various levels of sample size.


Experiments are challenging to execute well, even with these additional tools.  The groups that have sufficient size to necessitate testing are normally sufficiently large and complex that wrong decisions can be made.  Through simulation and thinking about the decision-making process, it is possible to quantify how often a wrong decision could occur, its impact, and how to best mitigate the problem.

(By the way, zulily is actively looking for someone to make experimentation better, so if you feel that you qualify, please apply!)

Optimizing memory consumption of Radix Trees in Java

On the Relevancy team at zulily, we are often required to load a large number of large strings into memory. This often causes memory issues. After looking at multiple ways to reduce memory pressure, we settled on Radix Trees to store these strings. Radix Trees provide very fast prefix searching and are great for auto-complete services and similar uses. This post focuses entirely on memory consumption.

What Is A Radix Tree?

Radix Trees take sequences of data and organize them in a tree structure. Strings with common prefixes end up sharing nodes toward the top of this structure, which is how memory savings is realized. Consider the following example, where we store “antidisestablishmentarian” and “antidisestablishmentarianism” in a Radix Tree:

+- antidisestablishmentarian (node 1)
                           +- ism (node 2)

Two strings, totaling 53 characters, can be stored as two nodes in a tree. The first node stores the common prefix (25 characters) between it and its children. The second stores the rest (3 characters). In terms of character data stored, the Radix Tree stores the same information in approximately 53% of the space (not counting the additional overhead introduced by the tree structure itself).

If you add the string “antibacterial” to the tree, you need to break apart node 1 and shuffle things around. You end with:

+- anti                             (node 3)
      |- disestablishmentarian      (node 4)
      |                      +- ism (node 2)
      +- bacterial                  (node 5)

Real-World Performance

We run a lot of software in the JVM, where memory performance can be tricky to measure. In order to validate our Radix Tree implementation and measure the impact, I pumped a bunch of pseudo-realistic data into various collections and captured memory snapshots with YourKit Java Profiler.

Input Data

It didn’t take long to hack together some real-looking data in Ruby with Faker. I created four input files of approximately 1,000,000 strings that included a random selection of 12-digit numbers, bitcoin addresses, email addresses and ISBNs.

sreed:src/ $ head zulily-oss/radix-tree/12-digit-numbers.txt

sreed:src/ $ head zulily-oss/radix-tree/bitcoins.txt

sreed:src/ $ head zulily-oss/radix-tree/emails.txt

sreed:src/ $ head zulily-oss/radix-tree/isbns.txt

Measuring Memory with YourKit

YourKit provides a measurement of “retained size” in its memory snapshots which is helpful when trying to understand how your code is impacting the heap. What isn’t necessarily intuitive about it, though, is what objects it excludes from this “retained size” measurement. Their documentation is very helpful here: only object references that are exclusively held by the object you’re measuring will be included. Instead of telling you “this is how much memory usage your object imposes on the VM,” retained size instead tells you “this is how much memory the VM would be able to garbage-collect if it were gone.” This is a subtle, but very real, difference if you wish to optimize memory consumption.

Thus, my memory testing needed to ensure that each collection held complete copies of the objects I wished to measure. In this case, each string key needed to be duplicated (I decided to intern and share every value I stored in order to measure only the memory gains from different key storage techniques).

// Results in shared reference, and inaccurate measurement
map1.put(key, value);
map2.put(key, value);

// Results in shared char[] reference, and better but
// still inaccurate measurement
map1.put(new String(key), value);
map2.put(new String(key), value);

// Results in complete copy of keys, and accurate measurement
map1.put(new String(key.toCharArray()), value);
map2.put(new String(key.toCharArray()), value);

Collections Tested

I tested our own Radix Tree implementation, ConcurrentRadixTree from, a string array, Guava‘s ImmutableMap and Java’s HashMap, TreeMap, Hashtable and LinkedHashMap. Each collection stored the same values for each key.

Both zulily’s Radix Tree and the ConcurrentRadixTree from concurrent-trees were configured to store string data as UTF-8-encoded byte arrays.

ConcurrentRadixTree was included simply to ensure that our own version (to be open-sourced soon) was worth the effort. The others were measured simply to highlight the benefits of Radix Tree storage for different input types. Each collection has its own merits and in most ways they are all superior to the Radix Tree for storage (put/get performance, concurrency and other features).



First of all, Guava’s ImmutableMap is pretty good. It stored the same key and value data as java.util.HashMap in 92-95% of the space. The Radix Tree breaks keys into byte array sequences and stores them in a tree structure based on common prefixes. This resulted in a best case of 62% the size of the ImmutableMap for bitcoin addresses (strings which have many common prefixes) and a worst case 88% for random 12-digit numbers. We see that the memory used by this data structure is largely dependent on the type of data put into it. Large strings with many large common prefixes are stored very efficiently in a narrow tree structure. Unique strings create a lot of branches in the underlying tree, making it very wide and adding a lot of overhead.

Converting Java Strings to byte arrays accounts for most of the memory savings, but not all. Byte array storage was anywhere from 90% (bitcoin addresses) to 99% (ISBNs) in the tests I ran.

For us, storing byte-encoded representations of string data in a radix tree allowed us to reclaim valuable memory in our services. However it wasn’t until validating the implementation in an accurate manner with realistic data and trustworthy tools that we rested easy knowing we had set out what we wished to accomplish.

Experience optimization at zulily

godinbanditsExperimentation is the name of the game for most top tech companies, and it’s no different here at zulily. Because new zulily events launch every day, traditional experiments can be cumbersome for some applications. We need to be able to move quickly, so we’ve built a contextual multi-armed bandit system that learns in real time to help us deliver the best experience to each zulily member.

As zulily has grown over the past four and a half years, the number of new events and products launching each day has increased at a tremendous pace. This is great for our members, but it brings with it the challenge of ensuring that each member’s experience is as personalized as possible. My team, the Data Science team, and Relevancy, with whom we work closely, are tasked with seamlessly optimizing and customizing that experience. In order to do so, we run experiments — a lot of them. Even the most minor changes to the site usually have to prove their mettle by beating a control group in a well-designed, sufficiently-powered experiment.

Continue reading