# The “Power” of A/B Testing

## Introduction

Customers around the world flock to Zulily on a daily basis to discover and shop products that are curated specially for them. To serve our customers better, we are constantly innovating to improve the customer experience. To help us decide between different customer experiences, we primarily use split testing or A/B testing, the industry standard for making scientific decisions about products. In a previous post, we talked about the fundamentals of A/B testing and described our home-grown solution. In this article, we are going to explore one choice behind designing sound A/B tests: we will calculate the number of people that need to be included in your test.

To briefly recap the idea of A/B testing, in an A/B test we compare the performance of two variants of the same experience. For example, in a test we might compare two ways of ordering the navigation tabs on the website (see Figure 1). In the original version (variant A or control), the tabs are ordered such that events launched today (New Today) appear before events that are ending soon (Ends Soon). In the new version of the website (variant B or control), Ends Soon appears before New Today. The test is set up such that, for a pre-defined period of time, customers visiting the website would be shown either variant A or variant B. Then, using statistical methods, we would measure the incremental improvement in the experience of customers that were shown variant B over those who were shown variant A. Finally, if there was a statistically significant improvement, we might decide to change order of the tabs on the website.

Since Zulily relies heavily on A/B testing to make business decisions, we are careful about avoiding common pitfalls of the method. The length of an A/B test ties in strongly with the success of the business for two reasons:

• If A/B tests run for more days than necessary, the pace of innovation at the company will slow down.
• If A/B tests run for fewer days than required to achieve statistically sound results, the results will be misguided.

To account for these issues, before making decisions based on the A/B test results, we run what’s called a ‘power analysis.’ A power analysis ensures that a certain number of people have been captured in the test to confirm or deny whether variant B was an improvement over variant A, which is the focus of this article. We also make sure that the test is run long enough so that short-term business cycles are accounted for. The calculation for the number of people needed in a test is a function of three things, effect size, significance level ($\alpha$), and power ($1-\beta$).

To consult the statistician after an experiment is finished is often merely to ask [them] to conduct a post mortem examination. [They] can perhaps say what the experiment died of.

– Robert Fisher, Statistician

## Common Terms

Before we get into the mechanics of that calculation, let us familiarize ourselves with some common statistical terms. In an A/B test, we are trying to estimate how all the customers behave (population) by measuring the behavior of a subset of customers (sample). For this, we ensure that our sample is representative of our entire customer base. During the test, we measure the behavior of the customers in our sample. Measurements might include the number of items purchased, the time spent on the website, or the money spent on the website by each customer.

For example, to test whether variant B outperformed variant A, we might want to know if the customers exposed to variant B spent more money than the customers exposed to variant A. In this test, our default position is that variant B made no difference on the behavior of the customers when compared to variant A (null hypothesis). As more customers are exposed to these variants and start purchasing products, we collect measurements on more customers, which allows us to either accept or reject this null hypothesis. The difference between the behavior of customers exposed to variant A and variant B is known as the effect size.

Further, there are a number of parameters set under the hood. Before starting the test, we assign a significance level ($\alpha$) which means that we might reject the null hypothesis when it is actually true in 5% of the cases (Type I error rate). Further, we will assign a power ($1-\beta$) which means that when the null hypothesis does not hold, or variant B changes the behavior of customers, the test will allow us to reject the null hypothesis 80% of the time. Importantly, these parameters need to be set at the beginning of the test and upheld for the duration of the test to avoid p-hacking, which leads to misguided results.

## Estimating the Number of Customers for the A/B Test

For this exercise, let us revisit the previous example where we show customers two versions of the Zulily navigation bar. Let us say we want to see if this change makes customers less or more engaged on Zulily’s website. One metric that can capture this is the proportion of customers who revisit the website when shown variant B versus variant A. for this exercise, let us say that we are interested in a boost in this metric of at least 1 % (effect size). If we see at least this effect size, we might implement the variant B. Question is, how many customers should be exposed to each variant to allow us to confirm that this change of 1% exists?

Starting off, we define some parameters. First, we define the significance level at 0.05. Second, from the central limit theorem, we assume that the average money that a group of customers spend on the website is normally distributed. Third, we direct 50% the customers visiting the site to variant A and 50% to variant B. These last two points greatly simplify the math behind the calculation. Now, we can estimate the number of people that need to be exposed to each variant.

where, $\sigma$ is the standard deviation of the population, $\delta$ is the change we expect to see, the $z_{1-\frac{\alpha}{2}}$, $z_{1-\beta}$ are quantile values calculated from a normal distribution. For the case of the parameters defined above, significance level of 0.05 and power of 0.80, and if we wanted to detect a 1% change in the proportion of people revisiting the website, our formula would simplify to:

This formula gives us the number of people that need to be exposed to one variant. Finally, since the customers were split evenly between variant A and variant B, we would need twice the number of the people in the entire test. This estimate can change significantly if any of the parameters change. For example, to detect a larger difference at this significance level, we would need much smaller samples. Further, if the observations are not normally distributed, then we would need a more complicated approach.

## Benefits of this calculation

In short, getting as estimate of the number of customers needed allows us to design our experiments better. We suggest conducting a power analysis both before and after starting a test for several reasons:

• Before starting the test – This gives us an estimate of how long our test should be run to detect the effect that we are anticipating. Ideally, this is done once to design the experiment and the results are tallied when the requisite number of people are exposed to both variants. However, the mean and standard deviation used in the calculation before starting the test are approximations to the actual values that we might see during the test. Thus, these a priori estimates might be off.
• After starting the test – As the test progresses, the mean and standard deviation converge to values representative of the current sample which allows us to get more accurate estimates of the sample size. This is especially useful in cases where the new experience introduces unexpected changes in the behavior of the customers leading to significantly different mean and standard deviation values than those estimated earlier.

## Conclusion

At Zulily, we strive to make well-informed choices for our customers by listening to their voice through our A/B testing platform, among other channels, and ensuring that we are constantly serving the needs of the customers. While obtaining an accurate estimate of the number of people for the test is challenging, we hold it central to the process. Most people agree that the benefits of a well-designed, statistically sound A/B testing system far outweigh the benefits from obtaining quick, but misdirected numbers. Therefore, we aim for a high-level level of scientific rigor in our tests.

I would like to thank my colleagues in the data science team, Demitri Plessas and Pamela Moriarty, and my manager, Paul Sheets, for taking time to review this article. This article is possible due to the excellent work by the entire data science team of maintaining the A/B testing platform, and ensuring that experiments at Zulily are well-designed.

|

# Sampling keys in a Redis cluster

We love Redis here at zulily. We store hundreds of millions of keys across many Redis instances, and we built our own internal distributed cache on top of Redis which powers the shopping experience for zulily customers.

One challenge when running a large, distributed cache using Redis (or many other key/value stores for that matter) is the opaque nature of the key spaces. It can be difficult to determine the overall composition of your Redis dataset, since most Redis commands operate on a single key. This is especially true when multiple codebases or teams use the same Redis instance(s), or when sharding your dataset over a large number of Redis instances.

Today, we’re open sourcing a Go package that we wrote to help with that task: reckon.

reckon enables us to periodically sample random keys from Redis instances across our fleet, aggregate statistics about the data contained in them — and then produce basic reports and metrics.

While there are some existing solutions for sampling a Redis key space, the reckon package has a few advantages:

Results from reckon are returned in data structures, not just printed to stdout or a file. This is what allows a user of reckon to sample data across a cluster of redis instances and merge the results to get an overall picture of the keyspaces. We include some example code to do just that.

## Arbitrary aggregation based on key and redis data type

reckon also allows you to define arbitrary buckets based on the name of the sampled key and/or the Redis data type (hash, set, list, etc.). During sampling, reckon compiles statistics about the various redis data types, and aggregates those statistics according to the buckets you defined.

Any type that implements the Aggregator interface can instruct reckon about how to group the Redis keys that it samples. This is best illustrated with some simple examples:

To aggregate only Redis sets whose keys start with the letter a:


func setsThatStartWithA(key string, valueType reckon.ValueType) []string {
if strings.HasPrefix(key, "a") && valueType == reckon.TypeSet {
return []string{"setsThatStartWithA"}
}
return []string{}
}


To aggregate sampled keys of any Redis data type that are longer than 80 characters:


func longKeys(key string, valueType reckon.ValueType) []string {
if len(key) > 80 {
return []string{"long-keys"}
}
return []string{}
}


## HTML and plain-text reports

When you’re done sampling, aggregating and/or combining the results produced by reckon you can easily produce a report of the findings in either plain-text or static HTML. An example HTML report is shown below:

a sample report showing key/value size distributions

The report shows the number of keys sampled, along with some example keys and elements of those keys (the number of example keys/elements is configurable). Additionally, a distribution of the sizes of both the keys and elements is shown — in both standard and “power-of-two” form. The power-of-two form shows a more concise view of the distribution, using a concept borrowed from the original Redis sampler: each row shows a number p, along with the number of keys/elements that are <= p and > p/2

For instance, using the example report shown above, you can see that:

• 68% of the keys sampled had key lengths between 8 and 16 characters
• 89.69% of the sets sampled had between 16 and 32 elements
• the mean number of elements in the sampled sets is 19.7

We have more features and refinements in the works for reckon, but in the meantime, check out the repo on github and let us know what you think. The codebase includes several example binaries to get you started that demonstrate the various usages of the package.

Pull requests are always welcome — and remember: Always be samplin’.

# Simulating Decisions to Improve Them

One of the jobs of the Data Science team is to help zulily make better decisions through data. One way that manifests itself is via experimentation. Like most ecommerce sites, zulily continuously runs experiments to improve the customer experience. Our team’s contribution is to think about the planning and analysis of those tests to make sure that when the results are read they are trustworthy and that ultimately the right decision is made.

# Coming in hot

As a running example throughout this post, consider a landing page experiment.  At zulily, we have several landing pages that are often the first thing a visitor sees after they click an advertisement on a third-party site. For example, if a person was searching for pet-related products, and they clicked on one of zulily’s ads, they might land here.  Note: while that landing page is real, all the underlying data in this post is randomly generated.

The experiment is to modify the landing page in some way to see if conversion rate is improved.  Hopefully data has been gathered to motivate the experiment but, please, just take this at face-value.

Any single landing page is not hugely critical, but in aggregate they’re important for zulily, and small improvements in conversion rates or other metrics can have a large impact on the bottom line. In this example, the outcome metric (what is trying to be improved) is conversion rate, which is simply the number of conversion over the total number of visitors.

# Thinking with the End in Mind

Planning an experiment consists of many things, but often the most opaque part is the implications associated with power. The simple definition of power: given some effect of the treatment, how likely will that effect be detectable. The implication here is that the more confident one wants to be in their ability to detect the change, the longer the test needs to run. How long the test needs to run has a direct bearing on the number of tests a company can run as a whole and which tests should be given priority… unless you don’t care about conflating treatments, but then you have bigger problems :).

Power analysis, however, is a challenge. For anything beyond a simple AB test, a lot needs to be thought through to determine the appropriate test. Therefore, it is often easier to think about the data, then work backwards through the analysis, and then the power.

Ultimately power analysis boils down to the simple question: for how long does a test need to run?

To illustrate this, consider the example experiment, where the underlying conversion rate for landing page A (the control) is 10%, and the expected conversion rate of the treatment is 10.5%. While these are the underlying conversion rates, due to randomness the realized conversion rate will likely be different, but hopefully close.

Imagine that each page is a coin, and each time a customer lands on the page the coin is flipped. Even though it’s known a priori that the underlying conversion rate for page A is 10%, if the coin is flipped 1000 times, it’s unlikely that it will be “heads” 100 times.  If you were to run two tests, you would get two different results, even though all variables are the same.

The rest of the post walks through the analysis of a single experiment, then describes how to expand that single experiment analysis into an analysis of the decision making process, and finally discusses a couple examples of complications that often arise in testing and how they can be incorporated into the analysis of the decision making.

# A Single Experiment

For instance: flipping the “landing page” coin, so to speak, 1000 times for each page, A and B. In this one experiment, the realized conversion rates (for fake data) are in the bar chart below.

That plot sure looks convincing, but just looking at the plot is not a sufficient way to analyze a test. Think back to the coin flipping example; since the difference in the underlying “heads” rate was only 0.5%, or 5 heads per 1000 flips, it wouldn’t be too surprising if A happened to have more heads than B in any given 1000 flips.

The good news is that statistical tools exist to help make it possible to understand how likely the difference observed was truly due to an underlying difference in rates, due to randomness.

The data collected would look something like:

where Treatment is the landing page treatment, and Converted is 0 for a visit without a conversion and 1 for a visit with a conversion.  For an experiment like this, that has a binary outcome, the statistical tool to choose is logistic regression.

Now we get to see some code!

Assuming the table from above is represented by the “visits” dataframe, the model is very simple to fit in statsmodels.

The output:

This is a lot information, but the decision will likely be based on only one number: in the second table, the interception between “C(Treatment)[T.B]” and “P>|z|”. This is one minus the probability that the difference between the two conversion rates is actually different, or the significance level. The convention is that if that value is less than .05, the difference is significant. Another number worth mentioning is the coefficient of the treatment. This is how much change is estimated. It’s important, because even if the outcome was significant it’s possible to have been a significant negative coefficient, and then the decision is worse than just not accepting a better page, since we’ll accept a worse landing page.

In this case the significance level is greater than .05, so the decision would be that the difference observed is not indicative of an actual difference in conversion rates. This is clearly the wrong decision. The conversion rates were specified as being different, but that difference cannot be statistically detected.

This is ultimately the challenge with power and sample sizes. Had the experiment been run again, with a larger sample, it is possible that we would have detected the change and made the correct decision.  Unfortunately, the planning was done incorrectly and only 1000 samples were taken.

# Always Be Sampling

Although the wrong decision was made in the last experiment, we want to improve our decision making.  It is possible to analyze our analysis through simulation. It is a matter of replicating — many times over — the analysis and decision process from earlier. Then it is possible to find out how often the correct decision would be made given the actual difference in conversion rate.

Put another way, the task is to:

1. Generate a random dataset set for the treatment and control group based on the expected conversion rates.
2. Fit the model that would have been from the example above.
3. Measure the outcome based on the decision criteria; here it’ll just be a significant p-value < 0.05.

And now be prepared for the most challenging part: the simulation. These three steps are going to be wrapped in a for loop, and the outcome is collected in an array. Here’s a simple example in python of how that could be carried out.

Running that experiment 500 times, with a sample size of 1000, would yield a correct decision roughly 4.2% of the time.  Ugh.

This is roughly the power the of the experiment at a sample size of 1000.  If we did this experiment 500 times, we would rarely make the correct decision.  To correct this, we need to change the experiment plan to generate more trials.

To get a sense for the power at different sample sizes we choose several possible sample sizes, then run the above simulation for those samples. Now there are two for loops: one for to iterate through the sample size, and one to carry out the analysis above.

Here is the outcome of the decisions at various sample sizes.  The “Power” column is the proportion of time a correct decision was made at the specified sample size.

Not until 10^4.5 samples — roughly 31,000 — does the probability of making the correct decision become greater than 50%.  It is now a matter of making the business decision about how necessary it is to detect the effect.  Typically it is around 80%, in the same way that the significance level is normally around 5%… convention.  It would be easy to repeat this test for several intermediate sample sizes, between 10^4.5 and 10^5, to determine a sample size that has the power the business is comfortable with.

## Uncertainty in Initial Conversion Rate

The outcome of the experiment was given a lot of criticism, but the underlying conversion rate was (more or less) taken for granted. The problem is there’s probably a lot of error in the estimated effect of the treatment before the experiment, and some error in the estimated effect of the control, since the control is based on past performance and the treatment is based on a combination of analysis and conjecture.

For example, say we had historical data that indicated that 1,000 out of 10,000 people had converted for the control thus far, and we ran a similar test to the control recently, so we have some confidence that 105 out of 1,000 people would convert.

If that was the prior information for each page the distribution for conversion rate for each landing page over 1,000 experiments could look like:

Even though it appears that landing page B does have a higher conversion rate on average, its distribution around that average is much wider.  To factor in that uncertainty, we can rerun that model with but instead of assuming a fixed conversion rate, we can sample from the distribution of the conversion rate before each simulation. Here’s the outcome, similar to above, of the proportion of times we’d make the correct decision.

Sadly our power was destroyed by the randomness associated with the uncertainty between landing pages.  Here’s the same power by sample size table as above.  For example, at 10^5.0 it is likely that conversion rate for landing page B was less than landing page A.

An alternative route in a situation like this is the use of a beta-binomial model to continue to incorporate additional data to the initial conversion rates.

## More Complicated Experiments

The initial example was a very simple test, but more complex tests are often useful.  With more complex experiments, the framework for planning needs to expand to facilitate better decision making.

Consider a similar example to the original one with an additional complication. Since the page is a landing page, the user had to come from somewhere.  These sources of traffic are also sources of variation. Just like how any realized experiment could vary from the expectation, any given source’s underlying conversion rate could also could also vary from the expectation. In the face of that uncertainty, it would be a good idea to run the test on multiple ads.

To simplify our assumptions, consider that the expected change in conversion rate is still 0.5%, but across three ads the conversion rate varies individually by -0.01%, 0.00% and +0.01% due to the individual ad-level characteristics.

For example, this could be outcome of one possible experiment with two landing pages and three ads.

Thankfully statsmodels has a consistent API so just a few things need to change to fit this model:

• Use gee instead of logit. This is a general estimating equation.  It enables a correlation within groups for a GLM to be fit, or, for these purposes, a logit regression with the group level variances taken into consideration.
• Pass the groups via the “groups” argument.
• Specify the family of the GLM; here it’s binomial with a logit link function (the default argument).

Those changes would like this:

The decision criteria here is similar to the first case, so we cannot say anything about the effect of the landing page, and likely this test would not roll out.  Now that the basic model is constructed, we follow the same process to estimate how much power the experiment would have at various levels of sample size.

# Conclusion

Experiments are challenging to execute well, even with these additional tools.  The groups that have sufficient size to necessitate testing are normally sufficiently large and complex that wrong decisions can be made.  Through simulation and thinking about the decision-making process, it is possible to quantify how often a wrong decision could occur, its impact, and how to best mitigate the problem.

(By the way, zulily is actively looking for someone to make experimentation better, so if you feel that you qualify, please apply!)

Here at zulily, we use Google Compute Engine (GCE) for running our Hadoop clusters. Google has a utility called bdutil for setting up and tearing down Hadoop clusters on GCE. We ran into a number of issues when using the utility and were using an internally patched version of it to create our Hadoop clusters. If you look at the source, bdutil is essentially a collection of bash scripts that automate the various steps of creating a GCE instance and provisioning it with all the necessary software needed to run Hadoop. One major issue we found with bdutil was that there is no way to provision a Hadoop cluster where the datanodes do not have external IP addresses. For clusters with many datanodes — the kind we typically run — this means we end up running against our quota of external IP addresses. Additionally, there is no reason for the datanodes to have external IP addresses as they should not be accessible to the public.

We decided to stop patching bdutil and write our own utility to provision a Hadoop cluster. The utility is called zdutil and you can find it on our GitHub page. Here’s how it works:

• First, GCE instances are created for the namenode and all datanodes in your Hadoop cluster.
• Then, any persistent disks that you requested are created and attached to the instances.
• If you have have any tags that you would like to be applied to the namenode or datanodes, the tags are added to the instances. This saves you from having to manually tag every single instance in your cluster or write your own script to do so.
• Next, all of the required setup scripts to provision the namenode and datanodes are copied to a GCS bucket of your choosing. The namenode then provisions itself.
• Once it completes, it copies (via scp) all scripts needed for datanode provisioning to each datanode and then each datanode will provision itself.
• Once all datanodes have been provisioned, the namenode will start the Hadoop cluster.

If you deploy the datanodes with either external or ephemeral IP addresses, they will have internet access as determined by the rules of your GCE network. If you deploy the datanodes with “none” for the IP address, they will proxy through the namenode using Squid. You don’t have to configure any of this yourself; zdutil will take care of the details for you, including installing and provisioning Squid on your namenode. It is also important to be aware that Google’s version of the Google Cloud Storage Connector currently does not support proxying. If you use zdutil, it will install our fork of the GCS Connector which does support proxying by adding the following properties to your Hadoop core-site.xml configuration file: fs.gs.proxy.host and fs.gs.proxy.port.

If you have any need for zdutil, please use it and give us your feedback. At the moment we only support Debian-based images and we only support Hadoop version 1. If you would like to see another OS supported or Yarn support, please add an issue to the GitHub page.

# Optimizing memory consumption of Radix Trees in Java

On the Relevancy team at zulily, we are often required to load a large number of large strings into memory. This often causes memory issues. After looking at multiple ways to reduce memory pressure, we settled on Radix Trees to store these strings. Radix Trees provide very fast prefix searching and are great for auto-complete services and similar uses. This post focuses entirely on memory consumption.

# What Is A Radix Tree?

Radix Trees take sequences of data and organize them in a tree structure. Strings with common prefixes end up sharing nodes toward the top of this structure, which is how memory savings is realized. Consider the following example, where we store “antidisestablishmentarian” and “antidisestablishmentarianism” in a Radix Tree:

+- antidisestablishmentarian (node 1)
+- ism (node 2)

Two strings, totaling 53 characters, can be stored as two nodes in a tree. The first node stores the common prefix (25 characters) between it and its children. The second stores the rest (3 characters). In terms of character data stored, the Radix Tree stores the same information in approximately 53% of the space (not counting the additional overhead introduced by the tree structure itself).

If you add the string “antibacterial” to the tree, you need to break apart node 1 and shuffle things around. You end with:

+- anti                             (node 3)
|- disestablishmentarian      (node 4)
|                      +- ism (node 2)
+- bacterial                  (node 5)



# Real-World Performance

We run a lot of software in the JVM, where memory performance can be tricky to measure. In order to validate our Radix Tree implementation and measure the impact, I pumped a bunch of pseudo-realistic data into various collections and captured memory snapshots with YourKit Java Profiler.

### Input Data

It didn’t take long to hack together some real-looking data in Ruby with Faker. I created four input files of approximately 1,000,000 strings that included a random selection of 12-digit numbers, bitcoin addresses, email addresses and ISBNs.

sreed:src/ $head zulily-oss/radix-tree/12-digit-numbers.txt 141273396879 414492487489 353513537462 511391464467 633249176834 347155664352 632411507158 752672544343 483117282483 211673267195 sreed:src/$ head zulily-oss/radix-tree/bitcoins.txt
1Mp85mezCtBXZDVHGSTn3NYZuriwRMmW6D
1N8ziuitNLmSnaXy2psYpLcXvugHw1Yc5s
18DnruBzLHmnVHQhDghoa6eDt6sDkfuWKr
1A3sRfAnP89HE4RgNQARa3kCq4xFEF9eev
12WR4DrsR4mM8gDHZCuqXe2h37VUSUPSNu
1PRmYuevwZXZamBEgANzLXe2SjFneGDsXp
1EpjPwt8Ap47XA6HwJhCTxUZRDH11GKWuQ
1P8MAgobhLw4FYcFHbw7a8t2FvQZg8K597
15xhiiLdkin8zi6S5KL9DkDDQyvLb1pjjT
1NPEZeEjgGu5TYdz5d3kxjVfLwxAZ2fK6f

sreed:src/ $head zulily-oss/radix-tree/emails.txt jakayla.hoppe@krajcikpollich.info abbey.goodwin@tromp.org laney.dach@walkerlubowitz.biz rosanna_towne@marks.name sherwood@oberbrunnerauer.name mohamed_rice@champlin.com margaret_kirlin@greenfeldercasper.net vince@funk.net leora_ohara@hackett.biz audra.hermann@bauch.org sreed:src/$ head zulily-oss/radix-tree/isbns.txt
216962073-7
640524955-7
955360834-5
429656067-0
605437693-4
204030847-4
037410069-1
239193083-6
182539755-4
034988227-4


### Measuring Memory with YourKit

YourKit provides a measurement of “retained size” in its memory snapshots which is helpful when trying to understand how your code is impacting the heap. What isn’t necessarily intuitive about it, though, is what objects it excludes from this “retained size” measurement. Their documentation is very helpful here: only object references that are exclusively held by the object you’re measuring will be included. Instead of telling you “this is how much memory usage your object imposes on the VM,” retained size instead tells you “this is how much memory the VM would be able to garbage-collect if it were gone.” This is a subtle, but very real, difference if you wish to optimize memory consumption.

Thus, my memory testing needed to ensure that each collection held complete copies of the objects I wished to measure. In this case, each string key needed to be duplicated (I decided to intern and share every value I stored in order to measure only the memory gains from different key storage techniques).

// Results in shared reference, and inaccurate measurement
map1.put(key, value);
map2.put(key, value);

// Results in shared char[] reference, and better but
// still inaccurate measurement
map1.put(new String(key), value);
map2.put(new String(key), value);

// Results in complete copy of keys, and accurate measurement
map1.put(new String(key.toCharArray()), value);
map2.put(new String(key.toCharArray()), value);


### Collections Tested

I tested our own Radix Tree implementation, ConcurrentRadixTree from https://code.google.com/p/concurrent-trees/, a string array, Guava‘s ImmutableMap and Java’s HashMap, TreeMap, Hashtable and LinkedHashMap. Each collection stored the same values for each key.

Both zulily’s Radix Tree and the ConcurrentRadixTree from concurrent-trees were configured to store string data as UTF-8-encoded byte arrays.

ConcurrentRadixTree was included simply to ensure that our own version (to be open-sourced soon) was worth the effort. The others were measured simply to highlight the benefits of Radix Tree storage for different input types. Each collection has its own merits and in most ways they are all superior to the Radix Tree for storage (put/get performance, concurrency and other features).

# Results

First of all, Guava’s ImmutableMap is pretty good. It stored the same key and value data as java.util.HashMap in 92-95% of the space. The Radix Tree breaks keys into byte array sequences and stores them in a tree structure based on common prefixes. This resulted in a best case of 62% the size of the ImmutableMap for bitcoin addresses (strings which have many common prefixes) and a worst case 88% for random 12-digit numbers. We see that the memory used by this data structure is largely dependent on the type of data put into it. Large strings with many large common prefixes are stored very efficiently in a narrow tree structure. Unique strings create a lot of branches in the underlying tree, making it very wide and adding a lot of overhead.

Converting Java Strings to byte arrays accounts for most of the memory savings, but not all. Byte array storage was anywhere from 90% (bitcoin addresses) to 99% (ISBNs) in the tests I ran.

For us, storing byte-encoded representations of string data in a radix tree allowed us to reclaim valuable memory in our services. However it wasn’t until validating the implementation in an accurate manner with realistic data and trustworthy tools that we rested easy knowing we had set out what we wished to accomplish.

# Meet a zulily Developer: Trevor

Each month zulily will talk with a developer and learn about a day in the life of a zulily engineer.

Who are you, and what do you do at zulily?

I’m Trevor, a developer on the Relevancy team. Prior to that, I worked on our fulfillment and warehouse management systems.

When did you join zulily?

I started in August of 2010, so it’s been 4 years now.

What was it like in the early days? Tell us a crazy story.

Oh man, where to start….

• My first desk was the classic startup cliché: a door blank on top of two filing cabinets. (We have proper desks now.)
• My second day on the job, the director in charge of the Supply Chain team stopped by my desk and introduced herself like so: “Hi, I’m Lys. I hear you’re traveling with me to our vendor site next week?” At that point my manager leaned over and said, “Oh, uh, heh, I meant to ask you: can you go to our vendor next week?”
• The following week consisted of Lys and me in a conference room with 10 folks in suits from the supply chain logistics company with whom we were gearing up to integrate. I had never worked on anything remotely related to supply chain logistics before, and couldn’t have told you what “GOH” stood for if my life depended on it (“garment on hanger”, if you’re curious). I spent most of that week furiously scribbling notes and wondering what in the world I’d gotten myself into.
• About a year later, we needed to build our own fulfillment center in Reno. My understanding at the time was that a typical FC startup project took 6 to 9 months. We had 10 weeks to go from an empty building to shipping packages — and we got it done. To me that was a testament to what a small, tightly focused, extremely motivated group of people can do. It was a lot of work, with not a lot of sleep, but in the end it was worth it.

How is that different from now?

Things are much, much less hectic nowadays. We still move fast and set aggressive goals, but we don’t have to burn ourselves out to achieve them. The team is also bigger now, so there are a lot more hands to help carry the load.

What’s a typical day like for you?

I usually get into the office at around 10 am. First off, I usually grab a cup of coffee and check email. Then I give our API monitoring charts a look to make sure everything’s healthy.

99% of the code I work on nowadays is in Java, so once I’ve confirmed that everything’s humming along I’ll fire up IntelliJ and get to coding. Somewhere between 11 a.m and 1 p.m. I’ll take a break for lunch, then back to coding for a few more hours before our daily afternoon standup meeting. After that, more coding, till around 7 p.m. when I head home.

We’re definitely fans of the “ship early, ship often” mantra. It’s not at all unusual for me to push 3 or 4 different builds to production over the course of a day. Of course, there are also plenty of days where I’m heads-down working on larger changes, but we try to keep our changes small enough, and the barrier to releasing new code low enough, that we don’t go dark for long stretches of time.

What gets you excited about working at zulily?

There are so many things:

• The team is absolutely top-notch. I’m surrounded by smart, talented people, both on my immediate team and across the entire organization. I learn something new from my coworkers every day.
• We move fast and try new things. Sometimes they work, sometimes they don’t, but every time we learn something new.
• The Relevancy team’s mandate is to figure out how to quickly and accurately surface the most engaging content for our members. We’re continually searching for ways to improve our systems, either by trying new and novel recommendation algorithms, or by increasing our capacity and reducing the time it takes our recommendations to update in response to user behavior. It’s a fascinating space that combines machine learning with hard-core engineering for scale. I love it.
• I’ve worked at places building packaged software with 9-to-12-month release cycles. It’s disheartening to put that much effort into a project, just to see it languish on a shelf somewhere because the customer can’t (or won’t) deploy it. Our team is the polar opposite of that-we push new code to production several times a day. This creates an incredible virtuous cycle. The barrier to pushing code live is low, which means you do it more often, which means each change is small, which means it’s both easy to verify and easy to roll back if something goes sideways. With such low friction, we’re constantly pushing forward, constantly improving our service, creating a much richer, more engaging experience for our members.

# Experience optimization at zulily

Experimentation is the name of the game for most top tech companies, and it’s no different here at zulily. Because new zulily events launch every day, traditional experiments can be cumbersome for some applications. We need to be able to move quickly, so we’ve built a contextual multi-armed bandit system that learns in real time to help us deliver the best experience to each zulily member.

As zulily has grown over the past four and a half years, the number of new events and products launching each day has increased at a tremendous pace. This is great for our members, but it brings with it the challenge of ensuring that each member’s experience is as personalized as possible. My team, the Data Science team, and Relevancy, with whom we work closely, are tasked with seamlessly optimizing and customizing that experience. In order to do so, we run experiments — a lot of them. Even the most minor changes to the site usually have to prove their mettle by beating a control group in a well-designed, sufficiently-powered experiment.