Experimentation is the name of the game for most top tech companies, and it’s no different here at zulily. Because new zulily events launch every day, traditional experiments can be cumbersome for some applications. We need to be able to move quickly, so we’ve built a contextual multi-armed bandit system that learns in real time to help us deliver the best experience to each zulily member.
As zulily has grown over the past four and a half years, the number of new events and products launching each day has increased at a tremendous pace. This is great for our members, but it brings with it the challenge of ensuring that each member’s experience is as personalized as possible. My team, the Data Science team, and Relevancy, with whom we work closely, are tasked with seamlessly optimizing and customizing that experience. In order to do so, we run experiments — a lot of them. Even the most minor changes to the site usually have to prove their mettle by beating a control group in a well-designed, sufficiently-powered experiment.
This is a great approach for testing out changes before we push them to all of our members. However, it also comes with some obvious and not-so-obvious drawbacks.
1) Collecting data in well-designed experiments (i.e., experiments with high statistical power, large sample sizes, and well-specific stopping rules) can take a long time, especially if the groups getting the treatment are very small or the effect of that treatment on behavior is expected to be small.
2) Most of zulily’s events are only available for three days and are only on the front page for one day. This doesn’t provide much time, and the event may be over before the experiment concludes.
3) Experiments often have the goal of finding which experience is the “best” experience — many people have read stories of companies testing many shades of colors for their buttons to find the best one. While this is appropriate for more permanent aspects of the site, it runs contrary to the idea of a personalized experience. We want to find the best experience *for you, right now* — and what’s best for you might change.
We decided to implement a system that learns in real time and guides which experiences are being delivered to members. This system is called a multi-armed bandit (MAB). To better understand what a MAB is, imagine walking into a casino and seeing, instead of rows of slot machines, a slot machine with multiple levers. You know that one of those levers pays out rewards at a higher rate than the other ones, but you don’t know which one. There’s a catch, too: all of those levers will pay out some of the time, which makes finding the winning one pretty difficult.
You are tasked with figuring out which lever’s the best, and the only way you can is to run lots of experiments. You pull one of the levers, make a note if it pays out or not, and then either pull it again or pull another one. As you pull the levers, you start to pull the ones that seem to be rewarding at a higher rate more often, but returning to the other levers if that rate was just inflated due to randomness.
Fundamentally, you have to balance between searching for the lever you think is the winner and trying to pull the winning lever as much as possible. Search for too long, and you’re missing out on the gains you could have had from the winning lever; search too little and you might have settled on an also-ran lever. Even worse, what if the levers change their reward rate after a while? You’d miss that, too.
This might seem like a contrived example, but it ends up being very powerful as a way to deliver many different experiences to our members and get real-time feedback. We need to stay flexible and optimize the experience for different kinds of members who may visit at different points in the day. The MAB occasionally tries out different experiences in new situations to see if its operating assumptions about its available choices are still holding up. Luckily, we get thousands of visits an hour, and can collect and process a lot of data to help us.
Now, you may have noticed that this approach has the same shortcoming as #3 above. That’s where the contextual part of “contextual multi-armed bandit” fits in. We’ve implemented a slightly more complicated algorithm that not only looks to deliver the best experience out of a set of choices — it tries to deliver the best one for you and members like you. We recognize that our members are a diverse bunch and what’s an optimal experience for some members may not be the optimal experience for others.
THE GORY DETAILS
Since this is an engineering blog, I’d be remiss if I didn’t delve into some of the more technical details about our contextual bandit. If you’re interested in learning more about multi-armed bandits, I highly recommend John Myles White’s book on the topic.
We prototyped the system using Python, the language of choice for the Data Science team. Because contextual MABs learn through reinforcement and rely on contextual information about the user and the experience, testing them offline has some challenges. You never know exactly which members will visit the site on a given day, in what order, and how they will react when faced with a given experience. To solve this problem, we used what are known as Monte Carlo simulations.
We simulated the process of members with given browsing histories visiting the site and interacting with the chosen experience with some pre-defined probability. We kept track of the math that the MAB was doing at each step and tracked how often and how quickly it learned from these interactions. Once we were comfortable that the MAB was operating as expected under controlled conditions, the Relevancy engineers took over and implemented it in Java so that it could handle thousands of requests an hour without any site experience degradation.
Finally, we released it into the wild — but not before it proved its mettle by beating a control group in a well-designed, sufficiently powered experiment.