Managing Your First Data Science or Machine Learning Project

Introduction

When I got started managing software development projects the standard methodology in use was the Waterfall, in which you attempted to understand and document the entire project lifecycle before anyone wrote a single line of code. While this may have offered some benefit for coordinating large multi-team projects, more often than not it resulted in missed deadlines, inflexibility, and delivering systems that didn’t fully achieve their business goals. Agile has since emerged as the dominant project methodology, addressing many of the Waterfall’s shortcomings. Agile recognizes that it’s unlikely we’ll know everything up front, and as such is built on an iterative foundation. This allows us to learn as we go, regularly incorporate stakeholder feedback, and to avoid failure.

For Data Science and Machine Learning (DS/ML) projects, I’d argue that an iterative approach is a necessary, but not sufficient for a successful project outcome. DS/ML projects are different. And these differences can fly below the traditional project manager’s radar until they pop up late in the schedule and deliver a nasty bite. In this blog post I’ll point out some of the key differences I’ve seen between a traditional software development project and a DS/ML project, and how you can protect your team and your stakeholders from these hidden dangers.

Project Goal

At a high level DS/ML projects typically seek to do one of three things: 1) Explain; 2) Predict; or 3) Find Hidden Structure. In the first two we are predicting something, but with different aims. When we are tasked with explanation we use ‘transparent’ models, meaning they show us directly how they arrive at a prediction. If our model is sufficiently accurate, we can make inferences about what drives the outcome we’re trying to predict. If our goal is simply getting the best possible prediction we can also use ‘black box’ models. These models are generally more complex and don’t provide an easy way to see how they make their predictions, but they can be more accurate than transparent models. When the goal is to find hidden structure, we are interested in creating groups of like entities such as customers, stores, or products, and then working with those groups rather than the individual entities. Regardless of the immediate goal, in all three cases we’re using DS/ML to help allocate scarce organizational resources more effectively.

When we want to explain or predict we need to explicitly define an outcome or behavior of interest. Most retail organizations, for example, are interested in retaining good customers. A common question is “How long will a given customer remain active?” One way to answer this is to build a churn model that attempts to estimate how likely a customer is to stop doing business with us. We now have a defined behavior of interest: customer churn, so we’re ready to start building a model, right? Wrong. A Data Scientist will need a much more precise definition of churn or risk building a model that won’t get used. Which group of customers are we targeting here? Those who just made their first purchase? Those who have been loyal customers for years? Those who haven’t bought anything from us in the last 6 months? High value customers whose activity seems to be tapering off recently? Each of these definitions of ‘at risk’ creates a different population of customers to model, leaving varying amounts of observed behavior at our disposal.

By definition we know less about new customers than long time loyal customers, so a model built for the former use case will probably not generalize well to the latter group. Once we’ve defined the population of customers under consideration, it’s really important to compute the ‘size of the prize’.

Size of the Prize

Say we build a churn model that makes predictions that are 100% accurate (if this were to happen, we’ve likely made a modeling mistake – more on that later), and we apply that model to the customer audience we defined to be at risk of churn. What’s the maximum potential ROI? Since our model can correctly distinguish those customers who will churn from those that won’t, we’d only consider offering retention incentives to those truly at risk of leaving. Maybe we decide to give them a discount on their next purchase. How effective have such interventions been in the past at retaining similar customers?

If a similar discount has historically resulted in 2% of the target audience making an incremental purchase, how many of todays at risk customers would we retain? If you assume that each retained customer’s new order will be comparable to their average historical order amount, and then back out the cost of the discount, how much is left? In some cases, even under ideal conditions you may find that you’re targeting a fairly narrow slice of customers to begin with, and the maximum ROI isn’t enough to move forward with model-based targeting for a retention campaign. It’s much better to know this before you’ve sunk time and effort in to building a model that won’t get used. Instead, maybe you could build a model to explain why customers leave in the first place and try to address the root causes.

Definition of Success

Many times there is a difference in the way you’ll assess the accuracy of a model and how stakeholders will measure the success of the project. DS/ML practitioners use a variety of model performance metrics, many of which are quite technical in nature. These can be very different from the organizational KPIs that a stakeholder will use to judge a project’s success. You need to make sure that success from a model accuracy standpoint will also move the KPI needle in the right direction.

Models are built to help us make better decisions in a specific organizational context. If we’re tasked with improving the decision making in an existing process, we need to understand all the things that are taken into account when making that decision today. For example, if we build a model that makes recommendations for products based on previous purchases, but fail to consider current inventory, we may recommend something we can’t deliver. If our business is seasonal in nature, we may be spot on with our product recommendation and have plenty on hand, but suggest something that’s seasonally inappropriate.

Then there is the technical context to consider. If the goal will be making a recommendation in real time in an e-commerce environment, such as when an item is added to a shopping cart, you’ve got to deliver that recommendation quickly and without adding any friction to the checkout process. That means you’ll need to be ready to quickly supply this recommendation for any customer at any time. Models are usually built or ‘trained’ offline, and that process can take some time. A trained model is then fed new data and will output a prediction. This last step is commonly called ‘scoring’. Scoring can be orders of magnitude faster than training. But keep in mind that some algorithms require much more training time than others. Even if your scoring code can keep up with your busiest bursts of customer traffic, if you want to train frequently – perhaps daily so your recommendations take recent customer activity into account – the data acquisition, preparation, and training cycle may not be able to keep up.

Some algorithms need more than data, they also require supplying values for ‘tuning parameters’. Think of these as ‘knobs’ that have to be dialed in to get the best performance. The optimal settings for these knobs will differ from project to project, and can vary over time for the same model. A behavioral shift in your customer base, seasonality, or a change in your product portfolio can all require that a model be retrained and retuned. These are all factors that can affect the quality of your model’s recommendations.

Once you have clearly defined the desired outcome, the target audience, the definition of success from both the KPI and model accuracy perspectives, and how the model will be deployed you’ve eliminated some of the major reasons models get built but don’t get used.

Model Inputs and Experimentation

In traditional software development, the inputs and outputs are usually very well defined. Whether the output is a report, an on-line shopping experience, an automatic inventory replenishment system, or an automobile’s cruise control, we usually enter into the project with a solid understanding of the inputs and outputs. The desired software just needs to consume the inputs and produce the outputs. That is certainly not to trivialize these types of projects; they can be far more difficult and complex than building a predictive model.

But DS/ML projects often differ in one key respect: while we’ll usually know (or will quickly define) what the desired output should be, many times the required inputs are unknown at the beginning of project. It’s also possible that the data needed to make the desired predictions can’t be acquired quickly enough or is of such poor quality that the project is infeasible. Unfortunately these outcomes are not often apparent until the Data Scientist has had a chance to explore the data and try some preliminary things out.

Our stakeholders can be of immense help when it comes to identifying candidate model inputs. Data that’s used to inform current decision making processes and execute existing business rules can be a rich source of predictive ‘signal’. Sometimes our current business processes rely on information that would be difficult to obtain (residing in multiple spreadsheets maintained by different groups) or almost impossible to access (tribal knowledge or intuition). Many algorithms can give us some notion of which data items they find useful to make predictions (strong signal), which play more of a supporting role (weak signal) or are otherwise uninformative (noise). Inputs containing strong signal are easy to identify, but the distinction between weak signal and noise is not always obvious.

Ghost Patterns

If we’re lucky we’ll have some good leads on possible model inputs. If we’re not so lucky we’ll have to start from scratch. There are real costs to including uninformative or redundant inputs. Besides the operational costs of acquiring, preparing, and managing inputs, not being selective about what goes into an algorithm can cause some models to learn ‘patterns’ in the training data that turn out to be spurious.

Say I asked you to predict a student’s math grade. Here’s your training data: Amy gets an ‘A’, Bobby gets a ‘B’, and Cindy gets a ‘C’. Now make a prediction: What grade does David get? If that’s all the information you had, you might be inclined to guess ‘D’, despite how shaky that seems. You’d probably be even less inclined to hazard a guess if I asked about Mary’s grade. The more data we put into a model the greater the chance that some data items completely unrelated to the outcome just happen to have a pattern in the training data that looks useful. When you try to make predictions with a data set that your model didn’t see during training, that spurious pattern won’t be there and model performance will suffer.

To figure out which candidate model inputs are useful to keep and which should be dropped from further consideration, the DS/ML practitioner must form hypotheses about and conduct experiments on the data. And there’s no guarantee that there will be reliable enough signal in the data to build a model that will meet your stakeholder’s definition of success.

Time Travel

We hope to form long, mutually beneficial relationships with our customers. If we earn our customers’ repeat business, we accumulate data and learn things about them over time. It’s common to use historical data to train a model from one period of time and then test its accuracy on data from a subsequent period. This reasonably simulates what would happen if I build a model on today’s data and use the model to predict what will happen tomorrow. Looking into the past like this is not without risk though. When we reach back into a historical data set like this, we need to be careful to avoid considering data that arrived after the point in business process at which we want to make a prediction.

I once built a model to predict how likely it was that a customer would make their first purchase between two points in time in their tenure. To test how accurate my model was I needed to compare it to real outcomes, so that meant using historical data. When I went to gather the historical data I accidentally included a data item that captured information about the customer’s first purchase – something I wouldn’t know at the point in time at which I wanted to make the prediction. If a customer had already made their first purchase at the time I wanted make the prediction, they wouldn’t be part of the target population to begin with.

The first indication that I had accidentally let my model cheat by peeking into the future was that it was 100% accurate when tested on a new data set. That generally doesn’t happen, at least to me, and least of all on the first version of a model I build. When I examined the model to see which inputs had a strong signal, the data item with information from the future stood out like a sore thumb. In this case my mistake was obvious, so I simply removed the data item that was ‘leaking’ information from the future and kept going. This particular information leak was easy to detect and correct, but that’s not always the case. And this issue is something that can bite even stakeholders during the conceptualization phases of projects, especially when trying to use DS/ML to improve decision making in longer running business processes.

Business processes that run over days, weeks, or longer typically fill in more and more data as they progress. When we’re reporting or doing other analysis on historical data, it can be easy to lose sight of the fact that not all of that data showed up at the same time. If you want to use a DS/ML capability to improve a long running business process, you need to be mindful of when the data actually becomes available. If not, there’s a real risk of proposing something that sounds awesome but is just not feasible.

Data availability and timing issues can also crop up in business processes that need to take quick action on new information. Just because data appears in an operational system in a timely fashion, that data still has to be readied for model scoring and fed to the scoring code. This pre-scoring data preparation process in some cases can be computationally intensive and may have its own input data requirements. Once the data is prepared and delivered to the scoring process, the scoring step itself is typically quick and efficient.

Unified Modeling Language (UML) Sequence & Timing Diagrams are useful tools for figuring out how long the end to end process might take might take. It’s wise to get a ballpark estimate on this before jumping into model building.

Deploying Models as a Service

Paraphrasing Josh Wills, a Data Scientist is a better programmer than most statisticians, and a better statistician than most programmers. That said, you probably still want software engineers building applications and machine learning engineers building modeling and scoring pipelines. There are two main strategies an application can use to obtain predictions from a model: The model can be directly integrated into the application’s code base or the application can call a service (API) to get model predictions. This choice can have a huge impact on architecture and success of an ML/DS project.

Integrating a predictive model directly into an application may seem tempting – no need to stand up and maintain a separate service, and the end-to-end implementation of making and acting on a prediction is in the same code base, which can simplify troubleshooting and debugging. But there are downsides. Application and model development are typically done on different cadences and by different teams. An integrated model means more a complicated deployment and testing process and can put application support engineers in the awkward position of having to troubleshoot code developed by another team with a different set of skills. Integrated models can’t easily be exposed to other applications or monitoring processes and can cause application feature bloat if there’s a desire to include a capability to A/B test competing models.

Using a service to host the scoring code gets around these issues, but also impacts the overall system architecture. Model inputs need to be made available behind the API. At first blush, this may seem like a disadvantage – more work and more moving parts. But the process that collects and prepares data for scoring will often need to operate in the background anyway, independent of normal application flow.

Exposing model predictions as a service has a number of advantages. Most importantly, it allows teams work more independently and focus on improving the parts of the system that best align with their skill sets. A/B testing of two or more models can be implemented behind the API without touching the application. Having the scoring code run in its own environment also makes it easier to scale. You’ll want to log some or all of the predictions, along with their inputs, for off-line analysis and long-term prediction performance monitoring. Being able to identify the cases where the model’s predictions are the least accurate can be incredibly valuable when looking to improve model performance.

If a latter revision of the model needs to incorporate additional data or prepare existing data in a different way, that work doesn’t need to be prioritized with the application development team. Imagine that you’ve integrated a model directly into an application, but the scoring code needs to be sped up to maintain a good user experience. Your options are much more limited than if you’ve deployed the scoring code as a service. How about adding in that A/B testing capability? Or logging predictions for off line analysis? Even just deploying a retuned version of an existing model will require cross-team coordination.

Modeling within a Service Based Architecture

The model scoring API is the contract between the DS/ML & application development teams. Changing what’s passed into this service or what’s returned back to the application (the API’s ‘signature’) is a dead giveaway that the division of responsibilities on either side of the API was not completely thought through. That is a serious risk to project success. For teams to work independently on subsystems, that contract cannot change. A change to an API signature will require both service producers and service consumers to make changes and will often result in a tighter coupling between the systems – one of the problems we’re trying to avoid in the first place with a service-based approach. And always keep the number of things going into and coming out that API to a bare minimum. The less the API client and server know about each other the better.

Application development teams may be uneasy about relying on such an opaque service. The more opaque a service is the less insight the application team has into a core component of their system. It may be tempting to include a lot of diagnostic information in the API’s response payload. Don’t do it. Instead, focus your efforts on persisting this diagnostic information behind the API. It’s fine to return a token or tracing id that can be used to retrieve this information through another channel at a later point in time, just keep your API signature as clean as possible.

As previously discussed, DS/ML projects are inherently iterative in nature and often require substantial experimentation. At the outset we don’t know exactly which data items will be useful as model inputs. This presents a problem for a service-based architecture. You want to encourage the Data Scientist to build the best model they can, so they’ll need to run a lot of little experiments, each of which could change what the model needs as inputs. So Machine Learning engineers will need to wait a bit until model input requirements settle down enough to the point where they can start to build data acquisition and processing pipelines. But there’s a catch: Waiting too long before building out the API unnecessarily extends timelines.

So how do we solve this? One idea is to work at the data source rather than data item level. The Data Scientist should quickly be able to narrow down the possible source systems from which data will need to acquired, and not long after that know which tables or data structures in those sources are of interest. One useful idiom from the Data Warehousing world is “Touch It, Take It”. This means that if today you know you’ll need a handful of data items from a given table, it’s better to grab everything in that table the first time rather than cherry-pick your first set and then having to open up the integration code each time you identify the need for an additional column. Sure, you’ll be acquiring some data prospectively, but you’ll also be maintaining a complete set of valuable candidate predictors. You’ll thank yourself when building the next version of the model, or different model in that same domain, because the input data will already be available.

Once the Data Scientist has identified the desired tables or data structures, you’ll have a good idea of the universe of data that could potentially be made available to the scoring code behind the API. This is the time to nail down the leanest API signature you can. A customer id goes in and a yes / no decision and a tracing token comes out. That’s it. Once you’ve got a minimal signature defined freeze it – at least until the first version of the model is in production.

Business Rules First

Predictive models are often used to improve on an existing business process built on business rules. If improving a business rule based process, consider the business rules as version zero of the model. They establish the baseline level of performance against which the model will be judged. Consider powering the first version of the API with these business rules. The earlier the end-to-end path can be exercised the better, and having a business rule based version of the model available as a fallback can provide a quick way to roll back the first model without rolling back the architecture.

In Closing

In this post I’ve tried to highlight some of the more important differences I’ve experienced between a traditional software development project and a DS/ML project. I was fortunate enough to be on the lookout for a few of these, but most only came to light in hindsight. DS/ML projects have enough inherent uncertainty; hopefully you’ll be able to use some of the information in this post to avoid some of these pitfalls in your next DS/ML project.

Practical A/B Testing

Introduction

A/B testing is essential to how we operate a data-driven business at zulily. We use it to assess the impact of new features and programs before we roll them out. This blog post focuses on some of the more practical aspects of A/B testing. It is divided into four parts. It begins with an introduction to A/B testing and how we measure long-term impact. Then, it moves into the A/B splitting mechanism. Next, it turns to Decima, our in-house A/B test analysis platform. Finally, it goes behind the scenes and describes the architecture of Decima.

A/B Testing

A/B Testing Basics

In A/B testing, the classic example is changing the color of a button. Say a button is blue, but a PM comes along with a great idea: What would happen if we make it green instead? The blue button is version A, the current version, the control. The green button is version B, the new version, the test. We want to know: Is the green button as awesome as we think? Is it a better experience for our users? Does it lead to better outcomes for our business? To find out, we run an A/B test. We randomly assign some users to see version A and some to see version B. Then we measure a few key outcome metrics for the users in each group. Finally, we use statistical analysis to compare those metrics between the two groups and determine whether the results are significant.

Statistical significance is a formal way of measuring whether a result is interesting. We know that there is natural variability in our users. Not everyone behaves exactly the same way. So, we want to check if the difference between A and B could just be due to chance. Pretend we ran an A/A test instead. We randomly split the users into two groups, but everyone gets the blue button. There is a range of differences (close to zero) that we could reasonably expect to see. When the results of the A/B test are statistically significant, it means they would be highly unusual to see under an A/A test. In that case, we would conclude that the green button did make a difference.

ab_testing_basics

Figure 1. A/B testing – Split users and assign to version A or B. Measure behavior of each group. Use statistical analysis to compare.

Cumulative Outcome Metrics

To shop on zulily, users have to create an account. Requiring our users to be signed in is great for A/B testing, and for analytics in general. It means we can tie together all of a user’s actions via their account id, even if they switch browsers or devices. This makes it easy to measure long-term behaviors, well beyond a single session. And, since we can measure them, we can A/B test for them.

One of the common outcomes we measure at zulily is purchasing. A short-term outcome would be: How much did this user spend in the session when they saw the blue or green button? A long-term outcome would be: How much did the user spend during the A/B test? Whenever a user sees the control or test experience, we say they were exposed. A user can be exposed repeatedly over the course of a test. We accumulate outcome metrics from the first exposure through the end of the test. By measuring cumulative outcomes, we can better understand long-term impact and not be distracted by novelty effects.

cumulative_outcomes

Figure 2. Cumulative outcome metrics – Measure each user’s behaviors from their first exposure forward. Users can be exposed multiple times – focus on the first time. Do not count the user’s behaviors before their first exposure.

Lift

Usually, A/B test analysis measures the difference between version B and version A. For an outcome metric x, the difference between test and control is xB – xA. This difference, especially for cumulative outcomes, can increase over time. Consider the example of spend per exposed user. As the A/B test goes on, both groups keep purchasing and accumulating more spend. Version B is a success if the test group’s spend increases faster than the control’s.

Instead of difference, we measure the lift of B over A. Lift scales the difference by the baseline value. For an outcome metric x, the lift of test over control is (xB – xA) / xA * 100%. We have found that lift for cumulative metrics tends to be stable over time.

lift_spend_difference

Figure 3. Lift over time – Cumulative behaviors increases over time for both A and B, so the difference between them grows too. The lift tends to stay constant, making it a better summary of the results.

Power Analysis

Before starting an A/B test, it is good to ask two questions: What percent of users should get test versus control? and How long will the test need to run? The formal statistical way of answering these questions is a power analysis. First, we need to know what is the smallest difference (or lift) that would be meaningful to the business. This is called the effect size. Second, we need to know how much the outcome metric typically fluctuates. The power analysis calculates the sample size, the number of users needed to detect this size of effect with statistical significance.

There are two components to using the sample size. The split is the fraction of users in test versus control, and this impacts the sample size needed. The time for the test is however long it will take to expose that many users. Since users can come back and be exposed again, the cumulative number exposed will grow more slowly as time goes on. Purely mathematically, the more unbalanced the split (the further from 50-50 in either direction), the longer the test. Likewise, the smaller the effect size, the longer the test.

Size + Time – Practical Considerations

Often the power analysis doesn’t tell the whole story. For example, at zulily we have a strong weekly cycle – people shop differently on weekends from weekdays. We always recommend running A/B tests for at least one week, and ideally in multiples of seven days. Of course, if the results look dramatically negative after the first day or two, it is fine to turn off the test early.

The balance of the split affects the length of the test run, but we also consider the level of risk. If we have a big program with lots of moving parts, we might start with 90% control, 10% test. On the flip side, if we want to make sure an important feature keeps providing lift, we might maintain a holdout with 5% control, 95% test. But, if we have a low risk test, such as a small UI change, a split at 50% control, 50% test will mean shorter testing time.

A/B Split

Goals for the A/B Split

There are three key properties that any splitting strategy should have. First, the users should be randomly assigned to treatments. That way, all other characteristics of the users will be approximately the same for each of the treatment groups. The only difference going in is the treatment, so we can conclude that any differences coming out were caused by the treatment. Second, the treatments for each A/B test should be assigned independently from all other A/B tests. That way, we can run many A/B tests simultaneously and not worry about them interfering with each other’s results. Of course, it wouldn’t make sense to apply two conflicting tests to the same feature at the same time. Third, the split should be reproducible. The same user should always be assigned to the same treatment of a test. The treatment shouldn’t vary randomly from page to page or from visit to visit.

Our Strategy

At zulily, our splitting strategy is to combine the user id with the test name and apply a hash function. The result is a bucket number for that user in that test. We often set up more buckets than treatments. This provides the flexibility to start with a small test group and later increase it by moving some of the buckets from control to test.

Our splitting strategy has all three key properties. First, the hash produces pseudo-random bucketing. Second, by including the test name, the user will get independent buckets for different tests. Third, the bucket is reproducible because the hash function is deterministic.

The hash is very fast to compute, so developers don’t have to worry about the A/B split slowing down their code. To implement a test, at the decision point in the code the developer places a call to our standard test lookup function with the test name and user id. It returns the bucket number and treatment name, so the user can be directed to version A or version B. Behind the scenes, the test lookup function generates a clickstream log with the test name, user id, timestamp, and bucket. We on the Data Science team use the clickstream records to know exactly who was exposed to which test when and which treatment they were assigned.

Audience v. Exposure

There are two main ways to assign users to an A/B test: using an audience or exposure. In an audience-based test, before the test launches we create an audience – a group of users who should be in the test – and randomly split them into control and test. Then we measure all of those users’ behavior for the entire test period. This is straightforward but imprecise. Not everyone in the audience will actually be touched by the A/B test. The results are statistically valid, but it will be more difficult to detect an effect due to the extra noise.

Instead, we prefer exposure-based testing. The user is only assigned to a treatment when they reach the feature being tested. The number of exposed users increases as the test runs. The only users in the analysis are those who could have been impacted by the A/B test, so it is easier to detect a lift. In addition, we only measure the cumulative outcomes starting from each user’s first exposure. This further refines the results by excluding anything a user might have done before they had a chance to be influenced by the test.

audience_v_exposure.png

Figure 4. Audience v exposure – While both statistically valid, exposure-based tests avoid sources of noise and can detect smaller effects.

Decima UI

A Bit of Roman Mythology

The ancient Romans had a concept of the Three Fates. These were three women who control each mortal’s thread of life. First, Nona spins the thread, then Decima measures it, and finally Morta cuts it when the life is over. We named our A/B test analysis system Decima because it measures all of the live tests at zulily.

three_fates

Figure 5. Three Fates – In ancient Roman mythology the Three Fates control the thread of life. Decima’s role is to measure it.

Decima UI

The Decima UI is the face of the system to internal users. These include PMs, analysts, developers, or anyone interested in the results of an A/B test. It has two main sections: the navigation and information panel and the results panel. Figure XX shows a screenshot of Decima displaying a demo A/B test.

demo_test_decima

Figure 6. Decima UI – At zulily, Decima displays the results of A/B tests. The left panel is for navigation and information. The main panel shows the results for each outcome metric.

Navigation + Information

The navigation and information panel is on the left. A/B tests are organized by Namespace or area of the business. Within a namespace, the Experiment drop-down lists the names of all live tests. The Platform drills down to just exposures and outcomes that occurred on that platform or group of platforms (all-apps, mobile-web, etc). The Segmentation drills down to users in a particular segment (new vs existing, US vs international, etc).

The date information shows the analysis_start_date and analysis_end_date. The results are for exposures and outcomes that occurred in this date range, inclusive. The n_days shows the length of the date range. The analysis_run_date shows the timestamp when the results were computed. For live tests, the end date is always yesterday and the run date is always this morning.

Results

The main panel displays the results for each outcome metric. We analyze whether the lift is zero or statistically significantly different from zero. If a lift is significant and positive, it is colored green. If it is significant and negative, it is colored orange. If it is flat, it is left gray. The plot shows the estimated lift and its 95% confidence interval. It is easy to see whether or not the confidence interval contains zero.

The table shows the average (or proportion for a binary outcome), standard deviation, and sample size for each treatment group. Based on the statistical analysis, it shows the estimated lift, confidence interval bounds, and p-value for comparing each test group to the control.

demo_test_single_metric

Figure 7. Single metric results – Zoom in one metric in the Decima UI. The plot shows the 95% confidence interval for lift. The table shows summary numbers and statistical results.

Common Metrics

We use a variety of outcome metrics depending on the goal of the new feature being tested. Our core metrics include purchasing and visiting behaviors. Specifically, spend per exposed approximates the impact of the test to our top-line. For each exposed user, we measure the cumulative spend (possibly zero) between their first exposure date and the analysis end date. Then we average this across all users for each treatment group. Spend per exposed can be broken down into two components: chance of purchase and spend per purchaser. Sometimes a test might cause more users to purchase but spend lower amounts, or vice versa. Spend per exposed combines the two to capture the overall impact. Revisit rate measures the impact of the test to repeat engagement. For each exposed user, we count the number of days they came back after their first exposure date. We have found that visit frequency is a strong predictor of future behaviors, months down the road.

common_metrics.png

Figure 8. Common outcome metrics. Spend per exposed can be broken into chance of purchase and demand per purchaser. Revisit rate is a proxy for long-term behavior.

Decima Architecture

Three Modules of Decima

Decima is comprised of three main modules. Each is named after a famous contributor to the field that corresponds to its role. Codd invented the relational database model, so the codd module assembles the user-level dataset from our data warehouse. Gauss was an influential statistician (the Gaussian or Normal distribution is named after him), so the gauss module performs the statistical analysis. Tufte is considered a pioneer in data visualization, so the tufte module displays the results in the Decima UI. Decima runs in Google Compute Engine (GCE), with a separate Docker container for each module.

Codd

The codd module is in charge of assembling the dataset. It is written in Python. It uses recursive formatting to compose the query out of parameterized query components, filling values for the dates, test name, etc. Then it submits the query to the data warehouse in Google BigQuery and exports the resulting dataset to Google Cloud Storage (GCS).

codd

Figure 9. Codd – The codd module of Decima does data assembly.

Gauss

The gauss module takes care of the statistical analysis. It is written in R. It imports the dataset produced by codd from GCS into a data.table. It loops through the outcome metrics and performs the statistical test for lift for each one using speedglm. It also loops through platforms and segmentations to generate results for the drill downs. Finally, it gathers all the results and writes them out to a file in GCS.

gauss

Figure 10. Gauss – The gauss module of Decima does statistical analysis.

Tufte

The tufte module serves the result visualizations. It is also written in R. It imports the results file produced by gauss from GCS. It creates the tables and plots for each metric in the test using ggplot2. It displays them in an interactive UI using shiny. The UI is hosted in GCE and can be accessed by anyone at zulily.

tufte

Figure 11. Tufte – The tufte module of Decima does data visualization.

Decima Meta

The fourth module of Decima is decima-meta. It doesn’t contain any software, just queries and configuration files. The queries are broken down into reusable pieces. For example, the exposure query and outcome metrics query can be mixed and matched. Each query piece has parameters for frequently changed values, such as dates or test ids. The configuration files are written in JSON and there is one per A/B test. They specify all the query pieces and parameters for codd, as well as the outcome metrics for gauss. The idea is: running an A/B test analysis should be as easy as adding a configuration file for it.

About the Author

Julie Michelman is a Data Scientist at zulily. She designs and analyzes A/B tests, utilizing Decima, the in-house A/B test analysis tool she helped build. She also builds machine learning models that are used across the business, including marketing, merchandising, and the recommender system. Julie holds a Master’s in Statistics from the University of Washington.

Image Sources

Figure 1. https://www.freepik.com/free-icon/multiple-users-silhouette_736514.htm, https://en.wikipedia.org/wiki/Normal_distribution
Figure 5. http://bytesdaily.blogspot.com/2015/12/some-december-trivia.html
Figure 9. https://en.wikipedia.org/wiki/Edgar_F._Codd
Figure 10. https://en.wikipedia.org/wiki/Carl_Friedrich_Gauss
Figure 11. https://en.wikipedia.org/wiki/Edward_Tufte

Calculating Ad Performance Metrics in Real Time

Authors: Sergey Podlazov, Rahul Srivastava

zulily is a flash sales company.  We post a product on the site, and puff… it’s gone in 72 hours.  Online ads for those products come and go just as fast, which doesn’t leave us much time to manually evaluate the performance of the ads and take corrective actions if needed.  To optimize our ad spend, we need to know in real-time how each ad is doing, and this is exactly what we engineered.

While we track multiple metrics to measure impact of an ad, I am going to focus on one that provides a good representation of the system architecture.  This is an engineering blog after all!

The metric in question is Cost per Total Activation, or CpTA in short.  The formula for the metric is this:  divide the total cost of the ad by the number of customer activations.  We call the numerator in this formula “spend” and refer to the denominator as an “activation”.  For example, if an ad costs zulily $100 between midnight and 15:45 PST on January 31 and results in 20 activations, the CpTA for this ad as of 15:45 PST is $100/20 = $5.

Here’s how zulily collects this metric in real-time.  For the sake of simplicity, I will skip archiving processes that are sprinkled on top the architecture below.

Screen Shot 2018-01-30 at 6.22.22 PM

The source of the spend for the metric is an advertiser API, e.g. Facebook.  We’ve implemented a Spend Producer (in reference to the Producer-Consumer model) that queries the API every 15 minutes for live ads and pushes the spend into a MongoDB.  Each spend record has a tracking code that uniquely identifies the ad.

The source for the activations is a Kafka stream of purchase orders that customers place with zulily.  We consume these orders and throw them into an AWS Kinesis stream.  This gives us the ability to process and archive the orders without causing an extra strain on Kafka.  It’s important to note that relevant orders also have the ad’s tracking code, just like the spend.  That’s the link that glues spend and activations together.

The Activation Evaluator application examines each purchase and determines if the purchase is an activation.  To do that, it looks up the previous purchase in a MongoDB collection for the customer Id on the purchase order.  If the most recent transaction is non-existent or older than X days, the purchase is an activation.  The Activation Evaluator updates the customer record with the date of the new purchase.  To make sure that we don’t drop any data if the Activation Evaluator runs into issues, we don’t move the checkpoint in the Kinesis stream until the write to Mongo is confirmed.

The Activation Evaluator sends evaluated purchases into another Kinesis stream.  Chaining up Kinesis stream is a pretty common pattern for AWS applications, as it allows for the separation of concern and makes the whole system more resilient to failure of individual components.

The Activation Calculator reads the evaluated purchases from the second Kinesis stream and captures them in Mongo.  We index the data by tracking code and timestamp, and voila, a simple count() will return the number of activations for a specified period.

The last step in the process is to take the Spend and divide it by the activations.  Done.

With this architecture, zulily measures a key advertising performance metric every 15 minutes and uses it to pause poorly-performing ads.  The metric also serves as an input for various Machine Learning models, but more on those in a future blog post… Stay tuned!!

 

 

 

Sampling keys in a Redis cluster

We love Redis here at zulily. We store hundreds of millions of keys across many Redis instances, and we built our own internal distributed cache on top of Redis which powers the shopping experience for zulily customers.

One challenge when running a large, distributed cache using Redis (or many other key/value stores for that matter) is the opaque nature of the key spaces. It can be difficult to determine the overall composition of your Redis dataset, since most Redis commands operate on a single key. This is especially true when multiple codebases or teams use the same Redis instance(s), or when sharding your dataset over a large number of Redis instances.

Today, we’re open sourcing a Go package that we wrote to help with that task: reckon.

reckon enables us to periodically sample random keys from Redis instances across our fleet, aggregate statistics about the data contained in them — and then produce basic reports and metrics.

While there are some existing solutions for sampling a Redis key space, the reckon package has a few advantages:

Programmatic access to sampling results

Results from reckon are returned in data structures, not just printed to stdout or a file. This is what allows a user of reckon to sample data across a cluster of redis instances and merge the results to get an overall picture of the keyspaces. We include some example code to do just that.

Arbitrary aggregation based on key and redis data type

reckon also allows you to define arbitrary buckets based on the name of the sampled key and/or the Redis data type (hash, set, list, etc.). During sampling, reckon compiles statistics about the various redis data types, and aggregates those statistics according to the buckets you defined.

Any type that implements the Aggregator interface can instruct reckon about how to group the Redis keys that it samples. This is best illustrated with some simple examples:

To aggregate only Redis sets whose keys start with the letter a:


func setsThatStartWithA(key string, valueType reckon.ValueType) []string {
  if strings.HasPrefix(key, "a") && valueType == reckon.TypeSet {
    return []string{"setsThatStartWithA"}
  }
  return []string{}
}

To aggregate sampled keys of any Redis data type that are longer than 80 characters:


func longKeys(key string, valueType reckon.ValueType) []string {
  if len(key) > 80 {
    return []string{"long-keys"}
  }
  return []string{}
}

HTML and plain-text reports

When you’re done sampling, aggregating and/or combining the results produced by reckon you can easily produce a report of the findings in either plain-text or static HTML. An example HTML report is shown below:

reckon-random-sets

a sample report showing key/value size distributions

The report shows the number of keys sampled, along with some example keys and elements of those keys (the number of example keys/elements is configurable). Additionally, a distribution of the sizes of both the keys and elements is shown — in both standard and “power-of-two” form. The power-of-two form shows a more concise view of the distribution, using a concept borrowed from the original Redis sampler: each row shows a number p, along with the number of keys/elements that are <= p and > p/2

For instance, using the example report shown above, you can see that:

  • 68% of the keys sampled had key lengths between 8 and 16 characters
  • 89.69% of the sets sampled had between 16 and 32 elements
  • the mean number of elements in the sampled sets is 19.7

We have more features and refinements in the works for reckon, but in the meantime, check out the repo on github and let us know what you think. The codebase includes several example binaries to get you started that demonstrate the various usages of the package.

Pull requests are always welcome — and remember: Always be samplin’.

Simulating Decisions to Improve Them

One of the jobs of the Data Science team is to help zulily make better decisions through data. One way that manifests itself is via experimentation. Like most ecommerce sites, zulily continuously runs experiments to improve the customer experience. Our team’s contribution is to think about the planning and analysis of those tests to make sure that when the results are read they are trustworthy and that ultimately the right decision is made.

Coming in hot

As a running example throughout this post, consider a landing page experiment.  At zulily, we have several landing pages that are often the first thing a visitor sees after they click an advertisement on a third-party site. For example, if a person was searching for pet-related products, and they clicked on one of zulily’s ads, they might land here.  Note: while that landing page is real, all the underlying data in this post is randomly generated.

The experiment is to modify the landing page in some way to see if conversion rate is improved.  Hopefully data has been gathered to motivate the experiment but, please, just take this at face-value.

Any single landing page is not hugely critical, but in aggregate they’re important for zulily, and small improvements in conversion rates or other metrics can have a large impact on the bottom line. In this example, the outcome metric (what is trying to be improved) is conversion rate, which is simply the number of conversion over the total number of visitors.

Thinking with the End in Mind

Planning an experiment consists of many things, but often the most opaque part is the implications associated with power. The simple definition of power: given some effect of the treatment, how likely will that effect be detectable. The implication here is that the more confident one wants to be in their ability to detect the change, the longer the test needs to run. How long the test needs to run has a direct bearing on the number of tests a company can run as a whole and which tests should be given priority… unless you don’t care about conflating treatments, but then you have bigger problems :).

Power analysis, however, is a challenge. For anything beyond a simple AB test, a lot needs to be thought through to determine the appropriate test. Therefore, it is often easier to think about the data, then work backwards through the analysis, and then the power.

Ultimately power analysis boils down to the simple question: for how long does a test need to run?

To illustrate this, consider the example experiment, where the underlying conversion rate for landing page A (the control) is 10%, and the expected conversion rate of the treatment is 10.5%. While these are the underlying conversion rates, due to randomness the realized conversion rate will likely be different, but hopefully close.

Imagine that each page is a coin, and each time a customer lands on the page the coin is flipped. Even though it’s known a priori that the underlying conversion rate for page A is 10%, if the coin is flipped 1000 times, it’s unlikely that it will be “heads” 100 times.  If you were to run two tests, you would get two different results, even though all variables are the same.

The rest of the post walks through the analysis of a single experiment, then describes how to expand that single experiment analysis into an analysis of the decision making process, and finally discusses a couple examples of complications that often arise in testing and how they can be incorporated into the analysis of the decision making.

A Single Experiment

For instance: flipping the “landing page” coin, so to speak, 1000 times for each page, A and B. In this one experiment, the realized conversion rates (for fake data) are in the bar chart below.

Figure1

That plot sure looks convincing, but just looking at the plot is not a sufficient way to analyze a test. Think back to the coin flipping example; since the difference in the underlying “heads” rate was only 0.5%, or 5 heads per 1000 flips, it wouldn’t be too surprising if A happened to have more heads than B in any given 1000 flips.

The good news is that statistical tools exist to help make it possible to understand how likely the difference observed was truly due to an underlying difference in rates, due to randomness.

The data collected would look something like:

Selection_258

where Treatment is the landing page treatment, and Converted is 0 for a visit without a conversion and 1 for a visit with a conversion.  For an experiment like this, that has a binary outcome, the statistical tool to choose is logistic regression.

Now we get to see some code!

Assuming the table from above is represented by the “visits” dataframe, the model is very simple to fit in statsmodels.

The output:

Selection_260

This is a lot information, but the decision will likely be based on only one number: in the second table, the interception between “C(Treatment)[T.B]” and “P>|z|”. This is one minus the probability that the difference between the two conversion rates is actually different, or the significance level. The convention is that if that value is less than .05, the difference is significant. Another number worth mentioning is the coefficient of the treatment. This is how much change is estimated. It’s important, because even if the outcome was significant it’s possible to have been a significant negative coefficient, and then the decision is worse than just not accepting a better page, since we’ll accept a worse landing page.

In this case the significance level is greater than .05, so the decision would be that the difference observed is not indicative of an actual difference in conversion rates. This is clearly the wrong decision. The conversion rates were specified as being different, but that difference cannot be statistically detected.

This is ultimately the challenge with power and sample sizes. Had the experiment been run again, with a larger sample, it is possible that we would have detected the change and made the correct decision.  Unfortunately, the planning was done incorrectly and only 1000 samples were taken.

Always Be Sampling

Although the wrong decision was made in the last experiment, we want to improve our decision making.  It is possible to analyze our analysis through simulation. It is a matter of replicating — many times over — the analysis and decision process from earlier. Then it is possible to find out how often the correct decision would be made given the actual difference in conversion rate.

Put another way, the task is to:

  1. Generate a random dataset set for the treatment and control group based on the expected conversion rates.
  2. Fit the model that would have been from the example above.
  3. Measure the outcome based on the decision criteria; here it’ll just be a significant p-value < 0.05.

And now be prepared for the most challenging part: the simulation. These three steps are going to be wrapped in a for loop, and the outcome is collected in an array. Here’s a simple example in python of how that could be carried out.

Running that experiment 500 times, with a sample size of 1000, would yield a correct decision roughly 4.2% of the time.  Ugh.

This is roughly the power the of the experiment at a sample size of 1000.  If we did this experiment 500 times, we would rarely make the correct decision.  To correct this, we need to change the experiment plan to generate more trials.

To get a sense for the power at different sample sizes we choose several possible sample sizes, then run the above simulation for those samples. Now there are two for loops: one for to iterate through the sample size, and one to carry out the analysis above.

Here is the outcome of the decisions at various sample sizes.  The “Power” column is the proportion of time a correct decision was made at the specified sample size.

Power

Not until 10^4.5 samples — roughly 31,000 — does the probability of making the correct decision become greater than 50%.  It is now a matter of making the business decision about how necessary it is to detect the effect.  Typically it is around 80%, in the same way that the significance level is normally around 5%… convention.  It would be easy to repeat this test for several intermediate sample sizes, between 10^4.5 and 10^5, to determine a sample size that has the power the business is comfortable with.

Uncertainty in Initial Conversion Rate

The outcome of the experiment was given a lot of criticism, but the underlying conversion rate was (more or less) taken for granted. The problem is there’s probably a lot of error in the estimated effect of the treatment before the experiment, and some error in the estimated effect of the control, since the control is based on past performance and the treatment is based on a combination of analysis and conjecture.

For example, say we had historical data that indicated that 1,000 out of 10,000 people had converted for the control thus far, and we ran a similar test to the control recently, so we have some confidence that 105 out of 1,000 people would convert.

If that was the prior information for each page the distribution for conversion rate for each landing page over 1,000 experiments could look like:

Selection_250

Even though it appears that landing page B does have a higher conversion rate on average, its distribution around that average is much wider.  To factor in that uncertainty, we can rerun that model with but instead of assuming a fixed conversion rate, we can sample from the distribution of the conversion rate before each simulation. Here’s the outcome, similar to above, of the proportion of times we’d make the correct decision.

Sadly our power was destroyed by the randomness associated with the uncertainty between landing pages.  Here’s the same power by sample size table as above.  For example, at 10^5.0 it is likely that conversion rate for landing page B was less than landing page A.

Selection_251

An alternative route in a situation like this is the use of a beta-binomial model to continue to incorporate additional data to the initial conversion rates.

More Complicated Experiments

The initial example was a very simple test, but more complex tests are often useful.  With more complex experiments, the framework for planning needs to expand to facilitate better decision making.

Consider a similar example to the original one with an additional complication. Since the page is a landing page, the user had to come from somewhere.  These sources of traffic are also sources of variation. Just like how any realized experiment could vary from the expectation, any given source’s underlying conversion rate could also could also vary from the expectation. In the face of that uncertainty, it would be a good idea to run the test on multiple ads.

To simplify our assumptions, consider that the expected change in conversion rate is still 0.5%, but across three ads the conversion rate varies individually by -0.01%, 0.00% and +0.01% due to the individual ad-level characteristics.

For example, this could be outcome of one possible experiment with two landing pages and three ads.

Selection_252

Thankfully statsmodels has a consistent API so just a few things need to change to fit this model:

  • Use gee instead of logit. This is a general estimating equation.  It enables a correlation within groups for a GLM to be fit, or, for these purposes, a logit regression with the group level variances taken into consideration.
  • Pass the groups via the “groups” argument.
  • Specify the family of the GLM; here it’s binomial with a logit link function (the default argument).

Those changes would like this:

Selection_253

The decision criteria here is similar to the first case, so we cannot say anything about the effect of the landing page, and likely this test would not roll out.  Now that the basic model is constructed, we follow the same process to estimate how much power the experiment would have at various levels of sample size.

Conclusion

Experiments are challenging to execute well, even with these additional tools.  The groups that have sufficient size to necessitate testing are normally sufficiently large and complex that wrong decisions can be made.  Through simulation and thinking about the decision-making process, it is possible to quantify how often a wrong decision could occur, its impact, and how to best mitigate the problem.

(By the way, zulily is actively looking for someone to make experimentation better, so if you feel that you qualify, please apply!)