Authors: Arun Selvadhurai, Julian Kim, Sergey Podlazov, Tin Liu, Vinay Yerramilli
In Marketing Tech, one of our jobs is to tell customers about zulily offers. These days everything and everyone goes mobile, and Mobile Push notifications are a great way to reach customers.
Our team faced a double-sided challenge. Imagine that you have to ferry passengers across a river. There’ll be times when only one or two passengers show up every second or so, but they need to make it across as soon as possible. Under other circumstances, two million passengers will show up at once, all demanding an immediate transfer to the opposite bank.
One way to solve this is to build a big boat and a bunch of small boats and use them as appropriate. While this works, the big boat will sit idle most of the time. If we build the big boat only, we will be able to easily handle the crowds, but it will cost a fortune to service individual passengers. Two million small boats alone won’t work either because they will probably take the entire length of the river.
Fortunately, in the world of software we can solve this challenge by building a boat that scales. Unlike the Lambda architecture with two different code paths for real-time and batch processing, an auto-scaling system offers a single code path that can handle one or one million messages with equal ease.
Let’s take a look at the system architecture diagram.
Campaigns and one-offs are passengers. In the case of a campaign, we have to send potentially millions of notifications in a matter of minutes. One-offs arrive randomly, one at a time.
An AWS Kinesis Stream paired with a Lambda function make a boat that scales. While we do need to provision both to have enough capacity to process the peak loads, we only pay for what we use with Lambda, and Kinesis is dirt-cheap.
We also ensure that the boat doesn’t ferry the same passenger multiple times, which would result in an awful customer experience (just imagine having your phone beep every few minutes). To solve this problem, we built a Frequency Cap service on top of Redis, which gave us a response time under 50ms per message. Before the code attempts to send a notification, it checks with the Frequency Cap service if the send has already been attempted. If it has, the message is skipped. Otherwise, it is marked as “Send Attempted”. It’s important to note that the call to the Frequency Cap API is made before an actual send is attempted. Such a sequence prevents the scenario where we send the message and fail to mark it accordingly due to a system failure.
Another interesting challenge worth explaining is to how we line up millions of passengers to board the boat efficiently. Imagine that they all arrive without a ticket, and the ticketing times vary. Yet, the board departs at an exact time that cannot be changed. We solve for this by ticketing in advance (the Payload Builder EMR service) and gathering passengers in a waiting area (files in S3). At an exact time, we open multiple doors from the waiting area (multithreading in the Kinesis Loader Java service), and the passengers make their way onto the boat (Kinesis Stream). The Step Function AWS service connects the Payload Builder and a Kinesis Loader into a workflow.
In summary, we built a system that can handle one or one million Mobile Push notifications with equal ease. We achieved this by combining batch and streaming architecture patterns and adding a service to prevent duplicate sends. We also did some cool stuff in the Payload Builder service to personalize each notification so check back in a few weeks for a new post on that.