Leveraging Serverless Tech without falling into the “Ownerless” trap

The appeal of serverless architecture sometimes feels like a ­­silver bullet: Offload your application code to your favorite cloud providers’ serverless offering and enjoy the benefits of extreme scalability, rapid development, and pay-for-use pricing. At this point in time most cloud providers’ solutions and tooling are well defined, and you can get your code executing in a serverless fashion with little effort.

The waters start to get a little murkier when considering making an entire application serverless while maintaining the coding standards and maintainability of traditional server-based architecture. Serverless has a tendency to become ownerless (aka hard to debug and painful to work on) over time without the proper structure, tooling and visibility. Eventually these applications suffer due to code fragility and slow development time, despite how easy and fast they were to prototype.

On the pricing team at Zulily, our core services are written in Java using the Dropwizard framework. We host on AWS using Docker and Kubernetes and deploy using Gitlab’s CICD framework. For most of our use cases this setup works well, but it’s not perfect for everything. We do feel confident though that this setup provides the structure and tools that help in avoiding the ownerless trap and this setup helps us achieve our goals of well-engineered service development. While this list of goals will differ between teams and projects, for our purposes it can be summarized below:

Goals:

  • Testability (Tests & Local debugging & Multiple environments)
  • Observability (Logging & Monitoring)
  • Developer Experience (DRY & Code Reviews & Speed & Maintainability)
  • Deployments
  • Performance (Latency & Scalability)

We are currently redesigning our competitive shopping pipeline and thought this would be a good use case to experiment with a serverless design. For our use case we were most interested in its ability to handle unknown and intermittent load and were also curious to see if it improved our speed of development and overall maintainability. We think we have been able to leverage the benefits of a serverless architecture while avoiding some of the pitfalls. In this post, I’ll provide an overview of the choices we made and lessons we learned.

OUR STACK IN A NUTSHELL AND HOW IT WORKS

There are many serverless frameworks, some are geared towards being platform agnostic, others focus on providing better UIs, some make deployment a breeze, etc… We found most of these frameworks were abstraction layers on top of AWS/GCP/Azure and since we are primarily an AWS-leaning shop at Zulily, we decided to stick with their native tools where we could, knowing that later we could swap out components if necessary. Some of these tools were already familiar to the team and we wanted to be able to take advantage of any new releases by AWS without waiting for an abstraction layer to implement said features.

Basic architecture of serverless stack:

Below is a breakdown of the major components/tools we are using in our serverless stack:

API Gateway:

What it is: Managed HTTP service that acts as a layer (gateway) in front of your actual application code.

How we use it: API gateway handles all the incoming traffic and outgoing responses to our service. We use it for parameter validation, request/response transformation, throttling, load balancing, security and documentation. This allows us to keep our actual application code simple while we offload most of the boilerplate API logic and request handling to API Gateway. API gateway also acts as a façade layer where we can provide clients with a single service and proxy requests to multiple backends. This is useful for supporting legacy Java endpoints while the system is migrated and updated.

Lambda:

What it is: Lambda is a serverless compute service that runs your code in response to events and automatically manages the underlying compute resources.

How we use it: We use Lambdas to run our application code using the API gateway Lambda proxy integration. Our Lambdas are written in Nodejs to reduce cold start time and memory footprint (compared to Java). We have a 1:1 relationship between an API endpoint and Lambda to ensure we can fine tune the memory/runtime per Lambda and utilize API Gateway to its fullest extent, for example, to bounce malformed requests at the gateway level and prevent Lambda execution. We also use Lambda layers, a feature introduced by AWS in 2018, to share common code between our endpoints while keeping the actual application logic isolated. This keeps our deployment size smaller per lambda and we don’t have to replicate shared code.

SAM (Cloudformation):

What it is: AWS SAM is an open-source framework that you can use to build serverless applications on AWS. It is an extension of Cloudformation which lets you bundle and define serverless AWS resources in template form, create stacks in AWS, and enable permissions. It also provides a set of command line tools for running serverless applications locally.

How we use it: This is the glue for our whole system. We use SAM to describe and define our entire stack in config files, run the application locally, and deploy to integration and production environments. This really ties everything together and without this tool we would be managing all the different components separately. Our API gateway, Lambdas (+ layers) and permissions are described and provisioned in yaml files.

Example of defining a lambda proxy in template.yaml

Example of shared layers in template.yaml

Example of defining corresponding API gateway endpoint in template.yaml

DynamoDB:

What it is: A managed key value and document data store.

How we use it: We use DynamoDB for our ingestion pipeline getting data into our system. Dynamo was the initial reason we chose to experiment with serverless for this project. Your serverless application is limited by your database’s ability to scale fluidly. In other words, it doesn’t matter if your application can handle large spikes in traffic if your database layer is going to return capacity errors in response with this increase in load. Our ingestion pipeline spikes daily from ~5 to ~700 read or write capacity units per second and is scheduled to increase, so we wanted to make sure throwing additional batches of reads or writes to our datastore from our serverless API wasn’t going to cause a bottleneck during a spike. In addition to being able to scale fluidly, Dynamo has also simplified our Lambda–datastore interaction as we don’t have the complexity overhead of trying to open and share database connections between lambdas because Dynamo’s low-level API uses HTTP(S). This is not to say it’s impossible to use a different database (lots of examples out there of this), but its arguably simpler to use something like Dynamo.

Normal spikey ingestion traffic for our service with dynamo scaling

Other tools: Gitlab CICD, Cloudwatch, Elk stack, Grafana:

These are fairly common tools, so we won’t go into detail about what they are.

How we use it: We use Gitlab CICD to deploy all of our other services, so we wanted to keep that the same for our serverless application. Our Gitlab runner uses a Docker image that has AWS SAM for building, testing, deploying (in stages) and rolling back our serverless application. Our team already uses the Elk stack and Grafana for logging, visualization and alerting. For this service all of our logs from API Gateway, Lambda and Dynamo get picked up in Cloudwatch. We use Cloudwatch as a data source in Grafana and have a utility that migrates our Cloudwatch logs to Logstash. That way we can keep our normal monitoring and logging systems without having to go to separate tools just for this project.

DID WE ADDRESS OUR GOALS?

So now that we have laid out our basic architecture and tooling: How well does this system address our goals mentioned at the start of this post? Below is a breakdown for how these have been addressed and a (yes, very subjective) grade for each.

Testability (Tests & Local debugging & Multiple environments) de

Overall Grade: A-

Positives: The tooling for serverless has come a long way and impressed us in this category. One of the most useful features is the ability to start your serverless app locally by running

sam local start-api <OPTIONS>

This starts a local http server based on your AWS::Serverless::Api specification in your template.yaml. When you make a request to your local server, it reads the corresponding CodeUri property of your lambda (AWS::Serverless::Function) and starts a docker container (same image AWS runs deployed lambdas) to run your lambda locally in conjunction with the request. We were able to write unit tests for all the lambdas and lambda layer code as well as deploy to specific integ/prod environments (discussed more below). There are additional console and CLI tools for triggering API gateway endpoints and lambdas directly.

Negatives: Most of the negatives for this category are nitpicks. Unit testing with layers took some tweaking and feels a little clumsy and sam local start-api doesn’t exactly mimic the deployed instance in how it handles errors and parameter validation. In addition, requests to the local instance were slow because it starts a docker container locally every time an endpoint is requested.

Observability (Logging & Monitoring)

Overall Grade: B

Positives: At the end of the day this works pretty well and mostly mimics our other services where we have our logs searchable in Kibana and our data visible in Grafana. The integration with Cloudwatch is pretty seamless and really easy to get started with.

Negatives: Observability in the serverless world is still tough. It tends to be distributed across lots of components and can be frustrating to track down where things unraveled. With the rapid development of tooling in this category, we don’t see this being a long-term problem. One tool that we have not tried, but could bump this grade up, is AWS X-ray which is a distributed tracing system for AWS components.

Developer Experience (DRY & Code Reviews & Speed & Maintainability)

Overall Grade: A-

Positives: Developer experience was good, bordering on great. Each endpoint is encapsulated in its own small codebase which makes adding new code or working on existing code really easy.  Lambda layers have solved a lot of the DRY issues. We share response, error and database libraries between the lambdas as well as NPM modules. All of our lambda code gets reviewed and deployed like our normal services.

Negatives. In our view the two biggest downsides have been unwieldy configuration and immature documentation. While configuration has its benefits, and it’s great to be able to see the entire infrastructure of your application in code, this can be tough to jump into and SAM/Cloudformation has a learning curve. Documentation is solid but could be better. Part of the issue is the rapid pace of feature releases and some confusion on best practices. 

Deployments

Overall Grade: A

Positives: Deployments are awesome with SAM/Cloudformation. From our Gitlab runner we execute:

aws cloudformation package --template-file template.yaml   --output-template-file packaged_{env}.yaml --s3-bucket {s3BucketName}

which uploads template resources to s3 and replaces paths in our template. We can then run:

aws cloudformation deploy --template-file packaged_{env}.yaml   --stack-name {stackName} --capabilities CAPABILITY_IAM --region us-east-1   --parameter-overrides DeployEnv={env} 

 which creates a change set (shows effects of proposed deployment) and then creates the stack in AWS. This will create/update all the resources, IAM permissions and associations defined in template.yaml. This is really fast, a normal build and deployment of all of our Lambdas, API Gateway endpoints, permissions etc… takes ~90 seconds (not including running our tests).

Negatives: Unclear best practices for deploying to multiple environments. One approach is to use different API stages in API gateway and have your other resources versioned for those stages. Or (the way we chose) is to have completely different stacks for different environments and pass a DeployEnv variable into our Cloudformation scripts.

Performance (Latency & Scalability)

Overall Grade: A-

Positives: To test our performance and latency before going live we proxied all of our requests in parallel to both our existing (Dropwizard) API and new serverless API. It is important to consider the differences between the systems (data stores, caches etc..), but after some tweaking we were able to achieve equivalent P99, P95 and P50 response times between the two systems. In addition to the ability to massively scale out instances in response to traffic, the beauty of the serverless API is that you can fine tune performance on a function (endpoint) basis. CPU share is proportionally allocated depending on overall memory of each Lambda, so increasing the memory of each Lambda has the ability to directly decrease latency. When we first started routing traffic in parallel to our serverless API we noticed some of the bulk API endpoints were not performing as well as we had hoped. Instead of increasing the overall CPU/memory of the entire deployment, we just had to do this for the slow Lambdas.

Negatives: A well-known and often discussed issue with serverless is dealing with cold start times. This refers to the latency increase when an initial instance is brought up by your cloud provider to process a request to your serverless app. In practice this is on the scale of sometimes several hundred ms latency added on cold start for our endpoints. Once spun up, instances won’t get torn down immediately and subsequent requests (if routed to this same instance) won’t suffer from that initial cold start time. There are several strategies to avoid these cold starts like “pre-warming” your API with dummy requests when you know traffic is about to spike. You can also configure your application to run with more memory which persist for longer between requests (+ increase CPU). Both of these strategies cost money, so the goal is to try to find the right balance for your service.

Example of scaling (at ~14:40) an arbitrary computation-heavy lambda from 128mb to 512mb in response times

Overall latency with spikes over a ~3 hour period (most lambdas configured at 128mb or 256mb). Spikes re either cold start times or calls that involve several hundred queries (bulk APIs) or a combo. Note* this performance is tweaked to be equivalent to the p99 p90 and p50 times of the old service.

WRAPPING UP

At the end of the day this setup has been a fun experiment and for the most part satisfied our requirements. Now that we have set this up once, we will definitely use it again for suitable projects in the future. A few things we didn’t touch on in this post are pricing and security, two very important goals for any project. Pricing is tricky to get an accurate comparison of because it depends on traffic, function size, function execution time and many other configurable options along the entire stack. It is worth mentioning, though, that if you have large and consistent traffic to your APIs, it is highly unlikely that serverless will be as cost effective as rolling you own (large) instance. That being said, if you want a low maintenance service that can scale on demand to handle spikey, intermittent traffic, it is definitely worth considering. Security considerations are also highly dependent on your use case. For us, this is an internal facing application so strategies like associating an internal firewall, limited resource IAM permissions and basic gateway token validation sufficed. For public facing applications there are a variety of security strategies to consider that aren’t mentioned in this post. Overall, we would argue that the tools are at a point where it does not feel like you are sacrificing in terms of development standards by choosing a serverless setup. Still not a silver bullet, but a great tool to have in the toolbox.

RESOURCES AND FURTHER READING

Overview of serverless computing: https://martinfowler.com/articles/serverless.html

API gateway: https://docs.aws.amazon.com/apigateway/index.html#lang/en_us

Lambda: https://docs.aws.amazon.com/lambda/index.html#lang/en_us

SAM: https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/what-is-sam.html

Cloudformation: https://docs.aws.amazon.com/cloudformation/index.html#lang/en_us

Dynamo: https://docs.aws.amazon.com/dynamodb/index.html#lang/en_us

Cloudwatch: https://docs.aws.amazon.com/cloudwatch/index.html#lang/en_us

Xray: https://docs.aws.amazon.com/xray/index.html#lang/en_us

Grafana: http://docs.grafana.org/

Elk stack: https://www.elastic.co/guide/index.html

Gitlab CICD: https://docs.gitlab.com/ee/ci/

Serving dynamically resized images at scale using Serverless

Featured

At zulily, we take pride in helping our customers discover amazing products at tremendous value. We rely heavily on highly curated images and videos to narrate these stories. These highly curated images, in particular, form the bulk of the content on our site and apps.

zulily-events.png

Two of our popular events from weekend of Oct 20th 2018

Today, we will talk about how zulily leverages serverless technologies for serving optimized images on the fly on a variety of devices with varying resolutions and screen sizes

In any given month, we serve over 23 billion image requests using our CDN partners. This results in over a Petabyte per month of data transferred to our users around the globe.

We use AWS Simple Storage Service (S3) bucket as the origin for all our images. Our in-house studio and merchandising teams upload rich images using internal tools to S3 buckets. As you can imagine, these images are pretty huge in terms of file size. Downloading and displaying these images as-is would result in sub-optimal experience for our customers and waste of bandwidth.

Architecture

Dynamic image resizing on the fly

Dynamic image resizing on the fly

Continue reading