Hi, I’m Mark — a mathematician turned a coffeeshop owner turned an actuary turned a software engineer. When I discovered my passion for software development, I quit my day job, went through an immersive development bootcamp and got recruited by Zulily’s Member Engagement Platform team. This post is about my first six months here.
Things I had to learn on the go:
Becoming an IntelliJ power user
Navigating the code base with many unfamiliar projects
Learning about containerized deployments and CI/CD
Getting my hands dirty with Java, Scala, Python, JavaScript, shell scripting and beyond
Reaching high into the AWS cloud: ECS, SQS, Kinesis, S3, Lambda, Step Functions, EMR, CloudWatch…
Things that caught me by surprise:
I shipped my code to production two weeks after my start date.
Engineers had autonomy and trust. The managers removed obstacles.
The amount of collective knowledge and experience around me was stunning.
I was expected to work directly with the Marketing team to clarify requirements. No middlemen.
Iteration was king!
Things that I am still learning with the help from my team:
Form validation in React;
Designing and building high-scale distributed systems;
Tuning AWS services to match our use cases;
On-call rotations (the team owns several tier-1 services);
And many more…
There is one recurring theme in all these experiences. No matter how new, huge, or unfamiliar a task may seem at first, it’s just a matter of breaking it down methodically into manageable chunks that can then be understood, built, and tested. This process can be time consuming, and I feel lucky that my team understands that. It’s clear that engineers at Zulily are encouraged to take the time and build products that last.
One of the things that I find truly amazing here is the breadth of work that is considered ‘full stack’. DevOps, Big Data, Micro Services, and React client apps are just some of the areas in which I have been able to expand my knowledge in the last several months. It may have been overwhelming at first, but when you are among teammates that have vast expertise, acquiring these new skills became an exciting part of my daily routine at Zulily.
It’s hard to compare what I know now to what I knew six months ago —my understanding has expended in breadth and depth — and I’m excited to see what the next six months will bring.
The
appeal of serverless architecture sometimes feels like a silver bullet: Offload
your application code to your favorite cloud providers’ serverless offering and
enjoy the benefits of extreme scalability, rapid development, and pay-for-use
pricing. At this point in time most cloud providers’ solutions and tooling are
well defined, and you can get your code executing in a serverless fashion with
little effort.
The
waters start to get a little murkier when considering making an entire
application serverless while maintaining the coding standards and
maintainability of traditional server-based architecture. Serverless has a
tendency to become ownerless (aka hard to debug and painful to work on) over
time without the proper structure, tooling and visibility. Eventually these
applications suffer due to code fragility and slow development time, despite
how easy and fast they were to prototype.
On
the pricing team at Zulily, our core services are written in Java using the
Dropwizard framework. We host on AWS using Docker and Kubernetes and deploy
using Gitlab’s CICD framework. For most of our use cases this setup works well,
but it’s not perfect for everything. We do feel confident though that this
setup provides the structure and tools that help in avoiding the ownerless trap
and this setup helps us achieve our goals of well-engineered service
development. While this list of goals will differ between teams and projects,
for our purposes it can be summarized below:
Goals:
Testability
(Tests & Local debugging & Multiple environments)
We
are currently redesigning our competitive shopping pipeline and thought this
would be a good use case to experiment with a serverless design. For our use
case we were most interested in its ability to handle unknown and intermittent
load and were also curious to see if it improved our speed of development and
overall maintainability. We think we have been able to leverage the benefits of
a serverless architecture while avoiding some of the pitfalls. In this post,
I’ll provide an overview of the choices we made and lessons we learned.
OUR STACK IN A NUTSHELL AND HOW IT WORKS
There
are many serverless frameworks, some are geared towards being platform agnostic,
others focus on providing better UIs, some make deployment a breeze, etc… We
found most of these frameworks were abstraction layers on top of AWS/GCP/Azure
and since we are primarily an AWS-leaning shop at Zulily, we decided to stick
with their native tools where we could, knowing that later we could swap out
components if necessary. Some of these tools were already familiar to the team
and we wanted to be able to take advantage of any new releases by AWS without
waiting for an abstraction layer to implement said features.
Basic architecture of serverless stack:
Below
is a breakdown of the major components/tools we are using in our serverless
stack:
API Gateway:
What it is: Managed
HTTP service that acts as a layer (gateway) in front of your actual application
code.
How we use it: API gateway
handles all the incoming traffic and outgoing responses to our service. We use
it for parameter validation, request/response transformation, throttling, load
balancing, security and documentation. This allows us to keep our actual
application code simple while we offload most of the boilerplate API logic and
request handling to API Gateway. API gateway also acts as a façade layer where
we can provide clients with a single service and proxy requests to multiple
backends. This is useful for supporting legacy Java endpoints while the system
is migrated and updated.
Lambda:
What it is: Lambda is a serverless compute service that runs your code in
response to events and automatically manages the underlying compute resources.
How we use it: We use Lambdas
to run our application code using the API gateway Lambda proxy integration. Our
Lambdas are written in Nodejs to reduce cold start time and memory footprint
(compared to Java). We have a 1:1 relationship between an API endpoint and Lambda
to ensure we can fine tune the memory/runtime per Lambda and utilize API Gateway
to its fullest extent, for example, to bounce malformed requests at the gateway
level and prevent Lambda execution. We also use Lambda layers, a feature
introduced by AWS in 2018, to share common code between our endpoints while
keeping the actual application logic isolated. This keeps our deployment size
smaller per lambda and we don’t have to replicate shared code.
SAM (Cloudformation):
What it is: AWS SAM is
an open-source framework that you can use to build serverless applications on
AWS. It is an extension of Cloudformation which lets you bundle and define
serverless AWS resources in template form, create stacks in AWS, and enable
permissions. It also provides a set of command line tools for running
serverless applications locally.
How we use it: This is the
glue for our whole system. We use SAM to describe and define our entire stack
in config files, run the application locally, and deploy to integration and
production environments. This really ties everything together and without this
tool we would be managing all the different components separately. Our API
gateway, Lambdas (+ layers) and permissions are described and provisioned in
yaml files.
Example of defining a lambda proxy in template.yaml
Example of shared layers in template.yaml
Example of defining corresponding API gateway endpoint in
template.yaml
DynamoDB:
What it is: A managed
key value and document data store.
How we use it: We use
DynamoDB for our ingestion pipeline getting data into our system. Dynamo was
the initial reason we chose to experiment with serverless for this project.
Your serverless application is limited by your database’s ability to scale
fluidly. In other words, it doesn’t matter if your application can handle large
spikes in traffic if your database layer is going to return capacity errors in
response with this increase in load. Our ingestion pipeline spikes daily from
~5 to ~700 read or write capacity units per second and is scheduled to
increase, so we wanted to make sure throwing additional batches of reads or
writes to our datastore from our serverless API wasn’t going to cause a
bottleneck during a spike. In addition to being able to scale fluidly, Dynamo
has also simplified our Lambda–datastore interaction as we don’t have the
complexity overhead of trying to open and share database connections between
lambdas because Dynamo’s low-level API uses HTTP(S). This is not to say it’s
impossible to use a different database (lots of examples out there of this),
but its arguably simpler to use something like Dynamo.
Normal spikey ingestion traffic for our service with dynamo
scaling
Other tools: Gitlab CICD, Cloudwatch, Elk stack, Grafana:
These
are fairly common tools, so we won’t go into detail about what they are.
How we use it: We use
Gitlab CICD to deploy all of our other services, so we wanted to keep that the
same for our serverless application. Our Gitlab runner uses a Docker image that
has AWS SAM for building, testing, deploying (in stages) and rolling back our
serverless application. Our team already uses the Elk stack and Grafana for
logging, visualization and alerting. For this service all of our logs from API
Gateway, Lambda and Dynamo get picked up in Cloudwatch. We use Cloudwatch as a
data source in Grafana and have a utility that migrates our Cloudwatch logs to
Logstash. That way we can keep our normal monitoring and logging systems
without having to go to separate tools just for this project.
DID WE ADDRESS OUR GOALS?
So
now that we have laid out our basic architecture and tooling: How well does
this system address our goals mentioned at the start of this post? Below is a
breakdown for how these have been addressed and a (yes, very subjective) grade
for each.
Testability (Tests & Local debugging & Multiple environments) de
Overall Grade:A-
Positives: The tooling
for serverless has come a long way and impressed us in this category. One of
the most useful features is the ability to start your serverless app locally by
running
sam local start-api <OPTIONS>
This
starts a local http server based on your AWS::Serverless::Api specification in
your template.yaml. When you make a request to your local server, it reads the
corresponding CodeUri property of your lambda (AWS::Serverless::Function) and
starts a docker container (same image AWS runs deployed lambdas) to run your
lambda locally in conjunction with the request. We were able to write unit
tests for all the lambdas and lambda layer code as well as deploy to specific
integ/prod environments (discussed more below). There are additional console
and CLI tools for triggering API gateway endpoints and lambdas directly.
Negatives: Most of the
negatives for this category are nitpicks. Unit testing with layers took some
tweaking and feels a little clumsy and sam
local start-api doesn’t exactly mimic the deployed instance in how it
handles errors and parameter validation. In addition, requests to the local
instance were slow because it starts a docker container locally every time an
endpoint is requested.
Observability (Logging
& Monitoring)
Overall Grade: B
Positives: At the end of
the day this works pretty well and mostly mimics our other services where we
have our logs searchable in Kibana and our data visible in Grafana. The
integration with Cloudwatch is pretty seamless and really easy to get started
with.
Negatives: Observability
in the serverless world is still tough. It tends to be distributed across lots
of components and can be frustrating to track down where things unraveled. With
the rapid development of tooling in this category, we don’t see this being a
long-term problem. One tool that we have not tried, but could bump this grade
up, is AWS X-ray which is a distributed tracing system for AWS components.
Positives: Developer
experience was good, bordering on great. Each endpoint is encapsulated in its
own small codebase which makes adding new code or working on existing code
really easy. Lambda layers have solved a
lot of the DRY issues. We share
response, error and database libraries between the lambdas as well as NPM
modules. All of our lambda code gets reviewed and deployed like our normal
services.
Negatives. In our view
the two biggest downsides have been unwieldy configuration and immature
documentation. While configuration has its benefits, and it’s great to be able
to see the entire infrastructure of your application in code, this can be tough
to jump into and SAM/Cloudformation has a learning curve. Documentation is solid
but could be better. Part of the issue is the rapid pace of feature releases
and some confusion on best practices.
Deployments
Overall Grade: A
Positives: Deployments
are awesome with SAM/Cloudformation. From our Gitlab runner we execute:
which creates a change set (shows effects of
proposed deployment) and then creates the stack in AWS. This will create/update
all the resources, IAM permissions and associations defined in template.yaml.
This is really fast, a normal build and deployment of all of our Lambdas, API
Gateway endpoints, permissions etc… takes ~90 seconds (not including running
our tests).
Negatives: Unclear best
practices for deploying to multiple environments. One approach is to use
different API stages in API gateway and have your other resources versioned for
those stages. Or (the way we chose) is to have completely different stacks for
different environments and pass a DeployEnv variable into our Cloudformation
scripts.
Performance (Latency & Scalability)
Overall Grade: A-
Positives: To test our performance and
latency before going live we proxied all of our requests in parallel to both our
existing (Dropwizard) API and new serverless API. It is important to consider
the differences between the systems (data stores, caches etc..), but after some
tweaking we were able to achieve equivalent P99, P95 and P50 response times
between the two systems. In addition to the ability to massively scale out
instances in response to traffic, the beauty of the serverless API is that you
can fine tune performance on a function (endpoint) basis. CPU share is
proportionally allocated depending on overall memory of each Lambda, so
increasing the memory of each Lambda has the ability to directly decrease
latency. When we first started routing traffic in parallel to our serverless
API we noticed some of the bulk API endpoints were not performing as well as we
had hoped. Instead of increasing the overall CPU/memory of the entire
deployment, we just had to do this for the slow Lambdas.
Negatives: A well-known and often discussed
issue with serverless is dealing with cold start times. This refers to the
latency increase when an initial instance is brought up by your cloud provider
to process a request to your serverless app. In practice this is on the scale
of sometimes several hundred ms latency added on cold start for our endpoints.
Once spun up, instances won’t get torn down immediately and subsequent requests
(if routed to this same instance) won’t suffer from that initial cold start
time. There are several strategies to avoid these cold starts like
“pre-warming” your API with dummy requests when you know traffic is about to
spike. You can also configure your application to run with more memory which
persist for longer between requests (+ increase CPU). Both of these strategies
cost money, so the goal is to try to find the right balance for your service.
Example of scaling (at ~14:40) an arbitrary
computation-heavy lambda from 128mb to 512mb in response times
Overall latency with spikes over a ~3 hour period (most lambdas configured at 128mb or 256mb). Spikes re either cold start times or calls that involve several hundred queries (bulk APIs) or a combo. Note* this performance is tweaked to be equivalent to the p99 p90 and p50 times of the old service.
WRAPPING UP
At
the end of the day this setup has been a fun experiment and for the most part
satisfied our requirements. Now that we have set this up once, we will
definitely use it again for suitable projects in the future. A few things we
didn’t touch on in this post are pricing and security, two very important goals
for any project. Pricing is tricky to get an accurate comparison of because it
depends on traffic, function size, function execution time and many other
configurable options along the entire stack. It is worth mentioning, though,
that if you have large and consistent traffic to your APIs, it is highly
unlikely that serverless will be as cost effective as rolling you own (large)
instance. That being said, if you want a low maintenance service that can scale
on demand to handle spikey, intermittent traffic, it is definitely worth
considering. Security considerations are also highly dependent on your use
case. For us, this is an internal facing application so strategies like
associating an internal firewall, limited resource IAM permissions and basic
gateway token validation sufficed. For public facing applications there are a
variety of security strategies to consider that aren’t mentioned in this post. Overall, we would argue that the
tools are at a point where it does not feel like you are sacrificing in terms
of development standards by choosing a serverless setup. Still not a silver
bullet, but a great tool to have in the toolbox.