With the nearly unmatched reliability and scalability offered by the 12-factor application design pattern, microservice-based designs have become a fundamental architectural pattern for modern applications. Thanks to the wide array of mature Open Source middleware and infrastructure software available today, designing and building microservices-based applications is easier than ever. Moving this complexity out of the application itself, and into standard supporting services has also allowed a whole industry of cloud providers to spring up and offer management of those services. A notable example of this is Amazon Web Services (AWS), which offer many services to support microservices-based applications at every layer. I like to define these layers as ‘compute’, ‘persistence,’ and ‘visibility, the minimal building blocks for any application that would be running in a production environment.

Compute refers to the CPU and RAM resources that you run your software on. Persistence is the means used to store data and communicate state among the various components of the application. Visibility is how you get insights into what the service is doing, letting you make improvements as needed. AWS offers a variety of services in each of these categories, allowing an organization to find the right fit and get a quick start on implementation without having to build extensive in-house talent.

This article will discuss the pros and cons of offerings from AWS in all three categories to help you find the best AWS product for your applications needs. Here, you will find information about:

Compute Options: Lambda, Fargate, and EHS
Application State Persistence: Amazon Aurora, DynamoDB, and Redshift
Visibility: AWS Cloudwatch

In addition to the AWS product overviews, you will find recommendations for third-party and open source products to help fill the gaps in the AWS offerings.

Compute Options

If you’re using microservices, you almost certainly will be using containers or serverless for at least part of your application stack because they bring significant advantages to the table.

Lambda

Perhaps the most unusual of the modern means to access compute is Function as a Service (FaaS). Also known as “serverless”, this technology avoids long-lived infrastructures of any kind by creating a “function” only when a request comes in and destroying it as soon as it is fulfilled. This allows the scalability curve of your application to track closer to your demand curve than any other method, and forces programming practices that ensure seamless horizontal scalability.

Lambda, Amazon’s FaaS service, supports a wide variety of languages, and most executables can be used via a callout from a supported language. Lambda responds to events from a variety of sources within AWS, but it’s most common trigger in a user-facing application is created via the AWS API Gateway service. This provides a consistent API surface for end users to interact with that can route requests inward to one or more Lambdas for processing.

With a strong message-passing layer like Kafka or Active MQ, an entirely Lambda-based architecture is possible--and sometimes even desirable. This is the case, for example, when large usage spikes are unpredictable. Because the scaling curve matches the usage so closely, there’s little waste in infrastructure overhead, and the additional cost per compute cycle that Lambda brings with it is offset by the very low cost incurred during low utilization periods.

More commonly though, Lambdas are used to augment a more traditional infrastructure style, handling more rare and potentially compute-intense requests like processing uploads or interaction with a third-party system. Triggered from S3 events or requests at API gateways, Lambdas really shine in this context as they allow the main application deployment to ignore the random additional load these tasks create, and any small delays at Lambda initialization can usually be ignored.

While very helpful, FaaS is not a panacea. If your application sees consistent demand, the cost overhead may make FaaS more expensive than more traditional “always on” options like containers or VMs. Startup times for functions can be problematic in some cases as well. Specifically, while AWS aims to have Lambdas instantiate in only a few milliseconds, the actual time required will vary by the size and complexity of the Lambda itself. For this reason, it’s often better to have a handful of smaller Lambdas rather than one large one. It’s especially important that your Lambdas not require large memory contexts to be loaded when they initialize because this slows the loading of the Lambda significantly. In this case, the application would likely be better designed as a long-running running service so that memory context can be initialized once and then re-used to serve many requests. Nonetheless, FaaS services are an important tool that can help keep costs low and scalability high when applied properly.

Fargate and ECS

The next step towards more traditional notions of compute infrastructure is the Fargate service. This runs specifically configured Docker containers on an abstracted and AWS-managed infrastructure.

Fargate is closely related to the Elastic Container Service (ECS), and in fact, it is managed under the umbrella of ECS. They both use the same configuration primitives of Tasks and Services, and have generally similar management overhead. However--unlike with Fargate--in traditional ECS, the end user must manage the underlying infrastructure. If your group has or wants to acquire expertise in that area, using traditional ECS instead of Fargate can deliver more savings when comparing the compute cost of a workload: ECS itself is free; you are only charged for the Amazon Compute Cloud (EC2) resources used as part of your ECS cluster.

An infrastructure deployed on ECS can support anything that runs in a container, so most legacy monolithic applications can live there happily. However, if you have periodic bursty workloads (like builds or data analysis jobs) you can avoid designing your ECS cluster for peak demand by configuring said workloads to run in Fargate containers instead. Like Lambda, Fargate is a way to make the cost of your infrastructure more closely match your demand curve. Oftentimes, you will find net savings despite the premium that is charged for the compute resources if you utilize it for periodic workloads.

ECS is AWS’s oldest offering in the container orchestration space and has seen minor improvements since it was first introduced in 2014. It works quite well if you are using it exclusively to deploy microservice-based applications that communicate exclusively with REST or other HTTP-based protocols. If you introduce services that need to communicate with TCP or on multiple ports, things get more complicated. If you have specific scheduling requirements regarding which services can be co-located with one another on a physical host, ECS also falls short. Fargate addresses some of these shortcomings, but only indirectly by dint of the abstraction it provides. If your situation dictates that you really need more control, not less, you would need to move to a more sophisticated orchestration system for your containers like Kubernetes, which brings us to the Elastic Kubernetes Service (EKS).

Elastic Kubernetes Service

Kubernates is flexible, efficient, scalable and easy to work with as a consumer. However, it is also complicated to manage, so many teams have shied away from it. EKS is the managed k8s offering from AWS. It allows teams to leverage the vast array of tools and patterns that have been built up to work with k8s without having to actually manage k8s itself.

While EKS comes at a premium, the benefit it brings in the form of lower management overhead can be significant, especially for smaller teams. It’s also important to note that EKS is certified to be k8s conformant: any tooling that works with standard k8s will work with EKS, and EKS can integrate with k8s run in a local datacenter or another cloud provider. This avoids vendor lock-in, provides significant flexibility, and is a useful onramp to hybrid-cloud infrastructure. One of the advancements in k8s that really helped cement its place in the container orchestration ecosystem is the variety of options it provides for persistent storage for containers. This is mostly useful for running legacy workloads because in most microservices deployments you won’t be using container-attached storage. You do need a way to store and communicate state in your application though, and AWS offers several options to achieving state and data persistence.

Application State Persistence

The persistence layer can come in many forms depending on the needs of the application and the preferences of the designers and builders. Persistence in this definition isn’t just long-term data storage, but also maintaining the state of the application while it’s in use. Moving this information out of the memory of the services running the app makes them “stateless”, which means they can be safely killed and created at any time, one of the tenets of the 12-factor design. As a result, it’s very common in a microservices design to have some sort of “message bus”, a system where information about the running state of the application is stored. Generally, this is ephemeral information that only has value while the application is running and is not usually backed up. Therefore, the performance of the system is very important.

Three widely used pieces of software that fall into this category are Redis, Active MQ, and Kafka, and AWS has managed offerings of each of them: Elasticache (Redis), Amazon MQ (Active MQ), and MSK (Kafka). Most are priced slightly higher than what a bare EC2 compute node of a similar size would cost. You will sacrifice some control in order to avoid managing and deploying these systems yourself, but in many cases, this is a good tradeoff. If you need a deployment at least as large as their minimum sizes, the small premium is almost certainly worth it. If you do not, then the decision becomes more complicated because you have to account for the cost of managing what can be a fairly complicated piece of software by yourself. Redis is fairly accessible, so it is a good candidate for self-management if you’re on the fence; active MQ and Kafka are more complex.

Because each service requires an additional investment in time and training, my general recommendation is to start with the managed service if it is within your budget. Then, create space to learn the system so you can move to self-management as your needs dictate. This is even more applicable to long-term persistence services.

RDS

RDS is a high-level offering that provides a few different databases to meet your particular needs, including Postgres, MySql, Maria DB, Oracle, MS SQL, and Amazon Aurora. Aurora is particularly well-suited to microservice-based architectures. It can be Postgres or MySQL compatible, which makes migrating to it easy for the vast majority of cases. Amazon claims that it is significantly faster than the original systems it replaces as well. Most importantly though, it is designed to scale seamlessly and be highly fault tolerant, which addresses the biggest shortcoming of even very well- managed traditional RDBS deployments: usually a single point of failure that is very difficult to eliminate. Aurora claims to have done that.

DynamoDB

DynamoDB is the primary NoSQL offering from AWS. NoSQL services are another take at eliminating that single point of failure for highly distributed systems like microservices. Other popular examples include CouchDB, Cassandra, Riak, and MongoDB. Each one is tuned for particular use cases, but they are largely interchangeable conceptually. They are meant to replace or augment a traditional RDBS and provide data storage in a “cloud native” manner. In practice this means that they can scale horizontally, by adding more servers to a cluster, most more easily than a traditional RDBS can. This provides benefits to performance and reliability by spreading the work around, but also lets you manage data volumes that far exceed what would be possible with monolithic RDBS.

Redshift

Datawarehousing is used primarily to provide business intelligence and analytics services over very large data sets. Often, this is in the form of data that has been “cubed”. Simply put, this is not only the data, but how the data has changed over time. Adding this element of time is one factor that has forced people to re-think data storage for real-time analytics because traditional RDBS simply weren’t designed for this. Redshift provides a solution to this through column-oriented data storage and massively parallel compute resources. It is accessed through a (mostly) Postgres-compatible SQL syntax.

In practice, a mature microservices-based application will likely need multiple kinds of storage to meet the requirements of its users. AWS has managed offerings in all of the major categories of data storage systems, and regularly adds new ones. Generally these services offer a level of performance and reliability that would be difficult to match using a self-managed deployment.

Visibility

The final layer of a microservices-based system is the visibility layer. Some people consider this as a separate entity, but I disagree with that position. One of the major criticisms levelled against microservice architectures is that they are too complex and therefore too difficult to troubleshoot. The best way to combat this problem is to design the visibility layer in with the rest of the application and infrastructure, and treat it as a distributed system as well. This is done most simply via metrics collection and log aggregation. AWS provides this primarily under their Cloudwatch service. It is simple to deploy, and quite inexpensive for what it offers.

However, visibility at large is one area where AWS is weak. Many of their services don’t provide all the metrics an operations group would like to see. If they do provide them, the ability to visualize and dashboard them in Cloudwatch is limited. Log aggregation also leaves something to be desired: each AWS service can potentially produce logs in a different way. Many, but not all, can send logs directly to Cloudwatch. Some store their logs as streams of files to S3. Others store them internally to the service itself. They must be extracted via the management console or API, and then fed into another system for aggregation.

To augment what AWS provides, I recommend the addition of a more fine-grained metrics collection system, a time-series database (TSDB) for long-term storage of metrics, and a visualization and dashboarding system. Netdata (metrics), Graphite (TSDB), and Grafana (Dashboard) are all prominent examples of Open Source tools that fill the gaps in the AWS offerings. They work well together and are manageable even for small teams.

Once your needs grow even more, you’re also likely to want a more sophisticated log aggregation tool. Cloudwatch Logs is pretty limited in this space, and they only recently started adding features to make it more robust. A promising Open Source alternative is Loki. It is a Prometheus-like log aggregator that is fairly new, but gaining traction. The “standard” self-managed tool for this is the Elasticstack suite of tools built around Elasticsearch. It generally works very well, but can be particularly complicated to manage and use. AWS offers a managed Elasticstack service, but they are very prescriptive in how it is configured and used, so there is a good chance you may not be able to leverage it.

The final piece of visibility tooling that is especially valuable for microservices is an application performance monitoring (APM) system. APM tools are generally a combination of a “probe” that gets built into the application to be monitored, and a data aggregation and analysis service that is used to view the data coming from the probes. AWS does not currently have an offering in this space, and there aren’t any Open Source options that I’m aware of, but there are many commercial products available. Some examples include New Relic APM, Appdynamics, and Dynatrace. They tend to be expensive, but if you are operating microservices at a sufficiently large scale, they can provide insights that would be very difficult to obtain otherwise.

Conclusion

The array of tools that AWS offers to support microservice-based applications is impressive. They provide a sufficiently gentle on-ramp to help newcomers get started, but enough power and flexibility to meet the needs of even the most demanding applications. If AWS put more effort into improving the visibility elements of their services, they could truly be a one-stop-shop for the microservice application architect. Until then, though, there are a variety of ways to fill those gaps with free or Open Source tools or SaaS offerings from third parties.

Microservices on AWS