Better Than Overprovisioning: Underprovision Your Cloud Services

Back when I was at Azure, we had a running joke that paying for cloud services is just “paying for someone else’s electricity at a markup.” If that’s the case, then it’s time to talk about the utility bill.

As a matter of routine business, we have gotten comfortable with overprovisioning our cloud services. We might be tempted to think that the best solution is to come up with a better way of right-sizing our provisioning. But the best solution might just be underprovisioning. That is, we might be able to cut a sizeable chunk off of our cloud expenses (without taking any risks) by allocating fewer resources to running our apps than our apps actually claim to need.

One critical focus of cloud spending is compute. As an industry, we have gotten comfortable with overspending on cloud compute services. In fact, we’ve gotten so comfortable with overspending that it is difficult to even think about doing something else. The accepted provisioning practices for Kubernetes applications provide a good example.

Kubernetes runs applications inside of containers. To run a typical cloud application like a microservice, a container starts its own server instance and then listens full-time. That is, a container must run continuously for the application in the container to do its job. But there’s more. In Kubernetes we don’t just run one instance of a container. The accepted practice is to run between three and five replicas of every deployment (which consists of one or more containers). That means for every microservice or web app that I am running, I should run at least three identical copies at all times.

Later we will come back to how this is a wasteful use of resources. But first, let’s talk about what we assume (perhaps wrongly) to be the ideal case.

Right-sized Provisioning

When we talk about provisioning applications, we are talking about allocating resources (CPU, memory, network, storage, and so on) to an application. In a Kubernetes app, I provisioned three replicas of a container running my application.

It should be possible, at least in theory, to provision resources for an application such that:

  • There are enough instances of my application to handle peak load
  • It is not the case that my application instances are consuming resources they don’t need
  • If part of the infrastructure fails, my app is still available

In practical terms, say I have a CMS-powered website. My website gets hardly any traffic over the weekends. It gets moderate traffic during most workdays. And every once in a while, maybe once every few months, a piece of content on the website gets more popular and I see a major spike in traffic.

Right-sized provisioning would have it such that during any of those three usage patterns, I am (1) not paying for resources I am not using, and (2) always able to handle the traffic with the existing resources. And (3) if part of my infrastructure goes down, my application is still be available and resourced.

For convenience, let’s refer to these as:

  1. The efficiency criterion: don’t pay for resources that are not needed or currently used
  2. The peak load criterion: even during peak load, an app should should have sufficient resources
  3. The resiliency criterion: When infrastructure fails, the app should (within reason) still have the resources to handle demand

Right-sized provisioning is hard enough that most people do not strive for it. Instead, we settle for overprovisioning.

Overprovisioning is a Problem

How do I provision an application such that my application can handle high loads (at unpredictable times), keep running if part of the infrastructure goes down, but not waste resources the rest of the time?

One way to answer is to acquiesce that we can’t have all three. And one of those three we can give up if we are okay paying. We can keep the ability to handle peak load, and keep the ability to resist failure. And we can give up the requirement that we not waste resources.

There are some good reasons to make this choice. Cloud technologies like virtual machines and containers tend not to start fast enough for us to run at lower levels of resource consumption most of the time, and then rapidly scale up when necessary. In Kubernetes if a pod or node goes down, it is not an option to merely hold off users while the application is restarted. Autoscalers have, to a small degree, helped. But in most cases, they are not practical for ordinary usage. In my experience, only very large installations operate autoscalers, and then even with a great deal of care and diligence.

I would contend that for the most part we have merely accepted the fact that we can’t reasonably have all three, so it’s easiest to drop the efficiency criterion.

And this trade-off leads us to a pattern of overprovisioning by default. That is, by default, we provision such that our applications consume resources that they do not, at that moment, need. So ingrained is this mentality that in Kubernetes, the best practice is setting our default replication count at a minimum of three. That is, in Kubernetes, we always run at least three instances of every application even when an application is not under active use.

Kubernetes is certainly not the only platform that favors overprovisioning. But for a moment, let’s constrain the scope of the discussion to just Kubernetes, and let’s see if we can understand something about the cost of this pattern.

The Cost of Overprovisioning

The suggested minimum size of a Kubernetes cluster is three nodes. When I provision three instances of my app (usually by creating a Deployment and setting replica: 3), those instances are spread out to my three nodes.

We could dive into the details of memory usage or CPU consumption, but instead, let’s go with the recommendations from the Kubernetes project. Even on the biggest, beefiest machine, you should not have more than 110 pods running per node. Each replica consumes one Kubernetes pod. So on my 3-node cluster, assuming all my apps run at a replica count of 3, the maximum number of applications I can run on that cluster is 110.

AWS’s configuration file for their Kubernetes service is public. And it turns out that 110 is more of a guideline. The closest to 110 in their matrix requires a VM sized m3.2xlarge and allows 118 pods per node. For a t.2 small instance, AWS allows you to run only 11 pods per VM. For a t2.xlarge that goes to 44. And at the high end, Amazon seems to have pushed the limit well beyond 110 to 737 for classes like 16xlarge and 24xlarge.

While a small sized VM comes in around $10/mo, the closest I could fine to running >100 on the AWS price calculator was an m6g.4xlarge at just under $300/mo. That puts the 3-node cluster at around $900/mo.

Why does it cost so much to run Kubernetes? In our view, the main issue is overprovisioning. We need lots of resources to keep many duplicate copies of containers running all of the time just in case an application hits a spike in traffic or one of the Kubernetes nodes fails.

We made the choice to give up on the efficiency criteria. And that choice was more expensive than we thought.

We Can Meet All Three Criteria

To return to our three criteria, what would a solution look like if we did not give up on any of the three? From the discussion of Kubernetes above, we have solid answers for the peak load criterion and the resiliency criterion: We need an environment that can allocate enough resources for peak load, and can spread application instances around enough to be resilient against infrastructure outages.

But when we mix in the efficiency criterion, we have to rethink how we accomplish the other two.

According to the efficiency criterion, we should not overprovision. And it seems that the only way we can meet the objectives of the other two criteria in addition to this is to rapidly scale up and down depending on demand.

Elsewhere I have written about how WebAssembly, as a cloud compute runtime, can scale nearly instantly. If this is the case, then we have a plausible way of satisfying all three criteria:

  1. To start, provision the minimal resources
  2. When load occurs, scale to meet load—and then scale down when the load disappears
  3. Spread instances of the application across the cluster such that the application instance is available during infrastructure outages.

When we introduced the Fermyon platform in June, 2022, we illustrated the kernel of the idea above. And that’s why we can run our production-grade clusters using three t2.small virtual machines for our worker pool. Spin is simply that much more efficient (and HashiCorp’s Nomad scheduler is pretty great, too).

One of our key learnings these last few months, though, is that perhaps right-sizing our provisioning is not good enough. Maybe we can do better.

Underprovisioning Is Better

As we begin to scale out, it may be possible to exceed the three criteria. We know already that we can scale an individual application to zero when it is under no load. Then we can scale it up to tens, hundreds, or even tens of thousands of instances if load requires. Because we scale to zero, though, we might have an opportunity.

When we have more than one application running — maybe hundreds or thousands — then an interesting optimization pattern emerges: We can underprovision.

Underprovisioning occurs when you allocate fewer resources to your applications (as an aggregate) than they would need if they were all running at the same time. In the Spin application model, if an app is not actively taking requests, it is not executing. When not executing, it is taking no memory or CPU, and is not “running in the background” in the traditional sense.

In other words, an app that has scaled to zero is not consuming the resources provisioned to it.

So we could, in theory, allocate those resources to another application.

To understand this, we need to think a little bit about how applications consume resources in aggregate and individually.

  • What percentage are running at once?
  • Are there some that do not run frequently?
  • Are there some that consume disproportionately high resources?

These are questions we can answer by merely inspecting the workloads we are running. Yes, there is a little bit of error margin, but for the most part we can make sensible observations based on the information that is at hand.

Now, as an example, we can imagine a cluster with many applications running on it. Some are high-performance production apps. Others are staging copies that receive very little traffic. Still others are systems apps that run on a schedule. And then there are the developers’ apps that get almost no traffic and are used only for testing during the software development cycle. Without even going into the details of the questions above, we can intuitively see how overprovisioning can work. It’s easy to say that some of those apps do not need their full allocation of resources all the time. And it’s equally easy to say that many of these apps won’t be used much. We can safely assume (again, only with this sparse info) that a development version will not consume its full resource allocation all the time. In fact, it will only use any resources occasionally, maybe even only a couple times a day. Likewise, as we look through the other workloads, we can see patterns and understand that few, if any, of these apps are going to consume their full resources all the time.

We could conjure the specter of anomalies, though. What if some process that normally behaves suddenly (due to a bug) consumes way more resources. That’s why we need the ability to place constraints on things. If we can say, “this application can only use 64MB of memory while that one can use up to 2GB”, then we have the safety net we need. This eliminates the possibility of a runaway application suddenly bursting out of its normal pattern and consuming too many resources.

This is where underprovisioning shines as a strategy. We can provision (and pay for) fewer resources than we would actually need if everything was running all the time. And we can do this safely because we can acquire knowledge about the behavioral patterns of applications, and apply these insights to provisioning.

Spin applications are designed to start nearly instantly and then immediately free up resources as soon as they are no longer in use. In a typical HTTP application, that means a Spin app is started when the request comes in, and shut down when the request has been handled. And a typical request is handled in a fraction of a second. Thus, that developer instance that might take up three pods and run 100% of the time on a Kubernetes cluster may, when all of the requests are combined, run for only a few total seconds the entire time it is deployed in a Fermyon cluster.

A Small Experiment

We recently did an informal benchmark of our underprovisioning strategy on Spin running on a Ryzen 5900X with 32G of RAM (it was a System76 desktop, in case you were wondering).

We provisioned 1,000 unique instances of the Bartholomew CMS onto a single Spin instance and then bombarded one of the instances with requests. At a concurrency of 100 simultaneous requests, Spin achieved a throughput of 4,970 requests per second and handled the average request in 15.02 milliseconds. That means that it started a Bartholomew instance, completed the request, and exited in fifteen milliseconds. And that’s with 1,000 WebAssembly components provisioned on the same Spin instance.

This is at least a first glimpse at what an underprovisioning scenario looks like. At any given moment we have a thousand different services available, and since most are not under load, they are consuming no notable resources. The ones that are running are not impeded by the presence of the others, and can still achieve high performance even when packed well beyond the normal provisioning guidelines.

Merely getting from overprovisioning to right-sized provisioning is a laudable goal. But what we have seen here is that we can go further. We can reduce application consumption to the minimum required for a particular combination of applications, not all of which will be under load at the same time. Underprovisioning is cost effective, but also safe and reliable. And that’s why we think the future of cloud compute will entail underprovisioning.

Try out Spin to experience a different way of provisioning apps

Interested in learning more?

Get Updates