Spot Is the Reward for Good Architecture
I've never been woken up at 2am because of Spot.
That needs context, because without it, it sounds like a brag. I've been paged plenty. Or called. Pagers are a bit dated. You get the point. Editing configs on an HA pair that triggered immediately. A failed rehash of a storage archive that ended with breakfast at IHOP at 6 AM. I could go on. I'm sure many of us can. Point being, I know the particular flavor of adrenaline that hits when your phone buzzes and the screen says something you don't want to read.
Spot has never been the thing I had to read. Not once.
Failure as a Service
Spot instances are spare capacity that is otherwise just consuming electricity by spinning fans and hard drives. AWS pioneered the idea of selling it at a discount with the condition that they could reclaim it when needed. GCP and Azure followed.
Werner Vogels is credited with the expression "Everything fails, all the time." Another term for failure in the industry is Tuesday. Or Wednesday. You get the point. The thing that separates good architecture from bad architecture is not whether failure happens, but what the system does when it does.
On-Demand hardware failure is often random and silent. The instance is there, and then it isn't, and the first evidence can be an alert starting the chain of dominoes that lead to system failures. Imagine if you had a 2 minute head start on the failure chain.
You can. When a Spot instance is going to be interrupted, AWS delivers a notice via both instance metadata and EventBridge, two minutes before the instance is reclaimed. Two minutes is enough to drain, checkpoint, and hand off cleanly, if the application is listening. For most apps.
Idempotent operations and externalized state are not nice-to-haves; they're load-bearing walls. The interruption handler is your first response, not your only one. Take queue workers, for example; they take a job, they finish it, they move on. If they never report back the job stays in the queue to be tried again.
A system that was already built to survive random, unannounced hardware failure (the On-Demand scenario) is overqualified for a failure mode that announces itself in advance. If your architecture handles the harder problem, it handles Spot for free. The architecture may not change, but the pricing sure does.
Sink or Swim
A quick alignment on a critical term to Spot. Pools. Not the aqua blue structures we look for at resorts, but collections of cloud compute instances grouped by type, size, and location. Each has its own capacity, which drives the price and interruption rate. Just like a resort, the more pools the better.
Running across multiple pools does not change the interruption rate of any individual pool. What changes is the blast radius to your workload should one pool get exhausted.
With one pool, an event has the potential to be an outage. Given enough reclamations, the fleet doesn't shrink, it disappears. With five independent pools, a reclamation event in one pool is just a capacity adjustment. The fleet gets smaller. The scheduler replaces the lost capacity from the remaining pools. Easy.
While diversification includes size, different instance families across different availability zones payoff faster than diversification across pie slices. Half- to a full-dozen distinct pools should be your floor. The more capacity you want, the more pools you want.
Not Spot
When engineers say Spot is unreliable, the story is almost always the same. Single instance type. Single availability zone. On-Demand gets constrained, Spot gets reclaimed, and there's nowhere to go. The workload goes down, and the operations team files Spot under "unreliable" and goes back to debugging Kubernetes YAML.
So the same architecture is moved to On-Demand. A hardware failure hits some days later, Auto Scaling doesn't register the problem as EC2 isn't failing health checks it can't conduct, and the result is identical. Same constrained workload due to the same single point of failure. Same outage. Same postmortem. The pricing model didn't cause the failure. The architecture did.
If your services are stateless (or close enough), if your fleet spans multiple AZs, and if your scheduler can replace instances without human intervention (e.g. ASG, Karpenter), you've already done the necessary work. Spot instance reclamation is but a single scenario of the general failure your architecture was built to handle. You're not starting from scratch. You're collecting on an investment you already made.
Whatever
Attribute-based instance selection is under-rated. Highly. Instead of naming specific instance types, you describe a compute shape: typically a vCPU range and a memory range. There are other options, like processor architecture, local storage, etc. This can increase options an order of magnitude. I've seen it solve many capacity problems, and can do the same for price. Speaking of... Price protections will prevent you receiving a massive instance packed full of GPUs. Unless, of course, you wanted that.
If you're using Kubernetes, Karpenter does this natively. It evaluates dozens of parameter matching instance types, and provisions nodes quickly and asynchronously. It does this without node groups, reducing complexity and increasing efficiency. Same concept, same benefit.
The allocation strategy, or how instances are selected when many match, matters more than most teams realize. The lowest-price strategy selects the cheapest available pool. Often, these Spot instances will be short-lived before interruptions shifting capacity from pool to pool. Totally acceptable for test workloads. Or well-architected ones.
AWS stopped recommending lowest-price for Spot at re:Invent 2022 in favor of price-capacity-optimized, which balances cost against available capacity. You pay marginally more than the absolute floor and interrupt considerably less. That tradeoff is not close.
Curious to know which tool best predicts Spot interruptions? Many try. All fail. Spot is spare capacity, and it's really only possible to trend usage, assuming it will behave tomorrow as it did yesterday. You can look for patterns, but you can do that in the stock market, too. Which do you think will fare better?
Yes, there are tools that can track price, interruption rates, and if you ask me to recommend one it's the Spot Placement Score. Wrote a blog about it and believe in it as strongly today as I did then.
Or perhaps you've bookmarked the Spot Instance Advisor. Delete that.
There is no tmux for Spot
Databases and stuff like that on Spot? Just don't.
State is effectively memory. Perhaps it's session information in a web UI. Lose it and someone has to log in again. Or maybe it's a document that took hours to write. Lose it and, well, you probably lost a customer with it.
Spot instances should never store anything for more than, let's see if you've been paying attention... two minutes. Whether it be in RAM or on disk. Whether it be to EFS, RDS, or maybe S3. And whether it be from EC2 or Fargate. With adequate notification, your outputs should be capable of being saved elsewhere without trouble.
Any well-architected, fault-tolerant, loosely-coupled workload should be capable of this. Especially on Spot, where the only promise you get is a two minute warning.
Arm Yourself
Graviton instances run roughly 10 to 15% cheaper per hour than equivalent Intel retail pricing. Run them on Spot, and the up-to-90% discount applies as well. These are multiplicative, so let's say 10% discount on Graviton and 90% discount on Spot. Do the math. It isn't free. Damn close to it, though.
Some have observed difficulty obtaining Graviton Spot instances, but as AWS continues to rack them that pressure will surely relent. Watch for changes in capacity with the Spot Placement Score tracker (genius blog, btw) to see when Graviton starts to free up.
If you can manage it, run your workload on both x86 and arm. Doing so gives you the flexibility to launch either. Your platform will be more flexible and tolerant, enabling regions others might avoid. This is also the secret to cost and capacity in Karpenter.
You Earned It
Spot is not for every workload. The limitations and qualifications are real. There are systems that need guarantees that Spot is structurally unable to make. And you remember the only promise Spot makes, right? Right.
AWS says it best: "Spot is a fit for stateless, fault-tolerant, loosely-coupled workloads." They recently added "that are time and location flexible". Not wrong, but you can get away with a subset of that if you build well.
The list of workloads where Spot does well is long: batch processing, containerized services, stateless APIs, anything that scales horizontally and recovers gracefully.
The difference between teams that run Spot without incident and teams that get paged is not tooling, and it's not a checklist they found in a blog post. It's architecture. The same architecture you should build if Spot didn't exist, because stateless design and resilience are worth having on their own.
Spot is what happens when you build well and then notice there's a pricing model that rewards you for it.
Oh, and what I didn't mention when I defined Spot is that it's contract-free. Yeah. There's no 1-year or 3-year commit; the contract is in the ephemeral nature of the model. Which means you can use 600,000 vCPUs on Spot for a week and turn them off when you're finished. And just like that... they're gone.
You can choose Spot, for all its killer benefits, and to flex on your buddies when Finance doesn't call you upstairs next month, but to really use it you have to earn it.