Shift Left. Build Right. Knock knock, Mr. Smith
I built a system at Adobe called Spot Stack. It was tiered infrastructure provisioned by Auto Scaling Groups (ASGs) that delivered EC2 resources to a pre-Kubernetes container platform. It managed capacity well, and did so at the lowest cost possible.
Inevitably, Auto Scaling groups improved, rendering the system I built pointless. I would argue that Spot Stack was better. A few times. But the fact was AWS knew what I built and built it better. Specifically, Mixed Instance Groups (MIGs) made it possible to ask for multiple instance types at once. It was on my six before I ever saw it; the Spot Stack retirement was inevitable.
There were other improvements, like better scaling and faster launches, but Mixed Instance Groups was The One. Auto Scaling's Neo to Spot Stack's Mr. Smith.
But wait. There are some that suggest Mr. Smith was actually The One…
Can a Customer Outbuild AWS?
A customer I can't name (some of you know) took every one of the four concepts in this series and put them to work in a single architecture. What they built answers that question.
You've recognized by now this was a series, right? No? Welcome to the finale. You have some catching up to do.
Getting Shifty
Finance reacts to numbers; engineering responds to signals. This team made cost a first-class engineering metric before touching instance selection.
Shifting left, the engineering team put their focus on unit costs. Very proactive in a typically reactive space.
The only CFO callout they’ll get is in the earnings report.
They benchmarked every compatible instance type under their actual workload and assigned a throughput value to each. Cost per hour matters, but cost per transaction matters more. One is resource pricing, the other measures value. Like MPG.
Price Performance is a Service
Remember the days when every new generation of AWS instances cost less than the generation before? When using a c4 vs a c5 was stupid because not only was the c4 slower but it cost more? Those days are gone.
AWS Marketing wants you on the latest and greatest instance type as quickly as it's released, but is that really the right choice anymore? Maybe. Do you know? Can you prove it or disprove it? I doubt it. They could.
What they learned is that the ratio of cost-per-hour for each instance type and the work it produces are not predictable. Within the same type, just changing size? Sure. But comparing Intel to AMD and Arm, or even different generations of the same architecture? No.
This pays for itself quickly. It's delivery on the shift left commitment and a service to the product. Those savings can be reinvested into more engineers. Or AI.
But Weight, There's More
Those throughput values didn't stay in a spreadsheet. They became ASG instance weights.
Weights are amazing. And hard. Many seasoned AWS architects don't understand them. Make sure you do before you use them. Hint: it's not weight 1 = first choice, weight 2 = second, etc
Instead of "launch N instances," weights allow you to define capacity in your own terms. Units. In their case, transactions per second. And yes, TPS reports.
Configured this way, the ASG doesn't care about instance size or generation or architecture; it cares about how much work gets done. A larger or faster instance contributes more weight because it produces more throughput.
Then comes the interesting part: value. A machine doing 10 TPS for $1/hr costs 0.10 per transaction. Another doing 20 TPS for $1.50/hr costs 0.075. That difference is everything.
They ranked every instance type by cost per transaction and fed that list directly into the ASG as its priority allocation strategy. When capacity is needed, it starts with the best value and works its way through the list until the target is met.
Spot Stack v2?
Using weights, they built three Auto Scaling groups. The interplay between those groups is the cleanest Spot implementation I have ever seen. Not gonna lie, though, it seems a bit familiar.
The first ASG covers the static footprint with Compute Savings Plans. Committed spend at the best non-Spot rate, covering the footprint they'll never drop below. Within that commitment, everything stays flexible. New instances, new architectures, room for both evolution and revolution.
The second ASG is Spot-only. Same as before: instance weights set to benchmarked throughput values. The difference is the allocation strategy. Price-capacity-optimized trades a bit of savings for stability across deeper pools. Interruptions have a price, too.
One sec. Bailey's latte interruption: Q rico! $19,500 COP. Sorry (not sorry).
With weights in play, what's actually being minimized is cost per transaction across the full diversification set. The more pools in the mix, the better the odds of finding a cheaper transaction without sacrificing stability.
The third ASG is On-Demand, triggered by the same CloudWatch metric at a slightly higher threshold than the Spot group. While Spot has capacity, it stays quiet. When Spot pools thin out and the metric crosses that threshold, on-demand picks up the slack. When Spot recovers, it steps back down. The failover works both ways, and scaling in removes OD from the third group before Spot from the second.
Same metric. Different thresholds. Genius. This even doubles scale-out during extreme bursts. But wait, it gets better.
The Spot ASG has a MaxPrice cap per unit hour. With weights in play, that means the hourly price of each instance type was evaluated against the amount of work it represented, not just the sticker price of the instance itself.
The interesting part is how they derived those numbers. For each instance type in the mix, they calculated the hourly rate at which the effective cost per transaction matched their on-demand baseline.
When no Spot pools can limbo beneath the bar, Spot can't launch and the third pool takes over.
It's clever as hell; the same benchmark that generated the ASG weights also generated the MaxPrice.
But why that cap? Good question.
Because it defines exactly when the economics stop working in Spot's favor. Not earlier, burning money on unnecessary on-demand. Not later, paying Spot prices that stopped saving anything.
Without the throughput data from Part 2, that cap would be a guess. With it, engineering owns the ceiling. And the ceiling only means something because they measured what it should be.
Insufficient Spot capacity? Launch On-Demand.
Spot unit price higher than On-Demand? Launch On-Demand.
Parts 1 through 4 of this series are all present here. They've earned every bit of cost savings they achieve.
Buy vs Build
For anyone running Karpenter with EKS: a version of this happens already, without the three-ASG scaffolding. Karpenter evaluates available node types against workload requirements and finds the lowest-cost match at provision time, including Spot pools and cross-family instance selection.
No Spot capacity, instances are being interrupted, Karpenter launches OD instead. When Spot capacity returns, Karpenter shifts the cluster back to Spot.
Back and forth. Like windshield wipers.
Karpenter doesn't announce that it can help with Spot depletion, but it can. Windshield wipers don't announce they're cleaning off bugs and bird stuff either, but they do, and they do it well. Usually.
The three-ASG model makes that behavior visible. It shows exactly what Karpenter is optimizing.
If you want this level of intelligence without building it from scratch, Dr XOSphere has already implemented parts of it (and they can hire me to build the rest).
Nothing Is Free
None of this works for workloads that haven't done the earlier work first, and there's continued due diligence.
The three-ASG pattern requires stateless, horizontally scalable services. If your application carries session state on the instance or treats the disk as persistent storage, the Spot tier may require significant design work before it fits. The weights mechanism needs a clean, consistent throughput metric. If your TPS equivalent isn't well-defined, the benchmark step is noise. Everything downstream is noise too.
The operational overhead is real, but reasonable. Three ASGs means three capacity surfaces to monitor and maintain. MaxPrice caps need revisiting when on-demand pricing shifts (e.g. discounts) or the workload changes. Weights need recalibration when instance families or traffic patterns evolve.
Teams that reach for this pattern before building the benchmarking foundation build something that looks right on a diagram and behaves wrong in production. If the architecture already earns Spot, it earns this too. If it doesn't, start there.
(REDACTED), if you're out there reading this, my hat's off to you!
I built Spot Stack to chase the cheapest available pool. This customer built a system that refuses bad deals, even when Spot offers them. Same toolbox. Different starting point. Spot Stack v2 wins, but someone else built it.
Spot is the reward for good architecture.
You have to earn it.
They did.
(NotebookLM video of this post)