You Moved to Kubernetes. Your CPU Utilization Is Still 10%.

You Moved to Kubernetes. Your CPU Utilization Is Still 10%.

I'm going hard today. Collating over 100 recent articles on the FinOps space and aggregating them down to a handful of select topics, this one stood out to me like a sore thumb, screaming for attention. This is going to be a bit longer than normal.

Breaking News: Nothing Has Changed

The CAST AI report popped onto my screen at 40k feet. Average Kubernetes CPU utilization: 10%. I recognized that number. McKinsey published it in 2008. For bare metal servers. Years before Kubernetes existed.

As server monkeys, my buddy and I used to laugh and joke at how much waste there was in our data centers. We didn't mind the work, of course. Stuffing hardware onto racks and finding clever new ways to route cables and cooling was always a fun business trip. Never mind the cost of the travel and hotels, the waste in hardware and energy was obvious to all of us in meetings. Fans spinning. Disks spinning. Budgets spinning... out of control.

Of course, that was fifteen years ago. I spent four years building data centers, the next four migrating to the cloud, the four after that optimizing our company's architectures to reduce waste, and then I joined AWS to help the world's largest businesses do the same.

Leaning forward in my seat (exit rows have the best legroom), I read that CAST AI, a Kubernetes cost optimization vendor, benchmarked more than 2,100 organizations across AWS, GCP, and Azure throughout 2024. The 10% is down from 13% the prior year. And 99.94% of clusters in the dataset are over-provisioned.

What the actual fuck?

The cloud efficiency pitch made utilization a central selling point. Migrating workloads would let organizations achieve the utilization rates only hyperscalers could demonstrate. Right? Pay by the hour, be elastic, no more static hardware sitting around doing nothing!

And now the number is moving in the wrong direction YOY. In fact, the number hasn't moved at all since my kid was born. Like so much SaaS we're subscribing to something we're not using. When cloud compute, and especially AI, consumes enough electricity to show up in national energy projections.

The 65% Nobody Measured

The paper trail goes back to 2008. McKinsey tracked enterprise server utilization at 6-12%. By 2012, the New York Times investigation "Power, Pollution and the Internet" confirmed it. Gartner confirmed it independently. Different methodology, same answer.

The facepalm was real. But server efficiency was a new concept. The model back then was to provision for max load. Scaling up was the norm. Horizontal scaling was in its infancy.

In 2014, the NRDC published its Data Center Efficiency Assessment and introduced the number that launched a thousand migrations: cloud providers ran at 65% utilization while enterprises sat at 12-18%. AWS ran with it. Of course they did; they had pioneered the concepts of tiering and elasticity, leveraging virtualization to maximize server utilization. Their blog post cited the McKinsey data, promised 77% fewer servers and 84% less power, and concluded that the cloud was the answer.

The 65% number was always soft. The NRDC's source was WSP consultants doing stakeholder interviews and scenario modeling, not auditing systems. The actual range they published was 40-70%; the marketing department picked the ceiling. Why wouldn't they? Lawrence Berkeley National Laboratory said in 2024 that virtually no provider reports utilization in context of actual compute capacity and no independent auditor has ever confirmed the number. System architectures are audited for attestations. Accessing production systems would violate customer privacy, so who can argue different?

Google's Borg papers show cluster utilization above 50%, which sounds like validation until you look at how. Google's reclamation system fills headroom from over-provisioned production workloads with internal batch jobs. Roughly 20% of cluster workload runs on scavenged capacity. The cluster looks busy because Google engineered it to BE busy. I approve. That's excellent engineering.

Back in the Xen days, when we built our own private cloud in a Los Angeles colo (yes, the term is fair, multiple teams sharing resources that are shaped for aggregate maximums) we did exactly this. We had a percentage of unused resources grinding distributed compute projects as a proof-of-concept for low priority batch work.

But I digress. Hyperscalers genuinely ran more efficiently than enterprise data centers. For the vendors. Cloud improved deployment velocity and reduced operational overhead. What the pitch didn't account for was that enterprises would bring their provisioning habits with them.

The Goal Isn't 100%. At 10%, You're Not Even Trying.

Google's reported 50% cluster utilization sounds like they're running at half capacity. They're not. x86 processors don't scale linearly under load. As utilization climbs past 60-70%, latency degrades non-linearly. It's the curve in the hockey stick you don't want your workloads to encounter.

GitHub's performance engineering team found that keeping utilization at or below 61% was necessary to limit CPU time degradation to under 40% on specific instance types. A 2024 paper testing Kubernetes autoscaling thresholds found that raising the scaling trigger from 60% to 70% caused measurable increase in pressure and reduction in hitting service-level objectives. So no, 100% is not the goal.

That reframes everything. Google at 50% is running at 70-80% of usable capacity, with just enough headroom for autoscalers to respond before customers start looking at competitors.

Arm processors, the same low power architecture powering your cell phone and television, changes the math. See Apple's M1, M2, M3, M4, M5... when's the next WWDC? Ampere Altra and AWS Graviton maintain consistent performance under load with a higher latency cliff than x86. Honeycomb migrated production workloads to Graviton3 and set their autoscaling CPU targets 10% higher than on x86, provisioning 30% fewer instances for the same latency targets. Arm is not only lower in price than comparable x86, but with that higher utilization before latency rains on your parade, you can usually run fewer of them.

The YAML File Is Different. The Behavior Isn't.

Before containers and their orchestrators, there were virtual machines. And VMs were genuinely transformative. Bare metal meant one operating system with one or more workloads competing for resources. A server running a single application at 10% utilization was 90% expensive paperweight. Virtualization changed that by letting multiple workloads share a physical host, but separated resources like CPU, RAM, network, and storage. Mix a workload that peaked during business hours with one that peaked overnight, and suddenly your host was doing real work across the full cycle.

Memory mattered here, and it was a decision. You were consciously allocating RAM across guests on a shared host. Placements were planned and intentional. CPU could be rigid (pinned) or float across guests flexibly (shared), but memory was strictly enforced.

Done well, a properly capacity-planned hypervisor could push a physical host to 60-70% utilization. That was the promise delivered. At the host level, at least. Or you could have multiple systems performing well with allocation slack and still be stuck at 10%.

Containers compressed things further, shedding the guest OS overhead entirely. But containers didn't change how engineers thought about resource requests. They changed the packaging, not the behavior. Because of their smaller sizes (code only) the resource needs were smaller, and careful planning was replaced with, effectively, Tetris.

Enter orchestration. Kubernetes, and others before and alongside it, automated the scheduling and placement of those containers. CNCF's 2024 survey found Kubernetes in production at 80% of organizations, up from 66% the year before. But orchestrators schedule based on what you request, not what you use. Provision for peak. Add a buffer. Ship it. The YAML file looks different than the Puppet manifest did. The behavior is identical. As are the results.

Datadog's 2025 State of Containers report, drawn from tens of thousands of production environments, found that 83% of container costs come from idle resources. Over 65% of containers use less than half their requested CPU. This isn't a tail of misconfigured services skewing the average. This is what production Kubernetes can look like at scale.

Organizations running at 10% on hardware they'd already paid for moved those workloads, and those provisioning habits, to infrastructure they rent by the hour. On owned hardware, 10% meant underutilizing a capital investment on the books. One could argue "sunk costs" in a meeting and move on down the agenda. On rented compute, it means paying for nine units of idle capacity for every unit of work.

Kubernetes also adds control plane fees, egress charges, and orchestration overhead that scales with cluster complexity, not application load. The CNCF FinOps microsurvey found that 49% of organizations saw cloud costs increase after adopting Kubernetes, with over-provisioning cited as the primary driver by 70% of respondents.

I've said it before: Kubernetes is like building a ship in a bottle wearing welding gloves. This is why. The complexity leads to uninformed decisions due to steep learning curves. In turn, inevitable shortcuts lead to... 10%.

Finance Gets the Invoice. Engineering Gets the Page.

The structural reason over-provisioning persists in Kubernetes is the same reason it persisted in enterprise data centers in 2008: the person allocating resources is not the person paying the bill.

Engineers aren't trying to waste money. They're building features, fixing bugs, patching vulnerabilities, and hitting launch dates. Product teams are chasing GTM. Across a service, cost is job five on a good day.

The problem isn't that people don't care. The problem is that few businesses build the feedback loop. Especially with shared services platforms, where many services run on the same orchestrator. The CNCF's FinOps microsurvey points out that only 2% of organizations run active chargeback and 19% showback, where teams see a real bill for what they consume. Thirty-eight percent have zero Kubernetes cost monitoring. Not inadequate monitoring. Nada. Zip. Zilch. Zero.

Showback and chargeback can close that gap, but building accurate cost allocation for a shared Kubernetes cluster is a real engineering project. Namespace-level attribution, shared system overhead, observability costs, storage that doesn't map cleanly to any single workload. It consistently loses priority to product work that has a stakeholder with a deadline attached to it. Be honest: how often do you organize that "miscellaneous" drawer in your kitchen? Same problem. Without motivation there's no point.

The tooling to do this exists. The organizational commitment often doesn't. And until teams can see what their services actually cost, or how much of their allocation they aren't using, expecting them to optimize for it is like asking someone to lose weight without a scale.

Half the Cluster Is Running the Cluster

Kubernetes orchestration imposes overhead that compounds the waste. Dynatrace's 2025 Kubernetes in the Wild report found that application workloads account for just 46% of cluster pod hours. System services consume 37%, monitoring 14%, backing services 3%. More than half of what the cluster is running is infrastructure for the infrastructure. Kubeception.

Service meshes can make it even worse. Istio's default sidecar proxy requests 100m CPU and 128 MB of memory per pod. At 1,000 pods, that's 100 vCPU and 128 GB reserved for the mesh alone. That's like a c7i.24xlarge at over $3k/mo. Ouch. Budget approvals for services often overlook the added heft the platform will have to endure.

Containers are lighter than VMs at the workload level, no argument there. But given those numbers, per-node overhead in production Kubernetes runs 15-25% of node resources before a single application pod is scheduled. Organizations running at 10% utilization are paying orchestration overhead on top of application waste, not in place of it. Dare we do the math?

Karpenter Can't Hammer the Nail Alone.

The technology to fix cluster-level waste exists. Karpenter continuously evaluates workload profiles, selects the cheapest instance type that fits, consolidates nodes, and integrates Spot when workloads tolerate interruption. This constant binpacking and infrastructure management is truly state of the art. Tinybird ran Karpenter on EKS with Spot and cut CI/CD costs by 90%. Production deployments consistently report 30-70% compute cost reductions. The observant will say Spot is capable of 90% reductions, but they're rare. The TRULY observant will see the reduction in footprint.

But Karpenter provisions based on what pods request, same as CAS. If a pod requests 4 vCPU and uses 0.4, Karpenter dutifully provisions a node large enough for 4 vCPU, optimizing an inflated request. The waste lives in the pod spec, and Karpenter works downstream of it.

Fifty-six percent of organizations are still managing resources manually. No autoscalers. No optimization tooling. Just YAML and vibes. Not AI coding vibes, walking around with a blindfold reaching out randomly to find your bearings vibes.

Now apply all of this to GPUs. Yeah, I did it, I brought up AI. Check your bingo card. Two-thirds of organizations running generative AI use Kubernetes for inference. GPU utilization on those clusters averages 15-25%, on hardware that can cost an order of magnitude more per unit than general-purpose compute. A 2 GB inference workload assigned to an 80 GB A100 leaves 97.5% of the GPU's memory idle. The tools for fractional GPU allocation exist. Adoption is limited. Same incentive failure, substantially higher stakes. Compound this with all the terrible estimations of token requirements in AI services and you start to understand layoffs.

Even Google's Tasks Are at 10%

Google's 2020 Borg paper analyzed their own cluster trace data. Borg was the inspiration for Kubernetes. Individual task CPU usage was 10-20% of requested resources. This is Google, running a system built specifically to maximize cluster efficiency, with a team whose entire job was squeezing out percentage points that, like all massive enterprises and hyperscalers, are worth millions of dollars each.

Over eight years, Google moved cluster-level utilization from 30% to above 50%. Cluster utilization includes everything running on that cluster; workloads, system services, and monitoring. Remember? The real difference was Google's reclamation system. It backfills the headroom between what tasks request and what's available with low priority batch jobs. The cluster looks busy. The individual tasks are still requesting five to ten times what they need.

Google engineered around the waste. They didn't eliminate it. If Google can't fix it, you're not going to kubectl apply your way there either.

Over-provisioning is rational behavior. Every engineer who's been paged at 2am for a resource exhaustion incident knows this. The aggregate result: 99.94% of clusters are over-provisioned, 83% of container costs are idle resources. Average CPU utilization: 10%.

That's not an industry that hasn't tried. It's an equilibrium. McKinsey documented it in 2008. We rebuilt the entire infrastructure stack and arrived at the same number, on rented compute that costs considerably more than owned hardware did.

The Meter Is Always Running

A server running at 10% utilization draws 40-50% of its peak power. Ten servers at 10% consume roughly four times the electricity of one server at 100% doing the same work. Most of those spinny things are still spinning and ICs are still hungry at idle.

The IEA measured global data center electricity at 415 TWh in 2024 and projects 945 TWh by 2030. That projection appears in a lot of keynotes and almost no capacity planning documents. If the industry simply doubled average utilization from 10% to 20%, the same workloads could run on half the infrastructure. That's roughly 100 TWh in annual savings, about $10 billion in electricity alone, and enough to power the Netherlands.

The US data center count roughly doubled between 2021 and 2024. Some of that growth reflects genuine demand. Some of it is the compounded consequence of building new infrastructure for workloads that already sit idle on the infrastructure they have. Or AI hype.

The problem is behavioral, not technological. The shrinking of servers through VMs to containers and new tools solved what they could. McKinsey found 10% in 2008. CAST AI found 10% in 2024. Promise has come from hardware and platform innovations, yet none of them have delivered.

Cloud compute is often likened to a utility. You can't buy more electricity or water than you actually use, but you can leave lights on or water running unnecessarily. Upgrading from incandescent to LED helps, as do low-flow toilets and shower heads, but nothing you install can overcome idle usage. That is, until your spouse runs in and yells at you. Hmm.

Subscribe to theSteveCo

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe