Your Biggest Cloud Bill Has Your Weakest SLA
Ali Kadhim
Co-Founder & CTO

The most expensive AI failure most teams will hit this year won't register as an outage. That is exactly why it is dangerous.
Picture a setup that is becoming standard: an agentic workflow running on a managed inference endpoint, the kind every enterprise is standing up right now. For weeks it is flawless. Then one Tuesday afternoon it starts falling apart in a way the dashboards were never built to catch. The endpoint is not returning errors. It is returning answers, just arriving three times slower than usual, with a request throttled to a 429 every few minutes. Downstream, the agent chain times out, retries, and starts hallucinating around the gaps. Customers see a product that has quietly gone stupid.
Pull up the provider's status page and it is green. All systems operational. Availability for the month lands above 99.9%, exactly as promised. By the contract, nothing happened. By every measure that matters to the people paying the bill, the product was broken, and the meter never stopped running. Full price for a service that was, functionally, offline.
That is the failure mode I keep coming back to as we build Next Signal, because it is the one almost nobody is pricing in. The infrastructure the entire industry is now betting the company on has the thinnest accountability layer in the whole stack.
State of FinOps 2026, by the numbers
- GPU and AI-model spend is now the #1 cost concern for FinOps teams, passing general cloud spend for the first time
- 98% of organizations now manage AI spend, up from 63% in 2025 and 31% in 2024
- 72% of companies exceeded their cloud budget last fiscal year
- Teams increasingly report being asked to self-fund AI by squeezing optimization savings out of everything else
The Line Item That Ate the Cloud Bill
For a decade, the FinOps conversation was about compute, storage, and egress: the boring, well-understood middle of the bill. That era is over. In the FinOps Foundation's State of FinOps 2026 survey, GPU and AI-model spend overtook general cloud cost as the single biggest concern practitioners have, and AI cost management became the most sought-after skill in the field. Ninety-eight percent of organizations now say they actively manage AI spend, up from less than a third two years ago.
The reason is the raw size of the numbers. A single p5.48xlarge GPU instance lists near $39,600 a month, and once you add the data transfer, storage, and networking that a real workload drags behind it, the true figure lands closer to $43,000 to $44,000. Multiply that across a fleet, add per-token inference charges, vector database queries, and the API gateways stitching your agents together, and AI stops being a line item and becomes the bill.
Zoom out and the scale is almost hard to hold in your head. Amazon, Microsoft, Google, and Meta are on track to spend somewhere between $630 billion and $725 billion on AI infrastructure in 2026 alone. The four of them raised data center capex roughly 78% year over year. Gartner puts total global AI spend for the year near $2.5 trillion, a 44% jump. This is the largest infrastructure build-out in the history of computing, and enterprises are being asked to fund their piece of it by finding efficiency everywhere else. The pressure to optimize has never been higher.
But optimization only answers half the question. The other half is the one nobody's pricing in: when you pour a third of your AI budget into workloads running on this infrastructure, what actually protects you when it fails?
The 99.9% Problem
Here's the uncomfortable answer. The managed AI services now carrying your most expensive, most strategic workloads ship with the same thin guarantee as everything else, and it's built to measure the wrong thing.
Amazon Bedrock commits to 99.9% monthly uptime. Breach it and you're entitled to a service credit: roughly 10% back on that service if availability lands between 99% and 99.9%, scaling up as it gets worse. Azure OpenAI publishes the same 99.9% availability figure. Vertex AI is structured the same way. On paper, you're covered.
Except "covered" is doing an enormous amount of work in that sentence. These SLAs measure availability: is the endpoint up and answering. They say almost nothing about the failure modes that actually break AI in production. When Azure OpenAI throttles your pay-as-you-go endpoint during a demand spike, that isn't a breach. You paid full price for degraded service with no recourse. When latency triples and your agent chain times out, the uptime metric stays green. When a model quietly regresses in quality after an update, there is no clause in any contract that even acknowledges it happened. The provider's status page and your customer's experience have quietly stopped describing the same reality, and that gap is exactly where the money leaks.
This is the same structural gap we've written about with traditional cloud SLAs, where the credit compensates you with a percentage of one month's spend while the outage costs you the business. But AI makes it worse in two specific ways. First, the spend at risk is far larger and growing far faster. Second, the failure modes that hurt you most (throttling, latency, degradation) sit entirely outside the metric the contract is willing to measure. We flagged in July of last year that the line between AI-as-software and AI-as-infrastructure was dissolving and the SLAs hadn't caught up. A year later, the spend has exploded and the contracts haven't moved an inch.
When the Floor Drops Out
The degradation problem is the quiet, everyday version. The loud version is a full outage, and AI has made the blast radius worse, because so much of the AI stack is stacked on the same shared foundations.
On February 2nd, a policy change inside Azure blocked read access to a set of Microsoft-managed storage accounts, and a related spike overwhelmed the Managed Identities service in East and West US. It lasted more than ten hours. Look at what got dragged down with it: Azure Kubernetes Service, Databricks, Copilot Studio, Azure AI Video Indexer, Container Apps. In other words, when the identity layer under the cloud buckled, the AI services sitting on top of it went with it. If your agentic pipeline authenticated through any of those, it didn't matter that "the model" was fine. The floor moved.
The next one is already scheduled
Forrester expects at least two major cloud outages in 2026, driven by exactly the kind of escalating complexity and deep service interdependence that took Azure down in February. The AI build-out is adding new layers of that dependency faster than anyone is hardening them.
This is the concentration story catching up with the AI story. We've argued before that outages aren't a risk to be engineered away. They're a certainty to be planned around. AI doesn't change that math; it raises the stakes. You are running more critical, more expensive, more deeply interconnected workloads across the same handful of providers. When one of them has a bad Tuesday, the number of things that break, and the dollar value of what breaks, is larger than it has ever been. And the credit you're owed afterward still tops out at a slice of one month's spend, if anyone on your team remembers to file for it inside the window.
What Changes When AI Is the Bill
I'm not writing this to argue against the AI build-out. We're building on the same infrastructure as everyone else, and the economics of repatriating it don't pencil out any better for AI than they did for the last generation of workloads. The answer isn't to retreat. It's to build the accountability layer the providers left out. Three things have to change once AI becomes the largest number on your bill.
Measure degradation, not just downtime. Uptime is the provider's metric, and it's designed to make their number look good. Yours should track what your users actually feel: response latency, throttling rates, error rates, and output quality against a baseline. Independent monitoring that watches the service the way your customer does, not the way the status page does, is the only way to know you've been degraded before your customers tell you. During February's Azure incident, the teams watching independent signals were ahead of the status page by hours.
Treat AI-service SLA credits as recoverable revenue. Every one of those 99.9% guarantees is enforceable, and providers do not proactively pay out when they miss. The burden is entirely on you to detect the breach, document the impact, and file inside a 30-to-60-day window that's designed to run out before a busy engineering team gets to it. On a fleet spending six or seven figures a month on GPU and inference, the credits left unclaimed at the end of every quarter are real money: the same money finance is pressuring you to find through optimization, sitting uncollected in the contract you already signed.
Attribute AI spend so you can act on it. GPU compute, model inference, vector queries, gateway calls, and embedding storage all land in different line items, often charged to different cost centers than the initiative that created them. You can't govern, forecast, or recover against a cost you can't see whole. Visibility is the precondition for every other move.
At Next Signal, this is the gap we built the platform to close. We monitor AWS, Azure, and GCP continuously across hundreds of services, including the managed AI and GPU services now dominating enterprise bills, detect SLA violations using independent signals that often surface before the provider's own acknowledgment, and automate the evidence collection and claims process so your team never has to reconstruct a timeline under deadline. The same discipline leading FinOps teams brought to reserved instances, applied to the fastest-growing and least-protected part of your spend.
The AI build-out is the biggest bet enterprise technology has ever made. It's being placed on infrastructure whose contracts were written for a smaller, simpler, more forgiving world. The spend has already moved. The accountability hasn't. The only variable you control is whether you close that gap yourself, or keep paying full price for the Tuesdays when your provider's status page and your customers stop agreeing on reality.
See what your AI and cloud spend may be leaving on the table at nextsignal.io. The ROI calculator gives you a rough estimate in under a minute.
Sources
Industry data and reporting cited in this article:
- FinOps Foundation: State of FinOps 2026
- Tom's Hardware: Big Tech's AI Spending Plans Reach $725 Billion
- Futurum Group: AI Capex 2026, The $690B Infrastructure Sprint
- Finout: Predicting AI Spending in 2026
- Amazon Bedrock Service Level Agreement
- Usage.ai: AWS Bedrock vs Vertex AI vs Azure OpenAI Cost & SLA Breakdown
- Network World: Azure Outage Disrupts VMs and Identity Services for Over 10 Hours
- The Register: Azure Outages Ripple Across Multiple Dependent Services
- OpenMetal: Forrester Predicts Two Major Cloud Outages in 2026
- Next Signal: The October 2025 AWS Outage Cost $581 Million