Skip to main content

The October 2025 AWS Outage Cost $581 Million

Tyler

Tyler

Co-Founder & CEO

|
The October 2025 AWS Outage Cost $581 Million

A few years ago I was brought in to support a large online travel company. What I did not know when I walked in was that six months earlier they had been completely leveled by an outage. It happened on Black Friday, right as a surge of holiday traffic hit. The site went dark.

I sat in the war room while executives described watching their competitors absorb every booking that should have been theirs, in real time. One outage. Thirty-five million dollars in lost sales. We spent the next six months rebuilding their infrastructure from the ground up.

Reading the post-mortems from the October 2025 AWS outage hit the same nerve.

On October 20th, a race condition inside AWS's internal DNS management system accidentally deleted every IP address tied to the DynamoDB regional endpoint in US-East-1. What sounds like a narrow, recoverable software bug cascaded into 15 hours of disruption across 3,500 companies in more than 60 countries. Zoom, Capital One, Coinbase, DoorDash, Reddit. All down. Industry analysts at CRN put the total business loss across that window at $581 million.

And here is what most CFOs have not fully absorbed: the cloud contracts almost none of those companies had recently read entitled them to compensation.

October 20, 2025 — by the numbers

  • 15 hours of disruption from a single DNS race condition in US-East-1
  • 3,500 companies affected across 60+ countries
  • $581 million in total business loss (CRN estimate)

Uptime timeline of the October 2025 AWS outage — service runs healthy, crashes into a 15-hour flatline during the US-EAST-1 DynamoDB DNS failure, then recovers, with $581M in losses across 60+ countries

The SLA Math That Breaks at Enterprise Scale

FinOps leaders at large organizations tend to operate under a comfortable assumption. They signed SLAs, they have uptime guarantees, they are protected. The actual math tells a very different story.

AWS, Azure, and GCP typically commit to 99.95% to 99.99% availability on their core services. When they breach those commitments, the contractual remedy is a service credit, not a reimbursement for what the outage actually cost your business. That credit is capped as a percentage of what you spent on the affected service in the month of the incident. First-tier breaches typically pay out a 10% credit on that service's monthly cost and increase in scale based on severity.

The Uptime Institute estimates the same 24-hour outage could cost a large organization north of $75 million in lost revenue, productivity loss, and the downstream SLA penalties they now owe their own enterprise customers.

What your cloud SLA actually covers — a service credit covers a percentage of one month's affected-service spend, while lost revenue, reputational damage, and downstream penalties are not covered

Your SLA does not cover lost revenue. It does not cover reputational damage. It does not cover the downstream penalties you owe your own customers when you fail to meet your own commitments. What it does is compensate you with a percentage of one month's spend on each affected service. That is the contract you signed. Gartner's post-incident analysis of the outage was blunt about it: resilience, not the fine print of your service agreement, is your actual protection.

Billions in Eligible Credits That Never Get Claimed

Here is what makes the picture worse. Even the recovery that SLAs do allow mostly goes unclaimed.

Next Signal's incident tracking data shows that AWS reported 25 events in 2024 with downtime exceeding 24 minutes. Every single one triggered SLA credit eligibility for affected customers. For an organization running $5 million a month in AWS spend, that is a real and recurring recovery opportunity sitting uncollected at the end of every month.

Annual SLA credit recovery opportunity by monthly cloud spend level, rising from $15K at $500K/mo to $1.5M at $50M/mo

Most teams leave it there, and the reasons are consistent across the industry. When an outage hits, engineers shift immediately to firefighting. Customer-facing teams manage the surge of inbound escalations. By the time the incident resolves, the 30 to 60-day claims window has started running and the motivation to build a documentation case has fully evaporated. Finance has no visibility into the incident data. Engineering has no bandwidth to reconstruct timelines and calculate impact. And the claims process itself requires submitting specific evidence through the provider's support system in their preferred format.

The cost of silence

Splunk research with Oxford Economics found the Global 2000 collectively spend an average of $200 million per year on downtime. If those organizations are not filing claims, they are subsidizing their cloud providers' SLA violations with their own silence.

The friction is not accidental. Cloud providers do not proactively issue credits when they miss their own uptime targets. The burden is entirely on the customer to identify the breach, document the impact, assemble the evidence, and navigate the claim process within the window. Organizations that automate that process recover what they are owed. Everyone else effectively waives it.

The Concentration Problem Is Not Going Away

This exposure is structural, not a one-time event.

AWS, Azure, and GCP now control between 60% and 70% of all global cloud workloads, according to CRN's Q4 2024 market share analysis. The 2025 OECD Competition Report identified that bundled services, deep integrations, and switching costs are actively deepening that concentration over time. For enterprise technology leaders, this has crossed the line from a competitive market concern into a fundamental operational resilience issue.

68%

Share of global cloud workloads controlled by AWS, Azure, and GCP combined. When one of three providers goes down, the blast radius is wider than it has ever been.

The Uptime Institute's 7th Annual Outage Analysis found that while the frequency of minor outages has slightly declined thanks to improved hardware, the financial severity of major outages is increasing exponentially for large organizations. You are running more critical workloads on more deeply integrated services across fewer providers. When one goes down, the blast radius is wider than it has ever been.

Gartner's recommendation is not to panic or start repatriation projects. The economics of on-premise infrastructure do not pencil out for most enterprises, and the capability gap is real. The right response is building the internal tooling and processes to hold your existing providers accountable, particularly around the SLA credits you are already contractually entitled to.

What CFOs and CTOs Need in Place Before the Next Outage

I have seen both sides of this problem, in client war rooms and now building the tooling to prevent it. There are two things every enterprise needs in place before the next outage happens, because the next one is already scheduled by probability.

The first is incident detection that runs independent of your provider's status page. During the October 2025 outage, AWS's public status page lagged the actual severity of the disruption by hours. Organizations with independent monitoring detected degradation before their provider acknowledged it, which meant they could get ahead of customer communication instead of reacting to it. That early signal changes everything about how an incident is managed.

The second is automated SLA credit recovery. The claims process is designed to create enough friction that most teams walk away from eligible credits. Doing this manually burns engineering hours on work that generates zero product value. The solution is to treat SLA recovery the same way leading FinOps teams treat reserved instance optimization: automated, continuous, and decoupled from the manual overhead of the incident response cycle.

At Next Signal, we built the platform to close both gaps. We monitor AWS, Azure, and GCP continuously across hundreds of services, detect SLA violations using independent signals that often surface before the provider's own acknowledgment, and automate the evidence collection and claims process so your team never has to think about it. The platform manages cloud spend across enterprises in financial services, insurance, healthcare, and technology, consistently recovering credits that would otherwise sit on the table unclaimed.

The October 2025 outage is not an outlier. It is a clear guide to exactly what enterprise cloud contracts actually protect you against, helping finance and technology leaders go in with realistic, well-informed expectations.

Cloud outages are inevitable. The math gap between SLA credits and actual business loss is a known, structural feature of every provider contract in the market. The only variable you control is whether you have the systems in place to recover what you are contractually owed when it happens.

See what your organization may be leaving on the table at nextsignal.io. The ROI calculator gives you a rough estimate in under a minute.

Sources

Industry data and reporting cited in this article: