Cloud outages are inevitable. Accountability isn't.

Posted

Nothing stresses your IT team more than responding to unexpected downtime. It sends your customers, customer facing teams, and your execs into a panic and it re-purposes large swaths of your organization from development and growth into fire fighting. Downtime could cost your organization $9,000 per minute as the Internet has become the de facto marketplace for most of our business interactions. For reasons that are well understood, IT teams have ceded control of their network infrastructure to the big three cloud providers: Amazon Web Services (AWS), Google Cloud Platform (GCP), and Microsoft Azure. According to CRN, these three world-wide platforms account for 68% of the global cloud market and they are still growing fast. It is a ton of responsibility, and yet, it doesn’t seem to carry commensurate accountability. Why is that?

The illusion of perfection

Your primary accountability tool is your Service Level Agreement (SLA). For most of the companies we talk to, it has been a while since they took a look at their SLAs. In case you don’t remember, you don’t have it handy, or you don’t want to wade through it, you will find the typical uptime commitment from your cloud provider is 99.95% or 99.99% availability in any given month. These commitments are made for the most commonly purchased or most critical services, and usually require some deployment redundancy across availability zones (additional cost).

99.99% feels like a strong commitment because it is a strong commitment. You can expect under 4.4 minutes of downtime in a month if your provider is meeting this goal. But it also does a little psychological work on their behalf: if they are willing to take such a strong stance on their reliability, it must be because their infrastructure is that reliable… right?

They don’t expect to be held accountable

Before we return to that reliability question, let’s talk about why it often doesn’t matter how reliable they are. When it comes to the accountability of your cloud provider, there are a number of other forces working against you and your team. Some of those are insidious and some of them are engineered. Here’s five reasons they don’t expect to be held accountable.

1) You have other things to worry about.

When unexpected downtime occurs, a scramble is set in motion where everyone’s primary concern becomes their part in resolving the issue or just making the pain go away. Your IT team is diagnosing, troubleshooting, and (ideally) implementing workarounds, all while being on the hook for communicating status to your organization. Your customer facing teams are responding to surge demand as they communicate status and calm your customers. When you get to the other side of an incident, there’s a let down as the adrenaline wears off. There’s also a natural inclination to get back to building stuff instead of fixing it. The finance side of the house is unaware that there are dollars at risk because they don’t have any visibility into service provider downtime.

2) You have a limited window to request credit.

Depending on the SLA, you may have 30 - 60 days to make a request. Depending on the scope of the issue, it may take you 30 days to fully address the fallout. The more time passes, the less communication is occurring between the IT teams responsible for uptime and the finance teams responsible for paying the bills. By the time you think about it again, your window may have elapsed.

3) The credit can’t possibly amount to much, right?

Necessarily, it is nearly impossible to convince an upstream provider to compensate you for the entirety of your loss, even when the responsibility is entirely theirs. Your SLA limits provider responsibility to what you, the customer, have paid for service and it provides credit for future service as a remedy. Credit reimbursement is divided into tiers, and in most cases, the first tier awards you a 10% credit of what you spent the month of the incident. Is it even worth the effort?

4) You don’t want to damage the relationship

You’ve spent a ton of time and money getting services running and developing relationships with their customer team. Maybe that involved some patience, incentives, or “favors” on your provider’s part when you lifted and shifted from somewhere else. Many people don’t enjoy being the “squeaky wheel,” even when they have a contract in place. You are fully invested in the ecosystem and it could be painful if they become unresponsive when you need something.

5) Getting credit is on you

Every cloud provider tracks their downtime, the scope of services it affects, and the customers using those services. But they won’t volunteer to give you money back. You need to initiate the request, you need to submit it in a format that they approve, you need to prove that you were impacted by their incident, and you need to do it in their timeframe. There is often a disconnect between the finance arm of your organization who is incentivized to conserve dollars and the IT arm of your organization incentivized to build and repair. Per Cloudzero’s 101+ Cloud Computing Statistics That Will Blow Your Mind, the AWS “Cost & Usage Report (CUR) is too large to load into Excel all at once. Amazon splits its monthly CUR into many separate files. Good luck understanding them.”

Outage Inevitability

You deserve some credit

Next Signal tracks incidents reported publicly by AWS, GCP, and Azure. According to our tracking data, in 2024, AWS reported 25 incidents with greater than 24 minutes of downtime. This would trigger a 10% service credit for both a 99.99% and 99.95% SLA commitment. AWS is the most reliable provider in the space. Historically, you can expect worse performance from both GCP and Azure. You only need to look at their public admissions of outages and you will discover that none of the big three cloud providers are exceeding the downtime standard they set for themselves.

SLA’s are intentionally punitive. A 0.01% failure in uptime can translate to a 10% service credit. It is hard to not view this as an admission that even a few minutes of downtime equates to a much larger impact. Even though this is a fraction of what an incident costs you, it isn’t a direct correlation to how much service you consumed in the month. Depending on your monthly cloud spend, even minimal outages can add up to real credit. And within the boundaries of a month, the clock is running and events are cumulative. If your cloud provider has two or more incidents which collectively exceed 4.4 minutes of downtime, you are eligible to make a claim.

Splunk cites research by Oxford Economics suggesting the Global 2000 could be spending an average of $200M per year on downtime. If you don’t make SLA claims, you are giving up one of the few levers you have to hold your provider accountable. You have a contract and this is just business. It isn’t good business to pay for service that is inaccessible. It isn’t good business to neglect the strongest bit of feedback you can send. It shouldn’t be good business to fail to meet service level agreement standards.

Next Signal can help

Next Signal has spent the last couple of years collecting public signals of downtime by the big three. Next Signal also keeps track of the SLAs in place for hundreds of services. At the end of each month, Next Signal helps customers understand whether their SLAs have been met and, in cases when they haven’t, helps customers make credit claims. If your finance organization wants more visibility into outage events and their impact, and wants to be empowered to make claims, reach out to Next Signal and schedule a conversation.