SLOs, SLAs, and SLIs: putting numbers on "it kinda works"
Posted on February 20, 2026 • 8 minutes • 1597 words
Table of contents
- Putting a face on the acronyms (no drama)
- MiniFlix: your “more or less stable” streaming platform
- From “it’s slow” to “the p95 has gone through the roof”
- The 100% pipe dream (and why it’s a trap)
- When the acronyms help you talk to actual people
- The summary that doesn’t fit on a sales slide
- Quick glossary
- Sources and references
In almost every company, there’s a magical phrase used to describe a system’s health: “it more or less works.” Translated into plain English: nobody knows how often it goes down, how many requests fail, or how much money is lost when it decides not to work. But hey, “more or less.”
SLI , SLO , and SLA are the grown-up version of that phrase. They’re the way to go from “I think it’s fine” to “this is what it handles, this is what we promise, and this is what’s at stake” — without having to fall back on the classic “trust me, I’m an engineer.”
Putting a face on the acronyms (no drama)
Let’s strip the mystical cloak off of these three letters.
An SLI (Service Level Indicator) is, basically, what you look at . A concrete number that describes how your service is behaving. For example: what percentage of requests succeed, how long the checkout API takes to respond for most users, how many errors per minute you’re throwing on login.
An SLO (Service Level Objective) is the level you aim for on that number. It’s your way of saying “below this, we consider ourselves doing a bad job.” It might sound like: “at least 99.9% of requests to the payments API must complete successfully every 30 days,” or “99% of searches must respond in under 400 ms.”
And an SLA (Service Level Agreement) is when you tie the knot… by contract. It’s the document you send to your customers saying: “I promise X level of service, and if I don’t deliver, Y happens” (credits, discounts, penalties). A typical example: “99.9% monthly uptime or we refund 10% of the fee.”
Google and friends explain it with great patience: the SLI is the metric, the SLO is the internal target, and the SLA is what can cost you money if you fail.
MiniFlix: your “more or less stable” streaming platform
Imagine you build MiniFlix, your own streaming platform. You have a /play endpoint and users who, for some reason, expect that when they hit the play button, the video actually plays. Weird people.
In this context, a good SLI might be:
- “Percentage of minutes during which the playback service responds correctly (200 OK) to video requests.”
Then you decide that, in order to sleep reasonably well, your internal SLO will be:
- “We want 99.95% monthly availability on video playback.” That means you accept, at most, roughly 22 minutes of total downtime per month.
But for your enterprise clients, you sign an SLA that’s a touch more relaxed — 99.9% — with credits if you drop below that. Not because you’re a bad person, but because you want a small cushion: if you miss the SLO but stay above the SLA, you know you’re burning through your internal error budget and it’s time to get serious about reliability before the credits start showing up on the invoice.
Suddenly, instead of “MiniFlix is being a bit flaky today,” you can say: “we’ve burned 80% of our error budget in 10 days — either we ease up on risky changes or we’re going to miss our own standard this month.” That sounds less like bar talk and more like a conscious decision.
Maybe someone sees it as a shell game — and in the wrong hands it could be — but when the numbers are real, at least you know which shell to look under.
From “it’s slow” to “the p95 has gone through the roof”
One of the big advantages of talking about SLOs isn’t the acronym itself — it’s the change in conversation.
Before:
- “The site is slow.”
- “Are you sure? It loads fine for me.”
- “Well, it doesn’t for me.”
After:
- “The p95 latency on checkout has gone from 800 ms to 2.5 seconds this week. Our SLO is p95 < 1.5 seconds. And, coincidence or not, conversions have dropped 7%.”
Before:
- “The system goes down sometimes.”
- “Well, we have a lot of users — that’s normal.”
After:
- “In 10 days we’ve burned through 60% of the availability error budget. If we keep going like this, we won’t hit 99.9% this month. Do we keep shipping risky features or ease off the gas?”
In Google’s SRE materials, two ideas come up with considerable insistence:
- SLOs need to be anchored in the user’s experience, not in metrics that only four people on the team understand.
- If your SLI goes up or down and users don’t care , you’re probably looking in the wrong place.
The 100% pipe dream (and why it’s a trap)
There’s a dangerous phase when you first discover SLOs. You typically go through something like:
- “We want to provide excellent service.”
- “Let’s set 100% availability then.”
- “Sure thing, champ.”
The serious literature on the subject is fairly unanimous: 100% is an expensive fantasy . Even giants like Google or AWS talk about 99.9%, 99.95%, 99.99%… and even then, they still have rough days.
Another trap is making up targets without looking at your track record. If historically you’ve been hovering around 99.2%, setting 99.99% “because it sounds professional” is basically promising yourself you’ll live in permanent violation. Better to look at a few months of data , see what level you’re actually delivering, and from there decide: do we want to maintain this? Do we want to push it up a bit? Do we have the technical and financial headroom to do it?
And then there’s the wrong scale. “We want p99 < 500 ms on /v1/search-details” sounds very specific, but it means nothing to anyone outside the team. “We want 99% of searches to finish in under 400 ms because above that we see people bailing out
” is a different story: suddenly, business understands why that number matters.
When the acronyms help you talk to actual people
Where SLIs and SLOs truly shine is at the boundary with business. Without them, the conversation tends to be a clash of religions.
Scene without SLOs:
- Business: “We need ten new features this quarter.”
- Engineering: “There’s not enough time.”
- Business: “You always say that.”
- Engineering: “There’s a lot of tech debt.”
- Business: “That sounds like an excuse.”
Scene with SLOs and error budget on the table:
- Engineering: “Our availability SLO is 99.9%. We’re 12 days in and we’ve already burned 70% of the error budget.”
- Business: “What happens if we miss it?”
- Engineering: “More incidents, worse user experience, and the possibility of breaching the SLA, which means credits to clients and a hit to our reputation.”
- Business: “What are the options?”
- Engineering: “We can pause risky features for a sprint and focus on stability, or keep going as-is and accept the risk explicitly.”
It’s not like everyone suddenly holds hands, but at least you’re discussing risks and decisions with numbers — not “tech debt” as some abstract boogeyman. Google describes exactly this: using SLOs and error budgets as currency to negotiate how much reliability you’re willing to sacrifice to move faster… and vice versa.
The summary that doesn’t fit on a sales slide
In the end, these three letters aren’t a consultant’s trick — they’re a kind of contract with yourself:
- The SLI is “what reality am I looking at.”
- The SLO is “what level of reality do I accept as good enough.”
- The SLA is “what part of that reality do I promise to others, and what happens if I don’t deliver.”
From there, you either keep living in the same old “it kinda works” mode, or you accept that putting uncomfortable numbers on the table — even if they sting a bit — is the only grown-up way to decide when to push, when to brake, and how much risk you can afford .
It sounds less epic than “we’re going to revolutionize the industry,” but it has one advantage: it works even when nobody’s around to present the dashboard.
Quick glossary
The minimum viable knowledge so nobody catches you with a “wait, what’s that?” face.
- SLI (Service Level Indicator): a metric that reflects how the user experiences your service (availability, latency, error rate…).
- SLO (Service Level Objective): an internal target that sets the minimum acceptable threshold for an SLI (e.g., 99.95% monthly availability).
- SLA (Service Level Agreement): a contract with the client that establishes service level commitments and the consequences (credits, penalties) if they’re not met.
- Error budget: the margin of failure allowed before breaching the SLO. Calculated as 100% - SLO. For example, a 99.9% SLO leaves a 0.1% error budget.
- p95 / p99 (percentiles): the values below which 95% or 99% of measurements fall. Used to measure latency without letting a few outliers hide the real user experience.
- SRE (Site Reliability Engineering): an engineering discipline, popularized by Google, that applies software principles to the operation of production systems.
Sources and references
Because saying “I more or less made it up” doesn’t meet any reasonable SLA.
- SRE fundamentals: SLIs, SLAs and SLOs - Google Cloud Blog
- Service Level Objectives (SRE Book, Ch. 4) - Google
- The Key Differences Between SLI, SLO and SLA in SRE - DZone
- Implementing SLOs (SRE Workbook, Ch. 2) - Google
- Embracing Risk (SRE Book, Ch. 3) - Google
- What is a Service-Level Objective (SLO)? - Atlassian
- SLAs: The What, the Why, the How - Atlassian
- Example SLO Document (SRE Workbook) - Google
- SLO Engineering Case Studies (SRE Workbook, Ch. 3) - Google
- Example Error Budget Policy (SRE Workbook) - Google
- Alerting on SLOs (SRE Workbook, Ch. 5) - Google
- Web Vitals - Google / web.dev
- The Calculus of Service Availability - ACM Queue
- Error Budget - Atlassian
