From LinkedIn · · 1 min
What does 'something you can rely on' actually mean?
'We have something we can rely on' sounds clean. It is also vague. The five things that have to be true before a system earns the word reliable.
There is a phrase I keep using when I talk with founders about backend systems: “we have something we can rely on.”
It sounds clean. It is also vague.
Here is what I actually mean by it.
A reliable production system is not the one that has never broken. It is the one that fails in ways you have already seen.
Five things have to be true:
- Visibility. You can see what is happening right now, without asking three engineers.
- Ownership. When something breaks, there is a name attached to it. Not just a team. A clear owner.
- Known failure modes. You know how it breaks. The list is short. Nothing on it surprises you anymore.
- Recovery rehearsed. Backups are not theoretical. Restore has been tested, and tested in the last 90 days.
- Controlled change. Deploys do not trigger the “what just shipped?” panic-thread on Slack.
Most systems fail one of the first three. We do not even get to deployment.
That is the gap between “it runs” and “you can rely on it.”
Uptime is a lagging indicator. Reliability is not.