Article · · 4 min
What I look for first when I open a production system
I have five days, not to fix a system, but to find where it will fail and what that failure will cost. The order I look in is not a checklist. It is a ranking by blast radius.
When I open a production system for the first time, I have five days. Not five days to fix it. Five days to understand it well enough to say, with evidence, where it will fail and what that failure will cost.
That constraint decides everything that follows. I cannot read every line. I cannot trace every path. So I do not start where the code is most interesting. I start where being wrong is most expensive.
The order I look in is not a checklist. It is a ranking by blast radius.
The biggest risks first, while I still have the most context and the fewest assumptions. By day four I will know too much to see the system the way an attacker or an outage will. So the first day is the most valuable, and I spend it on the things that take the whole system down, not the things that annoy a developer.
Here is the order, and why each step sits where it does.
First, what the internet can touch
Not the code. The exposure.
The first question is what is reachable from outside, because that is the only surface where a stranger gets a vote. Security groups left open from a debugging session. A storage bucket that is public because making it public was the fastest way to ship. An admin panel on a path someone assumed nobody would guess. A database listening on a public address because it was easier than a bastion.
The blast radius here is the entire system, so it goes first. Everything else assumes the attacker is already inside or is never coming. This step asks the cheaper question: how hard is it to get in at all.
Then, who can act, and who decided that
Inside the system, the question changes from “can you get in” to “what can you do once you are in.”
Identity rarely gets designed. It accumulates. A key minted for a one-off script three years ago, still holding full access. A role widened once during an incident at 2am and never narrowed. A shared credential pasted into a runbook so the on-call could use it. None of it was a mistake at the time. Every grant was reasonable in the moment it was made. Together they are a map of standing permission that nobody has read end to end.
I read it. Who can reach production, who can read the data, who can change the rules. The dangerous answers are almost never malicious. They are sediment.
Then, whether the data survives being needed
Everyone has backups. Almost nobody has a restore they have run on purpose.
The backup job is green and has been green for two years, which tells you the backup runs, and nothing about whether it works. A backup you have never restored is a hypothesis. The first time you find out whether it holds should not be the night you need it.
So I look for the restore, not the backup. Has it been run, into a clean environment, against a clock, recently enough to mean anything. This is the step that most often turns calm into quiet, because the honest answer is usually no.
Then, how code becomes production
The deploy path is where reliability is actually decided, because it is the one thing that touches the system on a schedule.
Can you roll back in one step, or is rollback a manual scramble nobody has rehearsed. Is there a path that skips review. Does the artifact that was tested get shipped, or does it get rebuilt on the way out and become something slightly different. The pipeline is a behavioral document: it tells you how the team acts under pressure, because it is what runs when there is pressure.
Last, what works only because nothing has tested it
Only now do I read the code itself, and through one lens: what here works because nothing has stressed it yet.
I am not counting coverage. I am looking for the places the system has never been asked a hard question. The path that only runs at month end. The queue that has never been full. The retry that has never actually retried. This is where untested code lives, and untested code that happens to work is a demo that has not failed yet.
It goes last on purpose. It has the smallest blast radius of the five and the largest surface, so it would eat all five days if I let it. After the first four steps I know which untested paths actually matter, and I read only those.
What five days actually produces
Not a list of bugs. A map.
The output is where the system fails first, ranked by what being wrong about it would cost, with the recovery paths that do not currently exist marked clearly. It is the document that lets someone who owns the system decide what to fix now, what to fund later, and what to simply watch. The findings are the easy part. The order is the work.
Most teams already know at least one of these is true. The audit is not the discovery. It is the moment it stops being optional to know.