Article · May 22, 2026 · 4 min

What I look for first when I open a production system

I have five days, not to fix a system, but to find where it will fail and what that failure will cost. The order I look in is not a checklist. It is a ranking by blast radius.

By Olha Shevchenko. Audits production systems on AWS and Node.js.

A field method

What I look for first when I open a production system

01 / 10
I have 5 days to map a system and find what actually matters.

The order I look in is not random. It is optimized for blast radius.
Here is what I check first, in order, and why.

02 / 10
01 · Network perimeter

What's exposed to the internet.
- Security groups: ports open to 0.0.0.0/0?
- Public IPs: which hosts have them, and why?
- DNS records: A-records pointing where? Orphans?
- SSH: still on port 22, still open to the world?
Why first: one bad security-group rule can expose the whole system.
03 / 10
02 · Identity & access

Who can do what.
- IAM users: who's active, who's a zombie, who has admin?
- SSH keys on production hosts: any orphans?
- DB users: shared credentials? overly broad privileges?
- The same login key in three different .env files?
Why second: today's exposure is yesterday's leaked credential.
04 / 10
03 · Source of truth for config

Where credentials live, vs where they should.
- .env files: tracked in git? written in three places that disagree?
- Secrets Manager: present, and actually read by the app?
- Hardcoded credentials in source?
- Drift: do .env values match the secrets store?
This one always surprises founders. The .env in the repo is sometimes the .env in production.
05 / 10
04 · Deploy pipeline

What happens between commit and production.
- Manual deploy, or automated?
- The pipeline runs as which user, with which permissions?
- What runs on every deploy? Key regen, cache clear, things you forgot?
- Is there a rollback path?
Most regressions I have seen came from the pipeline doing something nobody remembered it did.
06 / 10
05 · Backups & recovery

What you'd do if production disappeared right now.
- Backup automation: cron, snapshot, native?
- Encryption at rest?
- When was the last successful restore test?
- Where do backups physically live?
Backups exist is not backups work. Restore is the only test that counts.
07 / 10
Why the order

Network → Identity → Config → Pipeline → Backups

By day four you can find anything. By day one you can only find what is loud. So day one is for what is loudest when broken.
I find the high-blast items first, while the system is still in my head.

08 / 10
What it is

This is the framework. It surfaces problems.

The output of five days is a prioritized roadmap: severity, evidence, recommended actions.
Fixing them is separate work, usually one to three weeks depending on what surfaces.

09 / 10
Your turn

What do you check first when you open a system you've never seen?

The order I use is just one operator's order.
Curious what yours looks like.

Olha Shevchenko

Engineer · backend + cloud systems

10 / 10

1 / 10

When I open a production system for the first time, I have five days. Not five days to fix it. Five days to understand it well enough to say, with evidence, where it will fail and what that failure will cost.

That constraint decides everything that follows. I cannot read every line. I cannot trace every path. So I do not start where the code is most interesting. I start where being wrong is most expensive.

The order I look in is not a checklist. It is a ranking by blast radius.

The biggest risks first, while I still have the most context and the fewest assumptions. By day four I will know too much to see the system the way an attacker or an outage will. So the first day is the most valuable, and I spend it on the things that take the whole system down, not the things that annoy a developer.

Here is the order, and why each step sits where it does.

First, what the internet can touch

Not the code. The exposure.

The first question is what is reachable from outside, because that is the only surface where a stranger gets a vote. Security groups left open from a debugging session. A storage bucket that is public because making it public was the fastest way to ship. An admin panel on a path someone assumed nobody would guess. A database listening on a public address because it was easier than a bastion.

The blast radius here is the entire system, so it goes first. Everything else assumes the attacker is already inside or is never coming. This step asks the cheaper question: how hard is it to get in at all.

Then, who can act, and who decided that

Inside the system, the question changes from "can you get in" to "what can you do once you are in."

Identity rarely gets designed. It accumulates. A key minted for a one-off script three years ago, still holding full access. A role widened once during an incident at 2am and never narrowed. A shared credential pasted into a runbook so the on-call could use it. None of it was a mistake at the time. Every grant was reasonable in the moment it was made. Together they are a map of standing permission that nobody has read end to end.

I read it. Who can reach production, who can read the data, who can change the rules. The dangerous answers are almost never malicious. They are sediment.

Then, whether the data survives being needed

Everyone has backups. Almost nobody has a restore they have run on purpose.

The backup job is green and has been green for two years, which tells you the backup runs, and nothing about whether it works. A backup you have never restored is a hypothesis. The first time you find out whether it holds should not be the night you need it.

So I look for the restore, not the backup. Has it been run, into a clean environment, against a clock, recently enough to mean anything. This is the step that most often turns calm into quiet, because the honest answer is usually no.

Then, how code becomes production

The deploy path is where reliability is actually decided, because it is the one thing that touches the system on a schedule.

Can you roll back in one step, or is rollback a manual scramble nobody has rehearsed. Is there a path that skips review. Does the artifact that was tested get shipped, or does it get rebuilt on the way out and become something slightly different. The pipeline is a behavioral document: it tells you how the team acts under pressure, because it is what runs when there is pressure.

Last, what works only because nothing has tested it

Only now do I read the code itself, and through one lens: what here works because nothing has stressed it yet.

I am not counting coverage. I am looking for the places the system has never been asked a hard question. The path that only runs at month end. The queue that has never been full. The retry that has never actually retried. This is where untested code lives, and untested code that happens to work is a demo that has not failed yet.

It goes last on purpose. It has the smallest blast radius of the five and the largest surface, so it would eat all five days if I let it. After the first four steps I know which untested paths actually matter, and I read only those.

What five days actually produces

Not a list of bugs. A map.

The output is where the system fails first, ranked by what being wrong about it would cost, with the recovery paths that do not currently exist marked clearly. It is the document that lets someone who owns the system decide what to fix now, what to fund later, and what to simply watch. The findings are the easy part. The order is the work.

Most teams already know at least one of these is true. The audit is not the discovery. It is the moment it stops being optional to know.