Owning Mistakes

Something will go wrong. A deploy will break production. A migration will corrupt data. A config change will take down a service. This is not a possibility. It is a certainty. The only variable is what happens next.

In some teams, what happens next is a search for who to blame. The incident becomes a trial. People learn to hide their involvement. Near-misses go unreported. The system gets more fragile because the people operating it are afraid to admit what they see.

In other teams, what happens next is a search for what to learn. The incident becomes a case study. People volunteer what they know, including what they got wrong. Near-misses are treated as free lessons. The system gets more resilient because the people operating it are honest about how it actually works.

The difference is not process. It is culture.

Incidents happen

Complex systems fail in complex ways. A single engineer pushing a bad config is never the root cause — it is the trigger. The root cause is the system that allowed a bad config to reach production without validation, without a canary, without a rollback mechanism.

Blaming the engineer who pushed the button accomplishes nothing. They already know they pushed the button. What you need to understand is why the button was so easy to push, why nothing caught the error, and why the blast radius was so large.

This is the core insight of blameless postmortems: human error is not the explanation. It is the starting point of the investigation. Why did the person make that choice? What information were they missing? What tools failed them? What pressure were they under?

The answers to those questions lead to systemic improvements. The blame leads nowhere.

Postmortems that change things

A postmortem that does not produce action items is a group therapy session. It might feel cathartic, but it does not make the system better.

Good postmortems have structure:

Timeline — what happened, when, in factual sequence. No opinions, no blame, just events.
Impact — who was affected, for how long, how severely. Quantify it.
Root cause analysis — not “Alice pushed a bad config” but “the config deployment pipeline lacks validation, and the monitoring alert had a 15-minute delay.” Keep asking why until you reach something systemic.
Action items — specific, assigned, and deadlined. “Improve monitoring” is not an action item. “Add a latency alert to the checkout service with a 5-second threshold, owned by the platform team, by March 15” is an action item.
Follow-through — track the action items. Review them in the next team meeting. Close them when done. If they are not done by the deadline, escalate or re-scope, but do not silently drop them.

The postmortem document is a contract with your future self. It says: we learned this, and here is what we will do about it. If you do not follow through, you will have the same incident again, and the next postmortem will read exactly like this one.

Blame vs. accountability

Blameless does not mean unaccountable. This is the most common misunderstanding.

Blame looks backwards. It assigns fault. It is about punishment. It makes people defensive. It optimizes for not getting caught.

Accountability looks forward. It assigns ownership. It is about improvement. It makes people proactive. It optimizes for not having the same failure twice.

You can hold someone accountable for the action items that come out of a postmortem without blaming them for the incident. You can say “you own the fix for this” without saying “this was your fault.” The first is productive. The second is corrosive.

The distinction matters because blame and accountability feel similar in the moment but produce opposite incentives over time. Blame cultures have fewer reported incidents — not because fewer incidents happen, but because people stop reporting them. Accountable cultures have more reported incidents and near-misses, which means more learning, which means fewer severe incidents over time.

Leaders go first

Culture is not what you say. It is what you do, especially when it is uncomfortable.

If you want engineers to admit their mistakes, start by admitting yours. Publicly. In the postmortem. “I approved that design without considering the failure mode” is a statement that gives everyone else permission to be equally honest.

If the most senior person in the room has never said “I got this wrong,” no one else will either. The psychological safety that enables blameless culture does not come from a policy document. It comes from watching leaders be vulnerable first.

This is hard. It requires genuine confidence — the kind that comes from knowing your value is not diminished by admitting a mistake. But it is the single most powerful thing a leader can do to build a team that learns from failure instead of hiding it.

Own the mistake. Fix the system. Move forward.