Distributed Knowledge

A team where one person holds the only copy of how something works has a bug. They have not noticed yet because nothing has broken. But the bug is there — in the next person to ask, in the day that person is out sick, in the week they leave for another team, in the month the system breaks and the only one who knows why is on vacation.

Distributed systems engineering taught us that a single point of failure is a problem to be designed away, not lived with. The same principle applies to the team itself. Knowledge that lives in exactly one place is a single point of failure. The fix is the same in both cases: distribute it.

The cost of silos#

Knowledge silos look efficient up close. The expert handles the questions. The originator owns the system. The senior engineer is the one to ask. Each interaction is fast because the right person already knows.

The cost is invisible until it isn’t. The expert becomes a bottleneck. The originator gets pulled into every decision about their system, forever. The senior engineer cannot focus because they spend half their time answering questions that should have been written down years ago. Then one of them leaves, and the team discovers how much of its operation depended on a single person nobody had pressed to externalize what they knew.

Silos also produce fragility in decision-making. When only one person understands the system, the team defers to them. When that person is wrong, no one notices, because there is no one else qualified to push back. Silos do not just trap knowledge — they trap the ability to challenge it.

Bus factor is a real number#

The bus factor of a system is the number of people who would have to leave before nobody on the team understands it. A bus factor of one is not a virtue. It is a debt accruing interest every day until you pay it down.

Look at the critical systems in your team. For each one, can you name two people who could debug a production incident in it at 3 a.m.? If not, you have found something to fix.

The expert as bottleneck#

Being the only person who knows something feels good for about a year. After that, every meeting routes through you, every question pings you, every incident requires you. You stop being able to do new work because you are too busy being the person who knows.

If you are the expert, your job is to stop being the only expert. Pair on the work. Write the doc. Train the next person. Free yourself.

The originator as veto#

When one person originated a system, every change to it tends to require their approval. Sometimes that is correct — they have context nobody else does. But often it is just a habit, sustained because nobody else has been brought close enough to the code to feel qualified.

Bring others close. Walk them through the design. Hand off ownership deliberately. A system whose author left and which nobody else can change is a system on a timer.

Writing distributes#

The single most leveraged thing an engineer can do is write down what they learn. Not a polished essay. A short doc, an ADR, a comment in a runbook, a Slack message in a shared channel that gets pinned. Writing externalizes the knowledge from one head into a place where anyone can find it.

The next time someone asks the same question, you point them at the doc. The next time you ask yourself the same question — six months later, having forgotten — you find the doc. The next person to onboard reads the doc before they ask. The team that writes things down compounds its understanding. The team that explains everything verbally repeats itself forever.

Write the short doc, not the long one#

Perfect is the enemy of present. A two-paragraph doc that exists beats a comprehensive doc that does not. Write the short version now. Expand it later if it gets used.

The friction to writing a doc is what kills most documentation. Lower the friction. Markdown in the repo, a wiki page, a long Slack message in a pinned thread — any of these is better than nothing.

Write ADRs for important decisions#

When you make a decision that future people will want to understand, write an Architecture Decision Record. It does not have to be long. Three sections are enough:

Context — what was the situation that made this decision necessary
Decision — what was chosen
Consequences — what trade-offs this implies

Future you will thank present you. Future them will be able to challenge the decision intelligently instead of relitigating it from scratch.

Runbooks for repeated operations#

Anything you do more than twice should have a runbook. Not because it is hard. Because the next person will not remember how, and you will not remember the next time either.

A runbook is a list of commands and decisions. It is not a tutorial. Keep it short, accurate, and current. Update it the next time you run it.

Searchable beats memorable#

Documentation that is not findable does not exist. If the doc lives in a personal Notion page, a Google Drive folder nobody can find, or an email thread from 2023, it does not count.

Put docs where people search. Use words people actually search for, not internal jargon. Title the doc with the question it answers, not the system it describes.

Update or delete#

A wrong doc is worse than no doc. People trust it, follow it, and waste hours debugging the consequences. If you find an out-of-date doc, fix it or delete it. Do not leave it as a trap.

Talk in public#

Writing distributes the artifact; talking in public distributes the moment. Most knowledge transfer does not happen through documents. It happens through conversations that anyone could have been part of — if they had been in the room.

Use shared channels by default#

When someone asks you a question in a DM that is not personal, suggest moving it to the channel. The answer benefits everyone watching, including future people searching the archive.

The cost is small. The benefit compounds: the next person with the same question finds the answer instead of asking again.

Answer with context, not just the answer#

“Run kubectl get pods -n auth” is an answer. “The pods are in the auth namespace because we split that out last year for compliance reasons; you can find them with kubectl get pods -n auth” is a teaching moment.

The first solves today’s problem. The second propagates understanding.

Pin and cross-link#

When a discussion in chat produces useful knowledge, pin it or cross-link it from a more permanent place. Chat history degrades fast — the message you can find easily this week is lost in noise next month.

Decisions in public#

If a decision is being made about something that affects other people, do not make it in a DM. Make it in a place those people can see, even if they are not actively participating. Letting people watch is the cheapest way to keep them informed.

Teach as you work#

Writing distributes the artifact. Teaching distributes the capability. The difference matters: a doc tells you what to do; teaching builds the judgment to figure it out yourself when the doc does not cover your case.

Pair on the work#

Pair programming is the highest-bandwidth knowledge transfer there is. An hour of pairing transfers more skill than a day of reading docs. Use it deliberately — on the systems where bus factor is low, on the parts of the codebase your team needs more people in.

You do not have to pair on everything. Pair on the things you would otherwise be the only person doing.

Code review as teaching#

A code review can be a gate or a conversation. Make it a conversation. Explain the why behind your comments — not just “use this pattern”, but “use this pattern because we got bitten by X last quarter when we did it the other way”.

Reviews that teach create reviewers. Reviews that only gate create dependence.

Pull in the junior engineer#

When a tricky problem comes up, do not silently solve it alone. Pull in someone who has not seen this kind of problem before. Let them watch. Let them try. The cost to you is small. The gain to them — and to the team’s overall capability — is large.

Demo and walk-through#

When you build something new, demo it. Not a polished presentation. A ten-minute walk-through for the rest of the team, showing what it does and how it works. The recording becomes a doc. The session creates two or three more people who could touch the system if needed.

Operational knowledge#

Some of the most fragile knowledge in any team is operational — the tribal stuff that lives in oncall rotations, incident scars, and “oh, you have to also do X or it doesn’t work”.

Document tribal knowledge#

If the answer to a question is “you just have to know”, that answer is wrong. Write it down. The next person should be able to learn it from a doc, not from a war story.

Record incident learnings#

After every meaningful incident, write the postmortem. Not a blame document. A factual record of what happened, what we learned, and what we changed. This is how organizational memory accumulates — and how the same incident stops happening every six months.

Don’t store decisions in chat#

Chat is a place to have conversations, not to store outcomes. Important decisions, agreements, and learnings should land somewhere permanent — a doc, a ticket, a runbook. Otherwise they will be unrecoverable when you need them.

The discipline#

The work of distributing knowledge is rarely urgent. There is always a feature to ship, a bug to fix, a meeting to attend. Writing the doc, recording the architecture decision, walking a teammate through the system — these things slip when the calendar fills.

Resist. The cost of not distributing is paid by the team, not by you. The team that loses its expert and has nothing written down pays for the silos they tolerated. The team that loses its expert and has the docs, the recordings, the trail of decisions, keeps moving.

A bus factor of one is not a virtue. It is a debt you are accruing. Pay it down deliberately.