Evidence-Based Decisions

There is a particular kind of confidence that comes from looking at a dashboard and knowing — not guessing, not hoping — that the system is healthy. That deployments are faster this quarter. That the error rate dropped after last week’s fix. That the change you shipped actually moved the number you cared about.

This is what evidence gives you. Not certainty, but ground to stand on. And the only way to keep decisions grounded is to measure the things they depend on.

Folklore costs you slowly#

Without evidence, engineering decisions become folklore. “We think the API is slow” is not the same as “p99 latency crossed 800ms after the last deploy.” The first is a feeling. The second is something you can act on.

Teams that decide without evidence tend to optimize for the wrong things. They chase architectural purity while users churn. They rewrite systems that were fast enough and ignore systems that are actually on fire. Evidence gives you triage: it tells you where to look and when to stop.

Feelings are not data#

“It feels slow” is a useful signal that something might be worth measuring. It is not a conclusion. When you find yourself making a decision based on a feeling, stop and ask: what would the measurement look like? Then go take it.

If the measurement contradicts the feeling, trust the measurement. The feeling is a hypothesis. The measurement is the test.

Anecdotes scale poorly#

“A customer complained about this last week” is the start of an investigation, not the answer. One complaint might mean one annoyed user or it might mean a systemic problem. Without data, you cannot tell — and you will either over-react (rewriting a feature for one person) or under-react (dismissing a real issue as noise).

Treat anecdotes as prompts to look at the data, not as substitutes for it.

“We’ve always done it this way” is not evidence#

The most expensive folklore is the unexamined convention. The database is partitioned this way because someone decided it was a good idea five years ago. The deploy process has this step because once, in 2019, someone got bitten by skipping it. The service uses this library because the original author liked it.

Each of these might still be correct. Or might be cargo. The way to tell is to measure.

Measure what matters#

Not everything. The trap is building dashboards for every metric your infrastructure can emit, then never looking at any of them. A wall of unread graphs is worse than three graphs you actually watch.

Pick a small set of signals that tell you whether you are delivering value.

Start with DORA#

A good starting point is the four DORA metrics:

Deployment frequency — how often you ship to production
Lead time for changes — how long from commit to deploy
Change failure rate — what percentage of deploys cause incidents
Time to restore — how quickly you recover when things break

These four numbers, tracked honestly over time, tell you more about your engineering health than any architecture diagram. They measure outcomes, not activity.

User-facing metrics#

Beyond DORA, measure what matters to your users:

Page load time at the 75th and 95th percentiles
Conversion rate through key flows
Time to first meaningful interaction
Error rate from the user’s perspective, not the server’s

The specific metrics depend on your product, but the principle is the same: measure the thing you actually care about, from the perspective of the person who experiences it.

System health metrics#

For the systems you operate, measure the things that page you:

Latency at meaningful percentiles (p50, p95, p99)
Error rate by endpoint and by error class
Saturation of the constrained resource (CPU, memory, connections, queue depth)
Availability over a rolling window

Resist the urge to measure everything just because you can. Each metric you graph is a metric someone will eventually have to interpret.

A small dashboard wins#

Three to five metrics on a dashboard your team actually looks at is worth more than fifty metrics on a dashboard nobody opens.

If a metric does not change anyone’s behavior, it does not need to exist.

Vanity metrics#

Some metrics feel productive but measure nothing useful. Watch for these.

Lines of code#

More code is not better code. Often it is the opposite — a feature that took 200 lines to add is worse than the same feature that took 50.

Lines of code measures the size of the change, not its value. Stop reporting it.

Story points completed#

Story points measure how well you estimate, not how much value you ship. A team that completes 100 story points of low-value work has done less than a team that completes 30 story points of high-value work.

Use story points internally for planning if you must. Do not report them as a productivity metric.

Hours worked#

Effort is not output. The person who works 60 hours and produces a polished but unneeded feature contributed less than the person who worked 35 hours and shipped the right thing.

Hours worked is a vanity metric that quietly rewards inefficiency.

PRs merged#

Activity is not progress. A merged PR could be a typo fix, a critical feature, a refactor that broke nothing, or a refactor that broke everything. The count tells you none of that.

Measure what the PRs achieved, not how many there were.

The general test#

A good metric is one where improving the number genuinely means improving the thing you care about. If you can game the metric without improving the outcome, it is the wrong metric.

When in doubt, ask: if this number doubled tomorrow, would the product actually be twice as good? If the answer is no, the metric is vanity.

Read trends, not snapshots#

A single data point means nothing. The latency was 200ms today; that tells you almost nothing. The latency has been climbing 5% week over week for two months; that tells you something is wrong.

A point is noise#

Variation is normal. Any single measurement could be high, low, or perfectly typical. Do not act on a single reading unless it is alarmingly far from baseline.

Trends tell stories#

Look at the same metric over time. Is it stable? Drifting? Spiking periodically? Drifting tells you something is changing slowly — usually a leak or a growing data set. Spikes tell you something is triggered by an event — load, a job, a deploy.

The graph over time is the story. The single number is a frame from the story.

Correlate with deploys#

Annotate your dashboards with deploys, config changes, and feature flag rollouts. When a metric moves, you want to know what changed in the system at that moment.

Half of operational debugging is finding the deploy that caused the issue. The other half is fixing it. Make the first half easy.

Alert on actionable things#

An alert is a request for human attention. Every alert should be worth waking someone up for. If it isn’t, the alert is doing damage.

Alert fatigue is real#

A pager that fires twenty times a day for non-issues will eventually fire on a real incident and be ignored. The cost of an alert is not the page itself — it is the erosion of trust in the alerting system.

Tune alerts ruthlessly. If an alert fires and the response is “oh, that’s normal, ignore it”, the alert should be deleted or fixed, not tolerated.

Distinguish broken from interesting#

Some metrics are worth alerting on: error rates, saturation, request failures, downtime. Some metrics are worth watching but not alerting on: deploy frequency, queue depth at normal levels, traffic spikes during expected windows.

Move the watching-but-not-alerting metrics to a dashboard. Keep the alerts for what is genuinely broken.

Symptom over cause#

Alert on symptoms — what the user experiences — not on causes. Causes change as the system evolves; symptoms remain stable.

“Error rate over 1%” is a good alert. “CPU over 80% on the auth service” is a worse one — high CPU might or might not affect users, and the threshold drifts as the service changes.

Decide explicitly with evidence#

Evidence does not make decisions for you. It informs them. The decision is still yours — but it should be a decision made with data, not despite it.

State the hypothesis upfront#

When you ship a change intended to move a metric, write down what you expect to happen and by how much.

Bad:

“Refactor the login flow.”

Better:

“Refactor the login flow. We expect signup completion rate to increase by at least 5%. If it does not, we will revert or redesign.”

The hypothesis makes the experiment falsifiable. Without it, every outcome can be rationalized as a success.

Define success before you start#

What does “the feature worked” mean? What number has to move, by how much, for how long, before you call it a win?

Decide this before you ship. Otherwise, after shipping, you will be tempted to redefine success around whatever the metric happened to do.

Let data change your mind#

The hardest part of evidence-based decisions is acting on evidence that contradicts what you believed. The feature you thought would work didn’t. The optimization that should have helped didn’t. The user behavior you predicted didn’t materialize.

When the data says you were wrong, you were wrong. The point of measuring was to find out — not to find confirmation. Update.

Beware confirmation bias#

When you already believe a system is slow, you will find evidence it is slow. When you already believe a feature is succeeding, you will find evidence it is succeeding. The trap is well-documented and affects everyone.

Pre-register your metric before you look at the data. Decide what would convince you the change failed. Then check honestly.

Revisit the metrics#

A metric you set up six months ago might be irrelevant today. The product changed. The team changed. The bottleneck moved.

Quarterly review#

Once a quarter, look at every metric you measure and ask: does this still matter? Is anyone acting on it? Would removing it lose us anything?

Kill the ones that have stopped earning their place.

New metrics for new problems#

As your product evolves, the metrics that matter evolve too. The dashboard that served you at 1,000 users is not the dashboard you need at 100,000 users.

Add new metrics deliberately when you find a gap. Do not just accumulate them.

One number that matters#

For most teams, at any given quarter, there is one number that matters more than the rest. Conversion rate. Activation rate. Reliability. Latency. Cost per request.

Identify it. Put it at the top of the dashboard. Make sure everyone knows what it is and which direction “good” looks.

The goal#

The goal is not a perfect measurement system. The goal is to make decisions with evidence instead of intuition.

Start small. Measure the few things that matter. Be honest about what the numbers say. Let the data change your mind — even when the change is uncomfortable.

Good engineering is full of decisions that turned out to be right because they were grounded in evidence. The wins compound. The wrong calls get caught early. The team learns to trust the data, and the data starts to deserve the trust.