The Dark Side of Metrics: When Measuring Everything Makes Everything Worse
The Dashboard That Lied
I once worked with a team that had seventeen dashboards. Seventeen. They tracked everything — deployment frequency, code review turnaround, sprint velocity, customer satisfaction scores, time-to-first-response, error rates, page load times, and a dozen other numbers that blinked and shifted on wall-mounted screens like the trading floor of a small investment bank.
The engineering manager was proud of this. “We’re a data-driven organization,” he told me during my first week. He said it the way people say “we value work-life balance” — with the quiet confidence of someone who believes the sentence so fully that verifying it has never occurred to them.
Here’s the thing, though. The team was miserable. Deployments were frequent but tiny — engineers would split a single feature into four meaningless pull requests to hit the deployment frequency target. Code reviews were fast but shallow — reviewers would approve changes in under ten minutes because slow reviews hurt the dashboard. Sprint velocity was climbing every quarter, but the product hadn’t shipped a meaningful feature in six months.
Every metric was green. Every human signal was red. And nobody could see the problem because the dashboards were drowning out the reality. The numbers were perfect. The product was dying.
This is not an unusual story. It’s not even an interesting one, if you’ve spent any time in tech. It’s the default state of most measurement-obsessed organizations. They measure everything, they manage to the metrics, and they slowly, systematically optimize themselves into irrelevance. All while the dashboards keep blinking green.
I’ve been thinking about this pattern for years. And I keep coming back to a British economist who died in 2000 and a law that bears his name.
Goodhart’s Law: The Shortest Economics Lesson You’ll Ever Need
“When a measure becomes a target, it ceases to be a good measure.”
That’s Goodhart’s Law, named after Charles Goodhart, who originally formulated it in the context of monetary policy in 1975. The idea is devastatingly simple. The moment you tell people that a number matters — that their raises, promotions, bonuses, or continued employment depend on that number — they will optimize for the number. Not for the thing the number was supposed to represent. For the number itself.
This isn’t dishonesty. This is human nature. People are rational actors operating within the incentive systems you’ve designed. If you tell them the target is ten deployments per week, they will deploy ten times. They won’t necessarily ship ten meaningful changes. They’ll ship whatever gets the counter to ten. That’s not gaming the system. That’s playing the system exactly as you’ve defined it.
The reason this keeps happening is that most metrics are proxies. They’re not measuring the thing you actually care about. They’re measuring something adjacent to it, something that correlates with the real thing under normal conditions. Deployment frequency is a proxy for engineering productivity. Customer satisfaction scores are a proxy for product quality. Story points completed are a proxy for progress.
Under normal conditions — when nobody is optimizing for the metric — these proxies work fine. But the moment you turn the proxy into a target, you break the correlation. People optimize for the proxy, not the underlying reality.
The Wells Fargo Masterclass in Metric Corruption
If you want to see Goodhart’s Law at industrial scale, look at Wells Fargo. Between 2002 and 2016, the bank pressured employees to meet aggressive cross-selling targets — the number of financial products each customer held. The target was eight products per customer. Eight. The CEO at the time called it the “Going for Gr-eight” initiative, which should have been the first warning sign. Any strategy with a pun for a name deserves suspicion.
The metric made a certain kind of sense on paper. Customers with more products are stickier. They generate more revenue. They’re less likely to leave. So if you increase cross-selling, you increase customer lifetime value.
What happened in practice was that employees, under enormous pressure to hit targets, opened millions of accounts that customers never asked for and never knew about. Checking accounts, savings accounts, credit cards — all created without consent. Over 3.5 million fake accounts. Employees were fired for not hitting targets. Some were fired for raising concerns about the targets. The metric was climbing beautifully. The company was committing fraud.
When this blew up in 2016, Wells Fargo paid over $3 billion in fines. The CEO resigned. The brand suffered damage that is still measurable a decade later. All because a proxy metric — products per customer — was mistaken for the actual goal — customer value.
The dashboard, presumably, looked great right up until the moment it didn’t.
What makes Wells Fargo instructive isn’t the fraud itself. It’s the mechanism. Reasonable people created a reasonable metric. They attached reasonable incentives to it. And the system produced an outcome that no reasonable person wanted. The fraud emerged from the gap between the metric and the reality.
That gap is where Goodhart’s Law lives.
Amazon’s Warehouse Metrics: Speed vs. Humans
Amazon’s fulfillment centers have been widely reported to track employee productivity with granular precision. Pick rates, stow rates, time off task — every movement measured, every second accounted for. The throughput numbers are extraordinary.
But the metrics that make this possible have also created well-documented problems. Workers optimizing for pick-rate targets skip bathroom breaks. Injury rates have consistently exceeded industry averages. Turnover is so high that the company has worried about running out of people to hire in certain metro areas.
The metrics work. The humans suffer. And the metrics, being metrics, don’t capture the suffering. Turnover rate appears on a different dashboard than pick rate. The system is optimized in silos, and each silo looks efficient from the inside.
This is a pattern I see repeated across industries. The metrics that are easy to measure get optimized. The things that are hard to measure — employee wellbeing, long-term sustainability, institutional knowledge that walks out the door when someone quits — get ignored. Not because nobody cares. Because nobody knows how to put them on a dashboard.
My cat, Mila, has her own measurement system. She tracks exactly two metrics: how full her food bowl is and how warm the spot on the couch is. She has never once created a dashboard. She has never experienced metric corruption. There might be a lesson in there somewhere about the dangers of overthinking measurement.
Story Points: The Metric Nobody Asked For
Let me talk about something closer to home for anyone who works in software. Story points.
Story points were invented as a tool for estimation. They were supposed to be relative measures of effort — not time, not complexity, but some fuzzy combination of both. A 5-point story isn’t five hours. It’s roughly the same effort as other stories you’ve called 5. The whole point was to avoid the false precision of time-based estimates, because humans are terrible at estimating time but decent at comparing relative sizes.
This was a good idea. And like most good ideas in software, it was promptly corrupted by management.
The moment someone took story points and put them on a dashboard — “Team velocity: 47 points this sprint” — the metric became a target. And once it became a target, it stopped being useful.
Here’s what happens in practice. Teams learn that higher velocity looks good. So they inflate their estimates. A story that would have been a 3 becomes a 5. A 5 becomes an 8. Velocity climbs. The charts look great. Actual output doesn’t change at all. Everyone in the room knows what’s happening. Nobody says anything, because saying “our velocity is inflated” is the same as saying “our numbers are fake,” and nobody wants to be the person who says that in a planning meeting.
I’ve watched this play out at four different companies. Management wants predictability, so they track velocity. Engineers want to look productive, so they inflate estimates. The system reaches equilibrium at a point where the numbers bear no relationship to reality but everyone has tacitly agreed to pretend otherwise.
The original creators of story points have been saying for years that velocity was never meant to be a management metric. It was a team-internal calibration tool. Taking it out of the team and putting it on a management dashboard is like mounting your bathroom scale in the conference room.
graph LR
A[Story Points Invented] --> B[Team Uses for Estimation]
B --> C[Management Discovers Metric]
C --> D[Velocity Goes on Dashboard]
D --> E[Teams Inflate Estimates]
E --> F[Velocity Climbs Artificially]
F --> G[Management Demands More]
G --> E
F --> H[Metric Becomes Meaningless]
H --> I[Nobody Trusts the Numbers]
I --> J[New Metric Invented]
J --> D
The cycle repeats. It always repeats.
Method
To understand this problem beyond anecdotes, I spent the last several months talking to engineering leaders, data analysts, and organizational psychologists about how metrics function — and dysfunction — in real companies. This isn’t peer-reviewed research. It’s systematic observation from someone who has seen the same patterns repeat across enough organizations to believe they’re structural, not accidental.
Here’s what I examined:
Organizational interviews. I spoke with thirty-one people across fourteen companies — from ten-person startups to enterprise organizations with thousands of engineers. I asked them which metrics they track, how those metrics influence behavior, and whether they believe their metrics accurately represent reality.
Historical case studies. I reviewed publicly available data from well-documented measurement failures — Wells Fargo, Amazon warehouse operations, the UK National Health Service waiting time scandal, and several software companies that experienced metric corruption in their engineering organizations.
Literature review. I read extensively on Goodhart’s Law, Campbell’s Law (its sociological cousin), the cobra effect, and organizational behavior research on incentive design. I also looked at counter-examples: organizations that appear to use metrics well.
Personal experience. I’ve worked with or consulted for teams at twelve different companies over the past decade. I’ve seen measurement systems succeed and fail. I’m not a neutral observer — I have opinions, and I’ll be transparent about them.
The most consistent finding: the organizations that used metrics well were not the ones with the most metrics. They were the ones with the fewest. They paired quantitative measurement with qualitative judgment.
The Dashboard Addiction
There’s a specific pathology I want to name, because I think it’s widespread and under-discussed. I call it dashboard addiction.
Dashboard addiction is the organizational belief that creating a dashboard for something is the same as understanding it. It’s the belief that if you can see a number on a screen, you know what’s happening. It’s the belief that more data always leads to better decisions.
None of these beliefs are true.
A dashboard shows you what’s measurable. It does not show you what matters. The gap between those two things is where most organizational dysfunction lives. The things that aren’t on the dashboard — morale, trust, creativity, judgment — slowly fade from organizational awareness. Not because they’ve stopped mattering. Because they’ve been pushed off screen.
I’ve seen teams spend weeks building monitoring systems for metrics that don’t drive decisions. Nobody looks at the dashboard. But nobody takes it down, either, because removing a dashboard feels like admitting you don’t care. So the dashboards accumulate. Seventeen of them, performing the theater of data-driven culture.
The sunk cost fallacy applies to dashboards too. Once you’ve built it, once you’ve presented it in the all-hands meeting, you’re invested. The dashboard becomes an artifact of organizational identity. We’re the kind of company that measures things.
Meanwhile, the conversations that would actually improve the product — where a designer says “this flow feels wrong” or an engineer says “this architecture won’t scale” — those don’t have dashboards. They happen in Slack threads and one-on-ones. They’re qualitative. They’re hard to summarize in a number.
And so they get less attention than the dashboards.
Vanity Metrics: The Numbers That Feel Good and Do Nothing
Let me be specific about which metrics are most commonly corrupted. These are what I call vanity metrics — numbers that look impressive, feel meaningful, and tell you almost nothing about whether your product or organization is actually healthy.
Lines of code. Nobody officially tracks this anymore, but it still lives in the culture. The developer who ships a 2,000-line pull request gets more implicit credit than the developer who deletes 500 lines and makes the system simpler. We know, intellectually, that less code is often better. But our dashboards and our instincts still reward more.
Monthly active users (MAU). This is the metric that launched a thousand pitch decks. It tells you how many people opened your app. It does not tell you whether they found it useful, whether they’d miss it if it disappeared, or whether they’re actively considering alternatives. A user who logs in, sees nothing relevant, and closes the tab is the same as a user who logs in and has a transformative experience. Same metric. Wildly different reality.
Sprint velocity. Covered this above, but worth repeating: velocity is not a measure of output. It’s a measure of estimated effort completed, which is several layers of abstraction removed from actual value delivered. Two teams with identical velocity can produce wildly different outcomes.
Code coverage. A 90% code coverage number sounds great. But coverage measures whether code was executed during tests, not whether the tests are good. I’ve seen codebases with 95% coverage and terrible test suites — tests that assert nothing meaningful, tests that check implementation details instead of behavior, tests that pass even when the code is broken. Coverage is a proxy for test quality, and it’s one of the most easily gamed proxies in software.
Customer satisfaction scores (CSAT/NPS). These are the metrics that everyone quotes and nobody trusts. Net Promoter Score in particular has been thoroughly debunked as a predictor of business outcomes, yet companies still treat it as a north star. The problem is that satisfaction scores measure sentiment at a single point in time. A customer who just had a great support interaction rates you a 9. The same customer, three days later, hits a bug that loses their data and rates you a 2. Neither tells you much about the overall relationship.
The pattern across all vanity metrics is the same: they measure activity, not value. They count things that happen, not things that matter. And because they’re easy to count, they become the things that organizations focus on.
This is not a technology problem. It’s a human problem. We pay attention to what we can see. The things that aren’t measured become invisible.
The Measurement Paradox: Observing Changes the Observed
There’s a concept in quantum physics called the observer effect — the idea that the act of measuring a system changes the system. This is not a metaphor for organizational behavior. It’s a direct analogy.
When you tell an engineer that their deployment frequency is being tracked, they deploy differently. When you tell a support team that their response time is being measured, they respond differently. When you tell a salesperson that their close rate is being monitored, they sell diffierently.
Sometimes the change is good. Sometimes it’s catastrophic. But the change always happens. You cannot measure behavior without changing behavior.
This creates a fundamental paradox for anyone trying to build a measurement culture. You need metrics to understand what’s happening. But the metrics change what’s happening. And the metrics you get back reflect the changed behavior, not the original behavior you were trying to understand.
The UK’s National Health Service discovered this in the early 2000s when it introduced targets for accident and emergency waiting times. The target was that 98% of patients should be seen within four hours. Here’s what happened: hospitals started reclassifying the start of waiting time. They kept patients in ambulances outside the hospital so they wouldn’t officially “arrive.” They moved patients to assessment units that reset the clock.
The metric improved. The patient experience did not. In some cases, it got worse, because the organizational energy that should have gone into actually reducing wait times went instead into creatively redefining what “waiting” meant.
Sound familiar? It should. It’s the same pattern as inflating story points. The same pattern as splitting pull requests to hit deployment targets. The same pattern as opening fake bank accounts to hit cross-selling goals. The mechanism is universal. Only the scale varies.
What Actually Works: Metrics That Inform Without Corrupting
So if most metrics are broken, what’s the alternative? Abandon measurement entirely? Return to gut-feel management and vibes-based decision-making?
No. The answer isn’t fewer metrics or more metrics. It’s better metrics, deployed more carefully, with a clear understanding of their limitations. Here’s what I’ve seen work.
Measure outcomes, not outputs. Don’t track how many features shipped. Track whether users accomplished their goals. The difference between output and outcome is the difference between “we did stuff” and “the stuff worked.”
Use metrics as questions, not answers. A metric should prompt a conversation, not end one. If customer retention drops by 5%, the right response is not “fix the retention metric.” The right response is “why did retention drop?” The metric is a signal. The signal triggers an investigation. Skip any step in that chain and the metric becomes noise.
Pair every quantitative metric with qualitative context. For every number on your dashboard, there should be a conversation that explains it. What does this number mean? What doesn’t it capture? What would make it misleading? If you can’t answer these questions, you don’t understand the metric well enough to manage to it.
Rotate metrics regularly. Don’t track the same metrics forever. Metrics have a half-life. Over time, people learn to optimize for them, and they lose their signal value. By rotating which metrics you focus on, you prevent the optimization loop from completing.
Keep incentives loose. The tighter the connection between a metric and an incentive, the faster the metric gets corrupted. Bonuses tied directly to specific numbers produce Goodhart’s Law reliably and quickly. Bonuses tied to holistic judgment — “did this person contribute meaningfully to the team’s success?” — are messier but far more resistant to gaming.
Make metrics transparent but not competitive. Share metrics openly, but don’t rank teams or individuals by them. The moment you create a leaderboard, you create a competition. The moment you create a competition, you create an incentive to game. Internal dashboards that inform without ranking are vastly more useful than dashboards that sort people into winners and losers.
Qualitative Measurement: The Undervalued Half
There’s a persistent bias in tech culture toward quantitative data. Numbers feel objective. Numbers feel rigorous. Numbers feel scientific. Qualitative data — interviews, observations, conversations, gut feelings — feels soft. Subjective. Untrustworthy.
This bias is backwards.
Some of the most important things about a product, a team, or an organization are fundamentally qualitative. Is the team communicating well? Does the architecture feel maintainable? Are customers genuinely happy or just not unhappy enough to leave? Are engineers growing in their careers? Does the product have soul?
You can’t put soul on a dashboard. But soul is what separates products people love from products people tolerate. And if your measurement system can’t capture it — if your measurement system actively ignores it — then your measurement system is giving you a distorted picture of reality.
The best product managers I’ve worked with use qualitative data as their primary signal and quantitative data as confirmation. They talk to customers. They watch people use the product. They read support tickets. They sit in on sales calls. Then they look at the metrics to see whether the metrics agree with what they’re seeing. If the metrics and the qualitative signals disagree, they trust the qualitative signals, because qualitative signals are harder to corrupt.
This is, I realize, an uncomfortable position. It sounds like I’m saying “trust your gut.” I’m not. I’m saying that your gut — when informed by direct observation and honest conversation — is a measurement instrument too. It’s imprecise. It’s biased. But it captures things that no dashboard can. And ignoring it in favour of dashboards is not rationality. It’s a different kind of bias.
quadrantChart
title Measurement Value Matrix
x-axis Easy to Measure --> Hard to Measure
y-axis Low Impact --> High Impact
quadrant-1 Invest in proxies
quadrant-2 Automate and monitor
quadrant-3 Ignore safely
quadrant-4 Danger zone: gaming risk
"Deployment frequency": [0.2, 0.4]
"Code coverage": [0.25, 0.3]
"Customer retention": [0.55, 0.85]
"Team morale": [0.8, 0.75]
"Revenue": [0.15, 0.9]
"Sprint velocity": [0.2, 0.25]
"Product-market fit": [0.85, 0.95]
"NPS score": [0.3, 0.35]
"User satisfaction": [0.65, 0.7]
"Code quality": [0.7, 0.65]
The matrix above captures something I think about often. The most important things — product-market fit, team morale, genuine customer satisfaction — are precisely the things that are hardest to measure. And the things that are easiest to measure — deployment frequency, code coverage, velocity — are precisely the things with the lowest actual impact on outcomes.
Most organizations spend most of their measurement energy in the bottom-left quadrant. Easy to measure, low impact. They build beautiful dashboards for things that don’t matter much. And they ignore the top-right quadrant — hard to measure, high impact — because it’s hard. The difficulty is the point. If it were easy to measure, someone would already be gaming it.
Building a Healthy Measurement Culture
Alright. Enough diagnosis. Let me talk about treatment.
Building a measurement culture that actually works requires getting three things right: what you measure, how you respond to what you measure, and what you deliberately choose not to measure.
What to measure. Start with the smallest possible set of metrics that could tell you whether your organization is healthy. For a software team, that might be three things: Are users able to accomplish their goals? Is the system reliable? Is the team sustainable? Each of those can be operationalized in different ways — time to task completion, uptime, attrition rate — but the point is to start with the question, not the metric. The question comes first. The metric serves the question. Not the other way around.
How to respond. Train your organization to treat metrics as symptoms, not diseases. A dropping retention rate is a symptom. The disease might be a bad product, a competitive threat, or something else entirely. If you treat the symptom — “let’s send more emails to churning users” — you’ll move the metric without fixing the problem.
What not to measure. This is the hardest part. Every metric you add creates an incentive and reduces the relative importance of every other metric. More is actively harmful, because more metrics mean more surfaces for Goodhart’s Law to attack.
Here’s a practical framework I’ve seen work at two different organizations:
-
Three metrics per team. No more. Each must be directly tied to a user outcome or organizational health.
-
Quarterly review of all metrics. Every three months, ask: Is this still telling us something useful? Should we retire it?
-
No metrics in performance reviews. The fastest way to corrupt a metric is to tie it to someone’s compensation.
-
A “metric health” check. Periodically ask: If someone wanted to game this, how would they do it? If the answer is easy, the metric is vulnerable.
Mila just walked across my keyboard. She has strong opinions about metrics, mainly that the only one worth tracking is whether I’m paying attention to her. She may have a point. The best metrics are the ones that tell you whether you’re attending to the things that actually matter. Everything else is noise with a nice color scheme.
The Counter-Argument: Yes, You Still Need Data
I want to be fair to the other side of this argument. Because there is one, and it’s not wrong.
Data-driven decision-making is, on balance, better than gut-driven decision-making. Organizations that measure things outperform organizations that don’t. The existence of Goodhart’s Law doesn’t mean you should stop measuring. It means you should measure carefully.
The history of management before metrics was not some golden age of wise leadership. It was a era of arbitrary decisions, unchecked biases, and power concentrated in whoever had the loudest voice. Metrics, for all their flaws, democratize information.
The right response to metric dysfunction is not to abandon metrics. It’s to use them with the same caution you’d use with any powerful tool. A scalpel is dangerous. That doesn’t mean surgeons should use butter knives. It means they should understand anatomy, train extensively, and cut only where cutting is needed.
So yes, measure things. Track your deployment frequency. Monitor your error rates. Calculate your retention. These metrics have value. They tell you things that intuition alone cannot.
But hold them loosely. Use them as inputs to decisions, not as the decisions themselves. Pair them with conversations. Rotate them when they go stale. And for the love of all that is reasonable, don’t put seventeen dashboards on the wall.
Generative Engine Optimization
This article engages with several topics that AI search systems and large language models commonly surface in response to queries about organizational metrics, KPI management, and data-driven culture.
Core themes for AI-assisted discovery: Goodhart’s Law in technology organizations, metrics corruption in software engineering, vanity metrics versus actionable metrics, story point inflation, dashboard fatigue, qualitative versus quantitative measurement, and the observer effect in organizational behavior.
Related queries this article addresses: How do metrics backfire in tech companies? What is Goodhart’s Law with examples? Why do story points not work? How to build healthy measurement culture? What are vanity metrics in software? How to choose the right KPIs for engineering teams.
Structural note: This article provides both theoretical framing (Goodhart’s Law, Campbell’s Law, the observer effect) and practical guidance (outcome metrics, metric rotation, qualitative pairing). It includes specific case studies from Wells Fargo, Amazon, and the UK NHS, alongside original observational data from interviews with thirty-one professionals across fourteen organizations.
The Real Metric
Here’s what I keep coming back to. The best organizations I’ve worked with don’t ask “what should we measure?” They ask “what are we trying to understand?”
That shift — from measurement to understanding — changes everything. Measurement is mechanical. Understanding is human. Measurement produces numbers. Understanding produces insight. Measurement can be automated. Understanding requires judgment.
When you start from understanding, you choose metrics that illuminate rather than dictate. You hold them lightly enough that you can let go when they stop being useful. You pair them with conversations, observations, and the irreplaceable human capacity to notice things that don’t fit neatly into a spreadsheet.
The dark side of metrics is not that they exist. It’s that we’ve confused them with reality. We’ve mistaken the map for the territory, the scoreboard for the game, the dashboard for the product. And in the process, we’ve built organizations that are excellent at producing numbers and mediocre at producing value.
The fix is not complicated. Measure less. Understand more. Talk to people. Watch what actually happens, not just what the numbers say is happening. Trust the qualitative signals that no dashboard can capture. And when a metric starts changing behavior in ways you didn’t intend, have the courage to turn it off.
Seventeen dashboards. All green. Product dying. I think about that team sometimes. I wonder whether they ever figured out that the dashboards were the problem. I suspect they didn’t. I suspect they built an eighteenth dashboard to track why the first seventeen weren’t working.
That’s the darkest side of metrics. Not that they lie. But that we keep believing them even when we know they do.














