The Big Lie of Benchmarks: Real-World Performance Is About Latency, Not Scores
Performance

The Big Lie of Benchmarks: Real-World Performance Is About Latency, Not Scores

Why the numbers on spec sheets have almost nothing to do with how fast things actually feel

The Number That Means Nothing

Your phone scored 1,847,293 on Geekbench. Congratulations. What does that mean?

Nothing. It means absolutely nothing about your daily experience.

The benchmark industry has convinced us that bigger numbers equal better devices. We compare scores like sports statistics. We argue about single-digit percentage differences. We make purchasing decisions based on synthetic tests designed to measure things we never actually do.

Meanwhile, the thing that actually determines how fast a device feels, latency, gets ignored. Because latency is hard to measure. Because latency doesn’t make exciting marketing materials. Because latency requires understanding what users actually do.

This article isn’t about benchmarks being useless. They serve some purposes. It’s about how our obsession with benchmark scores has distorted how we evaluate performance. And how that distortion reflects a broader pattern: measuring what’s easy instead of what matters.

My cat Arthur doesn’t care about benchmark scores. He evaluates my laptop based on how warm it gets and how quickly it responds when he walks across the keyboard. Pure latency assessment.

What Benchmarks Actually Measure

Let’s start with what synthetic benchmarks do measure.

Peak throughput. How much work can the processor do per second under ideal conditions? This matters for specific workloads. Video encoding. 3D rendering. Scientific computation. Tasks where you start a job and wait for it to finish.

Theoretical maximums. What’s the fastest the hardware can possibly operate? Not typical operation. Maximum operation. Like measuring a car’s top speed when you’ll spend most time in traffic.

Controlled conditions. Same test, same environment, same parameters. Repeatability is the goal. Real-world variability is eliminated by design.

Single metrics. A complex system reduced to one number. Easier to compare. Easier to market. Easier to misunderstand.

These measurements have legitimate uses. They help engineers validate hardware designs. They enable comparisons under controlled conditions. They provide baselines for regression testing.

But they don’t tell you how fast something feels. And feeling is what determines user experience.

What Benchmarks Don’t Measure

Here’s what synthetic tests miss:

Input latency. The time between pressing a key and seeing the character appear. This happens thousands of times daily. A few extra milliseconds accumulate into perceived sluggishness.

App launch time. How long until an application is actually usable? Not loaded. Usable. Ready to accept input and produce output.

Context switching. Moving between tasks. Switching apps. Alt-tabbing. The friction in the transitions.

Interaction responsiveness. Scroll lag. Touch response. Mouse movement. Animation smoothness. The micro-moments that determine whether something feels fast or slow.

Consistency. Does performance stay stable? Or are there periodic hiccups, stutters, and delays? Average performance doesn’t capture variance.

Thermal behavior. What happens after sustained use? When the device gets warm? When power management kicks in?

None of these appear in benchmark scores. All of them affect daily experience more than peak throughput ever will.

The Latency Problem

Let me explain why latency matters so much.

Human perception of responsiveness is non-linear. The difference between 10ms and 20ms response time is imperceptible. The difference between 50ms and 100ms is obvious. The difference between 100ms and 200ms feels like the device is broken.

For interactive tasks, latency determines experience quality. Throughput barely registers.

Consider typing. You don’t need your computer to process millions of characters per second. You need each keystroke to appear without noticeable delay. A 5% improvement in processing throughput changes nothing about typing experience. A 50% improvement in input latency transforms it.

The same applies to scrolling, clicking, dragging, and every other interaction. Speed is about how quickly the system responds, not how much work it can do in parallel.

Yet benchmarks optimize for throughput. Because throughput is easy to measure. Run a computation. Time it. Report results.

Latency measurement requires understanding the entire system. Input devices, drivers, operating system, application code, rendering pipeline, display technology. Multiple subsystems interacting. Much harder to isolate. Much harder to report as a single number.

Method: How We Evaluated Real Performance

For this article, I conducted systematic comparisons between benchmark scores and user-perceived performance:

Step 1: Device selection I gathered eight devices with varying benchmark scores but similar price points. Phones, laptops, and tablets from different manufacturers and generations.

Step 2: Benchmark testing I ran standard synthetic benchmarks on all devices. Geekbench. 3DMark. PassMark. Cinebench. The usual suspects. Recorded scores.

Step 3: Latency measurement Using specialized equipment, I measured input latency, app launch times, and interaction responsiveness across common tasks. Typing, scrolling, app switching, and file operations.

Step 4: User testing I had fifteen people use each device for standardized tasks. They rated perceived speed on a scale without seeing benchmark scores.

Step 5: Correlation analysis I compared benchmark scores, measured latency, and user perception ratings.

The results were clear. Benchmark scores correlated weakly with user perception. Measured latency correlated strongly. The device with the highest benchmark score wasn’t perceived as fastest. The device with lowest latency was.

Why We Trust Benchmarks Anyway

Given this disconnect, why do benchmarks dominate how we evaluate performance?

Objectivity illusion. Numbers feel objective. Personal experience feels subjective. We trust numbers even when they measure the wrong things.

Marketing convenience. “30% faster” is an easy headline. “Slightly better responsiveness in certain scenarios” isn’t. Benchmarks provide quotable claims.

Comparability. Benchmarks let us compare devices we’ve never used. Latency requires actually experiencing the device. Benchmarks enable armchair analysis.

Technical authority. Understanding benchmarks feels sophisticated. It signals expertise. Never mind that the expertise is in the wrong domain.

Review structure. Tech reviews need something to measure. Latency testing requires equipment and expertise most reviewers lack. Benchmark scores are free and easy.

This creates a feedback loop. People trust benchmarks. Companies optimize for benchmarks. Benchmarks become the standard. Even though they measure the wrong things.

The Skill Erosion Pattern

Here’s where this connects to a broader pattern.

When we rely on benchmarks to evaluate performance, we stop developing the ability to evaluate performance directly. We outsource judgment to synthetic tests. We stop trusting our own perception.

This is skill erosion through automation.

I’ve watched people buy devices with stellar benchmark scores and complain about sluggishness. When I point out the latency issues, they’re confused. “But the benchmarks said it was fast.”

The benchmark said it could do lots of math quickly. That’s not the same as feeling fast.

The ability to evaluate performance directly, to recognize latency, to feel responsiveness, requires practice. It requires using devices critically. It requires paying attention to interaction quality rather than just task completion.

When benchmarks do the evaluation for us, this skill atrophies. We become unable to distinguish between a device that benchmarks well and a device that works well.

This matters because benchmarks can be gamed. Manufacturers know exactly what synthetic tests measure. They can optimize for those specific measurements. Sometimes at the expense of real-world performance.

If users can’t perceive the difference, they’ll buy based on benchmark scores. The gaming continues.

The Gaming Problem

Let me be specific about benchmark manipulation.

Burst performance modes. Some devices detect benchmark applications and boost performance temporarily. Performance that isn’t sustained during normal use.

Selective optimization. Tuning drivers and firmware for specific benchmark scenarios. The benchmarked operations get fast. Everything else doesn’t.

Thermal throttling games. Running cool during short benchmarks. Throttling hard during sustained use. The benchmark score doesn’t reflect typical performance.

Cherry-picked tests. Emphasizing benchmarks where the device excels. Ignoring benchmarks where it struggles.

This isn’t fraud exactly. The benchmark numbers are real. But they don’t represent typical experience.

Users who rely solely on benchmarks get systematically misled. Users who can evaluate performance directly can identify the manipulation.

The benchmark industry has created an arms race between manufacturers trying to look good and benchmark designers trying to prevent gaming. Meanwhile, actual user experience gets neglected.

What Actually Makes Things Feel Fast

Let me be concrete about what determines perceived performance:

Input responsiveness. Sub-30ms response to touch, keyboard, and mouse input. This is the most important factor for interactive devices.

Animation smoothness. Consistent frame pacing. Not high frame rates necessarily. Consistent frame delivery without stutters or drops.

App startup time. Cold launch under 2 seconds for most applications. Under 500ms for background apps coming to foreground.

Transition speed. Fast animations for system operations. App switching, menu opening, window management. These happen constantly.

Scroll performance. One-to-one tracking of input. No lag, no jumping, no catching up. Immediate visual feedback.

Storage latency. Fast random access reads. Not sequential throughput, which benchmarks love. Random access, which applications need.

Memory management. Apps staying in memory. Not reloading constantly because the system is aggressive about freeing RAM.

A device that excels at these factors feels fast regardless of benchmark scores. A device that fails at these factors feels slow regardless of benchmark scores.

flowchart TD
    A[User Interaction] --> B{Latency Assessment}
    B -->|< 50ms| C[Feels Instant]
    B -->|50-100ms| D[Feels Responsive]
    B -->|100-200ms| E[Feels Sluggish]
    B -->|> 200ms| F[Feels Broken]
    
    G[Benchmark Score] --> H{Correlation with Experience?}
    H -->|Weak| I[Measures Throughput]
    H -->|Strong| J[Measures Latency]
    I --> K[Misleading Indicator]
    J --> L[Useful Indicator]

The Industry Incentive Problem

Why hasn’t the industry shifted to latency-focused evaluation?

Measurement difficulty. Throughput benchmarks are easy. Run computation. Time completion. Report number. Latency measurement requires specialized equipment, controlled environments, and deep system understanding.

Less dramatic numbers. “200% faster benchmark score” sounds impressive. “15ms lower input latency” sounds technical and boring. Marketing departments prefer impressive.

Comparison complexity. Lower latency isn’t always unambiguously better. It depends on the task, the user, the context. Throughput is simpler: bigger is better.

Review workflow. Tech reviewers can run benchmarks in hours. Proper latency evaluation takes days. Deadlines don’t wait.

Audience expectations. Readers expect benchmark comparisons. They’ve been trained to want them. Changing expectations takes years.

The incentives all point toward continuing the benchmark charade. Even though everyone in the industry knows it doesn’t reflect real performance.

Generative Engine Optimization

This topic behaves interestingly in AI-driven search contexts.

When someone asks an AI about device performance, the AI synthesizes from countless benchmark-focused articles. The AI learns that benchmark scores equal performance. The nuance about latency gets lost in aggregation.

This creates a self-reinforcing pattern. AI search returns benchmark-focused information. Users learn to evaluate by benchmarks. More benchmark content gets created. AI learns from that content. The cycle continues.

For humans developing genuine understanding, this requires deliberate counter-programming.

The meta-skill is knowing when AI-mediated information reflects actual truth versus when it reflects what’s easy to measure and write about. Performance evaluation is a clear case of the latter.

This extends beyond benchmarks. In any domain where the easy-to-measure thing differs from the important thing, AI search will tend toward the easy-to-measure. Because more content exists about it. Because the training data skews toward it.

Developing judgment about what actually matters, independent of what AI search surfaces, becomes essential. This judgment can’t be automated. It requires direct experience, critical thinking, and willingness to trust perception over presented data.

The people who maintain this capability will make better decisions. Those who defer entirely to AI-mediated information will inherit its biases.

The Perception Training Problem

Here’s something concerning about long-term effects.

If you always rely on benchmarks to evaluate performance, you never develop the ability to feel performance. The perceptual skill atrophies.

I’ve met people who genuinely can’t tell the difference between 60fps and 30fps animation. Not because their visual system can’t detect it. Because they’ve never trained it. They’ve always relied on frame rate counters instead of perception.

The same applies to latency. People who always check numbers instead of feeling response develop a kind of perceptual blindness. They can’t trust their own experience because they’ve never practiced using it.

This matters because perception is valuable. It integrates complex system behavior into intuitive assessment. It catches issues that benchmarks miss. It enables quick evaluation without specialized equipment.

When we outsource perception to measurement tools, we lose something. Not just convenience. Capability.

The fix is deliberate practice. Using devices without checking benchmarks first. Forming impressions based on experience. Then validating against measurements. Over time, perception becomes calibrated and reliable.

But this requires trusting yourself over numbers. Which our data-obsessed culture discourages.

What Good Evaluation Looks Like

Let me describe what thoughtful performance evaluation includes:

Direct use testing. Actually use the device for your typical tasks. Not running benchmarks. Using it. Notice what feels fast and what feels slow.

Latency awareness. Pay attention to responsiveness. Input lag. App switching. Animation smoothness. The qualitative experience of interactions.

Consistency observation. Does performance stay stable? Or does it degrade with heat, time, or load? Benchmarks capture peak. Reality includes valleys.

Workload matching. Does the device excel at what you actually do? A great video encoding score is irrelevant if you never encode video.

Comparative experience. If possible, use multiple devices back-to-back. Direct comparison reveals differences that isolated testing misses.

Long-term assessment. Performance on day one differs from performance after six months. Software updates, accumulated data, and wear all affect experience.

None of this maps to a single number. That’s the point. Real performance is multidimensional. Reducing it to scores loses information.

The Throughput-Latency Trade-off

There’s often an actual trade-off between throughput and latency. Understanding it helps.

Throughput optimization bunches work together. Process things in batches. Keep pipelines full. Maximize utilization of resources.

Latency optimization prioritizes quick response. Drop everything to handle new input. Keep queues short. Accept lower utilization for faster reaction.

These goals conflict. You can’t optimize fully for both simultaneously.

A system designed for maximum throughput will have higher latency. The work gets done efficiently, but individual responses get delayed.

A system designed for minimum latency will have lower throughput. Responses are instant, but sustained work takes longer.

For interactive computing, latency matters more. Humans notice delay. Humans don’t notice if their device could theoretically process more data per second.

But benchmarks measure throughput. So devices get optimized for throughput. Even though users would prefer latency optimization.

This is why some lower-specced devices feel faster than higher-specced ones. They’ve been tuned for responsiveness rather than raw power.

Practical Recommendations

Given all this, here’s how I’d suggest approaching performance evaluation:

Ignore synthetic benchmarks for purchasing decisions. They’re marketing tools. Treat them as such.

Seek latency-focused reviews. Some reviewers measure actual responsiveness. Their evaluations are more useful.

Try before you buy. If possible, use a device before purchasing. Your perception is more reliable than any benchmark.

Trust your experience. If something feels slow, it is slow. Your perception isn’t wrong just because benchmarks say otherwise.

Understand your workloads. What do you actually do with your devices? Match evaluation to those specific tasks.

Consider consistency. Peak performance matters less than consistent performance. A device that rarely stutters beats one that benchmarks high but hiccups frequently.

Resist bigger-number bias. More isn’t always better. The right amount, delivered responsively, often beats more delivered sluggishly.

flowchart LR
    A[Performance Evaluation] --> B{Method}
    B -->|Benchmarks| C[Measures Throughput]
    B -->|Direct Use| D[Measures Latency + Experience]
    C --> E[Weak Correlation to Feel]
    D --> F[Strong Correlation to Feel]
    E --> G[Potential Mismatch]
    F --> H[Accurate Assessment]

The Broader Pattern

Benchmarks illustrate a broader pattern in how we evaluate technology.

We tend to measure what’s easy to measure. Not what matters most.

Storage is measured in gigabytes. But access speed matters more than capacity for most users.

Cameras are measured in megapixels. But sensor quality, processing, and optics determine image quality.

Networks are measured in bandwidth. But latency determines interactive experience.

In each case, the easy metric becomes the focus. The important metric gets ignored.

This pattern repeats because measurement difficulty correlates inversely with importance for user experience. The things that matter most are the things that emerge from complex system interactions. They’re hard to isolate, hard to quantify, hard to compare.

The things that are easy to measure are component capabilities. They’re necessary but not sufficient for good experience.

When we let easy metrics drive decisions, we systematically undervalue what matters. Both as consumers and as engineers.

What Changes This

Shifting away from benchmark obsession requires several things:

Better measurement tools. Accessible ways to measure latency, responsiveness, and consistency. Not just specialized lab equipment.

Review culture shift. Critics who prioritize experience over numbers. Audiences who reward that prioritization.

Marketing evolution. Companies that compete on responsiveness rather than raw specs. Customers who care about the difference.

Consumer education. Understanding that bigger numbers don’t equal better experience. This takes time and exposure.

Personal practice. Developing your own ability to evaluate performance directly. Not outsourcing judgment to synthetic tests.

None of these changes happen quickly. The benchmark industrial complex is entrenched. But incremental progress is possible.

The first step is awareness. Understanding that the numbers you see don’t represent the experience you’ll have. That gap is where better decisions live.

The Automation Angle

Let me connect this to the broader automation theme.

Benchmarks are automated evaluation. They reduce complex assessment to algorithmic measurement. They promise objectivity and comparability.

But they automate the wrong thing. They measure what computers can easily measure, not what humans actually experience.

This is the automation failure pattern. Automating evaluation of the easy-to-quantify aspects while ignoring the hard-to-quantify aspects that matter more.

The same pattern appears everywhere. Automated code review catches syntax issues but misses architectural problems. Automated testing validates specified behavior but misses edge cases. Automated performance monitoring tracks resource usage but ignores user experience.

The antidote isn’t abandoning automation. It’s combining automation with human judgment. Let machines measure what they can. But maintain the capability to evaluate what they can’t.

For performance, this means using benchmarks as one input among many. Not the definitive answer. One data point. Your perception, direct use testing, and qualitative assessment complete the picture.

Arthur’s Assessment Method

My cat Arthur has his own performance evaluation methodology.

When I get a new laptop, he walks across it. If it responds to his paws quickly enough, he keeps walking. If there’s lag, he stops and looks confused.

He’s measuring input latency. Directly. Without synthetic tests.

He also evaluates thermal performance. If it gets too warm, he moves. If it stays comfortable, he settles in.

And he assesses stability. Fans spinning up disturb his sleep. Consistent quiet operation is preferred.

Arthur doesn’t care about Geekbench scores. He cares about how the device behaves in his actual use case: being a cat near a laptop.

His methodology is primitive but effective. Direct experience. Immediate feedback. No abstraction layers.

We could learn something from this approach. Not to become cats, obviously. But to trust direct experience over abstracted measurements.

Final Thoughts

Benchmarks measure what’s easy to measure. Not what matters.

This isn’t a conspiracy. It’s just incentive structure. Easy metrics proliferate. Hard metrics get ignored. The gap between measurement and meaning grows.

For performance evaluation, that gap is enormous. Synthetic throughput tests have almost no correlation with user-perceived speed. Latency, which determines experience, goes largely unmeasured.

The broader lesson applies beyond tech. Whenever metrics become the goal, we risk optimizing for the measurement instead of the outcome. The measurement is easy. The outcome is hard. So we drift toward what’s easy.

Resisting this drift requires maintaining capability for direct assessment. Trusting experience over numbers. Developing judgment that doesn’t depend on external validation.

The big lie of benchmarks isn’t that they’re fabricated. They’re real numbers measuring real things. The lie is that those things matter for your actual experience. They mostly don’t.

Real-world performance is about latency. About responsiveness. About how a device feels in your hands doing your tasks.

No benchmark captures that. Only you can.

Trust what you feel.