OpinionBenchmarksPrompts

Stop Benchmarking LLMs. Benchmark Your Prompts.

Q: Why are public LLM benchmarks misleading for real work?

Public benchmarks measure one well-formed prompt per task. Your real work has variance across prompts, contexts, formats, and follow-ups. A model that wins MMLU can still produce worse output on your actual workflow because your prompts look nothing like the benchmark prompts.

Q: How do I benchmark my own prompts?

Pick three real tasks from your work, run each prompt through the candidate models, and compare outputs against a reference you wrote by hand. The score that matters is not which model wins overall; it is which one wins on the kind of work you actually do.

Q: How many prompts do I need to test?

Three is the floor. One picks up obvious failures. Three reveals where a model is consistent. Five to ten is better if you have time. Beyond that, the marginal value drops off and you spend more time grading than working.

MMLU and HumanEval do not predict your output quality. Your prompt variance does. Here is the test I run on every new model before I trust it with real work.

Eliran Suisa

May 17, 2026

7 min read

TL;DR

Public benchmarks rank models on prompts that look nothing like yours. The score does not transfer.
Build a three-prompt suite from your own work. Run it against every new model. Re-run it when the model updates.
Score on the outcome you ship, not on a rubric. Subjective is fine. Consistent across runs is not optional.
The exercise will improve your prompts as much as it tells you which model to pay for.

The Problem With Public Benchmarks

A new frontier model lands. The launch page shows it beating the previous state of the art on MMLU, HumanEval, GPQA, SWE-bench Verified, and a chart with eight more acronyms. You upgrade. A week later your output is no better. Sometimes it is worse.

The reason is simple. Public benchmarks measure a model on clean, single-shot, well-formed prompts that researchers wrote and curated. Your real work does not look like that. Your prompts are partial. They reference earlier context. They use your team's vocabulary. They ask for output in a format you invented. They follow up. The model that wins the benchmark is not necessarily the one that wins your work.

There is a deeper issue: benchmarks reward narrow skill, and labs train against the benchmarks. A model can post a higher MMLU score and still produce worse prose, worse code review, worse strategy memos. That is the gap between the leaderboard and your inbox.

What To Benchmark Instead

Benchmark your prompts on your work. The procedure is three steps and takes an afternoon.

Step One: Pick Three Real Tasks

Pull three real prompts from the last two weeks of your work. Not curated. Not cleaned up. The exact prompts you sent, including the rough phrasing and the missing context. The three should span the kinds of work you do: one writing task, one analytical task, one task that requires the model to refuse or push back. If your work is mostly code, swap in three code tasks instead, but keep them across categories: a bug hunt, a refactor, a test design.

Step Two: Write A Reference Answer

Write what a correct, useful answer looks like for each prompt. You do not need a full draft. A bullet list of the points that must appear and the failure modes that must be avoided is enough. This is the single hardest step. Skipping it is how prompt benchmarking devolves into vibes.

Step Three: Run And Score

Run each of the three prompts through every model you are considering. Run each one twice in separate sessions so you can see variance within a single model. Then score against the reference: does the answer cover the required points; does it hit any failure mode; how much editing would it need to ship.

What the score actually tells you

Consistency matters more than peak. A model that ships a 7 every time beats one that swings between 9 and 4.
Failure modes matter more than upside. A model that hallucinates clauses, fabricates citations, or confidently invents APIs is a model you cannot trust on production work.
Format adherence is a real axis. A model that wins on substance but ignores your output format costs you time on every prompt.

Why This Beats Reading The Leaderboard

Three reasons.

The first is that the test maps to your work. If you write postmortems, the test scores postmortem quality. If you do code review, it scores code review. The leaderboard scores neither.

The second is that the act of writing reference answers forces you to be specific about what good looks like. That clarity carries back into your prompts. Half the time, the test improves your prompt before it tells you anything about the model.

The third is that the test catches drift. Frontier models update silently. The same prompt that produced a clean answer in March can produce a sloppy one in May after a safety tune or a router change on the provider side. Re-running the suite once a quarter catches the drift before it bites a real deadline.

What I Stopped Doing

I stopped upgrading to whichever model topped the latest benchmark roundup. I stopped paying attention to chart deltas under five points on aggregated benchmarks. I stopped reading model cards top to bottom and started reading them only for the safety notes and the context window.

What I do instead is keep a three-prompt suite in a folder, re-run it whenever a new model lands, and keep whichever model holds up across all three. Most weeks the answer is no change. That is fine. Stability is a feature.

One Caveat

Benchmarks are not useless. They are evidence that a model can perform on a constrained task. Treat them as directional signal that a candidate is worth testing, then test against your own work. The mistake is using them as the decision, not as a filter.

FAQ

Why are public LLM benchmarks misleading for real work?

They measure clean, single-shot prompts that researchers wrote. Your prompts are partial, contextual, formatted to your needs, and often follow on from earlier turns. The score does not transfer.

How do I benchmark my own prompts?

Pick three real tasks from your work, write a reference answer for each, run each prompt through every candidate model twice, and score the output against the reference. The score that matters is the one on your work.

How many prompts do I need to test?

Three is the floor. Five to ten is better if you have time. Beyond that, the marginal value drops off and you spend more time grading than working.

Should I trust SWE-bench, MMLU, or HumanEval?

As directional signal, yes. As a basis for which model to pay for, no. Treat them as evidence that a model is worth testing on your work.