Why building your own benchmark will get you ahead

Public benchmarks will not tell you whether your AI feature is good enough to ship. They might tell you a model scores 87% on some aggregate of tasks that vaguely resemble yours. But your decision about whether switching is worth the cost is still based on vibes.

Every founder I've spoken to over the last weeks building agentic systems has the same complaint. The leaderboards that ship with new model releases don't map to what they're building, the benchmarks provided by frontier labs feel more like benchmarketing, and public independent benchmarks rarely hit their actual use cases. So when it's time to decide whether to switch models, the call comes down to a gut feeling instead of empirical evidence — and gut feelings can be expensive.

I get it. Comprehensive evals are hard to build, and when your startup is moving at lightning speed, it feels like a waste to invest time in something that might be obsolete in a few weeks when you pivot. But hear me out. A decent benchmark that holds up through pivots and grows with your product is less effort than you think. And the payoff compounds.

This post is an attempt to convince you it's worth at least a weekend. It's not a framework launch or a product pitch. It's a walkthrough of how I think about benchmark design, organized around three layers:

Model-level benchmarks — the kind everyone eventually needs.
Agentic-workflow evaluations — where most real value lives.
Agent-harness evaluations — few people are doing these well yet, but everyone will be by the end of the year.

My hope is that you'll use my work as a blueprint — or at least steal the thinking.

If you want to follow along, I've built a Claude Code plugin that packages the ideas in this post and ships with a working example. Install with two commands:

terminal

$ claude plugin marketplace add ant-open-skills/custom-model-bench
$ claude plugin install custom-model-bench@ant-open-skills

You can find the code for it here, too: github.com/ant-open-skills/custom-model-bench.

Why build your own benchmark

There are four good arguments for investing some time in your own eval suite. Each one compounds over the lifecycle of the product. The amount of time you invest should be proportionate to the stage your company is in. Hence if you are an early stage start-up, don't spend months building evals but build something simple that will scale as you grow.

You pick the right model per feature. Surprisingly enough, the default for many teams is still to pick one model and use it everywhere. This is almost always wrong. A classification call inside a workflow doesn't need the same model as the research step. Drafting a personalized email doesn't need the same model as summarizing one. Without evals, you have no way to know which parts of your product are being served by overkill and which are being served by under-power. With evals, per-feature model choice becomes a data question rather than an opinion. One of the first lessons from running my own worked example was that Haiku 4.5 was roughly a third the cost of Sonnet 4.6 on the research step — but fabricated facts at four times the rate.

Same pipeline, two ways to staff it

You upgrade faster when new models ship. A new frontier model drops roughly every four to six weeks. Teams without benchmarks spend weeks manually testing whether the new model is actually better and what to tweak to make it work. Teams with benchmarks run the suite in an afternoon, see where the new model wins and loses, and upgrade within days — with a clear idea of what to change when the new model introduces regressions.

Take CodeRabbit, for example. They run an ensemble of frontier models from multiple labs and benchmark every new release against their production pipeline on the same 100-point Error Pattern evaluation — no cherry-picking, no special prompting. When Opus 4.7 first shipped, the initial numbers actually regressed — more comments posted, less precision. Without the benchmark, that's where most teams would have stopped. Because they had one, they spotted the misaligned prompts, fine-tuned them, and shipped a 24% improvement within days — not weeks.

You define what success means — and that alone puts you ahead. Most teams shipping AI features have never sat down and written out what "good" looks like. The exercise of designing an eval forces that conversation. What counts as a correct research output? What's an acceptable fabrication rate? Which failure mode is actually unacceptable? You'll find, as I did, that answering these questions clearly is harder than building the feature itself — and that most of your engineering debates stop being opinion wars the moment there's a number attached.

You encode behavior. Every task in your benchmark is a behavioral contract. When you add a regression test for "the agent should not fabricate financial figures not present in the research step," you've made a rule. Every future model evaluated against that test has to pass it. Over time, the suite becomes a running specification of what your product is supposed to be.

None of this requires a three-month investment. A benchmark with 15–20 tasks and three or four graders is enough to catch most of what you care about.

Stage 1 — General benchmarks: latency, cost, and throughput

The simplest benchmark treats a model as a function: prompt in, text out, measured along the dimensions your infrastructure cares about. Latency. Tokens in, tokens out. Cost per call. Throughput under load. This isn't the capability-benchmark layer — it's not SWE-Bench or Terminal-Bench. Vendors run those on launch days to show raw capability. This layer answers a different question: does the model, running in your environment, actually meet the operational requirements of your feature?

The design looks roughly like this:

Stage 1 — the single-turn benchmark

A dataset of single-turn prompts fans out across candidate models. Each run records latency, tokens, cost, and a deterministic grade — then aggregates into p50/p95/p99 tails, cost per task, and accuracy.

The graders are entirely deterministic. Exact-match, regex, format compliance, schema check. No LLM judge.

This stage is useful, but it's rarely where you should stop.

A reasonable question is: why run this at all when public leaderboards already do it? Three reasons.

First, public benchmarks don't run in your environment. The numbers you see on a model provider's launch page were measured on their infrastructure, their region, their rate-limit tier, and probably their batching strategy. Your production latency depends on your API plan, your and your customers' geographic region, your retry policies, and whether you're running through a proxy or a gateway. I've seen real production p95 diverge meaningfully from published figures. If your feature has a latency budget, measure it where the call actually happens.

Second, pricing tiers vary by configuration. Context caching, batch API discounts, enterprise contracts, startup credits, and regional pricing all shift the cost-per-task number in ways public leaderboards don't reflect. Two models priced similarly on the marketing page can end up 2–3x apart once your specific discounts, commit tiers, and caching patterns apply. Your Stage 1 benchmark tells you what your bill will actually look like.

Third, you care about specific metrics. Maybe your feature lives or dies on p99 latency because you have a user-facing timeout. Maybe you care about cost per successful task (including retries) rather than cost per API call. Maybe you need to stress-test how a model behaves under concurrent load. Public benchmarks report averages; your business cares about tails.

Stage 1 benchmarks are also useful as a reality check. If your own measurements diverge from public leaderboards by more than you can explain, something is wrong — either in your setup, in the public number, or in the tasks the public benchmark happened to measure. Either way, you want to know.

If you want a starting point, the kit includes three Stage 1 scopes — a speed bench, a reasoning bench, and a tool bench. All three run on the same harness and render in the same dashboard. You can point them at your own prompts and models in minutes.

Dashboard walkthrough · youtu.be/ZBqVEegmL5k

This is table-stakes. Useful, but not where competitive advantage comes from — that is Stage 2 and Stage 3.

Stage 2 — Agentic workflow evaluation

The moment your product calls a model more than once in a loop, or lets the model call tools, or chains multiple stages together, you are out of the world where public benchmarks can help you. You are in the world where your eval suite is the only thing that will tell you whether your feature actually works.

Two shapes of agentic system

Left — a chained workflow: each stage is one model call, output of stage n feeds stage n+1. Right — an agentic loop: a single model with a harness wrapped around it that injects context, controls the loop, dispatches actions, persists state, and observes results — running until the model decides it's done.

This is where most of the real methodology lives. The rest of this section walks through the three decisions you'll face when building one: how to decompose the workflow, how to catch failures that span stages, and how to read multi-grader results without fooling yourself. I'll illustrate with the worked example in my kit — a two-stage agent that researches a sales prospect and drafts outreach — but the ideas transfer.

Decompose the workflow, then grade each stage on its own terms

The first methodology mistake most teams make is grading the final output as a single unit. A two-stage agent that researches a company and drafts an email produces one artifact at the end. It's tempting to grade the email and call it done.

That's wrong, and it's wrong for a specific reason: categorically different outputs need categorically different graders.

Structured outputs are factual questions. Did the research return valid JSON? Does the GitHub org match? Did the agent call the right tools?
Prose outputs are judgment calls. Is the email well-written? Does it fit the prospect?
Tool traces are behavioral evidence. How many turns? Did the agent recover from tool failures, or dead-end?

Try to evaluate all three with a single LLM judge holding a single rubric and you end up averaging incompatible things into a score that tells you nothing useful. A judge that sees a polished email but can't check the underlying research will rate an agent 5/5 that confidently fabricated every fact it cited.

The right move is to decompose the workflow into stages, each with its own success criteria, and reach for the grader that fits the output type. The practitioner consensus: prefer deterministic graders wherever possible, and reach for LLM judges only when the output genuinely requires judgment.

There are three categories of grader worth knowing:

Code-based. Deterministic, fast, cheap. Schema checks, regex, set operations, trace analysis. Fails loudly with exact error messages.
Model-based (LLM judge). Necessary for prose and quality judgments. Non-deterministic — same input, different scores across runs. Prone to well-documented biases: judges favor outputs from the same model family, and they favor longer answers even when they're not better.
Hybrid. An LLM does one narrow sub-task — usually extraction — and the result feeds a deterministic check. Underused. Often the best answer when neither pure approach fits.

For the prospect qualifier, the decomposition looks like this:

Decomposing the workflow into stage-appropriate graders

Stage 1 is almost entirely code graders. Stage 2 needs an LLM rubric judge. A cross-stage check catches what neither can see alone.

Stage 1 is almost entirely code graders: schema compliance, GitHub org match, tech stack overlap, contact count, tool call accuracy, efficiency, recovery rate, dead-end rate. The tool-use metrics are the ones most eval frameworks skip and the ones that matter most for agents. When your agent loops 14 times instead of 4, you want to know. When it dead-ends on a tool failure instead of pivoting, you want to know. These are diagnosable from the trace alone.

Stage 2 is where LLM judging becomes unavoidable. There's no deterministic correct email. I use Opus 4.7 as a judge, three times per task, and report the mean and standard deviation per dimension. Cheaper judges work too — pick based on your use case and budget — but whatever you choose needs strong reasoning capabilities. Three runs because a single judge call can get lucky or unlucky. In practice, grounding stays tight (std around 0.0–0.17) while relevance spreads more (std up to 0.80). That's exactly what you want from a working rubric: tight on objective dimensions, spread on subjective ones.

The same decomposition pattern applies to any multi-stage agentic feature. A coding agent has plan / execute / test. A research agent has search / synthesize / cite. Whatever your feature is, the first methodology question is always: what are the natural stages, and what kind of output does each produce?

Catch what spans stages

Per-stage graders catch per-stage failures. They don't catch failures that cross stages — and those are the ones that hurt most in production.

The specific failure mode: the model invents facts in Stage 2 that weren't present in Stage 1. The research step returns a clean, accurate ProspectProfile with no mention of a Series B. The drafting step writes an email congratulating the prospect on funding they never raised. The email is specific, relevant, confident, and completely fabricated. The rubric judge won't catch this — from its perspective, the email looks grounded.

The fix is a grader that operates across stages. Extract the factual claims from the Stage 2 output, check each claim against the Stage 1 output, flag what doesn't match. The RAG evaluation literature has been refining this technique for years — decompose the output into checkable claims, verify each against a trusted source, report the fabrication rate. The exact implementation is less important than running the check at all.

For the prospect qualifier I implemented this as a two-pass hybrid:

An LLM (Sonnet 4.6) extracts every factual claim from the email as a JSON array — named entities, numbers, URLs, tech terms, specific events.
A deterministic matcher checks each claim against the ProspectProfile. Prose paraphrasing is ignored. What's checked is named facts.

Fabricated claims get flagged by name, not by vibe. To validate the grader, I hand-crafted a deliberately-fabricated Supabase email — made-up "$80M Series C," invented Adobe partnership, fake "12M developers" stat, fabricated LLaMA-3 finetune, fictional Snowflake integration. Ran both the real and fabricated versions through the grader.

Well-grounded email: 0% fabrication, 8/8 claims matched.
Fabricated email: 100% fabrication, 11/11 hallucinations surfaced by name.

The grader costs roughly $0.30 per full pipeline. Cheap, and worth it. Most rubric judges score the fabricated email 3 or 4 out of 5 on grounding because the fabricated facts sound specific. The deterministic matcher marks it 0% grounded, correctly.

The specific implementation — claim extraction plus substring match — isn't the point. The principle is: for any workflow where later stages can fabricate beyond earlier-stage evidence, you need a grader that spans the stages. For a coding agent, check that the code implements the plan. For a research synthesizer, check that cited sources support the claims. The shape changes, the principle doesn't.

Read results without fooling yourself

Running the benchmark is easy. Reading it honestly is where rigor lives.

One rule above all: don't average across graders. If your benchmark produces a single "overall quality" number that averages a rubric score against a fabrication rate against a cost number, that number is almost always misleading. The whole argument for multi-grader benchmarks is that different graders catch different failures. Averaging them back into one score throws the signal away.

Here's the argument made concrete. Three candidates on the full two-stage workflow, 15 cases each, same orchestration:

	$/task	Judge overall	Fabrication rate
Haiku 4.5	$0.030	4.71	15.7%
Sonnet 4.6	$0.091	4.84	6.0%
Opus 4.7	$0.084	4.87	6.5%

Three findings in three different directions.

Haiku is a third the cost of Sonnet. On the rubric it's only slightly behind, which is the kind of number that would convince most founders to ship Haiku and move on. But the fabrication rate is 15.7% — more than double Sonnet. For sales outreach, where a fabricated partnership can tank a deal, that's disqualifying.

Opus is the highest-scoring candidate on the rubric by 0.03 over Sonnet, and slightly worse on fabrication. A 3.2× cost premium over Haiku for a rubric gain too small to matter, and a grounding score that goes the wrong way.

A single aggregate score would have averaged all this into mush: "Opus ahead, Sonnet close, Haiku third." What you actually have is three genuinely different options: Sonnet as the reliability pick, Haiku as the cost pick with a real quality cost, Opus as neither.

Two practical notes on reading results.

Run multiple trials and report distributions. Agent outputs are stochastic. LLM judges are stochastic. A single run tells you very little about underlying behavior. pass@k vs. pass^k matters here — pass@k flatters your numbers, pass^k tells you what a user actually sees in production. Pick the metric that matches what your users will experience.

Read your transcripts. Not all of them, but enough to trust your graders. The single most common way a benchmark lies to you is through a grader that's measuring something slightly different from what you think it's measuring. Reading the cases where your graders disagreed is the cheapest fidelity check you'll ever buy.

This is all about evaluating the workflow — the pipeline as it runs. It assumes the harness is fixed. Vary the harness itself — the prompts, the tool definitions, the retry policy, etc. — and you move into Stage 3.

Stage 3 — Agent harness evaluation

Stage 2 held the harness fixed. This section varies it.

The harness is everything around the model — the prompts, the tool definitions, the orchestration runtime, the retry policy, the permissions. It's the layer you actually control as an engineer, the one that changes most often in production, and the one least covered by public benchmarks. Which means if you're iterating on prompts, tools, or orchestration policies without a regression test, you are shipping changes blind.

The agent harness

The model in the middle reasons and decides. The harness is everything else: it injects context, controls the loop, dispatches actions and feeds results back, persists state, and observes results — running until the model decides it's done.

I'll cover three methodology decisions and illustrate with one finding — a same-model-different-runtime comparison from my kit that surprised me when I ran it. The rest is forward work.

Vary one harness dimension at a time

The hardest thing about harness evaluation is discipline. A Stage 2 benchmark compares models with everything else held constant. A Stage 3 benchmark does the same thing one layer up — compare two harnesses, with the model and task held constant. Sounds simple. In practice, most teams iterate on multiple harness dimensions at once, ship a change, watch the numbers move, and don't know what caused what.

What's specific to agents is naming the dimensions you can vary:

Prompts. System prompts, user prompts, tool-response formatting, output schemas.
Tool definitions. What tools exist, what their descriptions say, what arguments they accept, what they return.
Orchestration runtime. How turns are structured, how the agent decides to call tools vs. respond, how it stops.
Retry and recovery policy. What happens when a tool fails or times out. Whether the agent can self-correct.
Permissions and safety rails. What the agent is allowed to do, what requires confirmation, what's blocked.

Each of these can move agent behavior more than switching models does. And each of them is effectively a hidden input to your Stage 2 benchmark — if you didn't fix them, your Stage 2 numbers are partly measuring things you weren't trying to measure.

In practice this means two things. First: when you run a Stage 2 benchmark, record the full harness configuration alongside the results — prompts, tool definitions, runtime version, retry settings. Without that, the numbers aren't reproducible. Second: when you change one of those dimensions, rerun the benchmark and diff the output. The delta tells you what the change did.

Here's the finding that made this concrete for me. I ran the same prospect-qualifier workflow with Sonnet 4.6 under two orchestration runtimes — the default runner in my kit, which uses the Vercel AI SDK, and the Claude Agent SDK. Same model, same tasks, same tools, same prompts. Only the runtime changed.

	Mean turns	Mean tokens out	p50 latency	$ / 1k runs
Sonnet 4.6 — Vercel AI SDK	5.7	759	45.7s	$26.24
Sonnet 4.6 — Agent SDK	10.5	3,362	97.6s	$93.92

Nearly 2× the turns. ~4.4× the tokens. ~3.5× the cost. ~2× the latency. Same model, same task. Neither configuration is wrong — they're different stopping strategies encoded in the harness. The Agent SDK run takes more turns because its orchestration encourages the model to explore more before producing a final answer. The default runner is tighter.

The point isn't which runtime is "better." The point is that the leaderboard from Stage 2 shows one row per candidate — and that row is an average of whatever the harness chose to do. The same model, under a different harness, is effectively a different product. Without varying the harness, you can't tell how much of any Stage 2 result is the model and how much is the harness shaping it.

Reuse the graders you already built

The good news about Stage 3 is that you don't start from scratch. The graders from Stage 2 mostly transfer.

Code graders from Stage 1 — latency, cost, token counts, tool-call counts — measure the harness as naturally as they measure the model. In the table above, every number is a code grader output.

LLM judges still work for prose outputs when you're comparing harness configurations that produce prose (email drafts, research summaries, code comments).

The grounding faithfulness grader is the one that generalizes furthest. In Stage 2 it checked the final output against the Stage 1 source. In Stage 3 the same machinery — claim extraction plus a deterministic matcher — runs at any individual turn, against whatever context the agent had at that turn. An agent that invents a tool argument, claims a tool returned something it didn't, or fabricates a citation gets caught the same way: same failure pattern, same grader, one level down.

That generalization — grounding-per-turn instead of grounding-of-final-output — is the most useful primitive I've come across at the harness layer. If you treat "a good agent turn" as "a turn whose claims are grounded in the context it was given," grounding becomes the load-bearing check for every intermediate step, not just for the final draft.

There's a broader point about how to use these graders. Geoffrey Huntley, writing about autonomous coding agents, puts it in one line: "It's important to watch the loop. That is where your personal development and learning will come from. When you see a failure domain, put on your engineering hat and resolve the problem so it never happens again." For Huntley, "the loop" is literal — every tool call, every LLM exchange, every agent-to-agent message, every input that got fed back as context — and the discipline is reading enough of it to spot where things go wrong. Graders don't replace that reading; they extend it across volume. Reading one trace teaches you what went wrong once; running a grader across a hundred traces tells you what goes wrong systematically, and which traces are worth opening. The grader catches the pattern; reading the trace teaches you what the pattern means.

Accept that no single layer catches everything

The last methodology point is the most honest, and it's borrowed from safety engineering: no single evaluation layer catches every failure. You need multiple layers, and you need to accept that each has gaps.

Anthropic's Demystifying evals for AI agents post frames this explicitly as a Swiss Cheese model: automated evals, production monitoring, A/B testing, user feedback, and manual transcript review each have holes. Failures slip through one layer and get caught by another. No layer is complete on its own. The combination is what works.

For Stage 3 specifically, this means three things:

Expect wider error bars than in Stage 2. You're evaluating a policy — a way of producing many outputs — not a single output. Policies can only be evaluated in aggregate, which means you need more trials, more tasks, and you need to accept that the signal will be noisier than what you're used to from unit tests.
Expect to read more transcripts. Aggregate metrics at Stage 3 tell you a number moved without telling you which harness dimension moved it. A 2-point shift in fabrication rate after a prompt edit could be the prompt itself, a knock-on effect on which tools the agent reached for, or a retry-policy interaction. The trace is the only place those distinctions are visible. When the grader flags a regression, your next move isn't to retune the grader — it's to open the failing traces.
Combine the layer below with the layers above. Your Stage 3 benchmark tells you what the aggregate numbers say. Production monitoring tells you what users actually experience. A/B testing tells you whether your changes moved the needle. These aren't substitutes for each other. A team with only Stage 3 evals ships brittle systems that pass benchmarks and fail in production. A team with only production monitoring ships reactively. The combination is the practice.

I'm going to expand this section substantially as the kit grows into it — prompt regression tests, tool-description ablations, skills evaluation, multi-agent handoffs are all on the roadmap. For now, treat it as an invitation. If you're building in this space, the open questions are more valuable to work on than the closed ones.

What to build next

The kit is early. One direction I'm most interested in: skills evaluation. Claude Skills — SKILL.md files that package domain-specific instructions the model loads on demand — are behavioral commitments, which means they need regression tests the same way prompts do. Does swapping a skill in change the agent's output on ten representative tasks? Does a "more careful" skill actually produce more careful agents? This will be the nextg interation for a new unit of abstraction, that will let you interate and trace the changes and the impact on your skill.md files.

A few other things on the roadmap: a cross-family rubric jury to reduce self-preference bias in the Stage 2 judge, scheduled re-runs via Claude Code Routines so the leaderboard stays current as models ship, and more scopes for coding, research, and conversational agents.

Start with mine, or build your own

If you want to use custom-model-bench as a starting point, the repo is here. Fork it, drop your own dataset.jsonl and candidate configs into a new scope directory, and run:

terminal

$ bun run scripts/run-comparison.ts examples/<your-scope>

The viewer auto-discovers the new scope. Issues and PRs welcome.

But the argument of this post isn't that my kit is the one you should use. It's that you should have some kit — ideally one you understand well enough to trust — and that building one is more tractable than it looks. The dashboard is nice. The code automates the boring parts. But the thinking — the task design, the grader choice, the honest interpretation — is the work. That part doesn't come out of anyone's repo.

Teams shipping AI features without this keep shipping bugs they could have caught in an afternoon. Teams that invest once ship faster, upgrade faster, and catch fabrication before users do.

That's the whole argument.

— Hendrik