AI agent evaluation: how to measure whether your agent works

AI agentsZegaware engineering25 June 202610 min read

Last updated: 25 June 2026

Evaluate an AI agent by measuring task success against known-good outcomes, not by reading output that looks plausible. Build a representative set of real tasks with ground-truth results, score end-to-end success rather than individual steps, test safety and refusal behaviour, and re-run the whole set on every change. Treat a few good demonstrations as a starting point, never as evidence.

Why "it looked right in a few tries" is not evaluation

A handful of successful demonstrations tells you the agent can succeed, not that it does succeed reliably. Agents operate over multiple steps, and small ambiguities compound fast when they interact with live data and user constraints [1]. A demonstration that works on a tidy input says nothing about the messy inputs your users will actually send.

The deeper problem is that an agent can report success while the outcome is wrong. Arize, in a field analysis of production agent failures published on 29 January 2026, describes observability stacks that report "success" simply because an action returned an HTTP (Hypertext Transfer Protocol) 200 "OK" status, even when the real result failed [1]. If your evidence is "it looked right when I tried it", you are measuring the demonstration, not the agent.

Evaluation is the discipline of replacing that impression with a measurement: a repeatable score, over a fixed set of tasks, that tells you whether the agent produced the correct outcome and how often. In our audits of AI-built software, the absence of this measurement is the most common reason a team cannot say whether their agent is ready.

What to measure

Define "working" before you measure it. For most agents, working means the agent achieved the user's actual goal, did not take an unsafe or out-of-scope action, and did so within acceptable cost and latency. Each of those is measurable, and each fails differently.

Task success against known-good outcomes, not plausible-looking text

Score the outcome, not the prose. A large language model (LLM) is trained to produce text that reads well, which means fluent, confident output is the default even when the content is wrong. The Open Worldwide Application Security Project (OWASP) lists this directly: in its 2025 guidance, hallucination is named a major cause of misinformation, defined as content that seems accurate but is fabricated [2]. Plausibility is therefore not evidence of correctness; it is the thing that most often hides the error.

A useful test needs a known-good outcome to compare against. For a structured task (book the correct slot, return the right record, call the right tool with the right arguments) the check can be exact. For open-ended output, define the outcome in terms a person can verify: the required facts are present, any cited source actually exists, no prohibited action was taken.

End-to-end success versus per-step

Measure the whole task, not just the steps. An agent can pass every individual step and still fail the task, because errors compound across a multi-step run [1]. Per-step metrics are useful for diagnosis, telling you where a failure began, but they systematically overstate how well the agent does the job a user actually asked for.

Report end-to-end success as the headline number: of N representative tasks, how many reached the correct final outcome. Keep per-step traces underneath it for debugging. A common and misleading pattern is a high per-step pass rate sitting on top of a low end-to-end success rate, which means the agent is good at moving and bad at arriving.

Safety and refusal behaviour

Test what the agent refuses, not only what it does. An agent that completes tasks well but can be talked into an unsafe action has not passed evaluation. Prompt injection, OWASP's first-ranked LLM risk for 2025, occurs when input alters the model's behaviour in unintended ways, and it can drive the model to violate its guidelines or take an action it should not [3].

Your evaluation set must therefore include cases the agent should decline or escalate: out-of-scope requests, missing authorisation, instructions hidden inside retrieved content. Score refusal as a first-class outcome. A correct refusal is a pass; completing a task that should have been refused is a failure, however polished the result.

Regression on every change

Re-run the full evaluation set on every change. Models, prompts, tools and dependencies all shift behaviour, often invisibly, and a change that fixes one case routinely breaks another. The only way to catch this is to score the whole set before and after, then compare.

This is the same regression discipline mature teams apply to code, applied to behaviour instead. It is also why an evaluation set has to be cheap and automatic to run: if scoring the agent takes a day of manual effort, it will not happen on every change, and the regressions will reach production instead.

How to build an evaluation set

The evaluation set is the asset. The agent's framework and underlying model will change; a well-built set of tasks with known-good outcomes keeps its value across those changes. Build it deliberately.

Start from representative real tasks

Draw tasks from real usage, not from imagination. The set should mirror the distribution of what users actually ask, including the boring, common requests that make up most of the volume. An evaluation set built only from interesting cases reports a score that your production traffic does not share.

Attach a ground-truth outcome to each task

Every task needs a known-good answer recorded alongside it. This is the part teams most often skip, because it is manual work, and it is also the part that makes the set worth anything: without a ground-truth outcome there is nothing to score against, and you are back to judging plausibility. For structured tasks the ground truth is the correct record or action; for open-ended tasks it is a checklist of what a correct answer must contain.

Add edge cases and known failure modes

Include the inputs that break things. Empty fields, ambiguous requests, conflicting instructions, very long inputs, and the specific cases you have already seen fail belong in the set permanently. Every production incident should end by adding the triggering case to the evaluation set, so the same failure cannot return unnoticed.

Add adversarial inputs

Include inputs designed to make the agent misbehave. Prompt injection does not need to be human-visible to work, as long as the content is parsed by the model [3]. Put direct and indirect injection attempts in the set: instructions embedded in a document, a web page, or a tool result that try to redirect the agent. The broader catalogue to draw from is the OWASP Top 10 for LLM Applications 2025, a community-driven list of the security risks specific to these applications [6].

Offline versus online evaluation

Use both, for different questions. Offline evaluation runs the fixed set in a controlled environment before release: it is repeatable, safe to run on every change, and answers "did this change make the agent better or worse against known tasks". It is the gate in your pipeline.

Online evaluation measures the live system on real traffic after release: success and failure signals, refusals, cost, latency and human escalations. It answers "is the agent actually working for users now", which offline evaluation cannot, because real inputs always drift from any fixed set. The honest limitation of online evaluation is that real outcomes are often not labelled, so you have to instrument explicit signals (did the user retry, correct, or abandon) rather than trusting a 200 status [1].

Run both, because each covers the other's blind spot. Offline catches regressions before users meet them; online catches the cases your set did not anticipate, which then become new offline cases. This loop, from production failure to new test, is how the set stays representative over time.

The limits of using a large language model as a judge

Using a large language model to grade an agent's output (an "LLM judge") is practical at scale, but it is a measurement instrument with known error, and it must be treated as one. The model doing the judging is the same kind of system being judged, with the same failure modes.

An LLM judge can be wrong. It hallucinates like any other model, producing a confident verdict that seems accurate but is fabricated [2]. It also carries measurable biases: the foundational study of LLM-as-a-judge identifies position bias, verbosity bias and self-enhancement bias, as well as limited reasoning ability [4]. Verbosity bias (rewarding longer answers) and self-enhancement bias (a judge favouring answers that resemble its own output) directly corrupt a quality score if you do not control for them.

An LLM judge can also be injected. The text it reads is untrusted input, so a prompt injection in the agent's output can target the judge and instruct it to return a pass [3]. A judge that grades attacker-influenced content is grading material that may be written partly to fool it.

Use an LLM judge carefully, and keep humans for high-stakes decisions. An LLM judge is reasonable for high-volume, lower-stakes scoring, where an occasional wrong verdict is tolerable and you have validated the judge against human labels on a sample. For anything where a wrong pass is expensive (safety, money, irreversible actions) a human reviews the outcome. Validate the judge itself: measure how often it agrees with human graders, and treat that agreement rate as the judge's own accuracy.

How evaluation connects to guardrails and observability

Evaluation, guardrails and observability are three views of the same problem: knowing what the agent does. Evaluation measures behaviour before release against known tasks. Guardrails constrain behaviour at runtime, refusing or blocking the unsafe actions your evaluation defined. Observability records what actually happened in production so you can measure it. Each one feeds the others.

The United States National Institute of Standards and Technology (NIST) frames this in its AI Risk Management Framework, which organises the work into four functions: Govern, Map, Measure and Manage [5]. Evaluation is the Measure function made concrete; guardrails and incident response sit within Manage; and none of it works without the governance and mapping that decide what "correct" and "unsafe" mean for your system.

In practice the loop is continuous. Observability surfaces a real failure; that failure becomes a new case in the evaluation set; the evaluation set proves whether a guardrail or prompt change fixes it without regressing anything else; and the guardrail enforces the boundary at runtime. An agent without all three is one you cannot honestly say is working, only one that has not yet been seen to fail. For the build-side companion to this measurement discipline, see how to build a reliable AI agent; for the equivalent question about generated code, see is AI-generated code safe to ship?.

Frequently asked questions

How do you measure the accuracy of an AI agent?

Measure accuracy as end-to-end task success against known-good outcomes: run a fixed set of representative tasks, each with a recorded correct result, and count how many the agent gets right. Score the outcome rather than how plausible the text reads, because fluent output is the default even when the content is wrong [2].

What is the difference between offline and online agent evaluation?

Offline evaluation runs a fixed set of tasks in a controlled environment before release, so it is repeatable and safe to run on every change. Online evaluation measures the live agent on real traffic, capturing success signals, refusals, cost and escalations. Offline gates releases; online catches what the fixed set did not anticipate. Use both.

Can I use an LLM to evaluate my agent?

Yes, for high-volume, lower-stakes scoring, but treat the large language model judge as an instrument with known error. It can hallucinate a verdict, it can be swayed by answer length or style [4], and it can be targeted by a prompt injection in the content it reads [3]. Validate it against human grades and keep humans for high-stakes decisions.

How often should I run AI agent evaluations?

Run the full evaluation set on every change to the model, prompt, tools or dependencies, because any of these can shift behaviour and break a previously passing case. Treat it as regression testing for behaviour: automated, cheap to run, and blocking on failure. In production, evaluate continuously on live traffic and feed new failures back into the set.

Get your AI agent evaluated

We build and review AI agents to a senior standard, with evaluation, guardrails and observability treated as part of the engineering work rather than an afterthought. If you are running an agent and cannot yet state, as a measured number, whether it works, we can help you put that measurement in place. Read more about our work on AI Agents, or book a review to have senior engineers assess what you have already built.

Sources

Aryan Kargwal, "Why AI Agents Break: A Field Analysis of Production Failures", Arize AI, 29 January 2026. https://arize.com/blog/common-ai-agent-failures/
OWASP, "LLM09:2025 Misinformation", Top 10 for LLM Applications 2025. https://genai.owasp.org/llmrisk/llm092025-misinformation/
OWASP, "LLM01:2025 Prompt Injection", Top 10 for LLM Applications 2025. https://genai.owasp.org/llmrisk/llm01-prompt-injection/
Lianmin Zheng et al., "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena", arXiv:2306.05685, 2023. https://arxiv.org/abs/2306.05685
National Institute of Standards and Technology, "AI Risk Management Framework". https://www.nist.gov/itl/ai-risk-management-framework
OWASP, "Top 10 for LLM Applications 2025". https://genai.owasp.org/resource/owasp-top-10-for-llm-applications-2025/

Want it done properly, once? We install OpenClaw isolated, hardened and monitored, then keep it updated under a plain monthly retainer. Fixed setup fee, quoted in writing.

Get set up