Eval Loop — Agent Infrastructure Lexicon

An eval loop is a standing test harness for agent behavior: a set of cases, an automatic scorer, and a number that moves when the agent gets better or worse.

Why it matters

Without evals, every change is a vibe. You tweak a prompt, it looks better on the one example you tried, you ship. Then it regresses somewhere you were not looking. Evals make agent quality a measurement instead of a feeling, which is the only way it improves on purpose.

In practice

You collect real failures and known-good outcomes into cases. Every change runs against them. Scores are tracked over time, gated in CI, and tied to the specific behaviors you care about. New failures in production become new cases.

Where it shows up in my work

Evals are the feedback loop that makes cheap models viable in swarms. Tight loops, fast scoring, immediate correction. The Field Manual builds the reliability red-team sprint directly on top of this.