Eval Loop
The repeatable harness that scores agent behavior against known-good outcomes. The thing that turns seems fine into a number you can watch.
An eval loop is a standing test harness for agent behavior: a set of cases, an automatic scorer, and a number that moves when the agent gets better or worse.
Why it matters
Without evals, every change is a vibe. You tweak a prompt, it looks better on the one example you tried, you ship. Then it regresses somewhere you were not looking. Evals make agent quality a measurement instead of a feeling, which is the only way it improves on purpose.
In practice
You collect real failures and known-good outcomes into cases. Every change runs against them. Scores are tracked over time, gated in CI, and tied to the specific behaviors you care about. New failures in production become new cases.
Where it shows up in my work
Evals are the feedback loop that makes cheap models viable in swarms. Tight loops, fast scoring, immediate correction. The Field Manual builds the reliability red-team sprint directly on top of this.