Skip to content

Eval Files

Evaluation files define the test cases, graders, workspace lifecycle, and run controls for an evaluation run. The reserved tags.experiment key is the run/result grouping label, top-level target identifies the system under test, and fields such as evaluate_options.repeat, threshold, timeout_seconds, evaluate_options.budget_usd, and evaluate_options.max_concurrency control repeated attempts and gates. Workspace reuse belongs under workspace.isolation; repository provenance belongs under workspace.repos; Docker/container binding belongs under workspace.docker. Non-provisioning setup commands belong in top-level extensions; reset policy stays under workspace.hooks.after_each.reset; runner-specific setup belongs in the target object or targets.yaml. AgentV supports two eval data formats: YAML and JSONL.

YAML is the canonical portable model. TypeScript helpers, generated fixtures, and Python scripts should lower to the same YAML/JSONL shapes rather than inventing a separate eval contract. Eval files describe the task, target binding, and run controls. Use evaluate_options.max_concurrency for authored suite concurrency. Operators can still override concurrency with --workers or set defaults with execution.workers in agentv.config.* / .agentv/config.yaml; do not author legacy workers fields in eval YAML.

Eval YAML is AgentV’s composable and runnable authoring primitive. Use ordinary *.eval.yaml files for direct task suites and for wrapper evals that compose other suites. Raw case files are reusable data inputs, not a second runnable experiment format.

  • A task suite is eval YAML that owns task context: workspace, shared input, shared assertions, fixtures, graders, and test cases. It can run directly or be imported through imports.suites.
  • A raw case file is a YAML, JSON, JSONL, CSV, script-backed dataset, directory, or glob of cases. Import it with imports.tests, tests: ./cases.yaml, tests: file://cases.csv, or string shorthand; parent suite context applies because raw cases do not carry their own suite context.
  • A wrapper eval is eval YAML that imports one or more suites with imports.suites and binds run controls with top-level target, threshold, timeout_seconds, and evaluate_options. Wrapper evals can live anywhere in the repo. A wrapper that imports suites with imports.suites must not define parent workspace; imported suites own task environment. Machine-local existing workspace paths belong in CLI flags or config.local.yaml, not eval YAML.

For example, a reusable task suite can keep the task contract in one file:

evals/suites/refunds.eval.yaml
suite: refunds
workspace:
repos:
- path: ./support-app
repo: acme/support-app
commit: main
input: Answer using the refund policy in the workspace.
assertions:
- Applies the refund policy correctly
tests:
- id: missing-receipt
input: Can this customer get a refund without a receipt?

Raw cases are just case data:

evals/cases/refund-smoke.cases.yaml
- id: damaged-item
input: The item arrived damaged. What should support do?
expected_output: Offer a replacement or refund path.

A wrapper eval stays ordinary eval YAML while choosing a target and run controls:

experiments/refunds-codex.eval.yaml
name: refunds-codex
target: codex-gpt5
evaluate_options:
repeat:
count: 2
strategy: pass_any
imports:
suites:
- path: ../evals/suites/refunds.eval.yaml
tests:
- path: ../evals/cases/refund-smoke.cases.yaml
tests:
- id: local-edge-case
input: Can a final-sale item be refunded after damage in transit?
expected_output: Explain the final-sale exception for damaged transit.

The experiments/ directory in that example is optional and user-owned. AgentV does not infer behavior from the path; the wrapper runs because it is eval YAML with tests or imports. The wrapper owns target selection and run controls. Put workspace setup in imported child suites. Parent workspace-affecting fields, including top-level workspace, are for parent-owned raw cases, including cases imported with imports.tests. Runtime workspace path overrides belong in CLI flags or .agentv/config.local.yaml; repos, hooks, templates, Docker config, env checks, and isolation belong in top-level or case-level workspace.

The primary format. A single file contains metadata, inline runtime config, and tests:

description: Math problem solving evaluation
target: default
assertions:
- Correctly calculates the answer
- Explains the calculation briefly
tests:
- id: addition
input: What is 15 + 27?
expected_output: "42"
FieldDescription
descriptionHuman-readable description of the evaluation
suiteOptional suite identifier
categoryOptional slash-delimited analytics taxonomy path. Overrides the category derived from the eval file path.
targetNamed system under test from .agentv/targets.yaml or --targets
experimentOptional run/result grouping label
evaluate_options.repeatOptional repeat policy as a positive integer shorthand or object with count, strategy, early_exit, and cost_limit_usd
timeout_secondsOptional per-case timeout
evaluate_optionsOptional evaluation runtime options such as budget_usd and max_concurrency
thresholdOptional suite quality threshold
workspaceSuite-level task environment — inline object or string path to an external workspace file. Repo entries declare identity and checkout pins; acquisition is covered in Workspace Architecture.
extensionsPromptfoo-style lifecycle hooks: file://path/to/hooks.mjs:beforeAll, beforeEach, afterEach, afterAll, plus the built-in agentv:agent-rules. Hooks run after workspace.repos materializes.
importsOptional import groups. imports.suites imports full child eval suites with their task context. imports.tests imports raw test rows into this file’s context. Import entries may use scoped run: overrides for threshold, repeat, timeout_seconds, and budget_usd.
testsInline raw tests or a string path to an external raw-case file or directory. Legacy tests[].include entries still load with a migration warning; prefer imports.suites or imports.tests.
assertionsSuite-level graders appended to each test unless execution.skip_defaults: true is set on the test
inputSuite-level input messages prepended to each test’s input unless execution.skip_defaults: true is set on the test

workspace is what the agent can inspect or modify through tools, not prompt input. Put instructions in input; put repos, templates, Docker config, env checks, isolation, and repo provenance in workspace. Put lifecycle setup that does not acquire repos in extensions.

For historical or repo-state evals, put the checkout under workspace.repos[].commit or workspace.repos[].base_commit. A commit SHA in the prompt or metadata is useful context, but it does not materialize a repo for the agent to inspect.

extensions uses Promptfoo-compatible lifecycle names. File hooks are local JavaScript or TypeScript modules resolved relative to the eval file:

extensions:
- file://scripts/setup.mjs:beforeAll
- file://scripts/setup.mjs:beforeEach
- file://scripts/setup.mjs:afterEach
- file://scripts/setup.mjs:afterAll

Each exported function receives a context object with snake_case keys such as workspace_path, test_id, eval_run_id, case_input, and case_metadata. Setup hook failures (beforeAll, beforeEach) fail the affected run; teardown hook failures (afterEach, afterAll) are non-fatal.

agentv:agent-rules is the only built-in extension in this slice. It runs after workspace materialization and exposes staged rule paths to providers and result metadata as agent_rules_paths:

extensions:
- id: agentv:agent-rules
hook: beforeAll
skills: agent-rules/skills
hooks: agent-rules/hooks
agents: agent-rules/agents
rules: agent-rules/AGENTS.md

If agentv:agent-rules is authored as a string, it defaults to beforeAll and discovers conventional rule locations already present in the materialized workspace. It does not clone repositories or replace workspace.repos.

You can add structured metadata to your eval file using these optional top-level fields. Metadata is parsed when the name field is present:

FieldDescription
nameMachine-readable identifier (lowercase, hyphens, max 64 chars). Triggers metadata parsing.
descriptionHuman-readable description (max 1024 chars)
versionEval version string (e.g., "1.0")
authorAuthor or team identifier
tagsArray of string tags for categorization
licenseLicense identifier (e.g., "MIT", "Apache-2.0")
requiresDependency constraints (e.g., agentv: ">=0.30.0")
name: export-screening
description: Evaluates export control screening accuracy
version: "1.0"
author: acme-compliance
tags: [compliance, agents]
license: Apache-2.0
requires:
agentv: ">=0.30.0"
tests:
- id: denied-party
criteria: Identifies denied parties correctly
input: Screen "Acme Corp" against denied parties list

When category is omitted, AgentV derives it from the eval file path. Generic filenames do not add a leaf: security/eval.yaml becomes security, and security/network/dataset.eval.yaml becomes security/network. A meaningful named eval file contributes a leaf, so security/network.eval.yaml becomes security/network. Existing flat category strings remain valid one-node category paths.

The assertions field is the canonical way to define suite-level graders. Suite-level assertions are appended to every test’s graders unless a test sets execution.skip_defaults: true. For semantic or agent-behavior checks, prefer plain assertion strings first; AgentV treats them as rubric criteria. Use deterministic assertions or code graders when the expected output is exact or requires programmatic inspection. If the assertion strings already state the grading contract, omit a duplicate criteria field on each test. Use explicit type: llm-grader entries only when you need a custom prompt, a custom grader target, or a deliberately separate grader panel.

description: API response validation
assertions:
- type: is-json
required: true
- type: contains
value: "status"
- Correctly answers the user's question
- Explains the reasoning clearly
tests:
- id: health-check
input: Check API health

assertions supports rubric shorthand strings, deterministic assertion types (contains, regex, is_json, equals), g-eval, LLM graders, and code graders. See Tests for per-test assertions usage.

Reusable assertion sets can be factored into template files and referenced from any assertions array:

assertions:
- include: safe-response
- include: ./shared/format.yaml

Resolution rules:

  • include: name resolves to .agentv/templates/{name}.yaml with the closest matching directory winning
  • Relative paths resolve from the eval file location, so include: ./shared/format.yaml works as expected
  • Nested includes are allowed up to depth 3 to keep cycles and runaway recursion bounded
  • Suite-level includes follow the same merge behavior as other suite-level assertions and still respect execution.skip_defaults: true

The input field defines messages that are prepended to every test’s input. This avoids repeating the same prompt or system context in each test case — following the same pattern as suite-level assertions.

description: Travel assistant evaluation
input: "Answer as a concise travel assistant."
tests: ./cases.yaml

Use a block scalar for multi-line shared instructions:

input: |
Read AGENTS.md before answering.
Explain the tradeoffs clearly.
tests: ./cases.yaml

Each test in cases.yaml only needs its own query:

- id: japan-spring
criteria: Recommends spring for cherry blossoms
input: When is the best time to visit Japan?

The effective input at runtime becomes [...suite input, ...test input].

Suite-level input accepts the same formats as test-level input:

  • String — wrapped as [{ role: "user", content: "..." }]
  • Object without a top-level role key — wrapped as structured user-message content
  • Single message object — a { role, content } object using a supported message role
  • Message array — used as-is, including system messages and file references

The top-level role key is reserved for message objects. If your structured payload needs a field named role, nest it under another key.

input:
- role: system
content: You are a careful reviewer.
- role: user
content:
- type: file
value: ./system-prompt.md

To opt out for a specific test, set execution.skip_defaults: true (same flag that skips suite-level assertions).

The input_files field provides a shorthand for attaching shared file references to every test. When a test has a string input, the suite-level files are prepended as type: file content blocks in a single user message — the same shape produced by per-test input_files.

description: Schema review evaluation
input_files:
- ./shared-context.md
- ./schema.json
tests:
- id: summarize
criteria: Summarizes the important constraints
input: Summarize the important constraints.
- id: validate
criteria: Identifies validation gaps
input: What validation is missing?

Each test’s effective input becomes a single user message with [file blocks..., text block].

Per-test input_files overrides the suite-level value (it does not merge). To opt out, set execution.skip_defaults: true on the test.

For directory-style evals, a test may omit input and keep the task prompt in Markdown instead. AgentV resolves the prompt in this order:

  1. If the effective input_files contains a file named exactly PROMPT.md, that file becomes the test prompt.
  2. Otherwise, if a PROMPT.md exists beside the EVAL.yaml, that file becomes the test prompt.
  3. Other input_files remain attachments. PROMPT.md is removed from the attachment list so the prompt is not duplicated.
agent-001-fix-bug/
EVAL.yaml
PROMPT.md
fixtures/
failing-test.log
tests:
- id: fix-bug
criteria: Fixes the regression described in the prompt
input_files:
- ./fixtures/failing-test.log

Use explicit input when the prompt is short or generated from YAML variables. Use PROMPT.md when the task text is long enough that duplicating it inside YAML would make the eval hard to review.

Instead of inlining tests in the same file, you can point tests to an external YAML or JSONL file of raw cases. This is the inverse of the sidecar pattern — the metadata file references the test data:

name: my-eval
description: My evaluation suite
target: default
tests: ./cases.yaml

The path is resolved relative to the eval file’s directory. The external raw case file can be a YAML or JSON array of test objects, a JSONL file with one test per line, a promptfoo-compatible CSV file, or an explicit JavaScript or Python dataset function such as file://generate-tests.mjs:createTests or file://generate_tests.py:create_tests. String entries inside a tests: list work the same way and may use direct paths, file:// paths, directories, or globs:

tests:
- ./cases/*.cases.yaml

CSV datasets support promptfoo-style magic columns. __expected and __expectedN create AgentV assertions using the supported expected-column mini-DSL (contains:*, icontains:*, contains-any:*, contains-all:*, icontains-any:*, icontains-all:*, starts-with:*, ends-with:*, regex:*, equals:*, is-json, latency(<ms>), cost(<usd>), grade:*, llm-rubric:*, javascript:*, fn:*, eval:*, python:*, and file://*.py; file paths inside CSV cells are resolved relative to the CSV file). Unsupported promptfoo assertion forms such as similar:* are rejected during validation instead of being skipped at runtime. __provider_output becomes first-class expected_output, __metric names the generated assertions, __threshold sets the test threshold, __metadata:<key> adds metadata, and __config:__expectedN:threshold sets an assertion min_score. Ordinary columns become vars, so CSV rows can rely on suite-level input that interpolates those variables.

String shorthand is raw-case-only. Import reusable task suites through imports.suites; use imports.tests when you want to drop suite context and import only raw cases into the parent context:

imports:
suites:
- path: ./suites/*.eval.yaml
tests:
- path: ./cases/regression.jsonl
tests:
- id: local-edge-case
input: ...

Legacy tests[].include entries still load with a migration warning for older eval files, but new evals should use imports.suites or imports.tests.

When tests points to a directory, AgentV auto-discovers test cases from subdirectories. Each subdirectory containing a case.yaml (or case.yml) becomes a test case:

my-eval/
EVAL.yaml
cases/
fix-null-check/
case.yaml
add-greeting/
case.yaml
workspace/ # optional per-case workspace template
setup-files...
EVAL.yaml
name: my-benchmark
tests: ./cases/

Each case.yaml is a single YAML object (not an array) with the same fields as an inline test:

cases/fix-null-check/case.yaml
criteria: Fixes the null reference bug in the parser module
input: Fix the null check bug in parser.ts

Behavior:

  • Directory name as id: If case.yaml doesn’t specify an id, the directory name is used (e.g., fix-null-check)
  • Alphabetical ordering: Subdirectories are sorted alphabetically for deterministic order
  • Per-case workspace: A workspace/ subdirectory inside the case directory automatically sets workspace.template to that path, unless the case already defines a workspace field
  • Skipped directories: Subdirectories without case.yaml are skipped with a warning
  • Suite-level config applies: Suite-level assertions, input, workspace, target, and top-level run controls still apply to directory-discovered cases

This pattern is useful for benchmarks with many cases, where each case benefits from its own directory for workspace templates, supporting files, or documentation. For guidance on keeping provenance metadata, patches, oracle files, and generated dataset rows out of oversized inline YAML, see Benchmark Provenance.

All string fields in eval files support {{ env.VAR }} syntax for environment variable interpolation. This enables portable eval configs that work across machines and CI environments without hardcoded paths.

workspace:
repos:
- path: ./RepoA
repo: "{{ env.REPO_A_URL }}"
commit: "{{ env.REPO_A_COMMIT }}"
tests:
- id: test-1
input: "Evaluate the code in {{ env.PROJECT_NAME }}"
criteria: "{{ env.EVAL_CRITERIA }}"
  • Syntax: {{ env.VARIABLE_NAME }} with optional whitespace around the name
  • Missing variables resolve to an empty string
  • Partial interpolation is supported: {{ env.HOME }}/repos/{{ env.PROJECT }} becomes /home/user/repos/myproject
  • Non-string values (numbers, booleans) are not affected
  • Interpolation is applied recursively to all nested objects and arrays
  • Works in YAML eval files, external YAML/JSONL case files, and external workspace config files
  • .env files in the directory hierarchy are loaded automatically before interpolation
# workspace.yaml — works on any machine
repos:
- path: ./my-repo
repo: "{{ env.MY_REPO_URL }}"
commit: "{{ env.MY_REPO_COMMIT }}"
.env
MY_REPO_URL=https://github.com/org/my-repo.git
MY_REPO_COMMIT=main

Eval YAML also supports per-test vars for data-driven prompt templates. Use {{ vars.name }} placeholders in test-facing text fields, and AgentV resolves them when the suite loads.

input: "Answer clearly: {{ vars.question }}"
tests:
- id: capital
vars:
question: What is the capital of France?
expected_answer: Paris
criteria: "Answers {{ vars.question }} correctly"
input:
- role: user
content: "Question: {{ vars.question }}"
expected_output: "{{ vars.expected_answer }}"
  • vars is defined per test as an object
  • {{ vars.name }} and dotted paths like {{ vars.user.name }} are supported
  • Substitution applies to suite-level input, test input, input_files, criteria, expected_output, assertion values/metrics, and conversation turn input / expected_output / assertions
  • When the whole string is a single placeholder, the original JSON value is preserved
  • Missing variables render as empty strings following Nunjucks semantics
  • vars interpolation is separate from environment interpolation: {{ vars.question }} uses test data, {{ env.PROJECT_NAME }} uses environment variables

For large-scale evaluations, AgentV supports JSONL (JSON Lines) format. Each line is a single test:

{"id": "test-1", "criteria": "Calculates correctly", "input": "What is 2+2?"}
{"id": "test-2", "criteria": "Provides explanation", "input": "Explain variables"}

An optional YAML sidecar file provides metadata and execution config. Place it alongside the JSONL file with the same base name:

dataset.jsonl + dataset.eval.yaml:

description: Math evaluation dataset
suite: math-tests
target: azure-base
assertions:
- name: correctness
type: llm-grader
prompt: ./graders/correctness.md
  • Streaming-friendly — process line by line
  • Git-friendly — diffs show individual case changes
  • Programmatic generation — easy to create from scripts
  • Industry standard — compatible with DeepEval, LangWatch, Hugging Face datasets

Use the convert command to switch between YAML and JSONL:

Terminal window
agentv convert evals/dataset.eval.yaml --format jsonl
agentv convert evals/dataset.jsonl --format yaml