Experiments
AgentV eval files are the runnable authoring artifact. Use top-level
description for display metadata, tags.experiment as the run/result grouping
label, target for the system under test, and flat top-level run controls such
as timeout_seconds and threshold. Use evaluate_options for evaluation
runtime options such as repeat, budget_usd, and max_concurrency.
Use agentv eval --workers N or project config defaults such as
agentv.config.* / .agentv/config.yaml execution.workers for operator-side
overrides.
name: support-regressiondescription: Support regression suitetags: experiment: support-codextarget: extends: codex-gpt5 model: gpt-5.1 reasoning_effort: hightimeout_seconds: 720evaluate_options: repeat: count: 4 strategy: pass_any budget_usd: 2.00 max_concurrency: 3
workspace: hooks: before_all: command: ["bash", "-lc", "bun install && bun run build"]
tests: - id: refund-eligibility input: Can this customer get a refund? criteria: Applies the refund policy correctlyLayout Conventions
Section titled “Layout Conventions”Use directories for human organization, not schema behavior. A common layout is:
evals/ suites/ refunds.eval.yaml cases/ refund-smoke.cases.yamlexperiments/ refunds-codex.eval.yamlIn that layout, evals/suites/refunds.eval.yaml is a reusable task suite,
evals/cases/refund-smoke.cases.yaml is raw case data, and
experiments/refunds-codex.eval.yaml is a wrapper eval. The wrapper still runs
only because it is eval YAML:
name: refunds-codextarget: codex-gpt5
tests: - id: local-edge-case input: Check a damaged final-sale refund.
imports: suites: - path: ../evals/suites/refunds.eval.yaml tests: - path: ../evals/cases/refund-smoke.cases.yamlThe experiments/ folder is optional and user-owned. AgentV does not scan it
for special files or infer runtime behavior from the path; the same wrapper eval
could live under evals/wrappers/, benchmarks/, or beside the suite it runs.
Suite And Test Imports
Section titled “Suite And Test Imports”Use imports.suites for full child suites and imports.tests for raw test
rows. Inline tests remain raw cases owned by the current file.
imports: suites: - path: evals/support/*.eval.yaml select: test_ids: - refund-* - missing-order-date tags: regression metadata: priority: high run: threshold: 1.0 timeout_seconds: 300 tests: - path: cases/*.cases.yaml - path: cases/regression.jsonl
tests: - cases/smoke/*.cases.yamlimports.suites preserves the imported suite’s task contract: metadata,
workspace, shared input, shared assertions, and tests. The parent eval
still owns the single run bundle and run controls. Use parent target and
top-level run controls for the overall run, and import run: for scoped
threshold, timeout, or budget overrides.
A parent eval that imports any imports.suites entry must not define top-level
workspace. Imported suites own task environment. If the parent should provide
workspace context, import raw cases with imports.tests or shorthand paths
instead of importing an eval suite.
imports.tests imports only raw test entries. It intentionally drops shared
context from an imported eval suite, so parent suite fields apply to those raw
cases.
Import select.test_ids filters imported test IDs with glob patterns.
Import select.tags filters each imported case’s effective metadata.tags.
Effective case tags are suite-first and deduped:
suite.tags + suite.metadata.tags + test.metadata.tags. Top-level suite tags
still remain suite identity metadata for discovery and reporting; selection reads
the merged case metadata view. Import select.metadata filters case metadata by
key/value, where selector values may be scalars or lists. Globbed include paths
are resolved in deterministic path order, then test order.
String-valued tests and string entries inside tests[] are raw-case import
shorthand. They are equivalent to imports.tests and may point at
raw case files, directories, or globs. Importing another eval suite must use
imports.suites.
Suite imports are resolved as a deterministic include graph. Circular
imports.suites imports fail validation with the import chain; raw-case shorthand does
not recursively load suite runtime blocks.
Imported suite rows keep their source suite metadata in index.jsonl. Use each
row’s result_dir as the authoritative path to generated artifacts inside the
run directory; do not infer layout from suite names.
Scoped Run Overrides
Section titled “Scoped Run Overrides”Use scoped run: blocks for result interpretation and scheduling policies that
vary by include group or test case. Precedence is:
test.run > import run > parent top-level run controlstarget: agentthreshold: 0.8evaluate_options: repeat: count: 3 strategy: pass_any
imports: suites: - path: ./evals/flaky-agentic/**/*.eval.yaml select: tags: [agentic] run: timeout_seconds: 300
- path: ./evals/regression/**/*.eval.yaml select: tags: [must-pass] run: threshold: 1.0 timeout_seconds: 300
tests: - id: critical-case input: "..." criteria: Must pass exactly run: threshold: 1.0 budget_usd: 0.50Scoped run: supports threshold, repeat, timeout_seconds, and legacy
per-case budget_usd overrides. Parent suite budgets should use
evaluate_options.budget_usd for public eval authoring. Use
evaluate_options.max_concurrency for authored concurrency. Candidate-changing fields stay
parent-level. Executable workspace setup belongs in top-level lifecycle extensions, and
provider-specific setup belongs in target configuration.
Lifecycle Ownership
Section titled “Lifecycle Ownership”Run controls do not own commands that prepare files, dependencies, repos, or target-specific runner state.
| Need | Put it in |
|---|---|
| Install dependencies, build the repo, seed files | extensions: ["file://scripts/setup.mjs:beforeAll"] |
| Apply per-case state | extensions: ["file://scripts/setup.mjs:beforeEach"] |
| Reset file state after each case | workspace.hooks.after_each.reset |
| Configure an agent runner or provider variant | target object or targets.yaml |
| Choose the target | top-level target |
| Override the target’s default model | target.model |
| Configure repeat policy, budget, concurrency, timeout, threshold | evaluate_options.repeat, evaluate_options.budget_usd, evaluate_options.max_concurrency, timeout_seconds, threshold |
| Bind an existing local workspace directory | --workspace-path or .agentv/config.local.yaml |
extensions: - file://scripts/build.mjs:beforeAll
target: extends: codex-gpt5 hooks: before_each: command: ["sh", "-c", "cp -R skills \"{{workspace_path}}/.codex/skills\""]evaluate_options: repeat: count: 3 strategy: pass_anyExisting local workspace paths are machine-local bindings: pass
--workspace-path for a one-off run or put execution.workspace_path in
.agentv/config.local.yaml.
Put repos, templates, hooks, Docker config, env checks, and isolation under
top-level or case-level workspace.
Repeat Runs
Section titled “Repeat Runs”Use evaluate_options.repeat when you want AgentV to try each case more than once:
evaluate_options: repeat: 3Use object form when you need richer AgentV behavior:
evaluate_options: repeat: count: 3 strategy: pass_any early_exit: true cost_limit_usd: 1.00evaluate_options.repeat.strategy controls verdict aggregation. pass_any
treats the case as successful when any completed attempt passes; pass_all
requires every completed attempt to pass. mean and confidence_interval
aggregate scores where supported today. evaluate_options.repeat.early_exit is
only a scheduling and cost optimization: pass_any may stop at the first pass,
and pass_all may stop at the first fail. Leave it unset or false when you
want complete variance data. Per-case tests[].options.repeat overrides the
global repeat count or object for that case.
Result Layout
Section titled “Result Layout”Eval runs write to a direct run bundle:
.agentv/results/<run_id>/CLI --experiment sets the experiment label explicitly. Without that flag, AgentV
uses the reserved tags.experiment key (see below), then the suite name, then
the eval filename. The precedence is --experiment > tags.experiment > default.
There is no top-level experiment field — a run is labeled with tags.experiment.
The Dashboard uses “Experiment” for the comparison and result grouping concept;
folder names are only storage allocation and must not define result semantics.
Tags as run metadata (tags.experiment)
Section titled “Tags as run metadata (tags.experiment)”Suite-level tags accepts either the existing selection form (a string or list of
strings that drives select.tags / --tag name filtering) or a
promptfoo-shaped map:
tags: experiment: baseline-v2 team: complianceThe map form is run metadata, not selection. The reserved experiment key feeds
the experiment namespace, and the full map is emitted to
summary.json.metadata.tags and every index.jsonl row so the Dashboard can group
trend/compare views by tags.experiment.
Set or override map tags from the CLI with a repeatable --tag key=value flag
(--tag experiment=baseline-v2 --tag team=compliance); bare --tag name keeps its
existing file-selection meaning. Tags merge with precedence
CLI --tag key=value > project config tags > eval tags. --experiment
still wins over tags.experiment for the namespace, and an explicit
--tag experiment= clears the label back to the default.
Imported source suite metadata appears in index.jsonl rows and manifests.
Use index.jsonl fields such as eval_path, test_id, target, and
result_dir for identity and artifact discovery instead of reconstructing paths
from suite names or wrapper layout.
For the complete result file contract, including why row metadata is semantic truth and directories are storage allocation, see Result Artifact Contract.