Run evaluations
The arcade evals command discovers and executes evaluation suites with support for multiple providers, models, and output formats.
Backward compatibility: All new features (multi-provider support, multi-run evaluation, capture mode, output formats) work with existing evaluation suites. No code changes required.
Basic usage
Run all evaluations in the current directory:
arcade evals .The command searches for files starting with eval_ and ending with .py.
Show detailed results with critic feedback:
arcade evals . --detailsFilter to show only failures:
arcade evals . --only-failedMulti-provider support
Single provider with default model
Use OpenAI with default model (gpt-4o):
Bash/Zsh (macOS/Linux)
export OPENAI_API_KEY=sk-...
arcade evals .Use Anthropic with default model (claude-sonnet-4-5-20250929):
Bash/Zsh (macOS/Linux)
export ANTHROPIC_API_KEY=sk-ant-...
arcade evals . --use-provider anthropicSpecific models
Specify one or more models for a provider:
arcade evals . --use-provider openai:gpt-4o,gpt-4o-miniMultiple providers
Compare performance across providers (repeat --use-provider):
arcade evals . \
--use-provider openai:gpt-4o \
--use-provider anthropic:claude-sonnet-4-5-20250929 \
--api-key openai:sk-... \
--api-key anthropic:sk-ant-...When you specify multiple models, results show side-by-side comparisons.
API keys
are resolved in the following order:
| Priority | Format |
|---|---|
| 1. Explicit flag | --api-key provider:key (can repeat) |
| 2. Environment | OPENAI_API_KEY, ANTHROPIC_API_KEY |
3. .env file | OPENAI_API_KEY=..., ANTHROPIC_API_KEY=... |
Create a .env file in your directory to avoid setting keys in every terminal session.
Examples:
# Single provider
arcade evals . --api-key openai:sk-...
# Multiple providers
arcade evals . \
--api-key openai:sk-... \
--api-key anthropic:sk-ant-...Capture mode
Record calls without scoring to bootstrap test expectations:
arcade evals . --capture --output captures/baseline.jsonInclude conversation in captured output:
arcade evals . --capture --include-context --output captures/detailed.jsonCapture mode is useful for:
- Creating initial test expectations
- Debugging model behavior
- Understanding call patterns
See Capture mode for details.
Output formats
Save results to files
Specify output files with extensions - format is auto-detected:
# Single format
arcade evals . --output results.md
# Multiple formats
arcade evals . --output results.md --output results.html --output results.json
# All formats (no extension)
arcade evals . --output resultsAvailable formats
| Extension | Format | Description |
|---|---|---|
.txt | Plain text | Pytest-style output |
.md | Markdown | Tables and collapsible sections |
.html | HTML | Interactive report |
.json | JSON | Structured data for programmatic use |
| (none) | All formats | Generates all four formats |
Multi-run evaluation
Run each case multiple times to measure stability:
# Run each case 5 times with random seeds
# Pass if the majority of runs pass
arcade evals . \
--num-runs 5 \
--seed random \
--multi-run-pass-rule majority \
--details \
-o results.htmlWhen you use --num-runs > 1, Arcade adds per-case run_stats to results. If your case has critics, Arcade also adds critic_stats.
--seed only affects OpenAI runs. Arcade ignores it for Anthropic.
Valid values:
--seed:constant(default),random, or a non-negative integer--multi-run-pass-rule:last(default),mean, ormajority
What multi-run returns
In JSON output, each case includes multi-run data:
{
"name": "Get weather for city",
"status": "passed",
"score": 93.0,
"passed": true,
"warning": false,
"run_stats": {
"num_runs": 3,
"scores": [1.0, 0.8, 1.0],
"mean_score": 0.93,
"std_deviation": 0.09,
"passed": [true, false, true],
"warned": [false, true, false],
"seed_policy": "random",
"run_seeds": [123, 456, 789],
"pass_rule": "majority",
"runs": [
{"score": 1.0, "passed": true, "warning": false, "failure_reason": null, "details": []},
{"score": 0.8, "passed": false, "warning": true, "failure_reason": null, "details": []},
{"score": 1.0, "passed": true, "warning": false, "failure_reason": null, "details": []}
]
},
"critic_stats": {
"location": {
"run_scores": [0.7, 0.56, 0.7],
"weight": 0.7,
"mean_score_normalized": 0.95,
"std_deviation_normalized": 0.07,
"mean_score": 0.67,
"std_deviation": 0.05
}
}
}Arcade uses --multi-run-pass-rule to set the overall status, passed, and warning fields. It sets run_stats.mean_score to the average raw score across runs, and the top-level score is that aggregate score in percent.
Command options
Quick reference
| Flag | Short | Purpose | Example |
|---|---|---|---|
--use-provider | -p | Select provider/model | -p openai:gpt-4o |
--api-key | -k | Provider API key | -k openai:sk-... |
--capture | - | Record without scoring | --capture |
--details | -d | Show critic feedback | --details |
--only-failed | -f | Filter failures | --only-failed |
--output | -o | Output file (repeatable) | -o results.md |
--include-context | - | Add messages to output | --include-context |
--max-concurrent | -c | Parallel limit | -c 10 |
--num-runs | -n | Run each case multiple times | -n 5 |
--seed | - | Seed policy for OpenAI runs | --seed random |
--multi-run-pass-rule | - | Pass/warn rule for multi-run | --multi-run-pass-rule majority |
--debug | - | Debug info | --debug |
--use-provider, -p
Specify a provider and optional model list. Repeat the flag for multiple providers:
--use-provider <provider>[:<model1>,<model2>]Supported providers:
openai(default:gpt-4o)anthropic(default:claude-sonnet-4-5-20250929)
Anthropic model names include date stamps. Check Anthropic’s model documentation for the latest model versions.
Examples:
# Default model for provider
arcade evals . -p anthropic
# Specific model
arcade evals . -p openai:gpt-4o-mini
# Multiple models from same provider
arcade evals . -p openai:gpt-4o,gpt-4o-mini
# Multiple providers (repeat `-p`)
arcade evals . -p openai:gpt-4o -p anthropic:claude-sonnet-4-5-20250929--api-key, -k
Provide explicitly (repeatable):
arcade evals . -k openai:sk-... -k anthropic:sk-ant-...--capture
Enable capture mode to record calls without scoring:
arcade evals . --capture--include-context
Include system messages and conversation history in output:
arcade evals . --include-context --output results.md--output, -o
Specify one or more output files. Format is auto-detected from extension:
# Single format
arcade evals . -o results.md
# Multiple formats (repeat flag)
arcade evals . -o results.md -o results.html
# All formats (no extension)
arcade evals . -o results--details, -d
Show detailed results including critic feedback:
arcade evals . --details--only-failed, -f
Show only failed test cases:
arcade evals . --only-failed--max-concurrent, -c
Set maximum concurrent evaluations:
arcade evals . --max-concurrent 10Default is 1 concurrent evaluation.
--debug
Show debug information for troubleshooting:
arcade evals . --debugDisplays detailed error traces and connection information.
Understanding results
Results are formatted based on evaluation type (regular, multi-model, or comparative) and selected flags.
Summary format
Results show overall performance:
Summary -- Total: 5 -- Passed: 4 -- Failed: 1How flags affect output:
--details: Adds per-critic breakdown for each case--only-failed: Filters to show only failed cases (summary shows original totals)--include-context: Includes system messages and conversation history--num-runs: Adds per-run statistics and aggregate scores- Multiple models: Switches to comparison table format
- Comparative tracks: Shows side-by-side track comparison
Case results
Each case displays status and score:
PASSED Get weather for city -- Score: 100.00%
FAILED Weather with invalid city -- Score: 65.00%Detailed feedback
Use --details to see critic-level analysis:
Details:
location:
Match: False, Score: 0.00/0.70
Expected: Seattle
Actual: Seatle
units:
Match: True, Score: 0.30/0.30Multi-model results
When using multiple models, results show comparison tables:
Case: Get weather for city
Model: gpt-4o -- Score: 100.00% -- PASSED
Model: gpt-4o-mini -- Score: 95.00% -- WARNEDAdvanced usage
High concurrency for fast execution
Increase concurrent evaluations:
arcade evals . --max-concurrent 20High concurrency may hit API rate limits. Start with default (1) and increase gradually.
Save comprehensive results
Generate all formats with full details:
arcade evals . --details --include-context --output resultsThis creates:
results.txtresults.mdresults.htmlresults.json
Troubleshooting
Missing dependencies
If you see ImportError: MCP SDK is required, install the full package:
pip install 'arcade-mcp[evals]'For Anthropic support:
pip install anthropicTool name mismatches
names are normalized (dots become underscores). Check your tool definitions if you see unexpected names.
API rate limits
Reduce --max-concurrent value:
arcade evals . --max-concurrent 2No evaluation files found
Ensure your evaluation files:
- Start with
eval_ - End with
.py - Contain functions decorated with
@tool_eval()
Next steps
- Explore capture mode for recording calls
- Learn about comparative evaluations for comparing sources