Structure
A benchmark contains:- Name. What the benchmark tests.
- Environment. The simulated health system to run against.
- Tasks. Individual test cases, each with instructions and expected outcomes.
- Evals. Natural language assertions on each task, evaluated by an LLM judge.
- Configuration. Timeout, concurrency, and scoring thresholds.
Creating a Benchmark
Benchmarks are linked to an environment at creation time:Tasks
Tasks are separate resources linked to a benchmark. Each task defines:- Name. A short identifier for the test case.
- Instruction. What the agent should do (natural language).
- Trigger. An optional event that kicks off the task (e.g., an inbound HL7 message, a phone call).
- Tags. Optional labels for filtering and grouping.
Evals
Evals are separate resources linked to a task. Each eval has a label (short identifier), an assert (natural language assertion), and a weight:Multi-Interface Tasks
Tasks can span multiple simulators. For example, a task might require your agent to:- Read a patient’s chart in the FHIR EHR
- Call the insurance company’s IVR to check eligibility
- Submit a prior authorization through the payer portal
- Fax supporting documentation
Configuration
Timeout
The maximum time (in seconds) allowed for each task. If a task is not completed within this window, it is marked as timed out and receives a score of 0.Concurrency
How many tasks can run simultaneously within a single run. Sequential execution (concurrency: 1) is the default and ensures tasks don’t interfere with each other. Higher concurrency is useful for independent tasks.
Next Steps
Runs
Execute benchmarks and review results.
Evals
Write effective evaluation criteria.