Skip to main content
For most simulation workflows, follow this progression:
  1. Setupsimulators to define interfaces, environments to compose them, datasets to prepare patient data
  2. Definebenchmarks to create test suites, tasks to add test cases, evals to add assertions
  3. Executebenchmark_runs to start runs, poll with get until status is Completed
  4. Analyzetask-runs to see per-task results, eval-runs to see per-eval reasoning and scores
If the user already has an environment or benchmark ID, skip the setup or definition steps and go straight to execution.

Tool Chaining Patterns

Simulators exist independently from environments. Create them first, then attach them.
// 1. Create simulator
{ "action": "create", "type": "FHIR", "name": "EHR" }
// Response: { "id": "sim_01", ... }

// 2. Create environment
{ "action": "create", "name": "Clinic" }
// Response: { "id": "env_01", ... }

// 3. Link
{ "action": "addSimulator", "id": "env_01", "simulatorId": "sim_01" }
This pattern keeps simulators reusable across multiple environments. A single FHIR simulator definition can be linked to different environment configurations.

Benchmark definition

Benchmarks, tasks, and evals form a hierarchy. Create them top-down.
// 1. Benchmark
{ "action": "create", "name": "PA Tests", "environmentId": "env_01", "timeout": 300 }
// Response: { "id": "bm_01", ... }

// 2. Task
{ "action": "create", "benchmarkId": "bm_01", "name": "Submit PA", "instruction": "..." }
// Response: { "id": "task_01", ... }

// 3. Eval
{ "action": "create", "taskId": "task_01", "name": "pa-submitted", "assert": "...", "weight": 1.0 }
Each level references the parent by ID. Evals are always attached to a specific task.

Run and poll

Start a run, poll for completion, then drill into results.
// 1. Start
{ "action": "create", "benchmarkId": "bm_01" }
// Response: { "id": "run_01", "status": "Running" }

// 2. Poll (repeat until status is "Completed" or "Failed")
{ "action": "get", "id": "run_01" }
// Response: { "status": "Completed", "score": 0.92, "verdict": "Pass" }

// 3. Drill into task results
{ "action": "list", "runId": "run_01" }

// 4. Drill into eval results
{ "action": "list", "taskRunId": "tr_01" }
The benchmark_runs get response includes the overall score and verdict. Use task-runs and eval-runs to understand which specific checks passed or failed.

Writing Good Evals

Eval assert strings are evaluated by an LLM judge against the interaction evidence. Write them like test assertions: specific, observable, and unambiguous.
AssertQualityWhy
”The agent did a good job”BadSubjective, no observable criteria
”A prior auth was submitted”OkayObservable but vague about what counts as “submitted"
"A prior authorization request was submitted through the payer portal with CPT code 72148”GoodSpecific action, specific channel, specific data point
”The 271 eligibility response shows active coverage with plan type PPO”GoodSpecific transaction type, specific fields to check
”The agent called the FHIR endpoint GET /Patient and received a 200 response”GoodVerifiable against interaction logs
”The agent handled the error gracefully”Bad”Gracefully” is subjective
Use weights to distinguish critical checks from nice-to-haves. A weight of 1.0 means the eval is essential to the task. A weight of 0.5 or lower signals a secondary validation that improves the score but does not determine the verdict on its own.

One assertion per eval

Split compound checks into separate evals rather than combining them. This gives you granular scoring and clearer failure messages.
// Instead of one combined eval:
{ "assert": "PA was submitted with CPT 72148 and diagnosis code M54.5", "weight": 1.0 }

// Use two separate evals:
{ "name": "correct-cpt", "assert": "The PA request includes CPT code 72148", "weight": 1.0 }
{ "name": "correct-dx", "assert": "The PA request includes diagnosis code M54.5", "weight": 0.5 }

Error Handling

All tools return errors in a consistent shape:
{
  "error": "not_found",
  "message": "Environment env_xyz not found"
}
Error CodeMeaningRecommended Action
not_foundEntity does not exist or belongs to a different organizationVerify the ID; use a list action to find valid IDs
validation_errorInvalid parameters (missing required field, wrong type)Check the parameter types and required fields in the tool reference
conflictDuplicate or conflicting state (e.g., simulator already linked)Use get to check current state before retrying
timeoutRun exceeded the benchmark timeoutIncrease the benchmark timeout value or simplify the task
When a not_found error occurs, do not retry with the same ID. Use the list action on the parent resource to discover valid IDs. For example, if a tasks get returns not_found, call tasks list with the benchmarkId to see available tasks.

Next Steps

Tools Reference

Full parameter documentation for all eleven tools.

Workflow Examples

Step-by-step tool call sequences for common simulation tasks.