Best Practices

Recommended Tool Flow

For most simulation workflows, follow this progression:

Setup — simulators to define interfaces, environments to compose them, datasets to prepare patient data
Define — benchmarks to create test suites, tasks to add test cases, criteria to add assertions
Execute — benchmark_runs to start runs, poll with get until status is Completed
Analyze — task-runs to see per-task results, criterion-runs to see per-criterion reasoning and scores

If the user already has an environment or benchmark ID, skip the setup or definition steps and go straight to execution.

Tool Chaining Patterns

Create-then-link

Simulators exist independently from environments. Create them first, then attach them.

// 1. Create simulator
{ "action": "create", "type": "FHIR", "name": "EHR" }
// Response: { "id": "sim_01", ... }

// 2. Create environment
{ "action": "create", "name": "Clinic" }
// Response: { "id": "env_01", ... }

// 3. Link
{ "action": "addSimulator", "id": "env_01", "simulatorId": "sim_01" }

This pattern keeps simulators reusable across multiple environments. A single FHIR simulator definition can be linked to different environment configurations.

Benchmark definition

Benchmarks, tasks, and criteria form a hierarchy. Create them top-down.

// 1. Benchmark
{ "action": "create", "name": "PA Tests", "environmentId": "env_01", "timeout": 300 }
// Response: { "id": "bm_01", ... }

// 2. Task
{ "action": "create", "benchmarkId": "bm_01", "name": "Submit PA", "instruction": "..." }
// Response: { "id": "task_01", ... }

// 3. Criterion
{ "action": "create", "taskId": "task_01", "name": "pa-submitted", "assertion": { ... }, "weight": 1.0 }

Each level references the parent by ID. Criteria are always attached to a specific task.

Run and poll

Start a run, poll for completion, then drill into results.

// 1. Start
{ "action": "create", "benchmarkId": "bm_01" }
// Response: { "id": "run_01", "status": "Running" }

// 2. Poll (repeat until status is "Completed" or "Failed")
{ "action": "get", "id": "run_01" }
// Response: { "status": "Completed", "score": 0.92, "verdict": "Pass" }

// 3. Drill into task results
{ "action": "list", "runId": "run_01" }

// 4. Drill into criterion results
{ "action": "list", "taskRunId": "tr_01" }

The benchmark_runs get response includes the overall score and verdict. Use task-runs and criterion-runs to understand which specific checks passed or failed.

Writing Good Criteria

When writing criteria, describe observable outcomes the verification engine can check against sandbox state. Write them like test assertions: specific, observable, and unambiguous.

Assert	Quality	Why
”The agent did a good job”	Bad	Subjective, no observable criteria
”A prior auth was submitted”	Okay	Observable but vague about what counts as “submitted"
"A prior authorization request was submitted through the payer portal with CPT code 72148”	Good	Specific action, specific channel, specific data point
”The 271 eligibility response shows active coverage with plan type PPO”	Good	Specific transaction type, specific fields to check
”The agent called the FHIR endpoint GET /Patient and received a 200 response”	Good	Verifiable against interaction logs
”The agent handled the error gracefully”	Bad	”Gracefully” is subjective

Use weights to distinguish critical checks from nice-to-haves. A weight of 1.0 means the criterion is essential to the task. A weight of 0.5 or lower signals a secondary validation that improves the score but does not determine the verdict on its own.

One assertion per criterion

Split compound checks into separate criteria rather than combining them. This gives you granular scoring and clearer failure messages.

Error Handling

All tools return errors in a consistent shape:

{
  "error": "not_found",
  "message": "Environment env_xyz not found"
}

Error Code	Meaning	Recommended Action
`not_found`	Entity does not exist or belongs to a different organization	Verify the ID; use a `list` action to find valid IDs
`validation_error`	Invalid parameters (missing required field, wrong type)	Check the parameter types and required fields in the tool reference
`conflict`	Duplicate or conflicting state (e.g., simulator already linked)	Use `get` to check current state before retrying
`timeout`	Run exceeded the benchmark timeout	Increase the benchmark `timeout` value or simplify the task

When a not_found error occurs, do not retry with the same ID. Use the list action on the parent resource to discover valid IDs. For example, if a tasks get returns not_found, call tasks list with the benchmarkId to see available tasks.

Best Practices

Recommended Tool Flow

Tool Chaining Patterns

Create-then-link

Benchmark definition

Run and poll

Writing Good Criteria

One assertion per criterion

Error Handling

Next Steps

Tools Reference

Workflow Examples

​Recommended Tool Flow

​Tool Chaining Patterns

​Create-then-link

​Benchmark definition

​Run and poll

​Writing Good Criteria

​One assertion per criterion

​Error Handling

​Next Steps

Tools Reference

Workflow Examples

Recommended Tool Flow

Tool Chaining Patterns

Create-then-link

Benchmark definition

Run and poll

Writing Good Criteria

One assertion per criterion

Error Handling

Next Steps