POST /v1/task-runs/{id}/complete, the engine walks the task’s criteria, runs the matching check against the sandbox state, and writes a Criterion Run for each.
The Flow
Check Dispatch
Theassertion.assert field is a discriminator. The engine maps each value to a check implementation:
assert value | Data source |
|---|---|
fhir-resource-state | FHIR store attached to the playground (Google Cloud Healthcare API) |
hl7-structural | SandboxEvent rows of type hl7_outbound |
portal-state-match | Sandbox state rows keyed by correlate_by.resource |
sftp-file-present | GCS objects under the sandbox’s SFTP prefix |
voice-transcript | Recorded voice turns for the playground |
x12-response | X12 response records for the playground |
assertion.assert does not parse against a known spec, the criterion run is marked failed with details indicating an unsupported assertion.
Scoring
Each criterion produces ascore in [0, 1] and a passed boolean. The task score is a weighted mean:
verdict:
passwhenscore >= 0.9partialwhen0 < score < 0.9failwhenscore == 0or there are no criteria
Per-Axis Scores
Criteria can declare anaxis (for example correctness, safety, efficiency). The engine groups criteria by axis and produces a per-axis score using the same weighted mean. Criteria with no axis are collapsed into a __default__ bucket.
The POST /v1/task-runs/{id}/complete response includes an axes object keyed by axis:
Evidence Payloads
Every check produces structured evidence stored on the Criterion Run. Shape varies by check type:| Check | Evidence fields |
|---|---|
fhir-resource-state | fieldResults: [{ path, expected, actual, passed }] |
hl7-structural | field_results: [{ path, expected, actual, passed }] |
portal-state-match | assertionResults: [{ path, expected, actual, passed }] |
sftp-file-present | fieldResults plus matched path(s) |
voice-transcript | phraseResults: [{ phrase, found, turn }] |
x12-response | fieldResults: [{ path, expected, actual, passed }] |
scored: true the API elides details and field-level evidence in the completion response (to avoid leaking the scoring rubric to the agent). You can still read the full evidence later via GET /criterion-runs/{id}.
Benchmark Run Aggregation
Once every task run in a benchmark run is completed, the engine takes the mean of task scores to produce the benchmark runscore and verdict (pass at >= 0.9, partial above zero, fail otherwise).
Next Steps
Criteria
Write typed assertions for each check type.
Criterion Runs API
Read per-criterion results and evidence.