passed, score, details, and evidence.
How Criteria Work
Unlike the legacyeval approach (a single natural language assert string judged by an LLM), a criterion has a structured assertion object. The verification engine dispatches to a dedicated check implementation keyed by assertion.assert.
Anatomy of a Criterion
| Field | Description |
|---|---|
label | Short human-readable description |
weight | Relative contribution to the task score. The task score is a weighted mean of per-criterion scores |
axis | Optional scoring axis. Criteria sharing an axis contribute to a per-axis score (for example correctness, safety, efficiency) |
input_entity_id | Optional DatasetEntity the criterion is scoped to (e.g. “the referral the agent should have processed”) |
assertion | Typed assertion spec. Discriminated on assert |
Supported Checks
Each check is documented in full on the Criteria API reference. A quick tour:fhir-resource-state
Assert that a FHIR search returns a resource with the expected field values after the rollout.
hl7-structural
Assert field values on HL7v2 outbound messages (ADT, ORU, ORM, SIU).
portal-state-match
Assert that a row in simulated portal state has the expected values after submission.
sftp-file-present
Assert that a file was uploaded to the SFTP endpoint, optionally checking parsed JSON contents.
voice-transcript
Assert that required phrases appear (and forbidden phrases do not) in the call transcript. Phrase matching is LLM-assisted.
x12-response
Assert field values on an X12 EDI response (270/271/276/277/278).
Annotated Examples
FHIR: Appointment booked
Appointment?patient=Patient/john-smith&status=booked, then asserts the first result has the expected participant display name.
Voice: required disclosures
SFTP: claim file uploaded
Portal: prior auth submitted
HL7: ADT sent
X12: 271 eligibility response
Writing Good Criteria
- Prefer precise field assertions over free-form natural language.
- Group related criteria under an
axisso you can see a per-axis score breakdown in the task score. - Weight critical outcomes higher. Verial uses weighted means, so
weight: 2doubles a criterion’s contribution relative toweight: 1. - Test negative behaviors too. For example a
voice-transcriptcriterion withnot_contains: ["social security number"]. - Start narrow. One criterion per observable outcome is better than one compound criterion.
Next Steps
Verification
How the verification engine scores criteria into a task score.
Criteria API
REST endpoints and full assertion spec reference.