How Evals Work
Each eval has a label (short identifier) and an assert (natural language assertion). When a task completes, Verial passes the assertion along with all recorded interactions (FHIR requests, HL7 messages, call transcripts, fax documents) to an LLM judge. The judge determines whether the evidence supports the assertion.Writing Evals
Good evals are specific, observable, and grounded in evidence that the simulators capture.Good Evals
Avoid
Weights
Each eval has aweight that determines its contribution to the task score. Weights are relative within a task. A weight of 1.0 means the eval contributes twice as much as an eval with weight 0.5.
Scoring
Task Score
The task score is computed as the weighted sum of passed evals divided by the total weight:- All pass:
(1.0 + 0.5 + 0.5) / 2.0 = 1.0 - Only prior auth submitted:
1.0 / 2.0 = 0.5 - Prior auth + CPT code:
(1.0 + 0.5) / 2.0 = 0.75
Run Score
The run score is the average of all task scores (equally weighted across tasks).Verdict
The run verdict compares the run score against the benchmark’s threshold (default0.7):
| Score | Threshold | Verdict |
|---|---|---|
| 0.85 | 0.7 | Pass |
| 0.65 | 0.7 | Fail |
| 1.0 | 0.9 | Pass |
| 0.89 | 0.9 | Fail |
Eval Run Results
The LLM judge produces an eval run for each eval, including a result, score, and details. This is useful for understanding why an eval passed or failed:Multi-Interface Evals
Evals can reference evidence across multiple simulators. The judge has access to all interaction logs from all active simulators in the run.Tips
- Be specific about expected values. “CPT code 72148” is better than “correct CPT code.”
- Reference observable actions. Describe what should appear in logs, not what the agent should think.
- Use multiple evals per task. Break complex tasks into individual checkable outcomes.
- Weight critical outcomes higher. The primary action (submitting the prior auth) should have higher weight than secondary details (formatting).
- Test failure cases too. Include tasks where the correct behavior is to not take an action (e.g., “The agent should not submit a prior auth for an excluded service”).
Next Steps
Benchmarks
Build benchmarks with tasks and evals.
Runs
Execute benchmarks and review scored results.