voice-eval-harness · live demo

Open-source eval harness for production voice agents — extracted from the CI gates we run on every PR at Trifetch (closed-source HIPAA agentic-clinic OS). Persona-driven adversarial simulators, LLM-as-judge with budget caps, KB-grounding checks, multi-turn memory + tool-sequence assertions, and read-only audit of real production calls. Click a demo button below to load a live run.

GitHub ↗

What you're looking at

Vapi GI suite

6 multi-turn persona-driven cases vs a Bayview Endoscopy assistant — KB grounding, critical-phrasing, emergency triage, and a 2FA log-question flow. Cases derived from real Trifetch GI test scenarios.

Retell prod audit

voxeval audit read-only against 18 real production calls from a Trifetch ENT scheduling agent (last 7 days). LLM judge flags off-topic wander; clinic names anonymized.

Click any case

Transcript drawer opens with full multi-turn dialog, every assertion result (tool calls, latency, LLM-judge verdicts), and the failure narrative when the harness catches a real regression.

Drop a report.json here, or

browse

Vapi GI suite: 6 multi-turn persona-driven cases vs a Bayview Endoscopy assistant — KB grounding, critical-phrasing, emergency triage, 2FA log-question flow.
Retell prod audit: voxeval audit read-only against 18 real production calls from a Trifetch ENT scheduling agent (last 7 days, clinic names anonymised).

Loading…