voice-eval-harness · live demo
Open-source eval harness for production voice agents — extracted from the CI gates we run on every PR at Trifetch (closed-source HIPAA agentic-clinic OS). Persona-driven adversarial simulators, LLM-as-judge with budget caps, KB-grounding checks, multi-turn memory + tool-sequence assertions, and read-only audit of real production calls. Click a demo button below to load a live run.
What you're looking at
6 multi-turn persona-driven cases vs a Bayview Endoscopy assistant — KB grounding, critical-phrasing, emergency triage, and a 2FA log-question flow. Cases derived from real Trifetch GI test scenarios.
voxeval audit read-only against 18 real production calls from a Trifetch ENT scheduling agent (last 7 days). LLM judge flags off-topic wander; clinic names anonymized.
Transcript drawer opens with full multi-turn dialog, every assertion result (tool calls, latency, LLM-judge verdicts), and the failure narrative when the harness catches a real regression.
Drop a report.json here, or
Vapi GI suite: 6 multi-turn persona-driven cases vs a Bayview Endoscopy assistant — KB grounding, critical-phrasing, emergency triage, 2FA log-question flow.
Retell prod audit: voxeval audit read-only against 18 real production calls from a Trifetch ENT scheduling agent (last 7 days, clinic names anonymised).