Table of Contents
- Most voice agents ship broken. Here's why.
- What is voice agent evaluation?
- The 4-Layer Voice Agent Evaluation Framework
- Case Study: Media Brite Smile Dental
- Evaluation platform comparison
- ROI Math — 3 Scenarios
- Common objections — honest answers
- How SuperMIA handles evaluation
- Frequently asked questions
- The bottom line
Quick Answer
Voice agent evaluation measures four distinct layers — infrastructure, agent execution, user experience, and business outcomes — because failures cascade across all of them. WER under 5% on clean audio, TTFA under 800ms, and containment above 75% are the production thresholds. Pre-production testing catches 80% of issues that production monitoring misses. Both are required, not alternatives.
Most voice agents ship broken. Here's why.
Tuesday, 2 p.m. Your new voice agent has been live for six hours. The first complaint lands from the COO: a caller spent four minutes saying she wanted to cancel an appointment while the agent kept trying to sell her an upgrade. By Friday, you've pulled every metric on the dashboard. None of them flagged this call.
Most voice agents ship broken. Not because the models are weak — because the voice agent evaluation never tested what actually matters. This article walks through the four-layer pre-production testing framework that catches those failures before a single real caller hits your number.
"We shipped our voice agent after passing every internal test. First week in production, containment dropped to 40% and nobody could tell us why. We were monitoring WER and latency — neither moved. The issue was barge-in handling, which we hadn't tested at all."
TL;DR
- Voice agent evaluation measures four distinct layers — infrastructure, agent execution, user experience, and business outcomes — because failures cascade across all of them.
- Word Error Rate (WER) under 5% on clean audio and under 10% on phone audio is the production accuracy floor. Higher rates break downstream LLM reasoning.
- Time to First Audio (TTFA) under 800ms keeps conversations natural. Beyond 1,500ms, callers hang up.
- Pre-production testing catches 80% of issues that production monitoring misses. Both are required, not alternatives.
- Platforms like Hamming, Braintrust, and Future AGI cost $1K–10K/month. Building your own runs 6–12 months and $400K–$1M loaded. Most teams under 50 engineers should buy.
What is voice agent evaluation?
Voice agent evaluation is the systematic testing of an AI voice agent before and during production. It measures accuracy, latency, conversational quality, and business outcomes across the full pipeline — audio capture, speech-to-text, language model reasoning, text-to-speech, and telephony — using synthetic calls, real recordings, and automated scoring.
Key Takeaways
- WER under 5% clean and under 10% on phone audio is the production floor.
- Target TTFA under 800ms. Past 1,500ms, conversations break and callers abandon.
- Non-deterministic LLM outputs demand semantic scoring, not string-match testing.
- Production monitoring catches drift. Pre-production testing catches design flaws. You need both.
- The four-layer framework covers every failure surface a voice agent can hit.
The 4-Layer Voice Agent Evaluation Framework
A voice agent fails in one of four places. Test each layer in isolation before testing the full pipeline, because end-to-end failures cascade and mask which component actually broke.

Layer 1 — Infrastructure
Infrastructure is the audio path itself. Before any AI runs, your voice agent depends on packet delivery, codec quality, and jitter. These failures are invisible until a call sounds terrible — then everything else looks broken.
- Mean Opinion Score (MOS) ≥ 4.0 on TTS output
- Packet loss under 1% on telephony legs
- Jitter under 30ms end-to-end
- Audio sampling rate consistency — PSTN 8kHz vs WebRTC 16kHz matters
Telephony choice matters more than most teams realize. Twilio uses 8kHz G.711 by default, which drops STT accuracy 15–20% compared to Telnyx G.722 wideband or LiveKit WebRTC at 16kHz. If you're building a phone-first agent, test on real phone recordings — not studio audio.
Layer 2 — Agent Execution
Execution is where the AI actually runs: STT transcription, LLM reasoning, tool calls, and TTS synthesis. Each has its own failure mode, and each needs component-level testing before end-to-end testing means anything.
STT accuracy
Target WER below 5% on clean audio and below 10% on phone audio. But test against your actual call recordings, not marketing benchmarks. A provider advertising 3% WER on studio audio regularly hits 12–15% on real contact center calls.
LLM reasoning quality
Does the model extract the right intent and call the right tools? Test this in isolation. Measure intent accuracy (target 95%+), tool call success (target 98%+), and response appropriateness. Hallucinations here cascade into bad tool calls that affect real outcomes.
TTS quality and prompt compliance
MOS of 4.0–4.3 is current best-in-class. Run adversarial scenarios — jailbreak attempts, off-topic requests, emotional pressure. Single-turn red teaming succeeds 19.5% of the time on unhardened agents, per Cekura production data.
Layer 3 — User Experience
User experience is the perceived quality. This layer is where most vendor tooling falls short because it requires listening to audio, not just reading transcripts. Tone, timing, and interruption handling don't show up in string-match tests.
- Time to First Audio (TTFA): under 800ms. 1,000ms feels unnatural. 1,500ms causes frustration. 2,500ms leads to hangups.
- Barge-in latency: under 200ms. An agent that keeps talking over a frustrated caller destroys CSAT.
- Silence detection: long silences signal confusion. Track them.
- Repeat-query rate: callers rephrasing because the agent misunderstood them.
- Explicit frustration markers: phrases like "no, that's not what I said" or "let me talk to a person."
Layer 4 — Business Outcomes
Technical performance is meaningless without business outcome metrics. An agent can hit perfect WER and perfect latency and still fail to accomplish anything useful.
- Containment rate: percentage handled end-to-end without human transfer. Industry avg 55–65%. Best-in-class 75–88%.
- Task completion rate: 75–80% general support; 85–95% for specialized deployments.
- CSAT: target 4.2+ on a 5-point scale.
- First Contact Resolution (FCR): AI agents hit 65–78% vs 70–75% for humans.
See SuperMIA's voice agent platform →
80%+ containment within 30 days. Zero blind spots across the pipeline.
Case Study: Media Brite Smile Dental
Media Brite Smile Dental, a three-location dental practice in Philadelphia, deployed SuperMIA's voice agent in September 2025 after running the 4-layer evaluation framework across their use case. Before and after metrics:

| Metric | Before (manual) | After (AI voice + 4-layer eval) | Change |
|---|---|---|---|
| Avg patient call response time | 47 seconds | 27 seconds | 43% faster |
| Weekday appointment slot fill rate | 74% | 94% | +20 points |
| Missed-call recovery rate | 12% | 87% | +75 points |
| Monthly revenue (3 locations) | Baseline | 57% higher | +57% |
| Staff front-desk hours freed | — | 28 hrs/week | Repurposed to care |

The key: they didn't just monitor WER and latency. They ran the full 4-layer framework pre-production, caught three critical issues in Layer 3 (barge-in latency, silence detection, frustration marker handling) that would have tanked CSAT, and shipped with evidence instead of hope.
Evaluation platform comparison
If you're buying an eval harness instead of building one, here's how the main platforms stack up for voice agent evaluation specifically:

| Platform | Strength | Weakness | Price |
|---|---|---|---|
| Hamming | Voice-native synthetic calls; 4-layer coverage | Vendor-locked scoring | $2–10K/mo |
| Braintrust | Strong eval primitives; multi-modal | Less voice-specific tooling | $1.5–8K/mo |
| Future AGI | 5-layer eval stack; detailed docs | Newer, smaller ecosystem | $1–5K/mo |
| Cekura | Red-teaming + adversarial scenarios | Limited standard regression | $2–6K/mo |
| Maxim AI | Simulation-first; 1,000+ scenario gen | Thin on production monitoring | $1–4K/mo |
| DIY (in-house) | Full control; custom scoring | 6–12 months build; $400K–1M loaded | $0 license |
ROI Math — 3 Scenarios
Voice agent evaluation has a cost. Skipping it has a bigger one. Here's what the math looks like across three realistic deployment scenarios:
Scenario A — SMB Clinic (200 calls/day, 1 location)
- Pre-eval cost (platform + 2 weeks engineering): $8,000 one-time
- Annual eval platform: $18,000
- Prevented failures (5% containment drop caught pre-launch): $92,000/yr revenue preserved
- Net year-1 ROI: $66,000 (~3.5x return)
Scenario B — Mid-Market Contact Center (2,000 calls/day, 5 seats)
- Pre-eval cost (full 4-layer build-out): $45,000 one-time
- Annual eval platform: $60,000
- Prevented failures (1 month of low containment avoided): $380,000 revenue preserved
- Net year-1 ROI: $275,000 (~2.6x return)
Scenario C — Enterprise Deployment (20,000 calls/day, 40+ seats)
- Pre-eval cost (dedicated eval team + tooling): $240,000 one-time
- Annual eval platform + ML engineers: $680,000
- Prevented failures (drift + regression catch): $2.8M revenue preserved + $450K labor
- Net year-1 ROI: $2.33M (~2.5x return)
Common objections — honest answers
"We already have a QA team — why do we need voice agent evaluation?"
Your QA team is great at deterministic testing. Voice agents aren't deterministic. The same input can produce three different outputs, and all three can be correct. String-match testing breaks. You need semantic scoring, synthetic call generation, and audio-native evaluators that QA tooling doesn't ship with.
"Can't we just monitor production and fix problems as they appear?"
Production monitoring catches drift. It does not catch design flaws — those are baked in before day one. The COO-complaint scenario at the top of this article is a design-flaw failure: the agent was doing exactly what it was trained to do, just on the wrong intent. Monitoring didn't help because monitoring measures what the agent did, not what it should have done.
"Eval platforms cost more than our agent development budget."
At high volume, yes — a $60K/year eval platform feels heavy when your agent build was $40K. But the alternative is shipping with 10% lower containment than you thought, which translates to real revenue loss within the first quarter. The ROI math above is consistent across every scenario: evaluation pays for itself inside 12 months.
"Our vendor already tests their platform. Isn't that enough?"
Vendor testing covers the platform, not your deployment. Your prompts, your tools, your integration with CRM/EMR/billing systems — none of that is in their test suite. You need a layer-specific test of everything your team built on top of the platform.
One evaluation mistake that will undo all of this: Don't skip Layer 3 testing because your Layer 2 metrics look clean. An agent can hit perfect WER and perfect latency and still destroy CSAT through barge-in failures, unnatural silence handling, and frustration escalation loops that never appear in transcript-only testing. Listen to the audio.
How SuperMIA handles evaluation
SuperMIA runs every voice agent through the four-layer framework before production. The internal eval harness covers:
- Layer 1 — continuous MOS scoring on every deployment, with jitter and loss tracked per geographic region.
- Layer 2 — WER tested against 10,000+ labeled call recordings across verticals; LLM intent accuracy benchmarked per prompt version.
- Layer 3 — barge-in latency and TTFA measured on every release; production sentiment tracked with real-time alerts.
- Layer 4 — containment and task completion dashboards per customer deployment, with anomaly detection on baseline drift.
Typical SuperMIA deployments hit 80%+ containment within 30 days of production launch, with zero blind spots across the pipeline. See the full approach in our AI Voice Agents: 2026 Platform Guide.
See SuperMIA's voice agent platform →
Frequently asked questions
The bottom line
Most voice agents ship broken because evaluation treats voice like chat. It isn't. Non-deterministic outputs, acoustic variability, real-time latency budgets, and emotional conversations all demand a testing approach built for voice specifically. The four-layer framework covers every failure surface. Build it into CI/CD. Run it before every deployment. The teams that evaluate rigorously don't deploy with confidence — they deploy with evidence.
Skip 6 months of building an eval harness.
See SuperMIA's voice agent platform with built-in 4-layer evaluation. Live in 30 days, not 6 months.
Book a 15-minute demo →
Harikrishna Patel
Harikrishna Patel is the founder of MIA – My Intelligent Assistant, the AI automation platform built under Botfinity Inc. in Dallas, Texas. With 15+ years in software engineering, AI/ML, and enterprise solution design, he focuses on creating practical, scalable AI tools that help businesses automate support, workflows, and operations through voice and chat.
