Skip to main content

19 April 2026

Voice AI

AI Voice Agents with VAPI and Twilio: The Build Playbook

How to build production AI voice agents for Australian businesses on VAPI and Twilio. Architecture, cost model, inbound and outbound patterns, governance, from an operator who ships them.

AI Voice Agents with VAPI and Twilio: The Build Playbook, Voice AI, VAPI analysis by Amjid Ali.

AI voice agents moved from “interesting demo” to “pays for itself in the first month” in 2025. By 2026 it is no longer a question of whether voice agents work, it is a question of whether you build them well or badly. Built well, a voice agent recovers missed revenue, runs 24/7, and costs cents per call. Built badly, it frustrates callers, leaks data, and teaches your customers to distrust the phone number you just gave them.

This is the playbook I use when a client asks me to build a voice agent for their business. It covers what a voice agent actually is, why VAPI and Twilio together remove most of the engineering, the architecture patterns that hold up under real traffic, the cost model that decides the conversation with the CFO, and the evaluation checklist that keeps you from shipping a bad one.

If you want a productised engagement around this, I run it as a service: see AI Voice Agents, Inbound & Outbound Calls. This essay is the operator view, what to worry about, in what order, and why.

What a voice agent actually is

Four components running in real time over a phone line.

  1. Automatic speech recognition (ASR). The caller speaks, you turn the audio into text. Deepgram Nova-3 and AssemblyAI are the current pragmatic choices for production. Whisper works if you are self-hosting and can absorb the latency cost.
  2. A large language model. Reasons about what the caller said, decides the reply, chooses whether to call a tool (check calendar, write a CRM record, transfer to a human). Claude, GPT, Gemini, and a handful of strong open models all work. Latency matters more than leaderboard rank at this stage of the loop.
  3. Text-to-speech (TTS). Turns the reply back into natural voice. ElevenLabs and Cartesia are the two I reach for. Australian-accent voices exist and matter for brand fit.
  4. Telephony. Carries all of it over the actual phone network. Twilio is the default for Australian numbers, SIP trunking, and international reach. Plivo, Telnyx, and Zoom Phone are the serious alternatives, but Twilio’s AU coverage and ecosystem still win for most builds.

The whole loop, from the caller finishing a sentence to the agent starting a reply, needs to land inside about a second. Below 800ms feels human. Above 1.5 seconds feels awkward. Above 3 seconds the caller hangs up.

Why VAPI, and why Twilio

Stitching those four components together yourself is possible. I have done it. It eats two to three weeks of engineering on turn-taking, interruption handling, endpointing, barge-in, noise suppression, and latency tuning before you even touch the business logic. Every one of those is a hard problem with a hundred edge cases.

VAPI does that part. It is an orchestrator that sits between the telephony layer and the ASR/LLM/TTS trio, handles the real-time mess, and gives you a clean interface to the parts you actually want to customise: the prompt, the tools, the voice, the handoff rules. Sub-second end-to-end is the default, not something you tune for.

Twilio is the telephony layer: Australian inbound and outbound numbers, SIP trunking for office PBX integration, DTMF for menu fallbacks, warm transfer to humans, call recording for compliance. For an Australian-facing business, Twilio AU numbers plus a Sydney region for your VAPI tenant is the combination that keeps latency low and data sovereignty clean.

Could you build the same thing on LiveKit Agents plus Pipecat and run it on your own infrastructure? Yes, and for some engagements (regulated industries, sovereign-region requirements, very high call volume) that is the right answer. For 90% of Australian businesses, VAPI plus Twilio ships in three to four weeks. Self-hosted runs six to ten.

The ten decisions that shape your voice agent

Before anyone writes a prompt, the answers to these drive everything else. Nail them in week one.

  1. Inbound, outbound, or both? Drives the architecture, the prompt style, the escalation logic, and the cost model. Most builds start with one and add the other later.
  2. Which call types does the agent actually handle? Reception, booking, FAQ, qualification, reminders, payment follow-up, support triage? Pick the one with highest volume times highest value times cleanest fit. That is v1.
  3. Voice and accent. Australian accent, neutral, or specific gender and tone? Record the current front-desk person’s style as the reference so the agent sounds like the brand, not the platform.
  4. Tools the agent can call. Check calendar, write CRM record, send SMS, look up order status, create support ticket, transfer to human. Each tool is a typed function call. Start minimal, add as they prove out.
  5. Escalation rules. When does the agent hand to a human? Frustrated caller, explicit request, high-value lead, out-of-scope question, compliance boundary (medical advice, legal advice, payment disputes). These are not optional, the agent must know its limits and hand over cleanly.
  6. CRM and calendar. Which ones, and how are credentials handled? Service account with scoped permissions, not a personal token.
  7. Recording and consent. Recording is valuable for quality review but subject to the AU Privacy Act and state-level consent laws. Decide up front: opt-in announcement at the start of the call, or no recording of caller audio at all. Do not improvise this one.
  8. After-hours behaviour. Does the agent run 24/7, or only outside business hours, or only during an overflow spike? Affects cost and the prompt’s awareness of time.
  9. Failure handling. What happens when the LLM times out, the TTS API is down, or the caller is on a noisy mobile connection? Always have a graceful fallback (take a callback number, promise a human follow-up, or transfer to a voicemail).
  10. Ownership after handover. Who tunes the prompt, reviews calls, updates intents? Your team with our support, fully managed by us, or a mix? Affects the service shape and the ongoing cost.

Most failed voice-agent engagements I see skipped decision 5 (escalation), 7 (consent), or 9 (failure). They shipped on the happy path and discovered the edges in production, on live callers.

Inbound architecture, the reference build

For an inbound agent, the flow is:

  1. Caller dials the Twilio AU number. SIP routes to VAPI.
  2. VAPI picks up. Plays a short opening line (brand-matched, includes the consent line if recording).
  3. Caller speaks. Deepgram transcribes in real time.
  4. VAPI sends the transcript + system prompt + conversation history to the LLM.
  5. LLM reasons. It either:
    • replies directly (FAQ answer),
    • calls a tool (check calendar, look up order),
    • asks a clarifying question,
    • triggers an escalation (transfer to a human, take a message).
  6. Tool call resolves. Calendar returns slots, CRM returns the caller’s record, etc.
  7. LLM integrates the result and picks the reply.
  8. ElevenLabs speaks the reply. The caller hears a natural voice response.
  9. Loop, until the caller’s goal is met or escalation fires.

Around this core loop you build: the opening line, the consent handling, the handoff rules, the post-call write-back to CRM, the SMS or email confirmation, the audit trail.

The bits that catch people out:

  • Interruption handling. The caller will talk over the agent. VAPI handles barge-in by default, but you have to tune how eagerly the agent yields. Too eager and the agent never finishes a sentence. Too slow and it feels rude.
  • Endpointing. Deciding when the caller has finished speaking. Silence detection is naive for Australian callers with natural pauses. Semantic endpointing (the LLM decides based on content whether the caller is done) is the pattern that actually works.
  • Ambiguous intents. “I want to book something” maps to at least three different tools in a clinic. Build the agent to clarify with one question, not guess and be wrong.
  • Identity verification. For anything beyond booking, you need to know who is calling. Name plus date of birth is fine for most use cases. For regulated contexts, integrate a proper identity provider.

Outbound architecture, the reference build

Outbound is inverted. The agent calls out to a list from your CRM, tries to connect, and runs a scripted-but-flexible conversation.

  1. Source list from CRM with call-time windows (never outside business hours, never on public holidays, AU-specific rules).
  2. VAPI initiates the call via Twilio AU outbound.
  3. Twilio dials. Three outcomes: connects to a human, goes to voicemail, fails (no answer, busy, disconnected).
  4. If human answers: identity confirmation, then the scripted conversation. Tools and escalation work the same as inbound.
  5. If voicemail: either hang up silently (compliant and the norm for high-value outbound) or leave a pre-recorded message (lower-value, higher-volume, check your consent posture first).
  6. If failed: disposition the record, schedule a retry per your policy, or mark dead.
  7. Every outcome writes back to the CRM with a full transcript, disposition, next action, and audit trail.

Outbound is where the governance load goes up. The Australian Do Not Call Register and the Spam Act 2003 both apply to AI voice agents. If you are calling consumers, your list hygiene and consent posture have to be clean, or you are buying a fine, not a campaign.

Guardrails that keep the agent out of trouble

Voice agents are generous with mistakes. They will invent appointment slots that are not available, quote prices that are wrong, promise things you cannot deliver, and agree with frustrated callers in ways that create legal exposure. Guardrails are not optional.

Hard-coded tool boundaries. The agent can check availability, it cannot invent it. The agent can look up a price, it cannot discount it. Any commitment the agent makes should come from a tool call, not from the LLM’s imagination.

Confidence-gated responses. For any claim with business consequence (medical advice, legal advice, pricing, policy), the agent must either cite a tool result or refuse and escalate. “I think” is banned.

Profanity and escalation triggers. Caller frustration markers (raised voice, repeated requests, specific phrases) route to a human immediately. The agent does not try to de-escalate a real problem, that is what humans are for.

Bounded scope. A reception agent answers reception questions. It does not offer medical advice. It does not negotiate contracts. If asked, it says “that is something our team handles directly” and offers a callback. Scope is a feature.

Safe-completion prompts. Every reply respects a few hard rules at the system-prompt level: never promise a timeline the business has not agreed to, never claim to be human if asked directly, never share another customer’s data.

The red-team week exists to find every corner where these break. Budget for it.

The cost model that closes the sale

Voice agents are per-minute dominated. Your cost stack, per minute, roughly:

  • Twilio AU inbound: A$0.02-0.04 / minute (landline or mobile).
  • Twilio AU outbound: A$0.06-0.12 / minute (mobile higher, varies by carrier).
  • VAPI (ASR + LLM + TTS orchestration): US$0.05-0.10 / minute, depending on the LLM and TTS model.
  • LLM call tokens: usually included in VAPI’s price; verify per-provider.

Add that up: a typical 3-minute answered inbound call lands at A$0.20-0.40 all-in. A 2-minute outbound qualification call sits at A$0.25-0.60.

Now the math that decides the engagement:

  • A plumbing business missing five calls a week at A$800 average job: A$200,000 a year lost. A 24/7 voice agent at A$400/month covers itself on the first recovered call of the year.
  • A dental clinic missing fifteen calls a week at A$180 per appointment: A$140,000 a year lost. Same economics.
  • An outbound lead qualification campaign replacing two junior SDRs at A$75,000 loaded each: A$150,000 of labour against a few thousand a month of agent minutes. Voice agents are quietly one of the most brutal cost plays in the AI stack right now.

The maths is one-sided enough that the conversation with the CFO is usually not about whether to build, it is about which use case to build first. Discovery exists to answer that.

Four architectures we see in the wild

A. Single-purpose inbound agent

One number, one job (reception, booking, after-hours). Simplest to build, simplest to run, quickest to payback. Start here unless you have a specific reason not to.

B. Multi-intent inbound agent

One number handling many call types (reception + booking + support triage + FAQ). More prompt engineering, more tools, more testing. The right choice when your call volume justifies it and your categories are clean.

C. Inbound + outbound paired agent

Same brand, two flows. Inbound catches every call. Outbound runs qualification and reminders off the same CRM the inbound agent writes to. Shared infrastructure, shared voice, shared analytics.

D. Call centre augmentation

The agent handles tier-1 on all calls and transfers anything complex to human agents. Humans are freed up for high-value work. Volume handled per human goes up 3-5x. This is where mid-market call centres are landing in 2026.

Pick the architecture that matches your traffic shape and your team. Do not build D if your volume justifies A.

Evaluation checklist, before you call it done

If an inbound voice agent is about to go live, walk this list. Missing any of these is a blocker, not a phase-two item.

  • Opening line announces brand, purpose, and consent (if recording) in under 6 seconds.
  • End-to-end latency under 1 second on a stable connection.
  • Interruption handling tested with deliberately talkative callers.
  • Noisy-background calls tested (traffic, office, construction, baby).
  • Accent coverage tested (Australian, British, American, Indian, at least).
  • Ambiguous requests tested (“I need an appointment” in a multi-service context).
  • Escalation triggers fire reliably (frustration, explicit request, out-of-scope, high-value).
  • Warm-transfer to a human works with a coherent one-sentence handoff.
  • Tool failures handled gracefully (calendar down, CRM timeout).
  • Post-call write-back to CRM is complete, typed, and contains the transcript.
  • Cost per call reported per intent, not just in aggregate.
  • Red-team pass: no PII leakage across calls, no hallucinated prices, no agreeing-with-frustration that creates exposure.
  • Runbook exists and an on-call engineer who is not you has read it.
  • Staging number exists and is separate from production; no test calls ever land on production data.

When to build, when to buy, when to wait

Three options for every voice-agent need.

Build custom (VAPI + Twilio, or LiveKit self-hosted). You own the prompt, the voice, the data, the cost model. Pick this when the agent needs to sound like your business, integrate deep with your systems, or run under your compliance posture.

Buy a SaaS receptionist (Sophiie, Hooroo, Advancer, Dialpad, and similar). Faster to turn on, less flexible, costs a monthly per-seat fee, plus minutes, with a markup baked in. Your data lives in their tenant. Pick this when your needs are generic, you have no engineering capacity, and you accept the markup.

Wait. The voice-agent stack is improving monthly. TTS models are getting better, latency is dropping, multilingual is improving. Pick this when your use case is genuinely non-urgent, which, for most Australian businesses with a missed-call problem, it isn’t.

I usually recommend build custom, because the gap between generic SaaS and brand-matched, integrated voice agents is already meaningful and widening. The build cost amortises fast against the minute cost, and the data stays yours.

What to do next

If you are starting from zero: pick one call type (the one that hurts most). Shadow a day of it. Count the volume, estimate the cost of missed or badly-handled calls. If the number is bigger than a few hundred a month, a voice agent pays back inside the first month.

If you have a voice agent running: walk the evaluation checklist above. Most production problems are caught and fixed cheaply on the checklist, expensively after a bad call lands in a customer complaint.

If you are choosing between build and buy: if your call type is boring and generic, buy. If your brand and integrations matter, build. The break-even is around 2,000 minutes a month, below that, buy. Above, build.


Scoping a voice agent for your business? I run AI Voice Agent engagements from a 2-week discovery through to production build and managed operations. Or read the companion piece How We Deployed 55 AI Agents in Production for how voice agents fit into the broader agent stack.

Disclosure: the link to VAPI is a referral link. If you sign up through it I may earn a commission at no extra cost to you. I only recommend tools I actually ship with.

Frequently asked.

How do you build a production AI voice agent with VAPI and Twilio?
Four steps: 1) Provision an Australian Twilio number and attach it to Vapi. 2) Build the assistant prompt in Vapi with tools (CRM lookup, booking API, knowledge base). 3) Wire webhooks for call events into your system of record. 4) Ship with human-handoff, logging, and a measured goal (missed-call recovery rate, conversion, first-call resolution). A focused build is typically 4 weeks end-to-end.
What does it cost to run an AI voice agent on VAPI + Twilio in Australia?
Roughly A$0.10–0.20 per minute all-in (Vapi platform + Twilio minute + LLM + TTS/STT). A 5-minute average call costs 50c–A$1. For a line handling 1,000 calls per month at 5 min average, monthly run-cost is A$500–1,000. Compare to A$4,000–6,000 for a human receptionist seat, payback is typically under a month.
Are AI voice agents legal and compliant to run in Australia?
Yes, with care. Key obligations: disclose the caller is speaking with AI at the start of the call, comply with the Privacy Act 1988 for any captured data, follow the Scams Prevention Framework once you are a designated sector entity (1 July 2026), and check the Do Not Call Register for outbound campaigns. Build consent and disclosure into the call flow, do not retrofit.

Picked by shared topic. The through-line is agentic AI shipped into production, not the pilot theatre.

Read another.