OpenClaw Voice Interaction: Competitive Analysis

Executive Summary

For Discord voice, use discord-voice (avatarneil) with Deepgram streaming STT and ElevenLabs turbo TTS (~$0.03-0.05/min, 1.5-3.5s latency). The critical blocker is Discord's mandatory DAVE end-to-end encryption rolled out March 1, 2026, which breaks most existing bots; confirm issue #26108 is resolved before deploying. For phone calls, start with ClawdTalk (free tier, no public URL needed, <200ms voice loop). Skip ElevenLabs Agents (most expensive, least integrated).

Recommendations

Discord Voice (vinny/wendy/david in voice channels)

Use discord-voice (avatarneil): most complete, most flexible, active development. Run three separate OpenClaw instances with different Discord bot tokens. Address DAVE encryption status before deploying; check if the latest codebase has resolved #26108.

Best provider combo for latency + cost: Deepgram streaming STT (nova-2) + ElevenLabs turbo TTS. ~$0.03-0.05/min. Best combo for zero cost: local-whisper + Kokoro. No API keys needed, quality lower, latency higher.

Phone Calls (call your OpenClaw from your cell)

ClawdTalk has the best UX (no public URL, outbound WebSocket only, async tool calls keep conversation flowing). Start free. If it grows limiting, DeepClaw gives more control with better latency fundamentals (Deepgram Flux semantic turn detection vs VAD).

Avoid ElevenLabs Agents unless you're already paying for their Conversational AI tier.

Building Custom

The DeepClaw architecture (Deepgram Voice Agent API as a single-WebSocket voice pipeline + local OpenClaw via chat completions) is the cleanest reference. The main blocker is OpenClaw's chatCompletions ~5s buffering issue. For Discord, the core audio loop from avatarneil's discord-voice is the solid reference.

Summary Table

Solution	Channel	STT	TTS	Barge-in	Latency	~Cost/min	Local Option	Multi-bot	Maturity
discord-voice (avatarneil)	Discord voice	Whisper / GPT-4o / Deepgram / local	OpenAI / ElevenLabs / Deepgram / Polly / Edge / Kokoro	Yes	1.5-4s	$0-$0.05	Yes (Whisper + Kokoro)	Possible (separate instances)	Active, good
discord-voice-deepgram	Discord voice	Deepgram nova-2 streaming	Deepgram Aura-2	Yes	~1-2.5s	~$0.02	No	Possible	New, minimal
discord-voice-bridge (Zofia)	Discord voice	Unknown	Unknown	Mentioned	Unknown	Unknown	Unknown	Unknown	3 installs, dead
Voice Call plugin (built-in)	Phone (PSTN)	Provider STT (Twilio/Telnyx/Plivo)	OpenAI / ElevenLabs	No	3-6s+	$0.10-0.15	No	No by design	Official, stable
DeepClaw (Deepgram)	Phone (PSTN)	Deepgram Flux (semantic)	Deepgram Aura-2	Yes (native)	5s+ (buffering issue)	$0.10-0.15	No	No by design	New, Python
ElevenLabs Agents	Phone (PSTN)	ElevenLabs (VAD)	ElevenLabs	Basic VAD	5s+ (buffering issue)	$0.10-0.20	No	No by design	Simple/community
ClawdTalk (Telnyx)	Phone (PSTN)	Telnyx NaturalHD	Telnyx NaturalHD	Yes	<200ms voice loop	$0.06-0.12	No	No by design	Beta, promising

Critical Issue for All Discord Voice Solutions

Discord DAVE Encryption (March 1, 2026 deadline): Discord rolled out mandatory end-to-end voice encryption (DAVE protocol) across all servers as of March 1, 2026. Every Discord voice bot using the discord.js library needs explicit DAVE support.

Issue #24883 (Feb 2026): Bot joins voice channel but cannot decrypt audio, DecryptionFailed(UnencryptedWhenPassthroughDisabled) errors. Issue #26108 (Mar 2026): Still reproducing on macOS arm64, Node 25.6.1, even after partial fixes in PR #25861 and PR #25909.

Status: Partially patched in core OpenClaw but still unstable on some platforms. The @snazzah/davey DAVE library is installed but integration was incomplete as of late February 2026. If you're building a new Discord voice solution, you must handle DAVE from day one or avoid discord.js entirely.

Solution 1: discord-voice (avatarneil)

Source: github.com/openclaw/skills/.../discord-voice Install: clawdhub install discord-voice or manual git clone

What it does

Full-featured Discord voice channel plugin for OpenClaw. Bot joins a voice channel, listens for speech, transcribes, sends to the agent, synthesizes a response, and plays it back. Voice Activity Detection (VAD) triggers recording automatically. Slash commands or CLI to control join/leave/status/configuration.

Notably: thinking sound (optional looping audio while the LLM processes), fallback provider chains for both STT and TTS, runtime provider switching via slash commands, and speaker diarization support.

Architecture

User speaks in Discord
  → VAD detection (local, in plugin)
  → Audio buffered until silence (configurable threshold, default 800ms)
  → STT provider (batch or streaming)
  → Transcript → OpenClaw agent (LLM)
  → Response text → TTS provider
  → Audio played back in voice channel

Everything runs inside the OpenClaw Gateway process. No separate server needed.

STT Options

Provider	Mode	Quality	Notes
`whisper`	Batch (OpenAI API)	High	`whisper-1` model, ~$0.006/min
`gpt4o-mini`	Batch (OpenAI API)	High, faster	GPT-4o mini transcription, cheaper
`gpt4o-transcribe`	Batch (OpenAI API)	Highest	Best quality
`gpt4o-transcribe-diarize`	Batch (OpenAI API)	Highest + speaker ID	Identifies who is speaking
`deepgram`	Streaming WebSocket (Deepgram)	Very high	Nova-2, ~1s latency reduction vs batch
`local-whisper`	Local (Xenova/Transformers)	Good	No API key, CPU-only, model download required
`wyoming-whisper`	Remote TCP	Good	Connects to a Wyoming Faster Whisper server (Docker)

Fallback chains supported: e.g., primary deepgram, fallbacks ["local-whisper", "whisper"].

TTS Options

Provider	Latency	Quality	Cost	Offline
`openai` (OpenAI TTS)	Moderate	Good	~$0.015/1K chars	No
`elevenlabs` (ElevenLabs)	Low (turbo model)	Excellent	~$0.050/1K chars	No
`deepgram` (Deepgram Aura-2)	Low (~80ms)	Good	~$0.030/1K chars	No
`polly` (AWS Polly)	Low	Good	~$0.004/1K chars	No (AWS)
`edge` (Microsoft Edge TTS)	Very low	Good	Free	No (online)
`kokoro` (Kokoro)	Moderate	Good	Free	Yes (local CPU)

Fallback chains: e.g., primary elevenlabs, fallbacks ["edge", "polly", "kokoro"].

Latency

Silence threshold 800ms before processing begins
Batch STT: adds ~0.5-1.5s
Deepgram streaming STT: ~0.3-0.5s (interim results during speech)
LLM: ~0.5-2s depending on model
TTS: 80-300ms TTFB depending on provider
Total realistic E2E: ~1.5-3.5s with Deepgram streaming, ~2.5-4.5s with batch Whisper

Cost Model

Best paid combo (Deepgram STT + ElevenLabs turbo TTS + GPT-4o mini LLM): ~$0.03-0.05/min
Cheapest paid (Deepgram STT + Edge TTS free + fast LLM): ~$0.01-0.02/min
Offline: $0 (local-whisper + Kokoro), latency higher, quality lower on CPU

Setup Complexity: Medium

Required: ffmpeg, build-essential (for @discordjs/opus and sodium-native), Discord bot with Connect/Speak/Use Voice Activity permissions, API keys for chosen providers. Install steps: ~10 minutes if dependencies are in place. Native module builds can be finicky on some platforms.

Limitations

One voice channel per guild at a time
Max recording 30s (configurable)
DAVE encryption active issues (see critical issue above)
Audio speed bug: TTS at 24kHz plays at ~0.5x speed (issue #32293, needs 48kHz resampling)
OpenClaw 2026.2.24+ still has audio receive failures on some platforms
Thinking sound is a local MP3 loop, not adaptive to actual LLM duration

Discord Voice vs Phone

Discord voice only. No phone call support.

Multi-bot Support (vinny/wendy/david)

Not explicitly designed for it. The plugin reads channels.discord.token from the OpenClaw config. Running three separate OpenClaw instances with different Discord bot accounts would work. The "one voice channel per guild at a time" limit applies per instance. You would not run all three bots from one OpenClaw instance simultaneously without significant custom work.

Barge-in

Yes. Default enabled. Bot stops speaking immediately when VAD detects user talking. Can disable with bargeIn: false.

Community / Maturity

By avatarneil in the official openclaw/skills repo. Fairly active development. This is the most feature-complete Discord voice solution in the ecosystem.

Solution 2: discord-voice-deepgram

Source: playbooks.com/skills/.../discord-voice-deepgram Install: ClawHub or manual npm install

What it does

Streamlined, Deepgram-exclusive Discord voice plugin. Conceptually identical pipeline to discord-voice but locked to Deepgram for both STT and TTS. Adds voice-controlled speaker management (who the bot listens to) and wake-word support.

Architecture

Discord voice audio → Deepgram streaming STT (WebSocket)
Transcript → OpenClaw agent
Response → Deepgram TTS /v1/speak (streamed HTTP, Ogg/Opus)
Audio → Discord voice channel

STT / TTS

STT: Deepgram nova-2, streaming WebSocket, language configurable
TTS: Deepgram Aura-2 (aura-2-thalia-en default), streamed HTTP in Ogg/Opus

Latency

Likely ~1-2.5s E2E. Streaming STT eliminates most of the wait. Deepgram TTS is streamed (not batch), so first audio arrives ~80ms after generation starts.

Cost

Deepgram-only. STT: ~$0.0059/min (nova-2 streaming). TTS: ~$0.030/1K chars. No OpenAI, no ElevenLabs, no fallbacks; simpler billing but no vendor flexibility.

Setup Complexity: Low

Requires only DISCORD_TOKEN and DEEPGRAM_API_KEY. Cleaner config than avatarneil's version.

Limitations

No fallback providers; Deepgram downtime = silence
No local/offline option
Wake word and speaker management are novel but untested at scale
Less community validation than the avatarneil skill

Multi-bot / Discord Voice vs Phone

Discord voice only. Same multi-bot constraints as avatarneil (separate instances).

Barge-in

Yes, enabled by default.

Community / Maturity

Newer, part of openclaw/skills repo but simpler than the avatarneil version. Good for Deepgram-first setups. Limited community feedback.

Solution 3: discord-voice-bridge (ai-agent-Zofia)

Source: LobeHub marketplace GitHub: ai-agent-Zofia/discord-voice-bridge-openclaw-skill

What it does

Listed on LobeHub marketplace as a Discord voice bridge for OpenClaw. Claims: real-time audio capture, push-to-talk or always-listening, automatic STT forwarding, selectable TTS voices and languages, per-channel config, profanity filtering.

The honest assessment

3 installs. Zero stars. Zero forks. The listing page is almost entirely LobeHub marketplace boilerplate. No GitHub activity data available. No issues, no discussions, no real-world reports.

Recommendation

Skip. The avatarneil discord-voice skill covers all the stated features and has real community validation.

Solution 4: OpenClaw Voice Call Plugin (built-in)

Source: docs.openclaw.ai/plugins/voice-call Install: openclaw plugins install @openclaw/voice-call

What it does

The official first-party OpenClaw plugin for phone calls (PSTN/VoIP). Enables the agent to make and receive real phone calls via Twilio, Telnyx, or Plivo. Two modes: notify (agent calls and speaks a message, no response expected) and conversation (multi-turn back-and-forth).

This is not Discord voice; this is your agent picking up the phone.

Architecture

OpenClaw Gateway (Voice Call plugin running inside it)
  ↓ outbound: initiate_call via provider API
  ↓ inbound: webhook receives from Twilio/Telnyx/Plivo
Requires: publicly reachable webhook URL (ngrok, Tailscale, stable domain)
Audio streaming: WebSocket (Twilio Media Streams, Telnyx WebSocket)
STT: Provider-native speech recognition
TTS: OpenClaw core messages.tts config (OpenAI or ElevenLabs; overridable per-call)

STT

Provider-native. Twilio's speech input uses built-in STT (powered by Google STT under the hood). Telnyx and Plivo similarly. You don't choose a separate STT engine; it's bundled with the telephony provider.

TTS

Uses OpenClaw's messages.tts config (OpenAI TTS or ElevenLabs). Can override at the plugin level. Edge TTS explicitly not supported for calls (PCM requirement).

Latency

Webhook round-trips add 200-500ms overhead vs direct streaming
OpenClaw's core /v1/chat/completions endpoint buffers ~5 seconds before streaming
Realistic E2E: 3-8 seconds depending on provider and network

Cost Model

Twilio: $0.085/min + phone number ~$1/month
Telnyx: $0.005-0.025/min + phone number ~$0.50-2/month
Plivo: comparable to Telnyx
Plus TTS costs
Effective per-minute: ~$0.10-0.15 with Twilio + OpenAI TTS

Setup Complexity: Medium-High

Requires: public webhook URL (ngrok/Tailscale/stable domain), provider account setup, phone number provisioning.

Limitations

Phone calls only, no Discord voice
STT quality depends on telephony provider (not configurable)
No Deepgram, no local/offline
No native barge-in (voice loop is synchronous)
The ~5s buffering on OpenClaw's chat completions adds noticeable lag

Multi-bot Support

Not a designed feature. One Gateway instance, one phone number. Three bots would mean three instances and three phone numbers.

Barge-in

Not documented. The streaming infrastructure exists (WebSocket media streams) but mid-call interruption support is not mentioned.

Community / Maturity

Official, well-documented, actively maintained. The right answer for agent-initiated phone notifications. Less suitable for low-latency conversational use.

Solution 5: DeepClaw (Deepgram)

Source: github.com/deepgram/deepclaw Language: Python 3.10+

What it does

Deepgram's own project bridging phone calls to OpenClaw using the Deepgram Voice Agent API: a single WebSocket that bundles STT, TTS, turn detection, and barge-in together. You call a phone number, DeepClaw bridges to Deepgram's Voice Agent which handles all the voice intelligence, which in turn calls your local OpenClaw as the LLM backend.

Architecture

Caller (phone) → Twilio/Telnyx (PSTN)
  → deepclaw Python server (WebSocket bridge)
    → Deepgram Voice Agent API (WebSocket)
      → Flux STT (semantic turn detection)
      → Aura-2 TTS (~90ms TTFB)
      → LLM proxy: your local OpenClaw /v1/chat/completions
  → Audio streamed back through Twilio/Telnyx → caller

STT

Deepgram Flux: semantic turn detection, not just VAD silence detection. Understands *when you're done talking*, not just when you stop making noise. Fewer false triggers and faster response times compared to silence-threshold approaches.

TTS

Deepgram Aura-2: 80+ voices in 7 languages, ~90ms TTFB (fastest in this comparison). Output streamed directly back through the phone provider.

Latency

Major caveat: The initial greeting is instant (Deepgram-generated). But every subsequent response waits on OpenClaw's /v1/chat/completions endpoint, which buffers the full response before streaming (~5 seconds). Not suitable for fluid multi-turn conversation until OpenClaw fixes the buffering.

Deepgram comparison (from their own README):

	ElevenLabs	Deepgram
TTS latency (TTFB)	~200ms	90ms
TTS price	$0.050/1K chars	$0.030/1K chars
Barge-in	Basic VAD	Native StartOfTurn
Turn detection	VAD-based	Semantic (Flux)

Cost Model

Item	Twilio	Telnyx
Setup complexity	Moderate	Easy
Phone number	~$1/month	~$0.50-2/month
Call pricing	$0.085/min	$0.005-0.025/min
Deepgram Voice Agent	~$0.030/1K chars TTS	same

Telnyx significantly cheaper for call minutes.

Setup Complexity: Medium

Python install, .env file with credentials, ngrok for webhooks, configure Twilio or Telnyx to point to your server URL. OpenClaw's chatCompletions endpoint must be enabled: openclaw config set gateway.http.endpoints.chatCompletions.enabled true.

Security Concerns

LLM proxy endpoint has no authentication
No Twilio signature validation
Credentials in plaintext .env
ngrok exposes your local machine

These are dev-setup issues, not blockers, but need addressing before real-world use.

Limitations

Phone calls only, no Discord voice
~5s LLM buffering makes multi-turn conversation feel laggy
Python service (separate process from OpenClaw)
No local/offline option

Barge-in

Yes, native. The Deepgram Voice Agent API handles StartOfTurn detection natively at the WebSocket level without VAD tuning.

Community / Maturity

Official Deepgram project, well-documented. MIT license. Very new repo.

Solution 6: ElevenLabs Agents Integration

Source: Blog post: Call your OpenClaw over the phone

What it does

Uses ElevenLabs' Conversational AI platform as the voice layer. ElevenLabs handles STT, TTS, turn-taking, and phone integration. OpenClaw is the brain, connected via its /v1/chat/completions endpoint. Phone number via Twilio connected to the ElevenLabs agent.

This is the "no code" approach: connect existing platforms with an API config.

Architecture

Caller → Twilio → ElevenLabs Conversational AI Agent
ElevenLabs Agent → (ngrok tunnel) → OpenClaw /v1/chat/completions
OpenClaw processes turn, returns response
ElevenLabs synthesizes speech, plays to caller

STT / TTS

ElevenLabs handles both. Their Conversational AI uses built-in STT (VAD-based turn detection) and their voice library (~200ms TTFB, large voice selection, very natural).

Latency

Same ~5 second bottleneck from OpenClaw's chatCompletions buffering. ElevenLabs TTS adds ~200ms on top (vs Deepgram's 90ms). Total: noticeably worse than DeepClaw for conversational feel.

Cost Model

ElevenLabs Conversational AI: priced per minute (Creator plan ~$22/mo includes minutes, overage rates apply)
Twilio: $0.085/min + phone number
Plus OpenClaw LLM costs
More expensive than Deepgram path for high usage

Setup Complexity: Low

No code required. Enable chatCompletions, start ngrok, create ElevenLabs Agent with your endpoint, connect Twilio number.

Limitations

Phone calls only, no Discord voice
VAD-based barge-in (not semantic; more false triggers)
Expensive compared to Deepgram
~5s LLM buffering
Not a proper integration; it's glue between two platforms
ElevenLabs agent is a separate system; OpenClaw has no visibility into call state

Community / Maturity

Two blog posts and a Reddit thread. Works for personal use; not production-hardened.

Solution 7: ClawdTalk (Telnyx)

Source: telnyx.com/resources/openclaw-phone-calls, clawdtalk.com

What it does

Managed SaaS product (beta) from Telnyx specifically for giving Clawdbot a phone number. The key architectural difference: your bot connects outbound to ClawdTalk via WebSocket; ClawdTalk never connects inward to you. No public URL needed. No ngrok. Works behind NAT, firewalls, Docker networks.

The voice loop is fully handled by Telnyx AI Assistants (real-time STT + TTS). Your bot receives transcribed text, returns text. Voice processing is completely abstracted. For complex tasks (tool calls, memory lookup), the voice agent keeps talking while your bot processes asynchronously.

Inbound + outbound calls. Also supports WhatsApp and SMS on the same number.

Architecture

Your Bot (OpenClaw) → outbound WebSocket → ClawdTalk
ClawdTalk → Telnyx carrier (PSTN)
Caller → Telnyx → ClawdTalk
  → real-time STT (Telnyx NaturalHD)
  → text sent to your bot via WebSocket
  → bot returns response text
  → TTS (Telnyx NaturalHD, <200ms)
  → audio back to caller

STT / TTS

Telnyx NaturalHD for both. Not configurable externally (managed service). Latency claim: <200ms for the voice loop.

Cost Model

Tier	Price	Minutes	SMS
Free	$0	10/month	100/day
Starter	$12/mo	100/month	-
Pro	$30/mo	500/month	-
Pro overage	-	$0.02/min	-

Effective cost: Starter = $0.12/min, Pro = $0.06/min + $0.02 overage. Free tier is genuinely free (no CC required).

Setup Complexity: Low

Sign up at clawdtalk.com
Install ClawdTalk skill on Clawdbot
Get a phone number
Bot connects automatically via outbound WebSocket

Limitations

Phone calls only, no Discord voice
Cloud-only, no self-hosted option
TTS/STT not configurable (Telnyx NaturalHD only)
Beta product; feature gaps and pricing likely to change
Pro outbound calls to arbitrary numbers (free tier: only your verified number)

Barge-in

Yes. The WebSocket payload includes is_interruption: true for mid-speech interrupts.

Community / Maturity

Beta. Telnyx is a real carrier (140+ countries, 1B+ calls/year), so infrastructure is production-grade. ClawdTalk itself is new. A demo audio clip on the site shows barge-in and natural conversation.

YouTube Videos

Title	Channel	Date	Link	Relevance
OpenClaw Full Course: Setup, Skills, Voice, Memory & More	TechWithTim	~4 weeks ago	Watch	Full course including voice setup
OpenClaw + Minimax: FREE AI Voice Agent is INSANE!	AI Profit Lab	Feb 2, 2026	Watch	Minimax as free STT/TTS alternative
Giving My OpenClaw Agent a Voice (FOR FREE) w/ Edge TTS	unknown	Feb 10, 2026	Watch	Edge TTS tutorial (free Microsoft TTS)
Demo for OpenClaw Web Phone (barge-in + streaming TTS/STT)	unknown	~3 weeks ago	Watch	Most relevant: demos barge-in, silence filler, streaming
I Built the Ultimate AI Voice Agent with Clawdbot in 10 Minutes	Edwin / LegacyAI	Jan 28, 2026	Watch	Quick setup walkthrough

Most notable: The OpenClaw Web Phone demo shows barge-in, silence filler, and streaming TTS/STT in a PR that was waiting to merge. This suggests a web-based phone interface coming to core OpenClaw.

Minimax as free STT/TTS: worth investigating as a cost-saving provider for the discord-voice skill.

Known Issues Summary (GitHub)

Issue	Severity	Status
#24883: DAVE encryption not handled, no audio receive	Critical	Partial fix merged, still reproducing
#26108: Connected but no live audio on 2026.2.24 (macOS arm64)	Critical	Open
#32293: TTS audio plays at 0.5x speed with 24kHz source	High	Open (needs 48kHz resampling)
#39145: Discord listener blocks on slow AI responses	High	Open

The DAVE encryption issue is the most significant: Discord's mandatory E2E encryption for voice has exposed a gap in discord.js-based bots. Both the discord-voice-deepgram skill and avatarneil's skill use discord.js under the hood and are affected.