OpenClaw Voice Interaction: Competitive Analysis
Executive Summary
For Discord voice, use discord-voice (avatarneil) with Deepgram streaming STT and ElevenLabs turbo TTS (~$0.03-0.05/min, 1.5-3.5s latency). The critical blocker is Discord's mandatory DAVE end-to-end encryption rolled out March 1, 2026, which breaks most existing bots; confirm issue #26108 is resolved before deploying. For phone calls, start with ClawdTalk (free tier, no public URL needed, <200ms voice loop). Skip ElevenLabs Agents (most expensive, least integrated).
Recommendations
Discord Voice (vinny/wendy/david in voice channels)
Use discord-voice (avatarneil): most complete, most flexible, active development. Run three separate OpenClaw instances with different Discord bot tokens. Address DAVE encryption status before deploying; check if the latest codebase has resolved #26108.
Best provider combo for latency + cost: Deepgram streaming STT (nova-2) + ElevenLabs turbo TTS. ~$0.03-0.05/min. Best combo for zero cost: local-whisper + Kokoro. No API keys needed, quality lower, latency higher.
Phone Calls (call your OpenClaw from your cell)
ClawdTalk has the best UX (no public URL, outbound WebSocket only, async tool calls keep conversation flowing). Start free. If it grows limiting, DeepClaw gives more control with better latency fundamentals (Deepgram Flux semantic turn detection vs VAD).
Avoid ElevenLabs Agents unless you're already paying for their Conversational AI tier.
Building Custom
The DeepClaw architecture (Deepgram Voice Agent API as a single-WebSocket voice pipeline + local OpenClaw via chat completions) is the cleanest reference. The main blocker is OpenClaw's chatCompletions ~5s buffering issue. For Discord, the core audio loop from avatarneil's discord-voice is the solid reference.
Summary Table
| Solution | Channel | STT | TTS | Barge-in | Latency | ~Cost/min | Local Option | Multi-bot | Maturity |
|---|---|---|---|---|---|---|---|---|---|
| discord-voice (avatarneil) | Discord voice | Whisper / GPT-4o / Deepgram / local | OpenAI / ElevenLabs / Deepgram / Polly / Edge / Kokoro | Yes | 1.5-4s | $0-$0.05 | Yes (Whisper + Kokoro) | Possible (separate instances) | Active, good |
| discord-voice-deepgram | Discord voice | Deepgram nova-2 streaming | Deepgram Aura-2 | Yes | ~1-2.5s | ~$0.02 | No | Possible | New, minimal |
| discord-voice-bridge (Zofia) | Discord voice | Unknown | Unknown | Mentioned | Unknown | Unknown | Unknown | Unknown | 3 installs, dead |
| Voice Call plugin (built-in) | Phone (PSTN) | Provider STT (Twilio/Telnyx/Plivo) | OpenAI / ElevenLabs | No | 3-6s+ | $0.10-0.15 | No | No by design | Official, stable |
| DeepClaw (Deepgram) | Phone (PSTN) | Deepgram Flux (semantic) | Deepgram Aura-2 | Yes (native) | 5s+ (buffering issue) | $0.10-0.15 | No | No by design | New, Python |
| ElevenLabs Agents | Phone (PSTN) | ElevenLabs (VAD) | ElevenLabs | Basic VAD | 5s+ (buffering issue) | $0.10-0.20 | No | No by design | Simple/community |
| ClawdTalk (Telnyx) | Phone (PSTN) | Telnyx NaturalHD | Telnyx NaturalHD | Yes | <200ms voice loop | $0.06-0.12 | No | No by design | Beta, promising |
Critical Issue for All Discord Voice Solutions
Discord DAVE Encryption (March 1, 2026 deadline): Discord rolled out mandatory end-to-end voice encryption (DAVE protocol) across all servers as of March 1, 2026. Every Discord voice bot using the discord.js library needs explicit DAVE support.
Issue #24883 (Feb 2026): Bot joins voice channel but cannot decrypt audio, DecryptionFailed(UnencryptedWhenPassthroughDisabled) errors. Issue #26108 (Mar 2026): Still reproducing on macOS arm64, Node 25.6.1, even after partial fixes in PR #25861 and PR #25909.
Status: Partially patched in core OpenClaw but still unstable on some platforms. The @snazzah/davey DAVE library is installed but integration was incomplete as of late February 2026. If you're building a new Discord voice solution, you must handle DAVE from day one or avoid discord.js entirely.
Solution 1: discord-voice (avatarneil)
Source: github.com/openclaw/skills/.../discord-voice Install: clawdhub install discord-voice or manual git clone
What it does
Full-featured Discord voice channel plugin for OpenClaw. Bot joins a voice channel, listens for speech, transcribes, sends to the agent, synthesizes a response, and plays it back. Voice Activity Detection (VAD) triggers recording automatically. Slash commands or CLI to control join/leave/status/configuration.
Notably: thinking sound (optional looping audio while the LLM processes), fallback provider chains for both STT and TTS, runtime provider switching via slash commands, and speaker diarization support.
Architecture
User speaks in Discord
→ VAD detection (local, in plugin)
→ Audio buffered until silence (configurable threshold, default 800ms)
→ STT provider (batch or streaming)
→ Transcript → OpenClaw agent (LLM)
→ Response text → TTS provider
→ Audio played back in voice channelEverything runs inside the OpenClaw Gateway process. No separate server needed.
STT Options
| Provider | Mode | Quality | Notes |
|---|---|---|---|
whisper | Batch (OpenAI API) | High | whisper-1 model, ~$0.006/min |
gpt4o-mini | Batch (OpenAI API) | High, faster | GPT-4o mini transcription, cheaper |
gpt4o-transcribe | Batch (OpenAI API) | Highest | Best quality |
gpt4o-transcribe-diarize | Batch (OpenAI API) | Highest + speaker ID | Identifies who is speaking |
deepgram | Streaming WebSocket (Deepgram) | Very high | Nova-2, ~1s latency reduction vs batch |
local-whisper | Local (Xenova/Transformers) | Good | No API key, CPU-only, model download required |
wyoming-whisper | Remote TCP | Good | Connects to a Wyoming Faster Whisper server (Docker) |
Fallback chains supported: e.g., primary deepgram, fallbacks ["local-whisper", "whisper"].
TTS Options
| Provider | Latency | Quality | Cost | Offline |
|---|---|---|---|---|
openai (OpenAI TTS) | Moderate | Good | ~$0.015/1K chars | No |
elevenlabs (ElevenLabs) | Low (turbo model) | Excellent | ~$0.050/1K chars | No |
deepgram (Deepgram Aura-2) | Low (~80ms) | Good | ~$0.030/1K chars | No |
polly (AWS Polly) | Low | Good | ~$0.004/1K chars | No (AWS) |
edge (Microsoft Edge TTS) | Very low | Good | Free | No (online) |
kokoro (Kokoro) | Moderate | Good | Free | Yes (local CPU) |
Fallback chains: e.g., primary elevenlabs, fallbacks ["edge", "polly", "kokoro"].
Latency
- Silence threshold 800ms before processing begins
- Batch STT: adds ~0.5-1.5s
- Deepgram streaming STT: ~0.3-0.5s (interim results during speech)
- LLM: ~0.5-2s depending on model
- TTS: 80-300ms TTFB depending on provider
- Total realistic E2E: ~1.5-3.5s with Deepgram streaming, ~2.5-4.5s with batch Whisper
Cost Model
- Best paid combo (Deepgram STT + ElevenLabs turbo TTS + GPT-4o mini LLM): ~$0.03-0.05/min
- Cheapest paid (Deepgram STT + Edge TTS free + fast LLM): ~$0.01-0.02/min
- Offline: $0 (local-whisper + Kokoro), latency higher, quality lower on CPU
Setup Complexity: Medium
Required: ffmpeg, build-essential (for @discordjs/opus and sodium-native), Discord bot with Connect/Speak/Use Voice Activity permissions, API keys for chosen providers. Install steps: ~10 minutes if dependencies are in place. Native module builds can be finicky on some platforms.
Limitations
- One voice channel per guild at a time
- Max recording 30s (configurable)
- DAVE encryption active issues (see critical issue above)
- Audio speed bug: TTS at 24kHz plays at ~0.5x speed (issue #32293, needs 48kHz resampling)
- OpenClaw 2026.2.24+ still has audio receive failures on some platforms
- Thinking sound is a local MP3 loop, not adaptive to actual LLM duration
Discord Voice vs Phone
Discord voice only. No phone call support.
Multi-bot Support (vinny/wendy/david)
Not explicitly designed for it. The plugin reads channels.discord.token from the OpenClaw config. Running three separate OpenClaw instances with different Discord bot accounts would work. The "one voice channel per guild at a time" limit applies per instance. You would not run all three bots from one OpenClaw instance simultaneously without significant custom work.
Barge-in
Yes. Default enabled. Bot stops speaking immediately when VAD detects user talking. Can disable with bargeIn: false.
Community / Maturity
By avatarneil in the official openclaw/skills repo. Fairly active development. This is the most feature-complete Discord voice solution in the ecosystem.
Solution 2: discord-voice-deepgram
Source: playbooks.com/skills/.../discord-voice-deepgram Install: ClawHub or manual npm install
What it does
Streamlined, Deepgram-exclusive Discord voice plugin. Conceptually identical pipeline to discord-voice but locked to Deepgram for both STT and TTS. Adds voice-controlled speaker management (who the bot listens to) and wake-word support.
Architecture
Discord voice audio → Deepgram streaming STT (WebSocket)
Transcript → OpenClaw agent
Response → Deepgram TTS /v1/speak (streamed HTTP, Ogg/Opus)
Audio → Discord voice channelSTT / TTS
- STT: Deepgram nova-2, streaming WebSocket, language configurable
- TTS: Deepgram Aura-2 (
aura-2-thalia-endefault), streamed HTTP in Ogg/Opus
Latency
Likely ~1-2.5s E2E. Streaming STT eliminates most of the wait. Deepgram TTS is streamed (not batch), so first audio arrives ~80ms after generation starts.
Cost
Deepgram-only. STT: ~$0.0059/min (nova-2 streaming). TTS: ~$0.030/1K chars. No OpenAI, no ElevenLabs, no fallbacks; simpler billing but no vendor flexibility.
Setup Complexity: Low
Requires only DISCORD_TOKEN and DEEPGRAM_API_KEY. Cleaner config than avatarneil's version.
Limitations
- No fallback providers; Deepgram downtime = silence
- No local/offline option
- Wake word and speaker management are novel but untested at scale
- Less community validation than the avatarneil skill
Multi-bot / Discord Voice vs Phone
Discord voice only. Same multi-bot constraints as avatarneil (separate instances).
Barge-in
Yes, enabled by default.
Community / Maturity
Newer, part of openclaw/skills repo but simpler than the avatarneil version. Good for Deepgram-first setups. Limited community feedback.
Solution 3: discord-voice-bridge (ai-agent-Zofia)
Source: LobeHub marketplace GitHub: ai-agent-Zofia/discord-voice-bridge-openclaw-skill
What it does
Listed on LobeHub marketplace as a Discord voice bridge for OpenClaw. Claims: real-time audio capture, push-to-talk or always-listening, automatic STT forwarding, selectable TTS voices and languages, per-channel config, profanity filtering.
The honest assessment
3 installs. Zero stars. Zero forks. The listing page is almost entirely LobeHub marketplace boilerplate. No GitHub activity data available. No issues, no discussions, no real-world reports.
Recommendation
Skip. The avatarneil discord-voice skill covers all the stated features and has real community validation.
Solution 4: OpenClaw Voice Call Plugin (built-in)
Source: docs.openclaw.ai/plugins/voice-call Install: openclaw plugins install @openclaw/voice-call
What it does
The official first-party OpenClaw plugin for phone calls (PSTN/VoIP). Enables the agent to make and receive real phone calls via Twilio, Telnyx, or Plivo. Two modes: notify (agent calls and speaks a message, no response expected) and conversation (multi-turn back-and-forth).
This is not Discord voice; this is your agent picking up the phone.
Architecture
OpenClaw Gateway (Voice Call plugin running inside it)
↓ outbound: initiate_call via provider API
↓ inbound: webhook receives from Twilio/Telnyx/Plivo
Requires: publicly reachable webhook URL (ngrok, Tailscale, stable domain)
Audio streaming: WebSocket (Twilio Media Streams, Telnyx WebSocket)
STT: Provider-native speech recognition
TTS: OpenClaw core messages.tts config (OpenAI or ElevenLabs; overridable per-call)STT
Provider-native. Twilio's speech input uses built-in STT (powered by Google STT under the hood). Telnyx and Plivo similarly. You don't choose a separate STT engine; it's bundled with the telephony provider.
TTS
Uses OpenClaw's messages.tts config (OpenAI TTS or ElevenLabs). Can override at the plugin level. Edge TTS explicitly not supported for calls (PCM requirement).
Latency
- Webhook round-trips add 200-500ms overhead vs direct streaming
- OpenClaw's core
/v1/chat/completionsendpoint buffers ~5 seconds before streaming - Realistic E2E: 3-8 seconds depending on provider and network
Cost Model
- Twilio: $0.085/min + phone number ~$1/month
- Telnyx: $0.005-0.025/min + phone number ~$0.50-2/month
- Plivo: comparable to Telnyx
- Plus TTS costs
- Effective per-minute: ~$0.10-0.15 with Twilio + OpenAI TTS
Setup Complexity: Medium-High
Requires: public webhook URL (ngrok/Tailscale/stable domain), provider account setup, phone number provisioning.
Limitations
- Phone calls only, no Discord voice
- STT quality depends on telephony provider (not configurable)
- No Deepgram, no local/offline
- No native barge-in (voice loop is synchronous)
- The ~5s buffering on OpenClaw's chat completions adds noticeable lag
Multi-bot Support
Not a designed feature. One Gateway instance, one phone number. Three bots would mean three instances and three phone numbers.
Barge-in
Not documented. The streaming infrastructure exists (WebSocket media streams) but mid-call interruption support is not mentioned.
Community / Maturity
Official, well-documented, actively maintained. The right answer for agent-initiated phone notifications. Less suitable for low-latency conversational use.
Solution 5: DeepClaw (Deepgram)
Source: github.com/deepgram/deepclaw Language: Python 3.10+
What it does
Deepgram's own project bridging phone calls to OpenClaw using the Deepgram Voice Agent API: a single WebSocket that bundles STT, TTS, turn detection, and barge-in together. You call a phone number, DeepClaw bridges to Deepgram's Voice Agent which handles all the voice intelligence, which in turn calls your local OpenClaw as the LLM backend.
Architecture
Caller (phone) → Twilio/Telnyx (PSTN)
→ deepclaw Python server (WebSocket bridge)
→ Deepgram Voice Agent API (WebSocket)
→ Flux STT (semantic turn detection)
→ Aura-2 TTS (~90ms TTFB)
→ LLM proxy: your local OpenClaw /v1/chat/completions
→ Audio streamed back through Twilio/Telnyx → callerSTT
Deepgram Flux: semantic turn detection, not just VAD silence detection. Understands *when you're done talking*, not just when you stop making noise. Fewer false triggers and faster response times compared to silence-threshold approaches.
TTS
Deepgram Aura-2: 80+ voices in 7 languages, ~90ms TTFB (fastest in this comparison). Output streamed directly back through the phone provider.
Latency
Major caveat: The initial greeting is instant (Deepgram-generated). But every subsequent response waits on OpenClaw's /v1/chat/completions endpoint, which buffers the full response before streaming (~5 seconds). Not suitable for fluid multi-turn conversation until OpenClaw fixes the buffering.
Deepgram comparison (from their own README):
| ElevenLabs | Deepgram | |
|---|---|---|
| TTS latency (TTFB) | ~200ms | 90ms |
| TTS price | $0.050/1K chars | $0.030/1K chars |
| Barge-in | Basic VAD | Native StartOfTurn |
| Turn detection | VAD-based | Semantic (Flux) |
Cost Model
| Item | Twilio | Telnyx |
|---|---|---|
| Setup complexity | Moderate | Easy |
| Phone number | ~$1/month | ~$0.50-2/month |
| Call pricing | $0.085/min | $0.005-0.025/min |
| Deepgram Voice Agent | ~$0.030/1K chars TTS | same |
Telnyx significantly cheaper for call minutes.
Setup Complexity: Medium
Python install, .env file with credentials, ngrok for webhooks, configure Twilio or Telnyx to point to your server URL. OpenClaw's chatCompletions endpoint must be enabled: openclaw config set gateway.http.endpoints.chatCompletions.enabled true.
Security Concerns
- LLM proxy endpoint has no authentication
- No Twilio signature validation
- Credentials in plaintext
.env - ngrok exposes your local machine
These are dev-setup issues, not blockers, but need addressing before real-world use.
Limitations
- Phone calls only, no Discord voice
- ~5s LLM buffering makes multi-turn conversation feel laggy
- Python service (separate process from OpenClaw)
- No local/offline option
Barge-in
Yes, native. The Deepgram Voice Agent API handles StartOfTurn detection natively at the WebSocket level without VAD tuning.
Community / Maturity
Official Deepgram project, well-documented. MIT license. Very new repo.
Solution 6: ElevenLabs Agents Integration
Source: Blog post: Call your OpenClaw over the phone
What it does
Uses ElevenLabs' Conversational AI platform as the voice layer. ElevenLabs handles STT, TTS, turn-taking, and phone integration. OpenClaw is the brain, connected via its /v1/chat/completions endpoint. Phone number via Twilio connected to the ElevenLabs agent.
This is the "no code" approach: connect existing platforms with an API config.
Architecture
Caller → Twilio → ElevenLabs Conversational AI Agent
ElevenLabs Agent → (ngrok tunnel) → OpenClaw /v1/chat/completions
OpenClaw processes turn, returns response
ElevenLabs synthesizes speech, plays to callerSTT / TTS
ElevenLabs handles both. Their Conversational AI uses built-in STT (VAD-based turn detection) and their voice library (~200ms TTFB, large voice selection, very natural).
Latency
Same ~5 second bottleneck from OpenClaw's chatCompletions buffering. ElevenLabs TTS adds ~200ms on top (vs Deepgram's 90ms). Total: noticeably worse than DeepClaw for conversational feel.
Cost Model
- ElevenLabs Conversational AI: priced per minute (Creator plan ~$22/mo includes minutes, overage rates apply)
- Twilio: $0.085/min + phone number
- Plus OpenClaw LLM costs
- More expensive than Deepgram path for high usage
Setup Complexity: Low
No code required. Enable chatCompletions, start ngrok, create ElevenLabs Agent with your endpoint, connect Twilio number.
Limitations
- Phone calls only, no Discord voice
- VAD-based barge-in (not semantic; more false triggers)
- Expensive compared to Deepgram
- ~5s LLM buffering
- Not a proper integration; it's glue between two platforms
- ElevenLabs agent is a separate system; OpenClaw has no visibility into call state
Community / Maturity
Two blog posts and a Reddit thread. Works for personal use; not production-hardened.
Solution 7: ClawdTalk (Telnyx)
Source: telnyx.com/resources/openclaw-phone-calls, clawdtalk.com
What it does
Managed SaaS product (beta) from Telnyx specifically for giving Clawdbot a phone number. The key architectural difference: your bot connects outbound to ClawdTalk via WebSocket; ClawdTalk never connects inward to you. No public URL needed. No ngrok. Works behind NAT, firewalls, Docker networks.
The voice loop is fully handled by Telnyx AI Assistants (real-time STT + TTS). Your bot receives transcribed text, returns text. Voice processing is completely abstracted. For complex tasks (tool calls, memory lookup), the voice agent keeps talking while your bot processes asynchronously.
Inbound + outbound calls. Also supports WhatsApp and SMS on the same number.
Architecture
Your Bot (OpenClaw) → outbound WebSocket → ClawdTalk
ClawdTalk → Telnyx carrier (PSTN)
Caller → Telnyx → ClawdTalk
→ real-time STT (Telnyx NaturalHD)
→ text sent to your bot via WebSocket
→ bot returns response text
→ TTS (Telnyx NaturalHD, <200ms)
→ audio back to callerSTT / TTS
Telnyx NaturalHD for both. Not configurable externally (managed service). Latency claim: <200ms for the voice loop.
Cost Model
| Tier | Price | Minutes | SMS |
|---|---|---|---|
| Free | $0 | 10/month | 100/day |
| Starter | $12/mo | 100/month | - |
| Pro | $30/mo | 500/month | - |
| Pro overage | - | $0.02/min | - |
Effective cost: Starter = $0.12/min, Pro = $0.06/min + $0.02 overage. Free tier is genuinely free (no CC required).
Setup Complexity: Low
- Sign up at clawdtalk.com
- Install ClawdTalk skill on Clawdbot
- Get a phone number
- Bot connects automatically via outbound WebSocket
Limitations
- Phone calls only, no Discord voice
- Cloud-only, no self-hosted option
- TTS/STT not configurable (Telnyx NaturalHD only)
- Beta product; feature gaps and pricing likely to change
- Pro outbound calls to arbitrary numbers (free tier: only your verified number)
Barge-in
Yes. The WebSocket payload includes is_interruption: true for mid-speech interrupts.
Community / Maturity
Beta. Telnyx is a real carrier (140+ countries, 1B+ calls/year), so infrastructure is production-grade. ClawdTalk itself is new. A demo audio clip on the site shows barge-in and natural conversation.
YouTube Videos
| Title | Channel | Date | Link | Relevance |
|---|---|---|---|---|
| OpenClaw Full Course: Setup, Skills, Voice, Memory & More | TechWithTim | ~4 weeks ago | Watch | Full course including voice setup |
| OpenClaw + Minimax: FREE AI Voice Agent is INSANE! | AI Profit Lab | Feb 2, 2026 | Watch | Minimax as free STT/TTS alternative |
| Giving My OpenClaw Agent a Voice (FOR FREE) w/ Edge TTS | unknown | Feb 10, 2026 | Watch | Edge TTS tutorial (free Microsoft TTS) |
| Demo for OpenClaw Web Phone (barge-in + streaming TTS/STT) | unknown | ~3 weeks ago | Watch | Most relevant: demos barge-in, silence filler, streaming |
| I Built the Ultimate AI Voice Agent with Clawdbot in 10 Minutes | Edwin / LegacyAI | Jan 28, 2026 | Watch | Quick setup walkthrough |
Most notable: The OpenClaw Web Phone demo shows barge-in, silence filler, and streaming TTS/STT in a PR that was waiting to merge. This suggests a web-based phone interface coming to core OpenClaw.
Minimax as free STT/TTS: worth investigating as a cost-saving provider for the discord-voice skill.
Known Issues Summary (GitHub)
| Issue | Severity | Status |
|---|---|---|
| #24883: DAVE encryption not handled, no audio receive | Critical | Partial fix merged, still reproducing |
| #26108: Connected but no live audio on 2026.2.24 (macOS arm64) | Critical | Open |
| #32293: TTS audio plays at 0.5x speed with 24kHz source | High | Open (needs 48kHz resampling) |
| #39145: Discord listener blocks on slow AI responses | High | Open |
The DAVE encryption issue is the most significant: Discord's mandatory E2E encryption for voice has exposed a gap in discord.js-based bots. Both the discord-voice-deepgram skill and avatarneil's skill use discord.js under the hood and are affected.