research brief

OpenClaw Voice Interaction: Competitive Analysis

2026-03-15

OpenClaw Voice Interaction: Competitive Analysis

Executive Summary

For Discord voice, use discord-voice (avatarneil) with Deepgram streaming STT and ElevenLabs turbo TTS (~$0.03-0.05/min, 1.5-3.5s latency). The critical blocker is Discord's mandatory DAVE end-to-end encryption rolled out March 1, 2026, which breaks most existing bots; confirm issue #26108 is resolved before deploying. For phone calls, start with ClawdTalk (free tier, no public URL needed, <200ms voice loop). Skip ElevenLabs Agents (most expensive, least integrated).


Recommendations

Discord Voice (vinny/wendy/david in voice channels)

Use discord-voice (avatarneil): most complete, most flexible, active development. Run three separate OpenClaw instances with different Discord bot tokens. Address DAVE encryption status before deploying; check if the latest codebase has resolved #26108.

Best provider combo for latency + cost: Deepgram streaming STT (nova-2) + ElevenLabs turbo TTS. ~$0.03-0.05/min. Best combo for zero cost: local-whisper + Kokoro. No API keys needed, quality lower, latency higher.

Phone Calls (call your OpenClaw from your cell)

ClawdTalk has the best UX (no public URL, outbound WebSocket only, async tool calls keep conversation flowing). Start free. If it grows limiting, DeepClaw gives more control with better latency fundamentals (Deepgram Flux semantic turn detection vs VAD).

Avoid ElevenLabs Agents unless you're already paying for their Conversational AI tier.

Building Custom

The DeepClaw architecture (Deepgram Voice Agent API as a single-WebSocket voice pipeline + local OpenClaw via chat completions) is the cleanest reference. The main blocker is OpenClaw's chatCompletions ~5s buffering issue. For Discord, the core audio loop from avatarneil's discord-voice is the solid reference.


Summary Table

SolutionChannelSTTTTSBarge-inLatency~Cost/minLocal OptionMulti-botMaturity
discord-voice (avatarneil)Discord voiceWhisper / GPT-4o / Deepgram / localOpenAI / ElevenLabs / Deepgram / Polly / Edge / KokoroYes1.5-4s$0-$0.05Yes (Whisper + Kokoro)Possible (separate instances)Active, good
discord-voice-deepgramDiscord voiceDeepgram nova-2 streamingDeepgram Aura-2Yes~1-2.5s~$0.02NoPossibleNew, minimal
discord-voice-bridge (Zofia)Discord voiceUnknownUnknownMentionedUnknownUnknownUnknownUnknown3 installs, dead
Voice Call plugin (built-in)Phone (PSTN)Provider STT (Twilio/Telnyx/Plivo)OpenAI / ElevenLabsNo3-6s+$0.10-0.15NoNo by designOfficial, stable
DeepClaw (Deepgram)Phone (PSTN)Deepgram Flux (semantic)Deepgram Aura-2Yes (native)5s+ (buffering issue)$0.10-0.15NoNo by designNew, Python
ElevenLabs AgentsPhone (PSTN)ElevenLabs (VAD)ElevenLabsBasic VAD5s+ (buffering issue)$0.10-0.20NoNo by designSimple/community
ClawdTalk (Telnyx)Phone (PSTN)Telnyx NaturalHDTelnyx NaturalHDYes<200ms voice loop$0.06-0.12NoNo by designBeta, promising

Critical Issue for All Discord Voice Solutions

Discord DAVE Encryption (March 1, 2026 deadline): Discord rolled out mandatory end-to-end voice encryption (DAVE protocol) across all servers as of March 1, 2026. Every Discord voice bot using the discord.js library needs explicit DAVE support.

Issue #24883 (Feb 2026): Bot joins voice channel but cannot decrypt audio, DecryptionFailed(UnencryptedWhenPassthroughDisabled) errors. Issue #26108 (Mar 2026): Still reproducing on macOS arm64, Node 25.6.1, even after partial fixes in PR #25861 and PR #25909.

Status: Partially patched in core OpenClaw but still unstable on some platforms. The @snazzah/davey DAVE library is installed but integration was incomplete as of late February 2026. If you're building a new Discord voice solution, you must handle DAVE from day one or avoid discord.js entirely.


Solution 1: discord-voice (avatarneil)

Source: github.com/openclaw/skills/.../discord-voice Install: clawdhub install discord-voice or manual git clone

What it does

Full-featured Discord voice channel plugin for OpenClaw. Bot joins a voice channel, listens for speech, transcribes, sends to the agent, synthesizes a response, and plays it back. Voice Activity Detection (VAD) triggers recording automatically. Slash commands or CLI to control join/leave/status/configuration.

Notably: thinking sound (optional looping audio while the LLM processes), fallback provider chains for both STT and TTS, runtime provider switching via slash commands, and speaker diarization support.

Architecture

User speaks in Discord
  → VAD detection (local, in plugin)
  → Audio buffered until silence (configurable threshold, default 800ms)
  → STT provider (batch or streaming)
  → Transcript → OpenClaw agent (LLM)
  → Response text → TTS provider
  → Audio played back in voice channel

Everything runs inside the OpenClaw Gateway process. No separate server needed.

STT Options

ProviderModeQualityNotes
whisperBatch (OpenAI API)Highwhisper-1 model, ~$0.006/min
gpt4o-miniBatch (OpenAI API)High, fasterGPT-4o mini transcription, cheaper
gpt4o-transcribeBatch (OpenAI API)HighestBest quality
gpt4o-transcribe-diarizeBatch (OpenAI API)Highest + speaker IDIdentifies who is speaking
deepgramStreaming WebSocket (Deepgram)Very highNova-2, ~1s latency reduction vs batch
local-whisperLocal (Xenova/Transformers)GoodNo API key, CPU-only, model download required
wyoming-whisperRemote TCPGoodConnects to a Wyoming Faster Whisper server (Docker)

Fallback chains supported: e.g., primary deepgram, fallbacks ["local-whisper", "whisper"].

TTS Options

ProviderLatencyQualityCostOffline
openai (OpenAI TTS)ModerateGood~$0.015/1K charsNo
elevenlabs (ElevenLabs)Low (turbo model)Excellent~$0.050/1K charsNo
deepgram (Deepgram Aura-2)Low (~80ms)Good~$0.030/1K charsNo
polly (AWS Polly)LowGood~$0.004/1K charsNo (AWS)
edge (Microsoft Edge TTS)Very lowGoodFreeNo (online)
kokoro (Kokoro)ModerateGoodFreeYes (local CPU)

Fallback chains: e.g., primary elevenlabs, fallbacks ["edge", "polly", "kokoro"].

Latency

Cost Model

Setup Complexity: Medium

Required: ffmpeg, build-essential (for @discordjs/opus and sodium-native), Discord bot with Connect/Speak/Use Voice Activity permissions, API keys for chosen providers. Install steps: ~10 minutes if dependencies are in place. Native module builds can be finicky on some platforms.

Limitations

Discord Voice vs Phone

Discord voice only. No phone call support.

Multi-bot Support (vinny/wendy/david)

Not explicitly designed for it. The plugin reads channels.discord.token from the OpenClaw config. Running three separate OpenClaw instances with different Discord bot accounts would work. The "one voice channel per guild at a time" limit applies per instance. You would not run all three bots from one OpenClaw instance simultaneously without significant custom work.

Barge-in

Yes. Default enabled. Bot stops speaking immediately when VAD detects user talking. Can disable with bargeIn: false.

Community / Maturity

By avatarneil in the official openclaw/skills repo. Fairly active development. This is the most feature-complete Discord voice solution in the ecosystem.


Solution 2: discord-voice-deepgram

Source: playbooks.com/skills/.../discord-voice-deepgram Install: ClawHub or manual npm install

What it does

Streamlined, Deepgram-exclusive Discord voice plugin. Conceptually identical pipeline to discord-voice but locked to Deepgram for both STT and TTS. Adds voice-controlled speaker management (who the bot listens to) and wake-word support.

Architecture

Discord voice audio → Deepgram streaming STT (WebSocket)
Transcript → OpenClaw agent
Response → Deepgram TTS /v1/speak (streamed HTTP, Ogg/Opus)
Audio → Discord voice channel

STT / TTS

Latency

Likely ~1-2.5s E2E. Streaming STT eliminates most of the wait. Deepgram TTS is streamed (not batch), so first audio arrives ~80ms after generation starts.

Cost

Deepgram-only. STT: ~$0.0059/min (nova-2 streaming). TTS: ~$0.030/1K chars. No OpenAI, no ElevenLabs, no fallbacks; simpler billing but no vendor flexibility.

Setup Complexity: Low

Requires only DISCORD_TOKEN and DEEPGRAM_API_KEY. Cleaner config than avatarneil's version.

Limitations

Multi-bot / Discord Voice vs Phone

Discord voice only. Same multi-bot constraints as avatarneil (separate instances).

Barge-in

Yes, enabled by default.

Community / Maturity

Newer, part of openclaw/skills repo but simpler than the avatarneil version. Good for Deepgram-first setups. Limited community feedback.


Solution 3: discord-voice-bridge (ai-agent-Zofia)

Source: LobeHub marketplace GitHub: ai-agent-Zofia/discord-voice-bridge-openclaw-skill

What it does

Listed on LobeHub marketplace as a Discord voice bridge for OpenClaw. Claims: real-time audio capture, push-to-talk or always-listening, automatic STT forwarding, selectable TTS voices and languages, per-channel config, profanity filtering.

The honest assessment

3 installs. Zero stars. Zero forks. The listing page is almost entirely LobeHub marketplace boilerplate. No GitHub activity data available. No issues, no discussions, no real-world reports.

Recommendation

Skip. The avatarneil discord-voice skill covers all the stated features and has real community validation.


Solution 4: OpenClaw Voice Call Plugin (built-in)

Source: docs.openclaw.ai/plugins/voice-call Install: openclaw plugins install @openclaw/voice-call

What it does

The official first-party OpenClaw plugin for phone calls (PSTN/VoIP). Enables the agent to make and receive real phone calls via Twilio, Telnyx, or Plivo. Two modes: notify (agent calls and speaks a message, no response expected) and conversation (multi-turn back-and-forth).

This is not Discord voice; this is your agent picking up the phone.

Architecture

OpenClaw Gateway (Voice Call plugin running inside it)
  ↓ outbound: initiate_call via provider API
  ↓ inbound: webhook receives from Twilio/Telnyx/Plivo
Requires: publicly reachable webhook URL (ngrok, Tailscale, stable domain)
Audio streaming: WebSocket (Twilio Media Streams, Telnyx WebSocket)
STT: Provider-native speech recognition
TTS: OpenClaw core messages.tts config (OpenAI or ElevenLabs; overridable per-call)

STT

Provider-native. Twilio's speech input uses built-in STT (powered by Google STT under the hood). Telnyx and Plivo similarly. You don't choose a separate STT engine; it's bundled with the telephony provider.

TTS

Uses OpenClaw's messages.tts config (OpenAI TTS or ElevenLabs). Can override at the plugin level. Edge TTS explicitly not supported for calls (PCM requirement).

Latency

Cost Model

Setup Complexity: Medium-High

Requires: public webhook URL (ngrok/Tailscale/stable domain), provider account setup, phone number provisioning.

Limitations

Multi-bot Support

Not a designed feature. One Gateway instance, one phone number. Three bots would mean three instances and three phone numbers.

Barge-in

Not documented. The streaming infrastructure exists (WebSocket media streams) but mid-call interruption support is not mentioned.

Community / Maturity

Official, well-documented, actively maintained. The right answer for agent-initiated phone notifications. Less suitable for low-latency conversational use.


Solution 5: DeepClaw (Deepgram)

Source: github.com/deepgram/deepclaw Language: Python 3.10+

What it does

Deepgram's own project bridging phone calls to OpenClaw using the Deepgram Voice Agent API: a single WebSocket that bundles STT, TTS, turn detection, and barge-in together. You call a phone number, DeepClaw bridges to Deepgram's Voice Agent which handles all the voice intelligence, which in turn calls your local OpenClaw as the LLM backend.

Architecture

Caller (phone) → Twilio/Telnyx (PSTN)
  → deepclaw Python server (WebSocket bridge)
    → Deepgram Voice Agent API (WebSocket)
      → Flux STT (semantic turn detection)
      → Aura-2 TTS (~90ms TTFB)
      → LLM proxy: your local OpenClaw /v1/chat/completions
  → Audio streamed back through Twilio/Telnyx → caller

STT

Deepgram Flux: semantic turn detection, not just VAD silence detection. Understands *when you're done talking*, not just when you stop making noise. Fewer false triggers and faster response times compared to silence-threshold approaches.

TTS

Deepgram Aura-2: 80+ voices in 7 languages, ~90ms TTFB (fastest in this comparison). Output streamed directly back through the phone provider.

Latency

Major caveat: The initial greeting is instant (Deepgram-generated). But every subsequent response waits on OpenClaw's /v1/chat/completions endpoint, which buffers the full response before streaming (~5 seconds). Not suitable for fluid multi-turn conversation until OpenClaw fixes the buffering.

Deepgram comparison (from their own README):

ElevenLabsDeepgram
TTS latency (TTFB)~200ms90ms
TTS price$0.050/1K chars$0.030/1K chars
Barge-inBasic VADNative StartOfTurn
Turn detectionVAD-basedSemantic (Flux)

Cost Model

ItemTwilioTelnyx
Setup complexityModerateEasy
Phone number~$1/month~$0.50-2/month
Call pricing$0.085/min$0.005-0.025/min
Deepgram Voice Agent~$0.030/1K chars TTSsame

Telnyx significantly cheaper for call minutes.

Setup Complexity: Medium

Python install, .env file with credentials, ngrok for webhooks, configure Twilio or Telnyx to point to your server URL. OpenClaw's chatCompletions endpoint must be enabled: openclaw config set gateway.http.endpoints.chatCompletions.enabled true.

Security Concerns

These are dev-setup issues, not blockers, but need addressing before real-world use.

Limitations

Barge-in

Yes, native. The Deepgram Voice Agent API handles StartOfTurn detection natively at the WebSocket level without VAD tuning.

Community / Maturity

Official Deepgram project, well-documented. MIT license. Very new repo.


Solution 6: ElevenLabs Agents Integration

Source: Blog post: Call your OpenClaw over the phone

What it does

Uses ElevenLabs' Conversational AI platform as the voice layer. ElevenLabs handles STT, TTS, turn-taking, and phone integration. OpenClaw is the brain, connected via its /v1/chat/completions endpoint. Phone number via Twilio connected to the ElevenLabs agent.

This is the "no code" approach: connect existing platforms with an API config.

Architecture

Caller → Twilio → ElevenLabs Conversational AI Agent
ElevenLabs Agent → (ngrok tunnel) → OpenClaw /v1/chat/completions
OpenClaw processes turn, returns response
ElevenLabs synthesizes speech, plays to caller

STT / TTS

ElevenLabs handles both. Their Conversational AI uses built-in STT (VAD-based turn detection) and their voice library (~200ms TTFB, large voice selection, very natural).

Latency

Same ~5 second bottleneck from OpenClaw's chatCompletions buffering. ElevenLabs TTS adds ~200ms on top (vs Deepgram's 90ms). Total: noticeably worse than DeepClaw for conversational feel.

Cost Model

Setup Complexity: Low

No code required. Enable chatCompletions, start ngrok, create ElevenLabs Agent with your endpoint, connect Twilio number.

Limitations

Community / Maturity

Two blog posts and a Reddit thread. Works for personal use; not production-hardened.


Solution 7: ClawdTalk (Telnyx)

Source: telnyx.com/resources/openclaw-phone-calls, clawdtalk.com

What it does

Managed SaaS product (beta) from Telnyx specifically for giving Clawdbot a phone number. The key architectural difference: your bot connects outbound to ClawdTalk via WebSocket; ClawdTalk never connects inward to you. No public URL needed. No ngrok. Works behind NAT, firewalls, Docker networks.

The voice loop is fully handled by Telnyx AI Assistants (real-time STT + TTS). Your bot receives transcribed text, returns text. Voice processing is completely abstracted. For complex tasks (tool calls, memory lookup), the voice agent keeps talking while your bot processes asynchronously.

Inbound + outbound calls. Also supports WhatsApp and SMS on the same number.

Architecture

Your Bot (OpenClaw) → outbound WebSocket → ClawdTalk
ClawdTalk → Telnyx carrier (PSTN)
Caller → Telnyx → ClawdTalk
  → real-time STT (Telnyx NaturalHD)
  → text sent to your bot via WebSocket
  → bot returns response text
  → TTS (Telnyx NaturalHD, <200ms)
  → audio back to caller

STT / TTS

Telnyx NaturalHD for both. Not configurable externally (managed service). Latency claim: <200ms for the voice loop.

Cost Model

TierPriceMinutesSMS
Free$010/month100/day
Starter$12/mo100/month-
Pro$30/mo500/month-
Pro overage-$0.02/min-

Effective cost: Starter = $0.12/min, Pro = $0.06/min + $0.02 overage. Free tier is genuinely free (no CC required).

Setup Complexity: Low

  1. Sign up at clawdtalk.com
  2. Install ClawdTalk skill on Clawdbot
  3. Get a phone number
  4. Bot connects automatically via outbound WebSocket

Limitations

Barge-in

Yes. The WebSocket payload includes is_interruption: true for mid-speech interrupts.

Community / Maturity

Beta. Telnyx is a real carrier (140+ countries, 1B+ calls/year), so infrastructure is production-grade. ClawdTalk itself is new. A demo audio clip on the site shows barge-in and natural conversation.


YouTube Videos

TitleChannelDateLinkRelevance
OpenClaw Full Course: Setup, Skills, Voice, Memory & MoreTechWithTim~4 weeks agoWatchFull course including voice setup
OpenClaw + Minimax: FREE AI Voice Agent is INSANE!AI Profit LabFeb 2, 2026WatchMinimax as free STT/TTS alternative
Giving My OpenClaw Agent a Voice (FOR FREE) w/ Edge TTSunknownFeb 10, 2026WatchEdge TTS tutorial (free Microsoft TTS)
Demo for OpenClaw Web Phone (barge-in + streaming TTS/STT)unknown~3 weeks agoWatchMost relevant: demos barge-in, silence filler, streaming
I Built the Ultimate AI Voice Agent with Clawdbot in 10 MinutesEdwin / LegacyAIJan 28, 2026WatchQuick setup walkthrough

Most notable: The OpenClaw Web Phone demo shows barge-in, silence filler, and streaming TTS/STT in a PR that was waiting to merge. This suggests a web-based phone interface coming to core OpenClaw.

Minimax as free STT/TTS: worth investigating as a cost-saving provider for the discord-voice skill.


Known Issues Summary (GitHub)

IssueSeverityStatus
#24883: DAVE encryption not handled, no audio receiveCriticalPartial fix merged, still reproducing
#26108: Connected but no live audio on 2026.2.24 (macOS arm64)CriticalOpen
#32293: TTS audio plays at 0.5x speed with 24kHz sourceHighOpen (needs 48kHz resampling)
#39145: Discord listener blocks on slow AI responsesHighOpen

The DAVE encryption issue is the most significant: Discord's mandatory E2E encryption for voice has exposed a gap in discord.js-based bots. Both the discord-voice-deepgram skill and avatarneil's skill use discord.js under the hood and are affected.