Use Voice Mode with Hermes
Use Voice Mode with Hermes
Section titled “Use Voice Mode with Hermes”This guide is the practical companion to the Voice Mode feature reference.
If the feature page explains what voice mode can do, this guide shows how to actually use it well.
What voice mode is good for
Section titled “What voice mode is good for”Voice mode is especially useful when:
- you want a hands-free CLI workflow
- you want spoken responses in Telegram or Discord
- you want Hermes sitting in a Discord voice channel for live conversation
- you want quick idea capture, debugging, or back-and-forth while walking around instead of typing
Choose your voice mode setup
Section titled “Choose your voice mode setup”There are really three different voice experiences in Hermes.
| Mode | Best for | Platform |
|---|---|---|
| Interactive microphone loop | Personal hands-free use while coding or researching | CLI |
| Voice replies in chat | Spoken responses alongside normal messaging | Telegram, Discord |
| Live voice channel bot | Group or personal live conversation in a VC | Discord voice channels |
A good path is:
- get text working first
- enable voice replies second
- move to Discord voice channels last if you want the full experience
Step 1: make sure normal Hermes works first
Section titled “Step 1: make sure normal Hermes works first”Before touching voice mode, verify that:
- Hermes starts
- your provider is configured
- the agent can answer text prompts normally
hermesAsk something simple:
What tools do you have available?If that is not solid yet, fix text mode first.
Step 2: install the right extras
Section titled “Step 2: install the right extras”CLI microphone + playback
Section titled “CLI microphone + playback”pip install "hermes-agent[voice]"Messaging platforms
Section titled “Messaging platforms”pip install "hermes-agent[messaging]"Premium ElevenLabs TTS
Section titled “Premium ElevenLabs TTS”pip install "hermes-agent[tts-premium]"Local NeuTTS (optional)
Section titled “Local NeuTTS (optional)”python -m pip install -U neutts[all]Everything
Section titled “Everything”pip install "hermes-agent[all]"Step 3: install system dependencies
Section titled “Step 3: install system dependencies”brew install portaudio ffmpeg opusbrew install espeak-ngUbuntu / Debian
Section titled “Ubuntu / Debian”sudo apt install portaudio19-dev ffmpeg libopus0sudo apt install espeak-ngWhy these matter:
portaudio→ microphone input / playback for CLI voice modeffmpeg→ audio conversion for TTS and messaging deliveryopus→ Discord voice codec supportespeak-ng→ phonemizer backend for NeuTTS
Step 4: choose STT and TTS providers
Section titled “Step 4: choose STT and TTS providers”Hermes supports both local and cloud speech stacks.
Easiest / cheapest setup
Section titled “Easiest / cheapest setup”Use local STT and free Edge TTS:
- STT provider:
local - TTS provider:
edge
This is usually the best place to start.
Environment file example
Section titled “Environment file example”Add to ~/.hermes/.env:
# Cloud STT options (local needs no key)GROQ_API_KEY=***VOICE_TOOLS_OPENAI_KEY=***
# Premium TTS (optional)ELEVENLABS_API_KEY=***Provider recommendations
Section titled “Provider recommendations”Speech-to-text
Section titled “Speech-to-text”local→ best default for privacy and zero-cost usegroq→ very fast cloud transcriptionopenai→ good paid fallback
Text-to-speech
Section titled “Text-to-speech”edge→ free and good enough for most usersneutts→ free local/on-device TTSelevenlabs→ best qualityopenai→ good middle groundmistral→ multilingual, native Opus
If you use hermes setup
Section titled “If you use hermes setup”If you choose NeuTTS in the setup wizard, Hermes checks whether neutts is already installed. If it is missing, the wizard tells you NeuTTS needs the Python package neutts and the system package espeak-ng, offers to install them for you, installs espeak-ng with your platform package manager, and then runs:
python -m pip install -U neutts[all]If you skip that install or it fails, the wizard falls back to Edge TTS.
Step 5: recommended config
Section titled “Step 5: recommended config”voice: record_key: "ctrl+b" max_recording_seconds: 120 auto_tts: false beep_enabled: true silence_threshold: 200 silence_duration: 3.0
stt: provider: "local" local: model: "base"
tts: provider: "edge" edge: voice: "en-US-AriaNeural"This is a good conservative default for most people.
If you want local TTS instead, switch the tts block to:
tts: provider: "neutts" neutts: ref_audio: '' ref_text: '' model: neuphonic/neutts-air-q4-gguf device: cpuUse case 1: CLI voice mode
Section titled “Use case 1: CLI voice mode”Turn it on
Section titled “Turn it on”Start Hermes:
hermesInside the CLI:
/voice onRecording flow
Section titled “Recording flow”Default key:
Ctrl+B
Workflow:
- press
Ctrl+B - speak
- wait for silence detection to stop recording automatically
- Hermes transcribes and responds
- if TTS is on, it speaks the answer
- the loop can automatically restart for continuous use
Useful commands
Section titled “Useful commands”/voice/voice on/voice off/voice tts/voice statusGood CLI workflows
Section titled “Good CLI workflows”Walk-up debugging
Section titled “Walk-up debugging”Say:
I keep getting a docker permission error. Help me debug it.Then continue hands-free:
- “Read the last error again”
- “Explain the root cause in simpler terms”
- “Now give me the exact fix”
Research / brainstorming
Section titled “Research / brainstorming”Great for:
- walking around while thinking
- dictating half-formed ideas
- asking Hermes to structure your thoughts in real time
Accessibility / low-typing sessions
Section titled “Accessibility / low-typing sessions”If typing is inconvenient, voice mode is one of the fastest ways to stay in the full Hermes loop.
Tuning CLI behavior
Section titled “Tuning CLI behavior”Silence threshold
Section titled “Silence threshold”If Hermes starts/stops too aggressively, tune:
voice: silence_threshold: 250Higher threshold = less sensitive.
Silence duration
Section titled “Silence duration”If you pause a lot between sentences, increase:
voice: silence_duration: 4.0Record key
Section titled “Record key”If Ctrl+B conflicts with your terminal or tmux habits:
voice: record_key: "ctrl+space"Use case 2: voice replies in Telegram or Discord
Section titled “Use case 2: voice replies in Telegram or Discord”This mode is simpler than full voice channels.
Hermes stays a normal chat bot, but can speak replies.
Start the gateway
Section titled “Start the gateway”hermes gatewayTurn on voice replies
Section titled “Turn on voice replies”Inside Telegram or Discord:
/voice onor
/voice tts| Mode | Meaning |
|---|---|
off | text only |
voice_only | speak only when the user sent voice |
all | speak every reply |
When to use which mode
Section titled “When to use which mode”/voice onif you want spoken replies only for voice-originating messages/voice ttsif you want a full spoken assistant all the time
Good messaging workflows
Section titled “Good messaging workflows”Telegram assistant on your phone
Section titled “Telegram assistant on your phone”Use when:
- you are away from your machine
- you want to send voice notes and get quick spoken replies
- you want Hermes to function like a portable research or ops assistant
Discord DMs with spoken output
Section titled “Discord DMs with spoken output”Useful when you want private interaction without server-channel mention behavior.
Use case 3: Discord voice channels
Section titled “Use case 3: Discord voice channels”This is the most advanced mode.
Hermes joins a Discord VC, listens to user speech, transcribes it, runs the normal agent pipeline, and speaks replies back into the channel.
Required Discord permissions
Section titled “Required Discord permissions”In addition to the normal text-bot setup, make sure the bot has:
- Connect
- Speak
- preferably Use Voice Activity
Also enable privileged intents in the Developer Portal:
- Presence Intent
- Server Members Intent
- Message Content Intent
Join and leave
Section titled “Join and leave”In a Discord text channel where the bot is present:
/voice join/voice leave/voice statusWhat happens when joined
Section titled “What happens when joined”- users speak in the VC
- Hermes detects speech boundaries
- transcripts are posted in the associated text channel
- Hermes responds in text and audio
- the text channel is the one where
/voice joinwas issued
Best practices for Discord VC use
Section titled “Best practices for Discord VC use”- keep
DISCORD_ALLOWED_USERStight - use a dedicated bot/testing channel at first
- verify STT and TTS work in ordinary text-chat voice mode before trying VC mode
Voice quality recommendations
Section titled “Voice quality recommendations”Best quality setup
Section titled “Best quality setup”- STT: local
large-v3or Groqwhisper-large-v3 - TTS: ElevenLabs
Best speed / convenience setup
Section titled “Best speed / convenience setup”- STT: local
baseor Groq - TTS: Edge
Best zero-cost setup
Section titled “Best zero-cost setup”- STT: local
- TTS: Edge
Common failure modes
Section titled “Common failure modes””No audio device found”
Section titled “”No audio device found””Install portaudio.
”Bot joins but hears nothing”
Section titled “”Bot joins but hears nothing””Check:
- your Discord user ID is in
DISCORD_ALLOWED_USERS - you are not muted
- privileged intents are enabled
- the bot has Connect/Speak permissions
”It transcribes but does not speak”
Section titled “”It transcribes but does not speak””Check:
- TTS provider config
- API key / quota for ElevenLabs or OpenAI
ffmpeginstall for Edge conversion paths
”Whisper outputs garbage”
Section titled “”Whisper outputs garbage””Try:
- quieter environment
- higher
silence_threshold - different STT provider/model
- shorter, clearer utterances
”It works in DMs but not in server channels”
Section titled “”It works in DMs but not in server channels””That is often mention policy.
By default, the bot needs an @mention in Discord server text channels unless configured otherwise.
Suggested first-week setup
Section titled “Suggested first-week setup”If you want the shortest path to success:
- get text Hermes working
- install
hermes-agent[voice] - use CLI voice mode with local STT + Edge TTS
- then enable
/voice onin Telegram or Discord - only after that, try Discord VC mode
That progression keeps the debugging surface small.