Skip to content

Use Voice Mode with Hermes

This guide is the practical companion to the Voice Mode feature reference.

If the feature page explains what voice mode can do, this guide shows how to actually use it well.

Voice mode is especially useful when:

  • you want a hands-free CLI workflow
  • you want spoken responses in Telegram or Discord
  • you want Hermes sitting in a Discord voice channel for live conversation
  • you want quick idea capture, debugging, or back-and-forth while walking around instead of typing

There are really three different voice experiences in Hermes.

ModeBest forPlatform
Interactive microphone loopPersonal hands-free use while coding or researchingCLI
Voice replies in chatSpoken responses alongside normal messagingTelegram, Discord
Live voice channel botGroup or personal live conversation in a VCDiscord voice channels

A good path is:

  1. get text working first
  2. enable voice replies second
  3. move to Discord voice channels last if you want the full experience

Step 1: make sure normal Hermes works first

Section titled “Step 1: make sure normal Hermes works first”

Before touching voice mode, verify that:

  • Hermes starts
  • your provider is configured
  • the agent can answer text prompts normally
Окно терминала
hermes

Ask something simple:

What tools do you have available?

If that is not solid yet, fix text mode first.

Окно терминала
pip install "hermes-agent[voice]"
Окно терминала
pip install "hermes-agent[messaging]"
Окно терминала
pip install "hermes-agent[tts-premium]"
Окно терминала
python -m pip install -U neutts[all]
Окно терминала
pip install "hermes-agent[all]"
Окно терминала
brew install portaudio ffmpeg opus
brew install espeak-ng
Окно терминала
sudo apt install portaudio19-dev ffmpeg libopus0
sudo apt install espeak-ng

Why these matter:

  • portaudio → microphone input / playback for CLI voice mode
  • ffmpeg → audio conversion for TTS and messaging delivery
  • opus → Discord voice codec support
  • espeak-ng → phonemizer backend for NeuTTS

Hermes supports both local and cloud speech stacks.

Use local STT and free Edge TTS:

  • STT provider: local
  • TTS provider: edge

This is usually the best place to start.

Add to ~/.hermes/.env:

Окно терминала
# Cloud STT options (local needs no key)
GROQ_API_KEY=***
VOICE_TOOLS_OPENAI_KEY=***
# Premium TTS (optional)
ELEVENLABS_API_KEY=***
  • local → best default for privacy and zero-cost use
  • groq → very fast cloud transcription
  • openai → good paid fallback
  • edge → free and good enough for most users
  • neutts → free local/on-device TTS
  • elevenlabs → best quality
  • openai → good middle ground
  • mistral → multilingual, native Opus

If you choose NeuTTS in the setup wizard, Hermes checks whether neutts is already installed. If it is missing, the wizard tells you NeuTTS needs the Python package neutts and the system package espeak-ng, offers to install them for you, installs espeak-ng with your platform package manager, and then runs:

Окно терминала
python -m pip install -U neutts[all]

If you skip that install or it fails, the wizard falls back to Edge TTS.

voice:
record_key: "ctrl+b"
max_recording_seconds: 120
auto_tts: false
beep_enabled: true
silence_threshold: 200
silence_duration: 3.0
stt:
provider: "local"
local:
model: "base"
tts:
provider: "edge"
edge:
voice: "en-US-AriaNeural"

This is a good conservative default for most people.

If you want local TTS instead, switch the tts block to:

tts:
provider: "neutts"
neutts:
ref_audio: ''
ref_text: ''
model: neuphonic/neutts-air-q4-gguf
device: cpu

Start Hermes:

Окно терминала
hermes

Inside the CLI:

/voice on

Default key:

  • Ctrl+B

Workflow:

  1. press Ctrl+B
  2. speak
  3. wait for silence detection to stop recording automatically
  4. Hermes transcribes and responds
  5. if TTS is on, it speaks the answer
  6. the loop can automatically restart for continuous use
/voice
/voice on
/voice off
/voice tts
/voice status

Say:

I keep getting a docker permission error. Help me debug it.

Then continue hands-free:

  • “Read the last error again”
  • “Explain the root cause in simpler terms”
  • “Now give me the exact fix”

Great for:

  • walking around while thinking
  • dictating half-formed ideas
  • asking Hermes to structure your thoughts in real time

If typing is inconvenient, voice mode is one of the fastest ways to stay in the full Hermes loop.

If Hermes starts/stops too aggressively, tune:

voice:
silence_threshold: 250

Higher threshold = less sensitive.

If you pause a lot between sentences, increase:

voice:
silence_duration: 4.0

If Ctrl+B conflicts with your terminal or tmux habits:

voice:
record_key: "ctrl+space"

Use case 2: voice replies in Telegram or Discord

Section titled “Use case 2: voice replies in Telegram or Discord”

This mode is simpler than full voice channels.

Hermes stays a normal chat bot, but can speak replies.

Окно терминала
hermes gateway

Inside Telegram or Discord:

/voice on

or

/voice tts
ModeMeaning
offtext only
voice_onlyspeak only when the user sent voice
allspeak every reply
  • /voice on if you want spoken replies only for voice-originating messages
  • /voice tts if you want a full spoken assistant all the time

Use when:

  • you are away from your machine
  • you want to send voice notes and get quick spoken replies
  • you want Hermes to function like a portable research or ops assistant

Useful when you want private interaction without server-channel mention behavior.

This is the most advanced mode.

Hermes joins a Discord VC, listens to user speech, transcribes it, runs the normal agent pipeline, and speaks replies back into the channel.

In addition to the normal text-bot setup, make sure the bot has:

  • Connect
  • Speak
  • preferably Use Voice Activity

Also enable privileged intents in the Developer Portal:

  • Presence Intent
  • Server Members Intent
  • Message Content Intent

In a Discord text channel where the bot is present:

/voice join
/voice leave
/voice status
  • users speak in the VC
  • Hermes detects speech boundaries
  • transcripts are posted in the associated text channel
  • Hermes responds in text and audio
  • the text channel is the one where /voice join was issued
  • keep DISCORD_ALLOWED_USERS tight
  • use a dedicated bot/testing channel at first
  • verify STT and TTS work in ordinary text-chat voice mode before trying VC mode
  • STT: local large-v3 or Groq whisper-large-v3
  • TTS: ElevenLabs
  • STT: local base or Groq
  • TTS: Edge
  • STT: local
  • TTS: Edge

Install portaudio.

Check:

  • your Discord user ID is in DISCORD_ALLOWED_USERS
  • you are not muted
  • privileged intents are enabled
  • the bot has Connect/Speak permissions

Check:

  • TTS provider config
  • API key / quota for ElevenLabs or OpenAI
  • ffmpeg install for Edge conversion paths

Try:

  • quieter environment
  • higher silence_threshold
  • different STT provider/model
  • shorter, clearer utterances

”It works in DMs but not in server channels”

Section titled “”It works in DMs but not in server channels””

That is often mention policy.

By default, the bot needs an @mention in Discord server text channels unless configured otherwise.

If you want the shortest path to success:

  1. get text Hermes working
  2. install hermes-agent[voice]
  3. use CLI voice mode with local STT + Edge TTS
  4. then enable /voice on in Telegram or Discord
  5. only after that, try Discord VC mode

That progression keeps the debugging surface small.