Skip to content

Run Hermes Locally with Ollama — Zero API Cost

Run Hermes Locally with Ollama — Zero API Cost

Section titled “Run Hermes Locally with Ollama — Zero API Cost”

Cloud LLM APIs charge per token. A heavy coding session can cost $5–20. For personal projects, learning, or privacy-sensitive work, that adds up — and you’re sending every conversation to a third party.

You’ll set up Hermes Agent running entirely on your own hardware, using Ollama as the model backend. No API keys, no subscriptions, no data leaving your machine. Once configured, Hermes works exactly like it does with OpenRouter or Anthropic — terminal commands, file editing, web browsing, delegation — but the model runs locally.

By the end, you’ll have:

  • Ollama serving one or more open-weight models
  • Hermes connected to Ollama as a custom endpoint
  • A working local agent that can edit files, run commands, and browse the web
  • Optional: a Telegram/Discord bot powered entirely by your own hardware
ComponentMinimumRecommended
RAM8 GB (for 3B models)32+ GB (for 27B+ models)
Storage5 GB free30+ GB (for multiple models)
CPU4 cores8+ cores (AMD EPYC, Ryzen, Intel Xeon)
GPUNot requiredNVIDIA GPU with 8+ GB VRAM speeds things up significantly

:::tip CPU-only works, but expect slower responses Ollama runs on CPU-only servers. A 9B model on a modern 8-core CPU gives ~10 tokens/sec. A 31B model on CPU is slower (~2–5 tokens/sec) — each response takes 30–120 seconds, but it works. A GPU dramatically improves this. For CPU-only setups, increase the API timeout in config:

agent:
api_timeout: 1800 # 30 minutes — generous for slow local models

:::

Окно терминала
curl -fsSL https://ollama.com/install.sh | sh

Verify it’s running:

Окно терминала
ollama --version
curl http://localhost:11434/api/tags # Should return {"models":[]}

Choose based on your hardware:

ModelSize on DiskRAM NeededTool CallingBest For
gemma4:31b~20 GB24+ GBYesBest quality — strong tool use and reasoning
gemma2:27b~16 GB20+ GBNoConversational tasks, no tool use
gemma2:9b~5 GB8+ GBNoFast chat, Q&A — cannot call tools
llama3.2:3b~2 GB4+ GBNoLightweight quick answers only

:::warning Tool calling matters Hermes is an agentic assistant — it edits files, runs commands, and browses the web through tool calls. Models without tool-call support can only chat; they can’t take actions. For the full Hermes experience, use a model that supports tools (like gemma4:31b). :::

Pull your chosen model:

Окно терминала
ollama pull gemma4:31b

:::info Multiple models You can pull several models and switch between them inside Hermes with /model. Ollama loads the active model into memory on demand and unloads idle ones automatically. :::

Verify the model works:

Окно терминала
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gemma4:31b",
"messages": [{"role": "user", "content": "Say hello"}],
"max_tokens": 50
}'

You should see a JSON response with the model’s reply.

Run the Hermes setup wizard:

Окно терминала
hermes setup

When prompted for a provider, select Custom Endpoint and enter:

  • Base URL: http://localhost:11434/v1
  • API Key: Leave empty or type no-key (Ollama doesn’t need one)
  • Model: gemma4:31b (or whichever model you pulled)

Alternatively, edit ~/.hermes/config.yaml directly:

model:
default: "gemma4:31b"
provider: "custom"
base_url: "http://localhost:11434/v1"
Окно терминала
hermes

That’s it. You’re now running a fully local agent. Try it out:

You: List all Python files in this directory and count the lines of code in each
You: Read the README.md and summarize what this project does
You: Create a Python script that fetches the weather for Ho Chi Minh City

Hermes will use the terminal tool, file operations, and your local model — no cloud calls.

Step 5: Pick the Right Model for Your Task

Section titled “Step 5: Pick the Right Model for Your Task”

Not every task needs the biggest model. Here’s a practical guide:

TaskRecommended ModelWhy
File edits, code, terminal commandsgemma4:31bOnly model with reliable tool calling
Quick Q&A (no tool use needed)gemma2:9bFast responses for conversational tasks
Lightweight chatllama3.2:3bFastest, but very limited capabilities

Switch models on the fly inside a session:

/model gemma2:9b

By default, Ollama uses a 2048-token context. For agentic work (tool calls, long conversations), you need more:

Окно терминала
# Create a Modelfile that extends context
cat > /tmp/Modelfile << 'EOF'
FROM gemma4:31b
PARAMETER num_ctx 16384
EOF
ollama create gemma4-16k -f /tmp/Modelfile

Then update your Hermes config to use gemma4-16k as the model name.

By default, Ollama unloads models after 5 minutes of inactivity. For a persistent gateway bot, keep it loaded:

Окно терминала
# Set keep-alive to 24 hours
curl http://localhost:11434/api/generate \
-d '{"model": "gemma4:31b", "keep_alive": "24h"}'

Or set it globally in Ollama’s environment:

/etc/systemd/system/ollama.service.d/override.conf
[Service]
Environment="OLLAMA_KEEP_ALIVE=24h"

If you have an NVIDIA GPU, Ollama automatically offloads layers to it. Check with:

Окно терминала
ollama ps # Shows which model is loaded and how many GPU layers

For a 31B model on a 12 GB GPU, you’ll get partial offload (~40 layers on GPU, rest on CPU), which still gives a significant speedup.

Once Hermes works locally in the CLI, you can expose it as a Telegram or Discord bot — still running entirely on your hardware.

  1. Create a bot via @BotFather and get the token
  2. Add to your ~/.hermes/config.yaml:
model:
default: "gemma4:31b"
provider: "custom"
base_url: "http://localhost:11434/v1"
platforms:
telegram:
enabled: true
token: "YOUR_TELEGRAM_BOT_TOKEN"
  1. Start the gateway:
Окно терминала
hermes gateway

Now message your bot on Telegram — it responds using your local model.

  1. Create a Discord application at discord.com/developers
  2. Add to config:
platforms:
discord:
enabled: true
token: "YOUR_DISCORD_BOT_TOKEN"
  1. Start: hermes gateway

Local models can struggle with complex tasks. Set up a cloud fallback that only activates when the local model fails:

model:
default: "gemma4:31b"
provider: "custom"
base_url: "http://localhost:11434/v1"
fallback_providers:
- provider: openrouter
model: anthropic/claude-sonnet-4

This way, 90% of your usage is free (local), and only the hard tasks hit the paid API.

Ollama isn’t running. Start it:

Окно терминала
sudo systemctl start ollama
# or
ollama serve
  • Check model size vs RAM: If your model needs more RAM than available, it swaps to disk. Use a smaller model or add RAM.
  • Check ollama ps: If no GPU layers are offloaded, responses are CPU-bound. This is normal for CPU-only servers.
  • Reduce context: Large conversations slow down inference. Use /compress regularly, or set a lower compression threshold in config.

Smaller models (3B, 7B) sometimes ignore tool-call instructions and produce plain text instead of structured function calls. Solutions:

  • Use a bigger modelgemma4:31b or gemma2:27b handle tool calls much better than 3B/7B models.
  • Hermes has auto-repair — it detects malformed tool calls and attempts to fix them automatically.
  • Set up a fallback — if the local model fails 3 times, Hermes falls back to a cloud provider.

The default Ollama context (2048 tokens) is too small for agentic work. See Step 6 to increase it.

Here’s what running locally saves compared to cloud APIs, based on a typical coding session (~100K tokens input, ~20K tokens output):

ProviderCost per SessionMonthly (daily use)
Anthropic Claude Sonnet~$0.80~$24
OpenRouter (GPT-4o)~$0.60~$18
Ollama (local)$0.00$0.00

Your only cost is electricity — roughly $0.01–0.05 per session depending on hardware.

  • File editing and code generation — models 9B+ handle this well
  • Terminal commands — Hermes wraps the command, runs it, reads output regardless of model
  • Web browsing — the browser tool does the fetching; the model just interprets results
  • Cron jobs and scheduled tasks — work identically to cloud setups
  • Multi-platform gateway — Telegram, Discord, Slack all work with local models
  • Very complex multi-step reasoning — 70B+ or cloud models like Claude Opus are noticeably better
  • Long context windows — cloud models offer 100K–1M tokens; local models are typically 8K–32K
  • Speed on large responses — cloud inference is faster than CPU-only local for long generations

The sweet spot: use local for everyday tasks, set up a cloud fallback for the hard stuff.