Want your Hermes Agent completely private? No API calls. No vendor lock-in. No monthly bills.
Run it locally with Ollama. Free inference, complete data privacy, fast responses.
Why Ollama + Hermes
Cost: $0/month (after hardware investment) Privacy: Nothing leaves your machine Speed: Lower latency than cloud APIs Control: You own everything
Hermes learns, Ollama runs the model, your data stays yours.
Prerequisites
- Hermes Agent installed (see Article 2)
- Ollama installed from ollama.ai
- Hardware: 4GB RAM minimum (8GB recommended)
- Optional: GPU with 4GB+ VRAM (much faster)
Step 1: Install Ollama
macOS:
brew install ollama
Linux:
curl https://ollama.ai/install.sh | sh
Windows: Download from ollama.ai
Step 2: Start Ollama Server
ollama serve
Keep this terminal open. Ollama runs on http://localhost:11434.
Step 3: Download a Model
In another terminal:
# Hermes-optimized models (recommended)
ollama pull mistral # 7B, fast, good quality
ollama pull neural-chat # 7B, conversational
ollama pull orca-mini # 3B, minimal resources
# Or other popular models
ollama pull llama2 # 7B, general purpose
ollama pull dolphin-mixtral # Larger, more powerful
Download time: 5-30 minutes depending on model size and internet speed.
Model choice guide:
- First time? Pick
mistral(balance of speed and quality) - Want faster? Pick
orca-mini(3B, 1.5GB VRAM) - Want best quality? Pick
dolphin-mixtral(larger, slower)
Step 4: Verify Ollama Model
curl http://localhost:11434/api/tags
Should return:
{
"models": [
{
"name": "mistral:latest",
"size": 3800789248
}
]
}
Your model is downloaded and ready.
Step 5: Configure Hermes for Ollama
hermes setup
When prompted:
Choose LLM provider: local
Ollama endpoint: http://localhost:11434
Model name: mistral (or your chosen model)
That’s it. Hermes is now connected to your local Ollama.
Step 6: Test It Works
hermes
You should see the CLI prompt. Type:
What is machine learning?
Hermes will:
- Send your question to Ollama
- Ollama runs inference locally
- Returns answer to Hermes
- Hermes displays it
Completely local. Completely private.
Performance Expectations
Mistral (7B):
- First token: 3-5 seconds
- Subsequent tokens: 0.5-1 second each
- Total response: 10-15 seconds
GPU-accelerated (4GB VRAM):
- First token: 0.5 seconds
- Subsequent tokens: 0.1 second each
- Total response: 3-5 seconds
CPU only:
- Slower (but usable)
- Typical: 20-30 seconds per response
GPU makes a huge difference. Consider adding a GPU if budget allows.
Resource Usage
Disk Space:
Mistral 7B: 3.8 GB
Neural-Chat 7B: 4.1 GB
Orca-Mini 3B: 1.8 GB
Memory Usage (RAM + VRAM):
CPU mode: 4-6 GB
GPU mode (VRAM): 3-4 GB (for 7B model)
GPU Options:
- NVIDIA: CUDA-enabled (recommended)
- AMD: ROCm-enabled
- Apple: Metal-accelerated
- Intel: Limited GPU support
Configuring Ollama with Hermes
Your config file (~/.hermes/config.yml):
llm:
provider: "ollama"
endpoint: "http://localhost:11434"
model: "mistral"
# Performance tuning
streaming: true
context_window: 4096
temperature: 0.7
top_p: 0.9
# Inference settings
num_predict: 512
num_threads: 8 # CPU threads to use
num_gpu: 1 # GPU layers
Monitoring Ollama
Check what’s loaded:
curl http://localhost:11434/api/tags
Check memory usage:
# On the machine running Ollama
top # macOS: Activity Monitor
Monitor response times:
# Simple benchmark
time ollama run mistral "Write a 100 word essay on AI"
Switching Models
# Download another model
ollama pull llama2
# Update Hermes config
nano ~/.hermes/config.yml
# Change: model: "llama2"
# Restart Hermes
hermes
Ollama handles model switching. No restart needed for Ollama itself.
Multi-Model Setup
Run multiple models simultaneously:
# ~/.hermes/config.yml
models:
fast:
provider: "ollama"
endpoint: "http://localhost:11434"
model: "mistral" # Use for quick tasks
powerful:
provider: "ollama"
endpoint: "http://localhost:11434"
model: "dolphin-mixtral" # Use for complex tasks
Hermes can automatically pick the right model based on task complexity.
Troubleshooting Ollama
”Connection refused"
# Check if Ollama is running
ps aux | grep ollama
# If not, start it
ollama serve
"Model not found"
# Check downloaded models
ollama list
# Download missing model
ollama pull mistral
"Out of memory”
Solutions:
- Use smaller model:
orca-miniinstead ofdolphin-mixtral - Quantize model:
mistral:q4instead ofmistral:latest - Increase VRAM: Add GPU or reduce other apps
”Very slow responses”
Check:
# Model loaded?
curl http://localhost:11434/api/tags
# GPU accelerated?
# For NVIDIA: Check nvidia-smi
nvidia-smi
# If GPU not used, add to config:
# num_gpu: 1
Real-World Scenario: Team Setup
Setup: 10-person team, all using Hermes + Ollama locally.
Architecture:
Each Team Member
├─ Hermes Agent (local)
├─ Ollama server (local, same machine)
├─ Model: Mistral (shared download)
└─ Memory: Local ~./hermes/memory/
Cost: $0/month after initial hardware. Privacy: 100% (no data leaves office). Speed: Comparable to cloud (with GPU).
Scaling: If team needs faster inference, upgrade one machine to server hardware, run central Ollama, all machines connect to it.
Comparing Local vs. Cloud
| Factor | Local Ollama | Cloud (OpenAI) |
|---|---|---|
| Cost | $0/month | $10-100/month |
| Privacy | 100% local | Sent to provider |
| Speed | 5-15s/response | 3-10s/response |
| Setup | 15 min | Instant |
| Model choice | Limited (20+) | Many more |
| Quality | Good (7B models) | Excellent (GPT-4) |
For most team use cases, Ollama is worth it.
FAQ
Q: Which model should I use?
Start with mistral. It’s fast and good quality.
Q: Do I need a GPU? No, but it helps. 10x faster with GPU (4GB+ VRAM).
Q: Can I switch models mid-conversation? Yes, but Hermes will forget context. Each model is independent.
Q: How much internet bandwidth? None during inference (completely local). Model download is one-time (3-8GB).
Q: Can I share one Ollama server across multiple Hermes instances? Yes. Run Ollama on central server, point all Hermes instances to it.
Q: What about quantized models?
They’re smaller (faster, less VRAM) but slightly lower quality. q4 is usually best trade-off.
What to Read Next
- Advanced Ollama Tuning — Optimize for speed and cost
- Hermes Architecture — Understand how Hermes uses the LLM
- Connecting to Platforms — Use your local Hermes on Discord/Slack
That’s it. Free, private AI inference. No API keys. No monthly bills. Just local power.
Your data, your model, your machine. Completely under your control.
Related Articles
Deepen your understanding with these curated continuations.
Advanced Ollama Optimization for Hermes Agent: Speed, Cost & Quality
Squeeze 3x performance from Ollama. Optimize model selection, GPU tuning, quantization, and caching for production Hermes deployments.
Hermes Agent Setup Checklists: Personal, Team & Production
Three copy-paste checklists for Hermes Agent. Personal setup (15 min), team deployment (1 hr), and production security (before go-live).
Hermes Agent Config Templates: 5 Copy-Paste Ready Setups
Ready-to-use Hermes Agent config files for personal, team, production, enterprise, and hybrid setups. Copy, paste, adjust one value, done.