- Ollama runs open LLMs locally — Llama 3.3, Mistral, Gemma, DeepSeek, Qwen, Phi, and vision models on your own hardware
ollama run llama3.3— pull and start a model in one command- REST API on
http://localhost:11434— chat completions, embeddings, embeddings with multimodal input - Python (
ollamapackage) and JavaScript (ollama) libraries for integration - GPU acceleration: NVIDIA (CUDA), Apple Silicon (Metal), AMD ROCm
- Modelfile for customizing system prompts, temperature, and context window
Quick reference tables
Installation
| Platform | Command |
|---|---|
| macOS | Download from ollama.com or brew install ollama |
| Linux | `curl -fsSL https://ollama.com/install.sh |
| Windows | Download from ollama.com |
| Docker | docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama |
| Verify | ollama --version |
Core commands
| Command | What it does |
|---|---|
ollama run llama3.3 | Pull and start an interactive chat session |
ollama run llama3.3 "explain quantum computing" | Run a single prompt |
ollama pull llama3.3 | Download model without starting |
ollama list | List all downloaded models |
ollama show llama3.3 | Display model info and config |
ollama rm llama3.3 | Remove a model from disk |
ollama ps | Show running models and memory usage |
ollama stop llama3.3 | Stop a running model |
ollama create custom -f Modelfile | Create a model from a Modelfile |
ollama serve | Start the Ollama server (API daemon) |
ollama cp llama3.3 my-custom-llama | Duplicate and rename a model |
Popular models (May 2026)
| Model | Size | Best for | Memory needed |
|---|---|---|---|
llama3.3:70b | 70B | Best quality, complex reasoning | ~128GB VRAM |
llama3.3:8b | 8B | Fast, good quality | ~8GB VRAM |
mistral-nemo:12b | 12B | Balanced quality/speed | ~16GB VRAM |
deepseek-r1:8b | 8B | Reasoning, code | ~8GB VRAM |
deepseek-r1:70b | 70B | Advanced reasoning | ~128GB VRAM |
qwen3:8b | 8B | Fast, multilingual | ~8GB VRAM |
gemma3:4b | 4B | Lightweight, fast | ~4GB VRAM |
phi4:3.8b | 3.8B | Small footprint, good quality | ~4GB VRAM |
llava:7b | 7B | Vision + text | ~12GB VRAM |
qwen2.5vl:7b | 7B | Better vision | ~12GB VRAM |
nomic-embed-text | 137M | Fast text embeddings | ~500MB |
mxbai-embed-large | 334M | High-quality embeddings | ~1GB |
REST API
Ollama serves a REST API on http://localhost:11434:
Chat completions
curl http://localhost:11434/api/chat -d '{
"model": "llama3.3:8b",
"messages": [
{ "role": "system", "content": "You are a helpful assistant." },
{ "role": "user", "content": "What is recursion?" }
],
"stream": false
}' Generate (raw prompt)
curl http://localhost:11434/api/generate -d '{
"model": "llama3.3:8b",
"prompt": "Write a Python function to reverse a string.",
"stream": false
}' Embeddings
curl http://localhost:11434/api/embeddings -d '{
"model": "nomic-embed-text",
"prompt": "The quick brown fox jumps over the lazy dog"
}' Model management
# Copy model
curl http://localhost:11434/api/create -d '{"name": "my-llama", "path": "llama3.3:8b"}'
# Generate a raw model (no prompt)
curl http://localhost:11434/api/generate -d '{"model": "llama3.3:8b", "raw": true}'
# Show model info
curl http://localhost:11434/api/show -d '{"name": "llama3.3:8b"}' Python integration
pip install ollama import ollama
# Chat
response = ollama.chat(
model='llama3.3:8b',
messages=[
{'role': 'system', 'content': 'You are a code reviewer.'},
{'role': 'user', 'content': 'Review this function:\ndef add(a, b): return a + b'},
],
options={'temperature': 0.3}
)
print(response['message']['content'])
# Generate
response = ollama.generate(
model='deepseek-r1:8b',
prompt='Explain Docker containers in one sentence.',
stream=False
)
print(response['response'])
# Embeddings
embedding = ollama.embeddings(
model='nomic-embed-text',
prompt='local vector search example'
)
print(f"Embedding size: {len(embedding['embedding'])}")
# List models
models = ollama.list()
for m in models['models']:
print(f"{m['name']} — {m['size'] // (1024**3)} GB")
# Pull a model
ollama.pull('qwen3:8b')
# Streaming
for chunk in ollama.chat(model='llama3.3:8b', messages=[{'role': 'user', 'content': 'Hello'}]):
print(chunk['message']['content'], end='', flush=True) JavaScript / TypeScript integration
npm install ollama import Ollama from 'ollama';
const client = new Ollama();
// Chat
const response = await client.chat({
model: 'llama3.3:8b',
messages: [
{ role: 'system', content: 'You are a helpful assistant.' },
{ role: 'user', content: 'What are the best practices for REST APIs?' },
],
options: { temperature: 0.5, num_predict: 512 },
});
console.log(response.message.content);
// Streaming
const stream = await client.chat({
model: 'llama3.3:8b',
stream: true,
messages: [{ role: 'user', content: 'Write a bash script to backup a directory.' }],
});
for await (const chunk of stream) {
process.stdout.write(chunk.message.content);
}
// Generate embeddings
const embedding = await client.embeddings({
model: 'nomic-embed-text',
prompt: 'text to embed for vector search',
});
console.log(`Dimension: ${embedding.embedding.length}`); Modelfile (custom models)
Create a customized model by overriding defaults:
# Modelfile
FROM llama3.3:8b
# System prompt — the model's behavior
SYSTEM """
You are a senior backend engineer specializing in Python and PostgreSQL.
When answering questions:
- Provide working code with imports
- Include error handling
- Mention performance considerations
- Use type hints in Python examples
"""
# Parameters
PARAMETER temperature 0.3
PARAMETER top_p 0.9
PARAMETER num_ctx 8192
PARAMETER num_gpu 1 ollama create backend-expert -f Modelfile
ollama run backend-expert "How do I implement connection pooling in asyncpg?" Common Modelfile parameters
| Parameter | Example | Meaning |
|---|---|---|
PARAMETER temperature | 0.3 | Lower = more deterministic output |
PARAMETER num_ctx | 16384 | Context window size (model dependent) |
PARAMETER num_gpu | 1 | Layers offloaded to GPU |
PARAMETER num_thread | 8 | CPU threads to use |
PARAMETER top_k | 40 | Top-k sampling |
PARAMETER top_p | 0.9 | Nucleus sampling threshold |
SYSTEM | Custom instructions | System prompt override |
ADAPTER | ./my-adapter.gguf | LoRA adapter |
GPU setup
NVIDIA (CUDA)
Ollama auto-detects NVIDIA GPUs. Check with:
nvidia-smi
ollama list # Check if models are using GPU Apple Silicon (Metal)
# Ollama auto-uses Metal on Apple Silicon
# Check VRAM usage
Activity Monitor → GPU tab
# Force CPU-only (if you have issues)
OLLAMA_DEBUG=1 ollama serve Multi-GPU
# Use specific GPU
CUDA_VISIBLE_DEVICES=0 ollama serve
# Use all available
CUDA_VISIBLE_DEVICES=0,1 ollama serve Memory management
# Check model memory usage
ollama ps
# Stop a model to free memory
ollama stop llama3.3:8b
# Run a specific size on limited VRAM
ollama run llama3.3:8b # 8B fits in ~8GB VRAM
ollama run gemma3:4b # 4B fits in ~4GB VRAM
ollama run phi4:3.8b # 3.8B fits in ~4GB VRAM Common workflows
Local RAG pipeline
import ollama
def embed_chunks(chunks: list[str]) -> list[list[float]]:
"""Embed text chunks for vector search."""
return [ollama.embeddings(model='nomic-embed-text', prompt=c)['embedding'] for c in chunks]
def cosine_similarity(a: list[float], b: list[float]) -> float:
import math
dot = sum(x*y for x,y in zip(a,b))
norm_a = math.sqrt(sum(x*x for x in a))
norm_b = math.sqrt(sum(x*x for x in b))
return dot / (norm_a * norm_b)
def retrieve(query: str, chunks: list[str], embeddings: list[list[float]], top_k: int = 5):
"""Find top-k chunks most similar to query."""
query_emb = ollama.embeddings(model='nomic-embed-text', prompt=query)['embedding']
scored = sorted(zip(chunks, embeddings, range(len(chunks))),
key=lambda x: cosine_similarity(query_emb, x[1]), reverse=True)
return [c[0] for c in scored[:top_k]]
def rag_query(query: str, chunks: list[str], embeddings: list[list[float]]):
"""RAG query: retrieve context + generate answer."""
context = '\n\n'.join(retrieve(query, chunks, embeddings))
prompt = f"""Context:\n{context}\n\nQuestion: {query}\n\nAnswer based on the context above:"""
response = ollama.chat(model='llama3.3:8b', messages=[{'role': 'user', 'content': prompt}])
return response['message']['content'] Streaming API with Node.js + Express
import express from 'express';
import Ollama from 'ollama';
const app = express();
app.use(express.json());
app.post('/api/chat', async (req, res) => {
const { prompt, model = 'llama3.3:8b' } = req.body;
const stream = await ollama.chat({ model, stream: true, messages: [{ role: 'user', content: prompt }] });
res.setHeader('Content-Type', 'text/event-stream');
for await (const chunk of stream) {
res.write(`data: ${JSON.stringify(chunk)}\n\n`);
}
res.end();
}); Summary
ollama run— pull and interact;ollama serve— start REST API daemon- REST API on
localhost:11434— chat, generate, embeddings endpoints - Python (
pip install ollama) and JS libraries for integration - Modelfile for custom system prompts, temperature, context size
ollama psshows GPU memory; stop models to free VRAM- Vision models (
llava,qwen2.5vl) handle images; embeddings models for RAG
FAQ
How much RAM/VRAM do I need? A 7B model needs ~8GB VRAM (8-bit). A 70B model needs ~128GB VRAM (4-bit quantization, fp16 needs 140GB). CPU inference works but is 10–50x slower.
Can Ollama run on a server without a GPU? Yes — CPU inference works but is significantly slower. Use smaller models (3B–8B) for reasonable CPU performance.
How does Ollama compare to LM Studio or GPT4All? Ollama has the simplest API and best CLI experience. LM Studio has a better GUI for non-technical users. GPT4All has a broader model catalog but fewer management features. Ollama is the most developer-friendly.
Can I fine-tune models with Ollama? Ollama manages existing models — it doesn’t do training. For fine-tuning, use Hugging Face tooling (transformers, peft) and import the resulting model into Ollama via Modelfile + Adapter.
What’s the difference between generate and chat?
generate sends a raw prompt. chat uses a structured message format with roles (system, user, assistant). Use chat for conversational apps; use generate for single-shot tasks and scripting.
What to read next
- Docker Cheat Sheet — containerize Ollama for consistent deployment
- Git Cheat Sheet — version control your Ollama model configs and Modelfiles
- PostgreSQL Cheat Sheet — store embeddings in pgvector alongside your data
Related Articles
Deepen your understanding with these curated continuations.
Qwen Coder Cheatsheet (2026 Edition): Running Local Agents
Master Alibaba's open-weights Qwen Coder models. Essential commands for Ollama integration, local execution, and private agentic workflows.
OpenAI API Cheat Sheet: GPT-4o, Tools & Assistants
Master the OpenAI API with this guide to GPT-4o, function calling, structured outputs, and Assistants. Includes DALL-E 3, Whisper, and embedding examples.
Gemini API Cheat Sheet: 2.5 Pro, Vision & Tools
Master Google Gemini API for 2.5 Pro and Flash models. Guide to vision, JSON output, function calling, Search grounding, and the Gemini CLI tool.