Docker Cheat Sheet — containerize Ollama for consistent deployment Git Cheat Sheet — version control your Ollama model configs and Modelfiles PostgreSQL Cheat Sheet — store embeddings in pgvector alongside your data

Ollama Cheat Sheet: Local LLMs, Models, API & Integration (2026)

TL;DR

Ollama runs open LLMs locally — Llama 3.3, Mistral, Gemma, DeepSeek, Qwen, Phi, and vision models on your own hardware
ollama run llama3.3 — pull and start a model in one command
REST API on http://localhost:11434 — chat completions, embeddings, embeddings with multimodal input
Python (ollama package) and JavaScript (ollama) libraries for integration
GPU acceleration: NVIDIA (CUDA), Apple Silicon (Metal), AMD ROCm
Modelfile for customizing system prompts, temperature, and context window

Quick reference tables

Installation

Platform	Command
macOS	Download from ollama.com or `brew install ollama`
Linux	`curl -fsSL https://ollama.com/install.sh
Windows	Download from ollama.com
Docker	`docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama`
Verify	`ollama --version`

Core commands

Command	What it does
`ollama run llama3.3`	Pull and start an interactive chat session
`ollama run llama3.3 "explain quantum computing"`	Run a single prompt
`ollama pull llama3.3`	Download model without starting
`ollama list`	List all downloaded models
`ollama show llama3.3`	Display model info and config
`ollama rm llama3.3`	Remove a model from disk
`ollama ps`	Show running models and memory usage
`ollama stop llama3.3`	Stop a running model
`ollama create custom -f Modelfile`	Create a model from a Modelfile
`ollama serve`	Start the Ollama server (API daemon)
`ollama cp llama3.3 my-custom-llama`	Duplicate and rename a model

Popular models (May 2026)

Model	Size	Best for	Memory needed
`llama3.3:70b`	70B	Best quality, complex reasoning	~128GB VRAM
`llama3.3:8b`	8B	Fast, good quality	~8GB VRAM
`mistral-nemo:12b`	12B	Balanced quality/speed	~16GB VRAM
`deepseek-r1:8b`	8B	Reasoning, code	~8GB VRAM
`deepseek-r1:70b`	70B	Advanced reasoning	~128GB VRAM
`qwen3:8b`	8B	Fast, multilingual	~8GB VRAM
`gemma3:4b`	4B	Lightweight, fast	~4GB VRAM
`phi4:3.8b`	3.8B	Small footprint, good quality	~4GB VRAM
`llava:7b`	7B	Vision + text	~12GB VRAM
`qwen2.5vl:7b`	7B	Better vision	~12GB VRAM
`nomic-embed-text`	137M	Fast text embeddings	~500MB
`mxbai-embed-large`	334M	High-quality embeddings	~1GB

REST API

Ollama serves a REST API on http://localhost:11434:

Chat completions

bash

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.3:8b",
  "messages": [
    { "role": "system", "content": "You are a helpful assistant." },
    { "role": "user", "content": "What is recursion?" }
  ],
  "stream": false
}'

Generate (raw prompt)

bash

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.3:8b",
  "prompt": "Write a Python function to reverse a string.",
  "stream": false
}'

Embeddings

bash

curl http://localhost:11434/api/embeddings -d '{
  "model": "nomic-embed-text",
  "prompt": "The quick brown fox jumps over the lazy dog"
}'

Model management

bash

# Copy model
curl http://localhost:11434/api/create -d '{"name": "my-llama", "path": "llama3.3:8b"}'

# Generate a raw model (no prompt)
curl http://localhost:11434/api/generate -d '{"model": "llama3.3:8b", "raw": true}'

# Show model info
curl http://localhost:11434/api/show -d '{"name": "llama3.3:8b"}'

Python integration

bash

pip install ollama

python

import ollama

# Chat
response = ollama.chat(
    model='llama3.3:8b',
    messages=[
        {'role': 'system', 'content': 'You are a code reviewer.'},
        {'role': 'user', 'content': 'Review this function:\ndef add(a, b): return a + b'},
    ],
    options={'temperature': 0.3}
)
print(response['message']['content'])

# Generate
response = ollama.generate(
    model='deepseek-r1:8b',
    prompt='Explain Docker containers in one sentence.',
    stream=False
)
print(response['response'])

# Embeddings
embedding = ollama.embeddings(
    model='nomic-embed-text',
    prompt='local vector search example'
)
print(f"Embedding size: {len(embedding['embedding'])}")

# List models
models = ollama.list()
for m in models['models']:
    print(f"{m['name']} — {m['size'] // (1024**3)} GB")

# Pull a model
ollama.pull('qwen3:8b')

# Streaming
for chunk in ollama.chat(model='llama3.3:8b', messages=[{'role': 'user', 'content': 'Hello'}]):
    print(chunk['message']['content'], end='', flush=True)

JavaScript / TypeScript integration

bash

npm install ollama

typescript

import Ollama from 'ollama';

const client = new Ollama();

// Chat
const response = await client.chat({
  model: 'llama3.3:8b',
  messages: [
    { role: 'system', content: 'You are a helpful assistant.' },
    { role: 'user', content: 'What are the best practices for REST APIs?' },
  ],
  options: { temperature: 0.5, num_predict: 512 },
});

console.log(response.message.content);

// Streaming
const stream = await client.chat({
  model: 'llama3.3:8b',
  stream: true,
  messages: [{ role: 'user', content: 'Write a bash script to backup a directory.' }],
});

for await (const chunk of stream) {
  process.stdout.write(chunk.message.content);
}

// Generate embeddings
const embedding = await client.embeddings({
  model: 'nomic-embed-text',
  prompt: 'text to embed for vector search',
});
console.log(`Dimension: ${embedding.embedding.length}`);

Modelfile (custom models)

Create a customized model by overriding defaults:

dockerfile

# Modelfile
FROM llama3.3:8b

# System prompt — the model's behavior
SYSTEM """
You are a senior backend engineer specializing in Python and PostgreSQL.
When answering questions:
- Provide working code with imports
- Include error handling
- Mention performance considerations
- Use type hints in Python examples
"""

# Parameters
PARAMETER temperature 0.3
PARAMETER top_p 0.9
PARAMETER num_ctx 8192
PARAMETER num_gpu 1

bash

ollama create backend-expert -f Modelfile
ollama run backend-expert "How do I implement connection pooling in asyncpg?"

Common Modelfile parameters

Parameter	Example	Meaning
`PARAMETER temperature`	`0.3`	Lower = more deterministic output
`PARAMETER num_ctx`	`16384`	Context window size (model dependent)
`PARAMETER num_gpu`	`1`	Layers offloaded to GPU
`PARAMETER num_thread`	`8`	CPU threads to use
`PARAMETER top_k`	`40`	Top-k sampling
`PARAMETER top_p`	`0.9`	Nucleus sampling threshold
`SYSTEM`	Custom instructions	System prompt override
`ADAPTER`	`./my-adapter.gguf`	LoRA adapter

GPU setup

NVIDIA (CUDA)

Ollama auto-detects NVIDIA GPUs. Check with:

bash

nvidia-smi
ollama list  # Check if models are using GPU

Apple Silicon (Metal)

bash

# Ollama auto-uses Metal on Apple Silicon
# Check VRAM usage
Activity Monitor → GPU tab

# Force CPU-only (if you have issues)
OLLAMA_DEBUG=1 ollama serve

Multi-GPU

bash

# Use specific GPU
CUDA_VISIBLE_DEVICES=0 ollama serve

# Use all available
CUDA_VISIBLE_DEVICES=0,1 ollama serve

Memory management

bash

# Check model memory usage
ollama ps

# Stop a model to free memory
ollama stop llama3.3:8b

# Run a specific size on limited VRAM
ollama run llama3.3:8b  # 8B fits in ~8GB VRAM
ollama run gemma3:4b     # 4B fits in ~4GB VRAM
ollama run phi4:3.8b     # 3.8B fits in ~4GB VRAM

Common workflows

Local RAG pipeline

python

import ollama

def embed_chunks(chunks: list[str]) -> list[list[float]]:
    """Embed text chunks for vector search."""
    return [ollama.embeddings(model='nomic-embed-text', prompt=c)['embedding'] for c in chunks]

def cosine_similarity(a: list[float], b: list[float]) -> float:
    import math
    dot = sum(x*y for x,y in zip(a,b))
    norm_a = math.sqrt(sum(x*x for x in a))
    norm_b = math.sqrt(sum(x*x for x in b))
    return dot / (norm_a * norm_b)

def retrieve(query: str, chunks: list[str], embeddings: list[list[float]], top_k: int = 5):
    """Find top-k chunks most similar to query."""
    query_emb = ollama.embeddings(model='nomic-embed-text', prompt=query)['embedding']
    scored = sorted(zip(chunks, embeddings, range(len(chunks))),
                    key=lambda x: cosine_similarity(query_emb, x[1]), reverse=True)
    return [c[0] for c in scored[:top_k]]

def rag_query(query: str, chunks: list[str], embeddings: list[list[float]]):
    """RAG query: retrieve context + generate answer."""
    context = '\n\n'.join(retrieve(query, chunks, embeddings))
    prompt = f"""Context:\n{context}\n\nQuestion: {query}\n\nAnswer based on the context above:"""
    response = ollama.chat(model='llama3.3:8b', messages=[{'role': 'user', 'content': prompt}])
    return response['message']['content']

Streaming API with Node.js + Express

typescript

import express from 'express';
import Ollama from 'ollama';

const app = express();
app.use(express.json());

app.post('/api/chat', async (req, res) => {
  const { prompt, model = 'llama3.3:8b' } = req.body;
  const stream = await ollama.chat({ model, stream: true, messages: [{ role: 'user', content: prompt }] });
  res.setHeader('Content-Type', 'text/event-stream');
  for await (const chunk of stream) {
    res.write(`data: ${JSON.stringify(chunk)}\n\n`);
  }
  res.end();
});

Summary

ollama run — pull and interact; ollama serve — start REST API daemon
REST API on localhost:11434 — chat, generate, embeddings endpoints
Python (pip install ollama) and JS libraries for integration
Modelfile for custom system prompts, temperature, context size
ollama ps shows GPU memory; stop models to free VRAM
Vision models (llava, qwen2.5vl) handle images; embeddings models for RAG

FAQ

How much RAM/VRAM do I need? A 7B model needs ~8GB VRAM (8-bit). A 70B model needs ~128GB VRAM (4-bit quantization, fp16 needs 140GB). CPU inference works but is 10–50x slower.

Can Ollama run on a server without a GPU? Yes — CPU inference works but is significantly slower. Use smaller models (3B–8B) for reasonable CPU performance.

How does Ollama compare to LM Studio or GPT4All? Ollama has the simplest API and best CLI experience. LM Studio has a better GUI for non-technical users. GPT4All has a broader model catalog but fewer management features. Ollama is the most developer-friendly.

Can I fine-tune models with Ollama? Ollama manages existing models — it doesn’t do training. For fine-tuning, use Hugging Face tooling (transformers, peft) and import the resulting model into Ollama via Modelfile + Adapter.

What’s the difference between generate and chat? generate sends a raw prompt. chat uses a structured message format with roles (system, user, assistant). Use chat for conversational apps; use generate for single-shot tasks and scripting.

Ollama Cheat Sheet: Local LLMs, Models, API & Integration (2026)

Quick reference tables

Installation

Core commands

Popular models (May 2026)

REST API

Chat completions

Generate (raw prompt)

Embeddings

Model management

Python integration

JavaScript / TypeScript integration

Modelfile (custom models)

Common Modelfile parameters

GPU setup

NVIDIA (CUDA)

Apple Silicon (Metal)

Multi-GPU

Memory management

Common workflows

Local RAG pipeline

Streaming API with Node.js + Express

Summary

FAQ

What to read next

Related Articles

Qwen Coder Cheatsheet (2026 Edition): Running Local Agents

OpenAI API Cheat Sheet: GPT-4o, Tools & Assistants

Gemini API Cheat Sheet: 2.5 Pro, Vision & Tools

Related Articles

Qwen Coder Cheatsheet (2026 Edition): Running Local Agents

Claude API Cheat Sheet: SDK, CLI, MCP & Prompting

Gemini API Cheat Sheet: 2.5 Pro, Vision & Tools

Quick reference tables

Installation

Core commands

Popular models (May 2026)

REST API

Chat completions

Generate (raw prompt)

Embeddings

Model management

Python integration

JavaScript / TypeScript integration

Modelfile (custom models)

Common Modelfile parameters

GPU setup

NVIDIA (CUDA)

Apple Silicon (Metal)

Multi-GPU

Memory management

Common workflows

Local RAG pipeline

Streaming API with Node.js + Express

Summary

FAQ

What to read next

Related Articles

Qwen Coder Cheatsheet (2026 Edition): Running Local Agents

OpenAI API Cheat Sheet: GPT-4o, Tools & Assistants

Gemini API Cheat Sheet: 2.5 Pro, Vision & Tools

Related Articles

Qwen Coder Cheatsheet (2026 Edition): Running Local Agents

Claude API Cheat Sheet: SDK, CLI, MCP & Prompting

Gemini API Cheat Sheet: 2.5 Pro, Vision & Tools

Before you go...