MeshWorld India Logo MeshWorld.
Cheatsheet Ollama LLM AI Python Developer Tools Self-hosted Local AI 8 min read

Ollama Cheat Sheet: Local LLMs, Models, API & Integration (2026)

Darsh Jariwala
By Darsh Jariwala
| Updated: May 19, 2026
Ollama Cheat Sheet: Local LLMs, Models, API & Integration (2026)
TL;DR
  • Ollama runs open LLMs locally — Llama 3.3, Mistral, Gemma, DeepSeek, Qwen, Phi, and vision models on your own hardware
  • ollama run llama3.3 — pull and start a model in one command
  • REST API on http://localhost:11434 — chat completions, embeddings, embeddings with multimodal input
  • Python (ollama package) and JavaScript (ollama) libraries for integration
  • GPU acceleration: NVIDIA (CUDA), Apple Silicon (Metal), AMD ROCm
  • Modelfile for customizing system prompts, temperature, and context window

Quick reference tables

Installation

PlatformCommand
macOSDownload from ollama.com or brew install ollama
Linux`curl -fsSL https://ollama.com/install.sh
WindowsDownload from ollama.com
Dockerdocker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama
Verifyollama --version

Core commands

CommandWhat it does
ollama run llama3.3Pull and start an interactive chat session
ollama run llama3.3 "explain quantum computing"Run a single prompt
ollama pull llama3.3Download model without starting
ollama listList all downloaded models
ollama show llama3.3Display model info and config
ollama rm llama3.3Remove a model from disk
ollama psShow running models and memory usage
ollama stop llama3.3Stop a running model
ollama create custom -f ModelfileCreate a model from a Modelfile
ollama serveStart the Ollama server (API daemon)
ollama cp llama3.3 my-custom-llamaDuplicate and rename a model
ModelSizeBest forMemory needed
llama3.3:70b70BBest quality, complex reasoning~128GB VRAM
llama3.3:8b8BFast, good quality~8GB VRAM
mistral-nemo:12b12BBalanced quality/speed~16GB VRAM
deepseek-r1:8b8BReasoning, code~8GB VRAM
deepseek-r1:70b70BAdvanced reasoning~128GB VRAM
qwen3:8b8BFast, multilingual~8GB VRAM
gemma3:4b4BLightweight, fast~4GB VRAM
phi4:3.8b3.8BSmall footprint, good quality~4GB VRAM
llava:7b7BVision + text~12GB VRAM
qwen2.5vl:7b7BBetter vision~12GB VRAM
nomic-embed-text137MFast text embeddings~500MB
mxbai-embed-large334MHigh-quality embeddings~1GB

REST API

Ollama serves a REST API on http://localhost:11434:

Chat completions

bash
curl http://localhost:11434/api/chat -d '{
  "model": "llama3.3:8b",
  "messages": [
    { "role": "system", "content": "You are a helpful assistant." },
    { "role": "user", "content": "What is recursion?" }
  ],
  "stream": false
}'

Generate (raw prompt)

bash
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.3:8b",
  "prompt": "Write a Python function to reverse a string.",
  "stream": false
}'

Embeddings

bash
curl http://localhost:11434/api/embeddings -d '{
  "model": "nomic-embed-text",
  "prompt": "The quick brown fox jumps over the lazy dog"
}'

Model management

bash
# Copy model
curl http://localhost:11434/api/create -d '{"name": "my-llama", "path": "llama3.3:8b"}'

# Generate a raw model (no prompt)
curl http://localhost:11434/api/generate -d '{"model": "llama3.3:8b", "raw": true}'

# Show model info
curl http://localhost:11434/api/show -d '{"name": "llama3.3:8b"}'

Python integration

bash
pip install ollama
python
import ollama

# Chat
response = ollama.chat(
    model='llama3.3:8b',
    messages=[
        {'role': 'system', 'content': 'You are a code reviewer.'},
        {'role': 'user', 'content': 'Review this function:\ndef add(a, b): return a + b'},
    ],
    options={'temperature': 0.3}
)
print(response['message']['content'])

# Generate
response = ollama.generate(
    model='deepseek-r1:8b',
    prompt='Explain Docker containers in one sentence.',
    stream=False
)
print(response['response'])

# Embeddings
embedding = ollama.embeddings(
    model='nomic-embed-text',
    prompt='local vector search example'
)
print(f"Embedding size: {len(embedding['embedding'])}")

# List models
models = ollama.list()
for m in models['models']:
    print(f"{m['name']}{m['size'] // (1024**3)} GB")

# Pull a model
ollama.pull('qwen3:8b')

# Streaming
for chunk in ollama.chat(model='llama3.3:8b', messages=[{'role': 'user', 'content': 'Hello'}]):
    print(chunk['message']['content'], end='', flush=True)

JavaScript / TypeScript integration

bash
npm install ollama
typescript
import Ollama from 'ollama';

const client = new Ollama();

// Chat
const response = await client.chat({
  model: 'llama3.3:8b',
  messages: [
    { role: 'system', content: 'You are a helpful assistant.' },
    { role: 'user', content: 'What are the best practices for REST APIs?' },
  ],
  options: { temperature: 0.5, num_predict: 512 },
});

console.log(response.message.content);

// Streaming
const stream = await client.chat({
  model: 'llama3.3:8b',
  stream: true,
  messages: [{ role: 'user', content: 'Write a bash script to backup a directory.' }],
});

for await (const chunk of stream) {
  process.stdout.write(chunk.message.content);
}

// Generate embeddings
const embedding = await client.embeddings({
  model: 'nomic-embed-text',
  prompt: 'text to embed for vector search',
});
console.log(`Dimension: ${embedding.embedding.length}`);

Modelfile (custom models)

Create a customized model by overriding defaults:

dockerfile
# Modelfile
FROM llama3.3:8b

# System prompt — the model's behavior
SYSTEM """
You are a senior backend engineer specializing in Python and PostgreSQL.
When answering questions:
- Provide working code with imports
- Include error handling
- Mention performance considerations
- Use type hints in Python examples
"""

# Parameters
PARAMETER temperature 0.3
PARAMETER top_p 0.9
PARAMETER num_ctx 8192
PARAMETER num_gpu 1
bash
ollama create backend-expert -f Modelfile
ollama run backend-expert "How do I implement connection pooling in asyncpg?"

Common Modelfile parameters

ParameterExampleMeaning
PARAMETER temperature0.3Lower = more deterministic output
PARAMETER num_ctx16384Context window size (model dependent)
PARAMETER num_gpu1Layers offloaded to GPU
PARAMETER num_thread8CPU threads to use
PARAMETER top_k40Top-k sampling
PARAMETER top_p0.9Nucleus sampling threshold
SYSTEMCustom instructionsSystem prompt override
ADAPTER./my-adapter.ggufLoRA adapter

GPU setup

NVIDIA (CUDA)

Ollama auto-detects NVIDIA GPUs. Check with:

bash
nvidia-smi
ollama list  # Check if models are using GPU

Apple Silicon (Metal)

bash
# Ollama auto-uses Metal on Apple Silicon
# Check VRAM usage
Activity Monitor GPU tab

# Force CPU-only (if you have issues)
OLLAMA_DEBUG=1 ollama serve

Multi-GPU

bash
# Use specific GPU
CUDA_VISIBLE_DEVICES=0 ollama serve

# Use all available
CUDA_VISIBLE_DEVICES=0,1 ollama serve

Memory management

bash
# Check model memory usage
ollama ps

# Stop a model to free memory
ollama stop llama3.3:8b

# Run a specific size on limited VRAM
ollama run llama3.3:8b  # 8B fits in ~8GB VRAM
ollama run gemma3:4b     # 4B fits in ~4GB VRAM
ollama run phi4:3.8b     # 3.8B fits in ~4GB VRAM

Common workflows

Local RAG pipeline

python
import ollama

def embed_chunks(chunks: list[str]) -> list[list[float]]:
    """Embed text chunks for vector search."""
    return [ollama.embeddings(model='nomic-embed-text', prompt=c)['embedding'] for c in chunks]

def cosine_similarity(a: list[float], b: list[float]) -> float:
    import math
    dot = sum(x*y for x,y in zip(a,b))
    norm_a = math.sqrt(sum(x*x for x in a))
    norm_b = math.sqrt(sum(x*x for x in b))
    return dot / (norm_a * norm_b)

def retrieve(query: str, chunks: list[str], embeddings: list[list[float]], top_k: int = 5):
    """Find top-k chunks most similar to query."""
    query_emb = ollama.embeddings(model='nomic-embed-text', prompt=query)['embedding']
    scored = sorted(zip(chunks, embeddings, range(len(chunks))),
                    key=lambda x: cosine_similarity(query_emb, x[1]), reverse=True)
    return [c[0] for c in scored[:top_k]]

def rag_query(query: str, chunks: list[str], embeddings: list[list[float]]):
    """RAG query: retrieve context + generate answer."""
    context = '\n\n'.join(retrieve(query, chunks, embeddings))
    prompt = f"""Context:\n{context}\n\nQuestion: {query}\n\nAnswer based on the context above:"""
    response = ollama.chat(model='llama3.3:8b', messages=[{'role': 'user', 'content': prompt}])
    return response['message']['content']

Streaming API with Node.js + Express

typescript
import express from 'express';
import Ollama from 'ollama';

const app = express();
app.use(express.json());

app.post('/api/chat', async (req, res) => {
  const { prompt, model = 'llama3.3:8b' } = req.body;
  const stream = await ollama.chat({ model, stream: true, messages: [{ role: 'user', content: prompt }] });
  res.setHeader('Content-Type', 'text/event-stream');
  for await (const chunk of stream) {
    res.write(`data: ${JSON.stringify(chunk)}\n\n`);
  }
  res.end();
});

Summary

  • ollama run — pull and interact; ollama serve — start REST API daemon
  • REST API on localhost:11434 — chat, generate, embeddings endpoints
  • Python (pip install ollama) and JS libraries for integration
  • Modelfile for custom system prompts, temperature, context size
  • ollama ps shows GPU memory; stop models to free VRAM
  • Vision models (llava, qwen2.5vl) handle images; embeddings models for RAG

FAQ

How much RAM/VRAM do I need? A 7B model needs ~8GB VRAM (8-bit). A 70B model needs ~128GB VRAM (4-bit quantization, fp16 needs 140GB). CPU inference works but is 10–50x slower.

Can Ollama run on a server without a GPU? Yes — CPU inference works but is significantly slower. Use smaller models (3B–8B) for reasonable CPU performance.

How does Ollama compare to LM Studio or GPT4All? Ollama has the simplest API and best CLI experience. LM Studio has a better GUI for non-technical users. GPT4All has a broader model catalog but fewer management features. Ollama is the most developer-friendly.

Can I fine-tune models with Ollama? Ollama manages existing models — it doesn’t do training. For fine-tuning, use Hugging Face tooling (transformers, peft) and import the resulting model into Ollama via Modelfile + Adapter.

What’s the difference between generate and chat? generate sends a raw prompt. chat uses a structured message format with roles (system, user, assistant). Use chat for conversational apps; use generate for single-shot tasks and scripting.