Ollama has revolutionized local artificial intelligence by making it simple to package, distribute, and run high-performance large language models (LLMs) on standard desktop hardware or private cloud servers.
This reference sheet covers the essential Ollama CLI commands, custom Modelfile configuration parameters, GPU acceleration environment variables, and local API integrations.
Before diving into this cheatsheet, check out my previous deep-dive on Ollama Cheat Sheet: Local LLMs, Models, API & Integration (2026) to see how we structured these patterns in practice.
1. Core CLI Commands
Manage the lifecycle of your local models directly from the command line.
| Command | Action | Description |
|---|---|---|
ollama serve | Start Server | Start the background daemon service for the API |
ollama run <model> | Run Model | Download (if missing) and start an interactive session |
ollama pull <model> | Pull Model | Fetch updated model weights without running |
ollama rm <model> | Delete Model | Delete a model and free up disk space |
ollama list | List Models | Display all local models currently available |
ollama ps | Check Active | Show which models are actively loaded into CPU/GPU |
ollama show <model> | Show Info | Display architecture, parameters, and license metadata |
2. Writing a Custom Modelfile
Similar to a Dockerfile, an Ollama Modelfile allows you to create custom, pre-configured models by defining baseline weights, system instructions, and execution boundaries.
Example: Creating a Custom Senior Developer Assistant
-
Create a file named
Modelfilein your directory:# 1. Base weights from official registry FROM llama3:8b # 2. Configure model parameters PARAMETER temperature 0.2 PARAMETER num_ctx 8192 PARAMETER stop "[INST]" PARAMETER stop "[/INST]" # 3. Inject baseline system instructions SYSTEM """ You are a senior systems engineer. You write dry, optimized, and heavily commented code. You do not explain basic concepts; you focus strictly on advanced solutions. Never use generic marketing phrases or conversational filler words. """ -
Compile and build the model locally:
ollama create senior-dev -f ./Modelfile -
Run your custom model:
ollama run senior-dev
3. Server Configuration & GPU Tuning
Ollama is heavily optimized for hardware acceleration, but high-performance environments require manual configuration via environment variables.
Linux Service Configuration (systemd)
If running on a Linux host, edit the service file to inject custom environment variables:
sudo systemctl edit ollama.service
Add the environment blocks inside the editor:
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_NUM_PARALLEL=4"
Environment="OLLAMA_MAX_LOADED_MODELS=2"
Environment="OLLAMA_MODELS=/mnt/fast-nvme/ollama-models"
Save and restart the daemon:
sudo systemctl daemon-reload
sudo systemctl restart ollama
Essential Environment Tuning Flags
OLLAMA_HOST: Binds the server to all network interfaces (0.0.0.0:11434) to allow remote connections.OLLAMA_NUM_PARALLEL: Allows parallel request processing (increases GPU memory consumption but prevents blocking).OLLAMA_MAX_LOADED_MODELS: Dictates how many models can reside in GPU memory simultaneously.OLLAMA_MODELS: Modifies the storage path for model weights (ideal for mounting external NVMe arrays).
4. Local REST API Integration
Ollama runs a local HTTP server by default. You can easily communicate with it from standard tools or custom scripts.
Generating a Chat Response (CURL)
curl http://localhost:11434/api/chat -d '{
"model": "llama3",
"messages": [
{
"role": "user",
"content": "Explain symmetric encryption in one sentence."
}
],
"stream": false
}'
Response Object (JSON)
{
"model": "llama3",
"created_at": "2026-06-01T15:20:00.123456Z",
"message": {
"role": "assistant",
"content": "Symmetric encryption uses a single shared key to both encrypt and decrypt data."
},
"done": true,
"total_duration": 450123456,
"load_duration": 120456,
"prompt_eval_count": 22,
"eval_count": 14
}
5. Recommended Coding & Agent Models
For developer workflows, pull these highly optimized models:
qwen2.5-coder:7b: Currently the most powerful lightweight coding assistant.deepseek-coder:6.7b: Phenomenal at code editing, refactoring, and logic resolution.codegemma:7b: Google’s lightweight open weights model tailored for IDE completions.
Related Articles
Deepen your understanding with these curated continuations.
Best Ollama Models to Run in 2026: Benchmarks & Recommendations
Fact-checked benchmarks of Ollama models for 2026: Llama 4 Scout/Maverick, Qwen 3, DeepSeek R1, Mistral Small 3, Gemma 4, Phi-4. Speed, quality, VRAM requirements, and best models for coding, chat, reasoning, and local RAG.
Drizzle ORM Schema & Queries Cheatsheet: The Complete Reference
A comprehensive reference for Drizzle ORM schemas, relationships, query builder APIs, and advanced migrations.
Hono Edge Web Framework Cheatsheet: The Complete Reference
Master Hono Edge Web Framework: routing, middleware, context, custom handlers, validation, and cloud deployments.