Ollama & Local LLM Management Cheatsheet: Self-Hosted AI Guide

Ollama has revolutionized local artificial intelligence by making it simple to package, distribute, and run high-performance large language models (LLMs) on standard desktop hardware or private cloud servers.

This reference sheet covers the essential Ollama CLI commands, custom Modelfile configuration parameters, GPU acceleration environment variables, and local API integrations.

- **ollama run **: Start a model and enter an interactive shell session in your terminal. - **ollama pull **: Download the latest weights for a specific model from the library. - **Modelfile**: A configuration file used to define system prompts, temperature controls, and parameter parameters for custom local models. - **OLLAMA_NUM_PARALLEL**: Set this environment variable to allow serving multiple concurrent requests. - **Local Endpoint**: By default, Ollama hosts a local REST API endpoint at `http://localhost:11434`.

Before diving into this cheatsheet, check out my previous deep-dive on Ollama Cheat Sheet: Local LLMs, Models, API & Integration (2026) to see how we structured these patterns in practice.

1. Core CLI Commands

Manage the lifecycle of your local models directly from the command line.

Command	Action	Description
`ollama serve`	Start Server	Start the background daemon service for the API
`ollama run <model>`	Run Model	Download (if missing) and start an interactive session
`ollama pull <model>`	Pull Model	Fetch updated model weights without running
`ollama rm <model>`	Delete Model	Delete a model and free up disk space
`ollama list`	List Models	Display all local models currently available
`ollama ps`	Check Active	Show which models are actively loaded into CPU/GPU
`ollama show <model>`	Show Info	Display architecture, parameters, and license metadata

2. Writing a Custom `Modelfile`

Similar to a Dockerfile, an Ollama Modelfile allows you to create custom, pre-configured models by defining baseline weights, system instructions, and execution boundaries.

Example: Creating a Custom Senior Developer Assistant

Create a file named Modelfile in your directory:

# 1. Base weights from official registry
FROM llama3:8b

# 2. Configure model parameters
PARAMETER temperature 0.2
PARAMETER num_ctx 8192
PARAMETER stop "[INST]"
PARAMETER stop "[/INST]"

# 3. Inject baseline system instructions
SYSTEM """
You are a senior systems engineer. You write dry, optimized, and heavily commented 
code. You do not explain basic concepts; you focus strictly on advanced solutions.
Never use generic marketing phrases or conversational filler words.
"""

Compile and build the model locally:
```
ollama create senior-dev -f ./Modelfile
```
Run your custom model:
```
ollama run senior-dev
```

3. Server Configuration & GPU Tuning

Ollama is heavily optimized for hardware acceleration, but high-performance environments require manual configuration via environment variables.

Linux Service Configuration (`systemd`)

If running on a Linux host, edit the service file to inject custom environment variables:

sudo systemctl edit ollama.service

Add the environment blocks inside the editor:

[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_NUM_PARALLEL=4"
Environment="OLLAMA_MAX_LOADED_MODELS=2"
Environment="OLLAMA_MODELS=/mnt/fast-nvme/ollama-models"

Save and restart the daemon:

sudo systemctl daemon-reload
sudo systemctl restart ollama

Essential Environment Tuning Flags

OLLAMA_HOST: Binds the server to all network interfaces (0.0.0.0:11434) to allow remote connections.
OLLAMA_NUM_PARALLEL: Allows parallel request processing (increases GPU memory consumption but prevents blocking).
OLLAMA_MAX_LOADED_MODELS: Dictates how many models can reside in GPU memory simultaneously.
OLLAMA_MODELS: Modifies the storage path for model weights (ideal for mounting external NVMe arrays).

4. Local REST API Integration

Ollama runs a local HTTP server by default. You can easily communicate with it from standard tools or custom scripts.

Generating a Chat Response (CURL)

curl http://localhost:11434/api/chat -d '{
  "model": "llama3",
  "messages": [
    {
      "role": "user",
      "content": "Explain symmetric encryption in one sentence."
    }
  ],
  "stream": false
}'

Response Object (JSON)

{
  "model": "llama3",
  "created_at": "2026-06-01T15:20:00.123456Z",
  "message": {
    "role": "assistant",
    "content": "Symmetric encryption uses a single shared key to both encrypt and decrypt data."
  },
  "done": true,
  "total_duration": 450123456,
  "load_duration": 120456,
  "prompt_eval_count": 22,
  "eval_count": 14
}

5. Recommended Coding & Agent Models

For developer workflows, pull these highly optimized models:

qwen2.5-coder:7b: Currently the most powerful lightweight coding assistant.
deepseek-coder:6.7b: Phenomenal at code editing, refactoring, and logic resolution.
codegemma:7b: Google’s lightweight open weights model tailored for IDE completions.

Deepen your understanding with these curated continuations.

View All Articles

ai5 min read

How to Run a Local LLM Without a GPU: A CPU-Only Setup Guide

Run a local LLM without a GPU. CPU-only setup with llama.cpp, Ollama or LM Studio, with RAM requirements, quantization, and real-world speed benchmarks.