MeshWorld India Logo MeshWorld.
ollama local-llm ai llama3 terminal 4 min read

Ollama & Local LLM Management Cheatsheet: Self-Hosted AI Guide

Darsh Jariwala
By Darsh Jariwala
Ollama & Local LLM Management Cheatsheet: Self-Hosted AI Guide

Ollama has revolutionized local artificial intelligence by making it simple to package, distribute, and run high-performance large language models (LLMs) on standard desktop hardware or private cloud servers.

This reference sheet covers the essential Ollama CLI commands, custom Modelfile configuration parameters, GPU acceleration environment variables, and local API integrations.


- **ollama run **: Start a model and enter an interactive shell session in your terminal. - **ollama pull **: Download the latest weights for a specific model from the library. - **Modelfile**: A configuration file used to define system prompts, temperature controls, and parameter parameters for custom local models. - **OLLAMA_NUM_PARALLEL**: Set this environment variable to allow serving multiple concurrent requests. - **Local Endpoint**: By default, Ollama hosts a local REST API endpoint at `http://localhost:11434`.

Before diving into this cheatsheet, check out my previous deep-dive on Ollama Cheat Sheet: Local LLMs, Models, API & Integration (2026) to see how we structured these patterns in practice.

1. Core CLI Commands

Manage the lifecycle of your local models directly from the command line.

CommandActionDescription
ollama serveStart ServerStart the background daemon service for the API
ollama run <model>Run ModelDownload (if missing) and start an interactive session
ollama pull <model>Pull ModelFetch updated model weights without running
ollama rm <model>Delete ModelDelete a model and free up disk space
ollama listList ModelsDisplay all local models currently available
ollama psCheck ActiveShow which models are actively loaded into CPU/GPU
ollama show <model>Show InfoDisplay architecture, parameters, and license metadata

2. Writing a Custom Modelfile

Similar to a Dockerfile, an Ollama Modelfile allows you to create custom, pre-configured models by defining baseline weights, system instructions, and execution boundaries.

Example: Creating a Custom Senior Developer Assistant

  1. Create a file named Modelfile in your directory:

    # 1. Base weights from official registry
    FROM llama3:8b
    
    # 2. Configure model parameters
    PARAMETER temperature 0.2
    PARAMETER num_ctx 8192
    PARAMETER stop "[INST]"
    PARAMETER stop "[/INST]"
    
    # 3. Inject baseline system instructions
    SYSTEM """
    You are a senior systems engineer. You write dry, optimized, and heavily commented 
    code. You do not explain basic concepts; you focus strictly on advanced solutions.
    Never use generic marketing phrases or conversational filler words.
    """
  2. Compile and build the model locally:

    ollama create senior-dev -f ./Modelfile
  3. Run your custom model:

    ollama run senior-dev

3. Server Configuration & GPU Tuning

Ollama is heavily optimized for hardware acceleration, but high-performance environments require manual configuration via environment variables.

Linux Service Configuration (systemd)

If running on a Linux host, edit the service file to inject custom environment variables:

sudo systemctl edit ollama.service

Add the environment blocks inside the editor:

[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_NUM_PARALLEL=4"
Environment="OLLAMA_MAX_LOADED_MODELS=2"
Environment="OLLAMA_MODELS=/mnt/fast-nvme/ollama-models"

Save and restart the daemon:

sudo systemctl daemon-reload
sudo systemctl restart ollama

Essential Environment Tuning Flags

  • OLLAMA_HOST: Binds the server to all network interfaces (0.0.0.0:11434) to allow remote connections.
  • OLLAMA_NUM_PARALLEL: Allows parallel request processing (increases GPU memory consumption but prevents blocking).
  • OLLAMA_MAX_LOADED_MODELS: Dictates how many models can reside in GPU memory simultaneously.
  • OLLAMA_MODELS: Modifies the storage path for model weights (ideal for mounting external NVMe arrays).

4. Local REST API Integration

Ollama runs a local HTTP server by default. You can easily communicate with it from standard tools or custom scripts.

Generating a Chat Response (CURL)

curl http://localhost:11434/api/chat -d '{
  "model": "llama3",
  "messages": [
    {
      "role": "user",
      "content": "Explain symmetric encryption in one sentence."
    }
  ],
  "stream": false
}'

Response Object (JSON)

{
  "model": "llama3",
  "created_at": "2026-06-01T15:20:00.123456Z",
  "message": {
    "role": "assistant",
    "content": "Symmetric encryption uses a single shared key to both encrypt and decrypt data."
  },
  "done": true,
  "total_duration": 450123456,
  "load_duration": 120456,
  "prompt_eval_count": 22,
  "eval_count": 14
}

For developer workflows, pull these highly optimized models:

  • qwen2.5-coder:7b: Currently the most powerful lightweight coding assistant.
  • deepseek-coder:6.7b: Phenomenal at code editing, refactoring, and logic resolution.
  • codegemma:7b: Google’s lightweight open weights model tailored for IDE completions.