Running local LLMs with Ollama is no longer experimental. By mid-2026, models that run on consumer hardware match cloud models from two years ago. For many tasks — coding, summarization, classification, structured data extraction — local models are the practical choice.
But the model landscape changes fast. Six months ago’s best model is today’s baseline. This guide benchmarks the current field for real-world use.
| Task | Best Model | Size | Notes |
|---|---|---|---|
| General coding | DeepSeek R1 | 7B-14B | Best reasoning-to-size ratio |
| Chat / assistant | Qwen 3 | 8B-32B | Multilingual, strong instruction following |
| Structured extraction | Llama 4 Scout | 17B active | MoE consistency across runs |
| Reasoning | DeepSeek R1 | 14B-32B | Chain-of-thought optimized |
| RAG embedding | Qwen 3 embeddings | — | Best retrieval for code/docs |
| Code generation | DeepSeek Coder V3 | 6.7B-33B | Beats all comparably-sized models on code |
Hardware Guidelines
Consumer GPUs in 2026 range from 8GB (RTX 4060) to 24GB (RTX 5090). Apple Silicon runs models efficiently via unified memory (16GB-128GB).
| VRAM | Max Active Params | Quantization | Example Models |
|---|---|---|---|
| 8GB | 7B-8B | Q4_K_M | DeepSeek R1 7B, Qwen 3 8B |
| 12GB | 14B | Q4_K_M | DeepSeek R1 14B, Qwen 3 14B |
| 16GB | 14B-20B | Q4_K_M | DeepSeek R1 14B, Qwen 3 14B |
| 24GB | 32B | Q4_K_M | DeepSeek R1 32B, Qwen 3 32B, Gemma 4 26B |
| 48GB+ | 70B+ | Q3-Q4 | DeepSeek R1 70B, Llama 4 Maverick |
Apple Silicon note: A Mac with 64GB unified memory can run 32B models at Q4. M4 Ultra with 128GB can run 70B models.
Model-by-Model Benchmarks
Tests run on RTX 5090 (24GB VRAM) with Ollama 0.6.x, Q4_K_M quantization unless noted.
Llama 4 (Meta)
| Variant | Active Params | Total Params | File Size | Speed | Quality | Strengths |
|---|---|---|---|---|---|---|
| Scout | 17B | 109B | ~60GB | 22 t/s | Very good | 10M context, consistency |
| Maverick | 17B | 400B | ~220GB | 8 t/s | Excellent | Frontier-level, 128 experts |
Llama 4 (released April 2025) is Meta’s first MoE architecture. Scout fits on a single H100 with int4 quantization and supports a 10M token context window. Maverick uses 128 experts with 17B active parameters. The “8B” and “70B” sizes from Llama 3 do not exist in Llama 4 — Scout and Maverick are the only variants.
Best for: Long-context document processing (Scout), frontier quality on enterprise hardware (Maverick).
Qwen 3 (Alibaba)
| Variant | Size | Speed (tokens/s) | Quality | Strengths |
|---|---|---|---|---|
| Qwen 3 8B | 5.2GB | 90 t/s | Very good | Multilingual, chat |
| Qwen 3 14B | 9.3GB | 52 t/s | Excellent | Instruction following |
| Qwen 3 32B | 20GB | 28 t/s | Excellent | Complex reasoning |
| Qwen 3 30B-A3B | 19GB | 35 t/s | Very good | MoE efficiency (3B active) |
Qwen 3 (released 2025) offers both dense and MoE variants. The 30B-A3B MoE activates only 3B parameters per token, making it faster than the 32B dense model. Qwen 3 leads on multilingual performance across 119 languages.
Best for: Multilingual applications, general chat, assistant use cases.
DeepSeek R1 (Deep Seek)
| Variant | Size | Speed (tokens/s) | Quality | Strengths |
|---|---|---|---|---|
| R1 7B | 4.7GB | 92 t/s | Very good | Chain-of-thought reasoning |
| R1 14B | 9.0GB | 48 t/s | Excellent | Coding + reasoning |
| R1 32B | 20GB | 26 t/s | Excellent | Complex problem solving |
| R1 70B | 43GB | 14 t/s | Top-tier | Multi-step reasoning |
DeepSeek R1 was the breakthrough model of 2025-2026. The R1-0528 update (May 2025) improved math accuracy from 70% to 87.5% on AIME 2025 and reduced hallucinations by ~45%. Its chain-of-thought distillation means smaller models (7B, 14B) perform reasoning tasks that require much larger models from other families.
Best for: Coding tasks, multi-step reasoning, problem decomposition.
Mistral Small 3 (Mistral AI)
| Variant | Size | Speed (tokens/s) | Quality | Strengths |
|---|---|---|---|---|
| Small 3 | 24B (14GB) | 75 t/s | Very good | Fast, efficient |
| Small 3.1 | 24B (15GB) | 70 t/s | Very good | Vision + text, 128K context |
| Small 3.2 | 24B (15GB) | 70 t/s | Very good | Better function calling |
Mistral Small 3.x (released 2025-2026) is the current generation. Small 3.1 added multimodal understanding. Small 3.2 improved function calling and instruction following. All variants run on a single RTX 4090 or Mac with 32GB RAM.
Best for: General-purpose fallback model, low-resource deployments, agentic workflows.
Gemma 4 (Google)
| Variant | Size | Speed (tokens/s) | Quality | Strengths |
|---|---|---|---|---|
| E2B | 2.3B eff (1.5GB) | 150 t/s | Good | Mobile/edge, 140 languages |
| E4B | 4.5B eff (3GB) | 120 t/s | Good | On-device, audio support |
| 26B A4B | 26B total (15GB) | 45 t/s | Excellent | Agentic, tool use |
| 31B | 31B dense (17GB) | 38 t/s | Excellent | Frontier reasoning, #3 open model |
Gemma 4 (released April 2, 2026) is Google DeepMind’s latest open model family under Apache 2.0. The 31B dense model ranks #3 on Arena AI leaderboard. The 26B A4B MoE model excels at agentic workflows. Both outperform models 20x their size.
Best for: Agentic workflows (26B A4B), frontier reasoning on consumer hardware (31B dense).
Phi-4 (Microsoft)
| Variant | Size | Speed (tokens/s) | Quality | Strengths |
|---|---|---|---|---|
| Phi-4 14B | 8.0GB | 50 t/s | Very good | Math, logic, reasoning |
Phi-4 is specialized for STEM reasoning. It outperforms much larger models on math and logic benchmarks. Limited for general chat and creative tasks.
Best for: Math, logic puzzles, code logic, structured problem solving.
Recommendations by Use Case
Best for Coding: DeepSeek R1 14B
DeepSeek R1 14B at Q4_K_M (9GB VRAM) is the best coding model available for consumer hardware. The R1-0528 update made it even stronger on math and code.
Best for Chat: Qwen 3 14B
Qwen 3 14B has the best instruction following and conversation quality for its size. Multilingual support across 119 languages handles code-switching naturally.
Best for Agentic Workflows: Gemma 4 26B A4B
Gemma 4 26B A4B (4B active) excels at tool use and multi-step agentic tasks. It ranks #6 on Arena AI among open models.
Best for Small VRAM (8GB): DeepSeek R1 7B
At 4.7GB Q4, DeepSeek R1 7B fits in any GPU. It outperforms all other 7B-class models on reasoning tasks.
Best for Apple Silicon: Qwen 3 14B or DeepSeek R1 14B
Both run well on 24GB+ Macs at Q4. For 16GB Macs, use the 7B variants (Q4) or try Mistral Small 3 (24B at Q3).
What to Read Next
- Claude Code + Ollama Setup — Run Claude Code with local Ollama models
- Aider Setup Guide — Open-source AI coding agent for terminal users
- DevOps Pipeline with Free Tools — CI/CD setup for your AI-powered workflows
Related Articles
Deepen your understanding with these curated continuations.
The Trust Crisis in AI Coding: 84% Use It, 3% Trust It
84% of developers use AI coding tools but only 3% highly trust the output. Why trust is so low, real examples of failures, and how to build a healthy skepticism into your workflow.
Best AI Code Review Tools in 2026: Comparison & Guide
Compare CodeRabbit, GitHub Copilot Code Review, Amazon CodeGuru, Qodo, and GitLab Duo. Pricing, accuracy benchmarks, integration depth, and which to use for your team.
Aider: The Open-Source Claude Code Alternative You Should Know (2026)
Aider is the closest open-source equivalent to Claude Code — 45K GitHub stars, works with any model, auto-commits to Git. Complete setup guide with architect mode, model comparison, and workflow tips.