Topic · A5
The Best AI Coding Agents for Ollama and Local Models (2026)
Running an AI coding agent against a local LLM via Ollama is finally viable in 2026. This is the picker — aider, Cline, Continue, RooCode — with which models actually work, hardware requirements, and the cost math vs cloud APIs.
Local AI coding was a marginal use case in 2024 because local models couldn't write production code without falling apart inside multi-file edits. By mid-2026 that has changed. Qwen2.5-Coder-32B and DeepSeek-Coder-V2-Lite, paired with the right client, produce output close enough to GPT-5 / Sonnet 4.7 quality on common stacks that the trade-offs become legitimate.
The trade-offs: hardware cost, throughput, and the long tail of cases where a local model still misfires. This page is the picker — which clients, which models, what hardware, and the cost math.
This page does not cover Claude Code or Cursor for local models — both are cloud-API-only. The four clients that meaningfully support local models in May 2026 are aider, Cline, Continue, and Roo Code (a Cline fork).
The four clients
aider
paul-gauthier/aider — the terminal coding agent that pre-dates Claude Code, Cursor, and Codex CLI. Predates the local-model option too; aider has had Ollama support for over a year.How it talks to Ollama:
aider --model ollama/qwen2.5-coder:32b
The ollama/ prefix tells aider to use the Ollama HTTP API at http://localhost:11434. Aider's architect/editor mode works against two different local models if you want (e.g., 32B model as architect, 7B as editor for cheap completions).
What aider does well with local: terminal-native, no GUI, full file-tree awareness, integrates with git for atomic commits, the CONVENTIONS.md pattern translates 1:1. The pairing-mode is a unique benefit — you can have a stronger model plan and a faster model edit, which on local hardware also means "the better model isn't loaded constantly."
What it doesn't do: rich IDE integration. If you want inline completion and a chat sidebar inside VS Code, aider is the wrong client.
Cline (and forks Roo Code, Kilo Code)
cline/cline — the VS Code extension that opened the multi-step agent flow inside the IDE. Cline has had Ollama support since v2.x via Ollama's OpenAI-compatible endpoint.Config:
{
"cline.modelProvider": "ollama",
"cline.ollama.baseURL": "http://localhost:11434/v1",
"cline.ollama.model": "qwen2.5-coder:32b"
}
Roo Code and Kilo Code are forks with similar Ollama support. The forks tend to merge upstream Cline changes within days, so feature parity is high.
What Cline does well: full-IDE integration with VS Code, automatic file selection from open editor tabs, terminal command execution from the chat, browser-use integration (when paired with a vision model).
What it doesn't do: terminal-only flows. It's a VS Code extension; you'd run something else for non-VS-Code work.
Continue
continuedev/continue — the most-installed VS Code AI extension in May 2026 by some counts, with strong Ollama support and a config-file-driven architecture.Config in ~/.continue/config.json:
{
"models": [
{
"title": "Qwen Coder local",
"provider": "ollama",
"model": "qwen2.5-coder:32b",
"apiBase": "http://localhost:11434"
}
]
}
What Continue does well: inline completions (the most polished of the four), customizable slash commands, a strong "@-mention" system for adding context files/folders/symbols to the chat.
What it doesn't do: agentic multi-step work as well as Cline. Continue is closer to "smart autocomplete + chat" than "agent that runs commands and edits files autonomously."
Roo Code
RooCodeInc/Roo-Code — Cline fork. Functionally similar to Cline; some teams prefer Roo Code's UI or its slightly different defaults. Same Ollama setup.The Cline/Roo/Kilo trio is essentially one tool with three skins as of May 2026. Pick whichever UI you prefer; the underlying agent loop is the same.
Which model
The model picker, ranked by what we'd actually run on local hardware in May 2026:
Qwen2.5-Coder-32B
The strongest open-weights coding model for agentic work as of mid-2026. Tool calling support is robust. Performs well on multi-file edits in TypeScript, Python, Go, Rust. Falls behind GPT-5 / Sonnet 4.7 on complex architectural decisions; matches them on routine work.
Hardware: 22-24GB VRAM at 4-bit quantization. RTX 3090 / 4090 / 5090, or Apple Silicon with 32GB+ unified memory.
Throughput: 20-40 tokens/sec on RTX 4090, 15-25 on M3 Max.
DeepSeek-Coder-V2-Lite-Instruct (16B)
Second choice. Slightly weaker than Qwen2.5-Coder-32B but fits in 16GB VRAM. Useful for users with 16GB cards (RTX 4070 Ti Super, 5070, M-series base models).
Hardware: 12-16GB VRAM.
Throughput: 30-50 tokens/sec on RTX 4090, ~20 on M-series.
Llama 3.3 70B
Strong for general work, marginal for coding specifically. Has reliable tool calling. The 70B size makes it expensive to run locally — needs 40GB+ VRAM or significant unified memory (M3 Max 64GB+ or dual-GPU setup).
Hardware: dual 24GB GPUs, or 48GB+ unified memory.
Use case: if you have the hardware, Llama 3.3 70B is more capable on broad tasks than Qwen2.5-Coder-32B; if coding is the primary use, Qwen-Coder-32B beats it.
Qwen2.5-Coder-7B / DeepSeek-Coder-V2-7B
Floor for "this is usable." Good for completion and Q&A; degrades on agentic multi-file work. Run on 8-12GB VRAM (RTX 3060 12GB, RTX 4060 Ti 16GB).
The 7B class is what makes local AI accessible on consumer hardware. It's not what you want for serious work; it is what you want if your alternative is "no local model at all."
Below 7B parameters
Not recommended for agentic coding work. Completion-only at best.
Hardware reality
Three realistic local setups in May 2026:
Setup A: budget — 16GB VRAM ($800-1200 GPU). RTX 4060 Ti 16GB, RTX 4070 Super 12GB. Run DeepSeek-Coder-V2-Lite 16B at Q4. Usable but tight. Expect to swap out the model regularly to free VRAM. Setup B: solid — 24GB VRAM ($1500-2000). RTX 3090 24GB, RTX 4090 24GB. Run Qwen2.5-Coder-32B at Q4 comfortably. This is the sweet spot for solo developer local AI coding. Setup C: serious — 48GB+ ($3000+). Dual 3090s, RTX 5090, or Apple Silicon with 64GB+ unified memory. Run Llama 3.3 70B or Qwen2.5-Coder-32B at higher quantization (Q6/Q8) for better quality.CPU-only inference exists but is impractical for 32B+ models — single-digit tokens/sec means waiting 30 seconds for short responses and minutes for refactors.
Cost math vs cloud APIs
The crossover where local hardware pays off:
Cloud API cost (Claude Opus 4.7):- Daily 200k input + 50k output tokens
- Cost: $1.00 + $1.25 = $2.25/day = ~$820/year
- GPU: $1,800 (one-time)
- Electricity: ~$300/year at 300W avg, 8 hours/day, $0.15/kWh
- Amortize GPU over 3 years: $600/year
- Total: $900/year
The break-even shifts based on model. If you can use Claude Sonnet 4.7 ($3/$15) instead of Opus ($5/$25), cloud wins at higher usage. If you're running Llama 3.3 70B locally, you need more hardware and electricity, so the break-even moves up.
For most solo developers, cloud API is cheaper unless you're running an agentic workflow at scale (autoresearch, automated PR generation, large refactors). For team setups where 5 developers share one local rig via a local API server, the math flips earlier.
Where local AI coding still loses
Latency on long tasks. A multi-file refactor that takes 90 seconds on Claude Opus takes 4-6 minutes locally. The throughput gap matters more on agentic work than on completion. Tool calling fragility. Local models can produce malformed tool calls under context pressure. Aider's auto-retry handles common cases; some tasks still wedge. Quality on architecturally novel work. When the task is "write standard CRUD," local 32B models match cloud Sonnet. When the task is "design a novel concurrency pattern for this codebase," cloud Opus pulls ahead measurably. Context windows. Most local models cap at 32K-128K tokens. Cloud has 200K-1M. For large codebases, the context limit forces more aggressive context pruning locally. Update cadence. Cloud models update silently. Local models require manual download of new versions. A new Qwen-Coder release in October won't help you until youollama pull it.
The privacy case
Where local AI coding wins decisively: when the code cannot leave your machine.
- Legal: contracts under attorney-client privilege
- Healthcare: code touching PHI
- Defense: classified codebases
- Corporate: pre-public-release code in regulated industries
- Personal: side projects you don't want OpenAI to see
Where this fails
Local model rankings change quickly. Qwen2.5-Coder-32B is the May 2026 leader. By Q4 2026 there will likely be a stronger option. Re-evaluate quarterly. Hardware advice is approximate. Tokens/sec figures are illustrative; your throughput depends on quantization, context length, batch size. Run your own benchmarks before committing $2000 to a GPU. Client ecosystem is shifting. Cline forked into Roo Code and Kilo Code; Continue has been actively rewriting its agent loop. The four-client picture is current; in 6 months one of these may absorb the others or fork further. Multi-tool workflows don't fully translate. A skill or rule that depends on Claude-specific features (the Task tool for subagents, Claude's prompt-caching) doesn't port to local. AGENTS.md works everywhere; Claude Code skills are Claude-Code-specific. See /topic/agents-md. Power and noise. A 24GB GPU running flat-out for hours pulls 350W+ and produces audible fan noise. For laptop or shared-office setups this matters.What to read next
- /topic/aider-conventions-md — the configuration file aider reads
- /topic/agents-md — the cross-tool config standard
- /topic/claude-md — Claude Code's equivalent (not portable to local)
- /topic/llm-cost-tracking — measuring cloud vs local cost in production
- /topic/llm-gateway-decision — routing across local and cloud models
- /topic/mcp-servers — MCP servers work locally too
- /for/security-conscious-ai-team — the audience that requires local
Sources
- aider. paul-gauthier/aider repository. Architect/editor mode, Ollama support.
- Cline. cline/cline repository. VS Code extension.
- Roo Code. RooCodeInc/Roo-Code repository. Cline fork.
- Continue. continuedev/continue repository. Config-driven IDE assistant.
- Ollama. ollama/ollama repository. Local LLM runtime.
- Qwen Team. Qwen2.5-Coder-32B model card.
- DeepSeek. DeepSeek-Coder-V2 model card.
- Meta. Llama 3.3 70B model card.
- r/LocalLLaMA consensus (aggregated through cited dev.to and blog posts; direct Reddit access blocked in research). BGE-M3 + AnythingLLM + Ollama as the privacy-first stack reference.
Frequently asked
- Can I use Claude Code with Ollama?
- No. Claude Code only talks to the Anthropic API. For local-model coding you need a different client. aider supports Ollama directly. Cline (and its forks Roo Code and Kilo Code) supports Ollama via the OpenAI-compatible endpoint Ollama exposes. Continue supports it as a configured provider. These are the four mainstream options.
- Which local model is good enough for coding?
- As of May 2026, the practical floor is Qwen2.5-Coder-32B or DeepSeek-Coder-V2-Lite for general work. For Rust/Go/TypeScript specifically, Qwen2.5-Coder-32B is the strongest open model. Below 32B parameters, output quality drops sharply on multi-file refactors. 7B models are usable for completion and Q&A; not for agentic work.
- What hardware do I need to run a 32B model?
- Realistic minimums: 24GB VRAM (an RTX 3090, 4090, or Apple Silicon with 32GB unified memory). 32B models at 4-bit quantization fit in ~22GB. 16GB VRAM forces you down to 13B-14B parameter models which lose significantly on agentic tasks. CPU-only inference works for 7B models at painful speeds (10-30 tokens/sec); not practical for 32B+ without GPU.
- Is local AI coding cheaper than Claude API?
- Only at high usage. The crossover is roughly 200k-500k tokens/day. Below that, Claude API ($5/$25 per million tokens for Opus) is cheaper than the electricity + hardware amortization for a local 24GB GPU rig. Above that, local wins because you've already paid for the hardware. Most solo developers do not cross the threshold; teams with 5+ developers usually do.
- Do local models support tool calling?
- Yes, increasingly well. Qwen2.5-Coder-32B and DeepSeek-Coder-V2 both have native tool-calling support that works in aider's tool-use flow. Llama 3.3 70B supports tool calling. Older or smaller models often produce malformed tool calls — aider's auto-retry handles this but adds latency.
- Why would I use local models instead of just calling Claude or GPT?
- Three reasons. (1) Privacy — code never leaves your machine. (2) Cost at scale. (3) Air-gapped or restricted environments where API access isn't allowed. The case is strongest in regulated industries (legal, healthcare, defense) and for developers whose code is contractually private.