Can I use Claude Code with Ollama?

No. Claude Code only talks to the Anthropic API. For local-model coding you need a different client. aider supports Ollama directly. Cline (and its forks Roo Code and Kilo Code) supports Ollama via the OpenAI-compatible endpoint Ollama exposes. Continue supports it as a configured provider. These are the four mainstream options.

Which local model is good enough for coding?

As of May 2026, the practical floor is Qwen2.5-Coder-32B or DeepSeek-Coder-V2-Lite for general work. For Rust/Go/TypeScript specifically, Qwen2.5-Coder-32B is the strongest open model. Below 32B parameters, output quality drops sharply on multi-file refactors. 7B models are usable for completion and Q&A; not for agentic work.

What hardware do I need to run a 32B model?

Realistic minimums: 24GB VRAM (an RTX 3090, 4090, or Apple Silicon with 32GB unified memory). 32B models at 4-bit quantization fit in ~22GB. 16GB VRAM forces you down to 13B-14B parameter models which lose significantly on agentic tasks. CPU-only inference works for 7B models at painful speeds (10-30 tokens/sec); not practical for 32B+ without GPU.

Is local AI coding cheaper than Claude API?

Only at high usage. The crossover is roughly 200k-500k tokens/day. Below that, Claude API ($5/$25 per million tokens for Opus) is cheaper than the electricity + hardware amortization for a local 24GB GPU rig. Above that, local wins because you've already paid for the hardware. Most solo developers do not cross the threshold; teams with 5+ developers usually do.

Do local models support tool calling?

Yes, increasingly well. Qwen2.5-Coder-32B and DeepSeek-Coder-V2 both have native tool-calling support that works in aider's tool-use flow. Llama 3.3 70B supports tool calling. Older or smaller models often produce malformed tool calls — aider's auto-retry handles this but adds latency.

Why would I use local models instead of just calling Claude or GPT?

Three reasons. (1) Privacy — code never leaves your machine. (2) Cost at scale. (3) Air-gapped or restricted environments where API access isn't allowed. The case is strongest in regulated industries (legal, healthcare, defense) and for developers whose code is contractually private.

The Best AI Coding Agents for Ollama and Local Models (2026)

Local AI coding was a marginal use case in 2024 because local models couldn't write production code without falling apart inside multi-file edits. By mid-2026 that has changed. Qwen2.5-Coder-32B and DeepSeek-Coder-V2-Lite, paired with the right client, produce output close enough to GPT-5 / Sonnet 4.7 quality on common stacks that the trade-offs become legitimate.

The trade-offs: hardware cost, throughput, and the long tail of cases where a local model still misfires. This page is the picker — which clients, which models, what hardware, and the cost math.

This page does not cover Claude Code or Cursor for local models — both are cloud-API-only. The four clients that meaningfully support local models in May 2026 are aider, Cline, Continue, and Roo Code (a Cline fork).

The four clients

aider

paul-gauthier/aider — the terminal coding agent that pre-dates Claude Code, Cursor, and Codex CLI. Predates the local-model option too; aider has had Ollama support for over a year.

How it talks to Ollama:

aider --model ollama/qwen2.5-coder:32b

The ollama/ prefix tells aider to use the Ollama HTTP API at http://localhost:11434. Aider's architect/editor mode works against two different local models if you want (e.g., 32B model as architect, 7B as editor for cheap completions).

What aider does well with local: terminal-native, no GUI, full file-tree awareness, integrates with git for atomic commits, the CONVENTIONS.md pattern translates 1:1. The pairing-mode is a unique benefit — you can have a stronger model plan and a faster model edit, which on local hardware also means "the better model isn't loaded constantly."

What it doesn't do: rich IDE integration. If you want inline completion and a chat sidebar inside VS Code, aider is the wrong client.

Cline (and forks Roo Code, Kilo Code)

cline/cline — the VS Code extension that opened the multi-step agent flow inside the IDE. Cline has had Ollama support since v2.x via Ollama's OpenAI-compatible endpoint.

Config:

{
  "cline.modelProvider": "ollama",
  "cline.ollama.baseURL": "http://localhost:11434/v1",
  "cline.ollama.model": "qwen2.5-coder:32b"
}

Roo Code and Kilo Code are forks with similar Ollama support. The forks tend to merge upstream Cline changes within days, so feature parity is high.

What Cline does well: full-IDE integration with VS Code, automatic file selection from open editor tabs, terminal command execution from the chat, browser-use integration (when paired with a vision model).

What it doesn't do: terminal-only flows. It's a VS Code extension; you'd run something else for non-VS-Code work.

Continue

continuedev/continue — the most-installed VS Code AI extension in May 2026 by some counts, with strong Ollama support and a config-file-driven architecture.

Config in ~/.continue/config.json:

{
  "models": [
    {
      "title": "Qwen Coder local",
      "provider": "ollama",
      "model": "qwen2.5-coder:32b",
      "apiBase": "http://localhost:11434"
    }
  ]
}

What Continue does well: inline completions (the most polished of the four), customizable slash commands, a strong "@-mention" system for adding context files/folders/symbols to the chat.

What it doesn't do: agentic multi-step work as well as Cline. Continue is closer to "smart autocomplete + chat" than "agent that runs commands and edits files autonomously."

Roo Code

RooCodeInc/Roo-Code — Cline fork. Functionally similar to Cline; some teams prefer Roo Code's UI or its slightly different defaults. Same Ollama setup.

The Cline/Roo/Kilo trio is essentially one tool with three skins as of May 2026. Pick whichever UI you prefer; the underlying agent loop is the same.

Which model

The model picker, ranked by what we'd actually run on local hardware in May 2026:

Qwen2.5-Coder-32B

The strongest open-weights coding model for agentic work as of mid-2026. Tool calling support is robust. Performs well on multi-file edits in TypeScript, Python, Go, Rust. Falls behind GPT-5 / Sonnet 4.7 on complex architectural decisions; matches them on routine work.

Hardware: 22-24GB VRAM at 4-bit quantization. RTX 3090 / 4090 / 5090, or Apple Silicon with 32GB+ unified memory.

Throughput: 20-40 tokens/sec on RTX 4090, 15-25 on M3 Max.

DeepSeek-Coder-V2-Lite-Instruct (16B)

Second choice. Slightly weaker than Qwen2.5-Coder-32B but fits in 16GB VRAM. Useful for users with 16GB cards (RTX 4070 Ti Super, 5070, M-series base models).

Hardware: 12-16GB VRAM.

Throughput: 30-50 tokens/sec on RTX 4090, ~20 on M-series.

Llama 3.3 70B

Strong for general work, marginal for coding specifically. Has reliable tool calling. The 70B size makes it expensive to run locally — needs 40GB+ VRAM or significant unified memory (M3 Max 64GB+ or dual-GPU setup).

Hardware: dual 24GB GPUs, or 48GB+ unified memory.

Use case: if you have the hardware, Llama 3.3 70B is more capable on broad tasks than Qwen2.5-Coder-32B; if coding is the primary use, Qwen-Coder-32B beats it.

Qwen2.5-Coder-7B / DeepSeek-Coder-V2-7B

Floor for "this is usable." Good for completion and Q&A; degrades on agentic multi-file work. Run on 8-12GB VRAM (RTX 3060 12GB, RTX 4060 Ti 16GB).

The 7B class is what makes local AI accessible on consumer hardware. It's not what you want for serious work; it is what you want if your alternative is "no local model at all."

Below 7B parameters

Not recommended for agentic coding work. Completion-only at best.

Hardware reality

Three realistic local setups in May 2026:

Setup A: budget — 16GB VRAM ($800-1200 GPU). RTX 4060 Ti 16GB, RTX 4070 Super 12GB. Run DeepSeek-Coder-V2-Lite 16B at Q4. Usable but tight. Expect to swap out the model regularly to free VRAM. Setup B: solid — 24GB VRAM ($1500-2000). RTX 3090 24GB, RTX 4090 24GB. Run Qwen2.5-Coder-32B at Q4 comfortably. This is the sweet spot for solo developer local AI coding. Setup C: serious — 48GB+ ($3000+). Dual 3090s, RTX 5090, or Apple Silicon with 64GB+ unified memory. Run Llama 3.3 70B or Qwen2.5-Coder-32B at higher quantization (Q6/Q8) for better quality.

CPU-only inference exists but is impractical for 32B+ models — single-digit tokens/sec means waiting 30 seconds for short responses and minutes for refactors.

Cost math vs cloud APIs

The crossover where local hardware pays off:

Cloud API cost (Claude Opus 4.7):

Daily 200k input + 50k output tokens
Cost: $1.00 + $1.25 = $2.25/day = ~$820/year

Local hardware cost (RTX 4090 setup):

GPU: $1,800 (one-time)
Electricity: ~$300/year at 300W avg, 8 hours/day, $0.15/kWh
Amortize GPU over 3 years: $600/year
Total: $900/year

At 200k/day usage, cloud and local are roughly even on raw cost. Below 200k/day, cloud wins (you're paying for hardware you don't use). Above 200k/day, local pulls ahead — every additional token is free.

The break-even shifts based on model. If you can use Claude Sonnet 4.7 ($3/$15) instead of Opus ($5/$25), cloud wins at higher usage. If you're running Llama 3.3 70B locally, you need more hardware and electricity, so the break-even moves up.

For most solo developers, cloud API is cheaper unless you're running an agentic workflow at scale (autoresearch, automated PR generation, large refactors). For team setups where 5 developers share one local rig via a local API server, the math flips earlier.

Where local AI coding still loses

Latency on long tasks. A multi-file refactor that takes 90 seconds on Claude Opus takes 4-6 minutes locally. The throughput gap matters more on agentic work than on completion. Tool calling fragility. Local models can produce malformed tool calls under context pressure. Aider's auto-retry handles common cases; some tasks still wedge. Quality on architecturally novel work. When the task is "write standard CRUD," local 32B models match cloud Sonnet. When the task is "design a novel concurrency pattern for this codebase," cloud Opus pulls ahead measurably. Context windows. Most local models cap at 32K-128K tokens. Cloud has 200K-1M. For large codebases, the context limit forces more aggressive context pruning locally. Update cadence. Cloud models update silently. Local models require manual download of new versions. A new Qwen-Coder release in October won't help you until you ollama pull it.

The privacy case

Where local AI coding wins decisively: when the code cannot leave your machine.

Legal: contracts under attorney-client privilege
Healthcare: code touching PHI
Defense: classified codebases
Corporate: pre-public-release code in regulated industries
Personal: side projects you don't want OpenAI to see

In these cases, local-only is the only option. The cost math is irrelevant — cloud is unavailable, not "more expensive." Setup B (24GB VRAM, Qwen2.5-Coder-32B) is the practical default.

Where this fails

Local model rankings change quickly. Qwen2.5-Coder-32B is the May 2026 leader. By Q4 2026 there will likely be a stronger option. Re-evaluate quarterly. Hardware advice is approximate. Tokens/sec figures are illustrative; your throughput depends on quantization, context length, batch size. Run your own benchmarks before committing $2000 to a GPU. Client ecosystem is shifting. Cline forked into Roo Code and Kilo Code; Continue has been actively rewriting its agent loop. The four-client picture is current; in 6 months one of these may absorb the others or fork further. Multi-tool workflows don't fully translate. A skill or rule that depends on Claude-specific features (the Task tool for subagents, Claude's prompt-caching) doesn't port to local. AGENTS.md works everywhere; Claude Code skills are Claude-Code-specific. See /topic/agents-md. Power and noise. A 24GB GPU running flat-out for hours pulls 350W+ and produces audible fan noise. For laptop or shared-office setups this matters.

Sources

aider. paul-gauthier/aider repository. Architect/editor mode, Ollama support.
Cline. cline/cline repository. VS Code extension.
Roo Code. RooCodeInc/Roo-Code repository. Cline fork.
Continue. continuedev/continue repository. Config-driven IDE assistant.
Ollama. ollama/ollama repository. Local LLM runtime.
Qwen Team. Qwen2.5-Coder-32B model card.
DeepSeek. DeepSeek-Coder-V2 model card.
Meta. Llama 3.3 70B model card.
r/LocalLLaMA consensus (aggregated through cited dev.to and blog posts; direct Reddit access blocked in research). BGE-M3 + AnythingLLM + Ollama as the privacy-first stack reference.