Skip to main content
Local models let you use Cline completely offline. Your code never leaves your machine, there are no per-token charges, and you can run as many tasks as your hardware allows. The tradeoff is hardware requirements and speed. Local models generate tokens at 5–20 tokens/second on consumer hardware, compared to 50–200+ tokens/second from cloud APIs. With the right setup, they handle real development work reliably.

Hardware requirements

Your available RAM determines which models you can run:
RAMRecommended modelQuantizationNotes
32 GBQwen3 Coder 30B4-bit (~17 GB)Entry-level local coding
64 GBQwen3 Coder 30B8-bit (~32 GB)Better quality, full Cline features
128 GB+GLM-4.5-Air4-bitCloud-competitive performance
Most models under 30B parameters fail with Cline’s tool-use format — they produce broken outputs, refuse commands, or lose context across files. Stick to Qwen3 Coder 30B or larger unless you’ve specifically verified a smaller model works.
After extensive testing, Qwen3 Coder 30B is the most reliable model under 70B parameters for use with Cline:
  • 256K native context window — handles entire repositories
  • Strong tool-use — reliable command execution with Cline’s format
  • Consistent outputs — maintains context across multi-file tasks
Download sizes: 4-bit (~17 GB), 8-bit (~32 GB), 16-bit (~60 GB)

Choose your runtime

Ollama is a command-line tool for running models locally. It has lower memory overhead than LM Studio and works well for scripting and server deployments.Best for: Power users, terminal-comfortable developers, server setups.
1

Install Ollama

Visit ollama.com and download the installer for your operating system (Windows, macOS, or Linux). Run the installer and follow the prompts.
2

Download a model

Open your terminal and pull the recommended model:
ollama pull qwen2.5-coder:32b
Browse all available models at ollama.com/search. Other capable options:
ollama pull mistral-small:latest
ollama pull codellama:34b-code
The first download may take several minutes depending on model size and your connection.
3

Start the model

Ollama starts automatically after installation and runs in the background. To run the model interactively to verify it works:
ollama run qwen2.5-coder:32b
Keep Ollama running whenever you use it with Cline. The default server port is 11434.
4

Configure Cline

  1. Click the settings gear icon (⚙️) in the Cline panel.
  2. Set API Provider to Ollama.
  3. Set Base URL to http://localhost:11434/ (the default — no change needed unless you’re running Ollama on a different host).
  4. Select your downloaded model from the Model dropdown.
  5. Save settings.
5

Set the context window

By default, Ollama uses a small context window. For Cline, set it to the model’s maximum:Create a Modelfile to override the context size:
FROM qwen2.5-coder:32b
PARAMETER num_ctx 262144
Then build a named model from it:
ollama create qwen-coder-256k -f Modelfile
Select qwen-coder-256k in Cline’s model dropdown.
6

Enable compact prompts

In Cline settings, go to Features → Use Compact Prompt and toggle it on. This reduces prompt size by ~90% — essential for getting acceptable speed with local models.
Troubleshooting Ollama:
  • Cline can’t connect — Verify Ollama is running (ollama list should return results). Check that the base URL matches.
  • Slow responses — This is normal for local models. Enable compact prompts and consider using 4-bit quantization.
  • Model errors or broken tool use — Verify num_ctx is set correctly. Models with insufficient context cause truncated outputs.

The OpenAI-compatible endpoint pattern

Both Ollama and LM Studio expose a local HTTP server that follows the OpenAI API format. This means you can also configure them as a custom OpenAI-compatible provider in Cline if the dedicated provider option doesn’t appear:
RuntimeBase URLModel name format
Ollamahttp://localhost:11434/v1qwen2.5-coder:32b
LM Studiohttp://localhost:1234/v1Name shown in LM Studio UI
To use this approach, select OpenAI Compatible as the provider in Cline, enter the base URL above, and enter any non-empty string as the API key (local servers don’t authenticate).

Performance expectations

MetricTypical range
Initial model load10–30 seconds
Token generation5–20 tokens/sec
Context processing (large)Slower — scales with context size
Memory usageClose to quantization download size

Tips for better performance

  • Use 4-bit quantization to maximize speed; use 8-bit if you have 64 GB+ RAM and want better quality.
  • Enable compact prompts in Cline — this is the single biggest improvement you can make.
  • Store models on an NVMe SSD, not a spinning hard drive.
  • Close other applications to free RAM for the model.
  • Enable Flash Attention if your hardware supports it.

When to use local vs. cloud models

Local models are better when…Cloud models are better when…
Code must stay on your machineCodebase exceeds 256K tokens
You work offlineYou need consistent team performance
You want zero API costYou need the latest model capabilities
You’re experimenting with many iterationsSpeed is critical

Next steps

Model selection guide

Compare all models and find the right one for your workflow.

Context windows

Understand context limits and how to stay within them with large codebases.