Hardware requirements
Your available RAM determines which models you can run:| RAM | Recommended model | Quantization | Notes |
|---|---|---|---|
| 32 GB | Qwen3 Coder 30B | 4-bit (~17 GB) | Entry-level local coding |
| 64 GB | Qwen3 Coder 30B | 8-bit (~32 GB) | Better quality, full Cline features |
| 128 GB+ | GLM-4.5-Air | 4-bit | Cloud-competitive performance |
Recommended model: Qwen3 Coder 30B
After extensive testing, Qwen3 Coder 30B is the most reliable model under 70B parameters for use with Cline:- 256K native context window — handles entire repositories
- Strong tool-use — reliable command execution with Cline’s format
- Consistent outputs — maintains context across multi-file tasks
Choose your runtime
- Ollama
- LM Studio
Ollama is a command-line tool for running models locally. It has lower memory overhead than LM Studio and works well for scripting and server deployments.Best for: Power users, terminal-comfortable developers, server setups.Troubleshooting Ollama:
Install Ollama
Visit ollama.com and download the installer for your operating system (Windows, macOS, or Linux). Run the installer and follow the prompts.
Download a model
Open your terminal and pull the recommended model:Browse all available models at ollama.com/search. Other capable options:The first download may take several minutes depending on model size and your connection.
Start the model
Ollama starts automatically after installation and runs in the background. To run the model interactively to verify it works:Keep Ollama running whenever you use it with Cline. The default server port is
11434.Configure Cline
- Click the settings gear icon (⚙️) in the Cline panel.
- Set API Provider to
Ollama. - Set Base URL to
http://localhost:11434/(the default — no change needed unless you’re running Ollama on a different host). - Select your downloaded model from the Model dropdown.
- Save settings.
Set the context window
By default, Ollama uses a small context window. For Cline, set it to the model’s maximum:Create a Then build a named model from it:Select
Modelfile to override the context size:qwen-coder-256k in Cline’s model dropdown.- Cline can’t connect — Verify Ollama is running (
ollama listshould return results). Check that the base URL matches. - Slow responses — This is normal for local models. Enable compact prompts and consider using 4-bit quantization.
- Model errors or broken tool use — Verify
num_ctxis set correctly. Models with insufficient context cause truncated outputs.
The OpenAI-compatible endpoint pattern
Both Ollama and LM Studio expose a local HTTP server that follows the OpenAI API format. This means you can also configure them as a custom OpenAI-compatible provider in Cline if the dedicated provider option doesn’t appear:| Runtime | Base URL | Model name format |
|---|---|---|
| Ollama | http://localhost:11434/v1 | qwen2.5-coder:32b |
| LM Studio | http://localhost:1234/v1 | Name shown in LM Studio UI |
Performance expectations
| Metric | Typical range |
|---|---|
| Initial model load | 10–30 seconds |
| Token generation | 5–20 tokens/sec |
| Context processing (large) | Slower — scales with context size |
| Memory usage | Close to quantization download size |
Tips for better performance
- Use 4-bit quantization to maximize speed; use 8-bit if you have 64 GB+ RAM and want better quality.
- Enable compact prompts in Cline — this is the single biggest improvement you can make.
- Store models on an NVMe SSD, not a spinning hard drive.
- Close other applications to free RAM for the model.
- Enable Flash Attention if your hardware supports it.
When to use local vs. cloud models
| Local models are better when… | Cloud models are better when… |
|---|---|
| Code must stay on your machine | Codebase exceeds 256K tokens |
| You work offline | You need consistent team performance |
| You want zero API cost | You need the latest model capabilities |
| You’re experimenting with many iterations | Speed is critical |
Next steps
Model selection guide
Compare all models and find the right one for your workflow.
Context windows
Understand context limits and how to stay within them with large codebases.