Running Models Locally

Local models let you use Cline completely offline. Your code never leaves your machine, there are no per-token charges, and you can run as many tasks as your hardware allows. The tradeoff is hardware requirements and speed. Local models generate tokens at 5–20 tokens/second on consumer hardware, compared to 50–200+ tokens/second from cloud APIs. With the right setup, they handle real development work reliably.

Hardware requirements

Your available RAM determines which models you can run:

RAM	Recommended model	Quantization	Notes
32 GB	Qwen3 Coder 30B	4-bit (~17 GB)	Entry-level local coding
64 GB	Qwen3 Coder 30B	8-bit (~32 GB)	Better quality, full Cline features
128 GB+	GLM-4.5-Air	4-bit	Cloud-competitive performance

Most models under 30B parameters fail with Cline’s tool-use format — they produce broken outputs, refuse commands, or lose context across files. Stick to Qwen3 Coder 30B or larger unless you’ve specifically verified a smaller model works.

Recommended model: Qwen3 Coder 30B

After extensive testing, Qwen3 Coder 30B is the most reliable model under 70B parameters for use with Cline:

256K native context window — handles entire repositories
Strong tool-use — reliable command execution with Cline’s format
Consistent outputs — maintains context across multi-file tasks

Download sizes: 4-bit (~17 GB), 8-bit (~32 GB), 16-bit (~60 GB)

Choose your runtime

Ollama
LM Studio

Ollama is a command-line tool for running models locally. It has lower memory overhead than LM Studio and works well for scripting and server deployments.Best for: Power users, terminal-comfortable developers, server setups.

Install Ollama

Visit ollama.com and download the installer for your operating system (Windows, macOS, or Linux). Run the installer and follow the prompts.

Download a model

Open your terminal and pull the recommended model:

ollama pull qwen2.5-coder:32b

Browse all available models at ollama.com/search. Other capable options:

ollama pull mistral-small:latest
ollama pull codellama:34b-code

The first download may take several minutes depending on model size and your connection.

Start the model

Ollama starts automatically after installation and runs in the background. To run the model interactively to verify it works:

ollama run qwen2.5-coder:32b

Keep Ollama running whenever you use it with Cline. The default server port is 11434.

Configure Cline

Click the settings gear icon (⚙️) in the Cline panel.
Set API Provider to Ollama.
Set Base URL to http://localhost:11434/ (the default — no change needed unless you’re running Ollama on a different host).
Select your downloaded model from the Model dropdown.
Save settings.

Set the context window

By default, Ollama uses a small context window. For Cline, set it to the model’s maximum:Create a Modelfile to override the context size:

FROM qwen2.5-coder:32b
PARAMETER num_ctx 262144

Then build a named model from it:

ollama create qwen-coder-256k -f Modelfile

Select qwen-coder-256k in Cline’s model dropdown.

Enable compact prompts

In Cline settings, go to Features → Use Compact Prompt and toggle it on. This reduces prompt size by ~90% — essential for getting acceptable speed with local models.

Troubleshooting Ollama:

Cline can’t connect — Verify Ollama is running (ollama list should return results). Check that the base URL matches.
Slow responses — This is normal for local models. Enable compact prompts and consider using 4-bit quantization.
Model errors or broken tool use — Verify num_ctx is set correctly. Models with insufficient context cause truncated outputs.

LM Studio is a desktop application with a graphical interface for downloading and running local models. It’s the easiest way to get started.Best for: Desktop users who prefer a GUI, quick experimentation.

Install LM Studio

Visit lmstudio.ai and download the installer for your operating system. LM Studio requires AVX2 CPU support (most computers from 2013 onward).

Download a model

Open LM Studio.
Navigate to the Discover tab.
Search for Qwen3-Coder-30B-A3B-Instruct.
Select the appropriate quantization for your RAM:
- 32 GB RAM — 4-bit GGUF (~17 GB download)
- 64 GB RAM — 8-bit GGUF (~32 GB download)
- Mac (Apple Silicon) — MLX format for better performance
Click Download and wait for it to complete.

Start the local server

Navigate to the Developer tab.
Load your downloaded model.
Toggle the server switch to Running.

The server starts at http://localhost:1234. Keep LM Studio open while using Cline.

Configure model settings

After loading your model in the Developer tab, configure these critical settings:

Setting	Value	Why
Context Length	`262144`	Set to model maximum for best results
KV Cache Quantization	Off (unchecked)	Critical — enables consistent tool use
Flash Attention	On (if available)	Improves inference speed

Configure Cline

Click the settings gear icon (⚙️) in the Cline panel.
Set API Provider to LM Studio.
The Base URL defaults to http://localhost:1234 — leave this as-is.
Select your loaded model from the Model dropdown.
Save settings.

Enable compact prompts

In Cline settings, go to Features → Use Compact Prompt and toggle it on. This reduces prompt size by ~90%, which is essential for local inference speed.

Troubleshooting LM Studio:

Cline can’t connect — Verify the server is running in the Developer tab and a model is loaded.
Inconsistent tool use or errors — Make sure KV Cache Quantization is off. This is the most common cause of tool-use failures with LM Studio.
Out of memory — Switch to a lower quantization (4-bit instead of 8-bit) or close other applications to free RAM.

The OpenAI-compatible endpoint pattern

Both Ollama and LM Studio expose a local HTTP server that follows the OpenAI API format. This means you can also configure them as a custom OpenAI-compatible provider in Cline if the dedicated provider option doesn’t appear:

Runtime	Base URL	Model name format
Ollama	`http://localhost:11434/v1`	`qwen2.5-coder:32b`
LM Studio	`http://localhost:1234/v1`	Name shown in LM Studio UI

To use this approach, select OpenAI Compatible as the provider in Cline, enter the base URL above, and enter any non-empty string as the API key (local servers don’t authenticate).

Performance expectations

Metric	Typical range
Initial model load	10–30 seconds
Token generation	5–20 tokens/sec
Context processing (large)	Slower — scales with context size
Memory usage	Close to quantization download size

Tips for better performance

Use 4-bit quantization to maximize speed; use 8-bit if you have 64 GB+ RAM and want better quality.
Enable compact prompts in Cline — this is the single biggest improvement you can make.
Store models on an NVMe SSD, not a spinning hard drive.
Close other applications to free RAM for the model.
Enable Flash Attention if your hardware supports it.

When to use local vs. cloud models

Local models are better when…	Cloud models are better when…
Code must stay on your machine	Codebase exceeds 256K tokens
You work offline	You need consistent team performance
You want zero API cost	You need the latest model capabilities
You’re experimenting with many iterations	Speed is critical

Get Started

Core Workflows

Features

Customization

MCP

Models & Providers

Cline CLI

Troubleshooting

Running Models Locally

Hardware requirements

Recommended model: Qwen3 Coder 30B

Choose your runtime

The OpenAI-compatible endpoint pattern

Performance expectations

Tips for better performance

When to use local vs. cloud models

Next steps

Model selection guide

Context windows

​Hardware requirements

​Recommended model: Qwen3 Coder 30B

​Choose your runtime

​The OpenAI-compatible endpoint pattern

​Performance expectations

​Tips for better performance

​When to use local vs. cloud models

​Next steps

Model selection guide

Context windows

Hardware requirements

Recommended model: Qwen3 Coder 30B

Choose your runtime

The OpenAI-compatible endpoint pattern

Performance expectations

Tips for better performance

When to use local vs. cloud models

Next steps