LocalLLMGear

How to Run Local LLMs in VS Code (2026)

By LocalLLMGear Editorial · Editorial Team · Updated 2026-06-29

We test hardware hands-on and may use AI tools in research — every guide is human-reviewed. Editorial policy.

We may earn a commission from links in this article, at no extra cost to you. Disclosure.

A local LLM inside VS Code gives you most of what a paid cloud copilot does — inline autocomplete, chat about your code, even multi-file edits — except it runs entirely on your own machine. No per-seat fee, no code leaving your laptop, and it works on a plane. The setup is genuinely simple: run a model with Ollama, then point a free VS Code extension at it. This guide walks the whole path, start to finish.

The 30-second answer: Install Ollama and pull a code model (ollama pull qwen2.5-coder). In VS Code, install the Continue (or Cline) extension from the marketplace, then set its provider to Ollama and pick your model. That’s it — you get private autocomplete and chat, fully offline. All three tools are free.

What you’re actually wiring together

Three free pieces, each doing one job:

  • Ollama runs the model and exposes a local API at http://localhost:11434. This is the engine — see the complete Ollama guide for install and how the API works.
  • VS Code is your editor (nothing special needed — the stock install).
  • An extensionContinue or Cline — is the bridge. It lives in the editor, sends your code and prompts to Ollama, and shows you completions, chat, and diffs.

None of these cost anything, and none of them require an account. The model weights are open, the tools are open source, and the whole loop stays on your computer.

Step 1 — Get a model running with Ollama

If you haven’t already, install Ollama and pull a code-focused model. Coding models follow code structure and fill-in-the-middle completion better than general chat models:

# a capable all-round coder for chat/refactors
ollama pull qwen2.5-coder

# a small, fast model for low-latency autocomplete
ollama pull qwen2.5-coder:1.5b

Confirm the server is alive — the extension talks to this exact endpoint:

ollama list                          # models you have
curl http://localhost:11434/api/tags # should return JSON, not "connection refused"

If that curl fails, Ollama isn’t running. Launch the menu-bar/tray app (macOS/Windows) or sudo systemctl start ollama (Linux). Full troubleshooting is in the Ollama guide. Not sure which model to pull for your hardware? Our best local LLM for coding breaks down the families by quality and VRAM.

Step 2 — Install the VS Code extension

Open VS Code, go to the Extensions panel (Ctrl/Cmd+Shift+X), and search for either:

  • Continue — the easiest on-ramp. Great inline autocomplete and a chat sidebar that can see your open files and selection. Start here if you’re unsure.
  • Cline — more agentic. It’s built around multi-file edits and running tasks across a project, so it shines when you want the assistant to do things, not just suggest.

Click Install. They don’t conflict, so you can keep both and use Continue for quick completions and Cline for larger, agent-style edits. Here’s the quick split:

Continue vs Cline for local models in VS Code

Extension Type Best for
Continue ★ Our pick Autocomplete + chat Easiest start, inline suggestions, chat-with-your-code
Cline Agentic edits Multi-file changes, task-style 'do this across the repo'

Step 3 — Point the extension at your local Ollama

This is the only configuration that matters: tell the extension to use Ollama as the provider instead of a cloud API.

In Continue: open its config (the gear/settings icon in the Continue sidebar, which opens a JSON config file) and add Ollama as a model provider. Conceptually it looks like this:

{
  "models": [
    {
      "title": "Qwen Coder (local)",
      "provider": "ollama",
      "model": "qwen2.5-coder"
    }
  ],
  "tabAutocompleteModel": {
    "title": "Fast autocomplete",
    "provider": "ollama",
    "model": "qwen2.5-coder:1.5b"
  }
}

The key fields are "provider": "ollama" and a "model" name that matches what ollama list shows. Continue defaults to the local http://localhost:11434 endpoint, so you usually don’t need to set a URL at all.

In Cline: open its settings, choose Ollama as the API provider, and select your pulled model from the dropdown. If it asks for a base URL, use http://localhost:11434.

Save, and the extension reconnects to your local server. No API key needed — and if a field demands one, type any dummy value; Ollama ignores it.

Step 4 — Use it: autocomplete and chat

Now it works where you already do:

  • Autocomplete: start typing a function and the extension proposes a completion inline. Press Tab to accept. This fires constantly, so it should point at your small fast model — latency matters more than raw quality here.
  • Chat: open the extension’s sidebar, select a block of code (or a whole file), and ask “explain this,” “add error handling,” or “write a test.” Continue and Cline can read your selection and open files for context.
  • Edits / refactors: ask for a change and the extension shows a diff you can apply. This is where a larger chat model earns its keep.

A common, comfortable setup: a tiny model (1B–3B) for autocomplete and a bigger coder (7B–32B) for chat. Just remember two loaded models use roughly the sum of their memory.

Hardware: it all comes down to VRAM

The single thing that decides whether this feels snappy or sluggish is whether the model fits in your VRAM. If it does, completions are near-instant; if it overflows into system RAM, everything crawls. Rough guide: a 7B coder in 4-bit wants ~6–8 GB, a 14B wants ~10–12 GB, and a 32B wants 20–24 GB (numbers approximate — verify against the exact build you download).

That’s why the autocomplete-vs-chat split helps: a 1.5B model sips VRAM and stays responsive while you type, leaving room for the bigger model on demand. Picking or upgrading a card? Our hardware hub maps VRAM tiers to the models they unlock.

Troubleshooting

  • “Connection refused” / no model found: Ollama isn’t running, or the model name in the config doesn’t match ollama list. Fix the name or start the server.
  • Autocomplete lags: the autocomplete model is too big — switch it to a 1B–3B model. Check loaded models with ollama ps.
  • Chat answers are weak: use a dedicated code model (Qwen Coder, DeepSeek Coder) at the largest size your VRAM allows — see best local LLM for coding.
  • First response is slow, then fast: that’s the model loading into memory the first time. Subsequent calls are quicker while it stays resident.

You’re done

That’s the whole loop: a model in Ollama, a free extension in VS Code, the provider set to local. You now have private, offline autocomplete and chat that costs nothing per month — a real alternative to cloud copilots for a large chunk of day-to-day work. The gap to the very best cloud models narrows every release, and on a capable GPU it’s small for everyday coding.

Want to get more out of these tools — prompting, context, and building on top of local models rather than just running them? A structured course shortcuts months of trial and error:

Go deeper on DataCamp Ad

Frequently asked questions

Is running a local LLM in VS Code free?+

Yes. VS Code, the Continue and Cline extensions, and Ollama are all free and open source. You only pay for the hardware you already own — there's no subscription, API key, or per-seat fee, and nothing leaves your machine.

Continue or Cline — which should I use?+

Continue is the easiest start for inline autocomplete and chat-with-your-code, so begin there. Cline leans toward agentic, multi-file edits and running tasks across your project. They don't conflict, so many people install both and use each where it shines.

Why is my local autocomplete slow or laggy in VS Code?+

Almost always the model is too big for your VRAM and spilled into system RAM. Use a small, fast model (a 1B–3B coder) for autocomplete and save a larger one for chat. Check what's loaded with `ollama ps` and watch your GPU memory.

Disclosure: some links above are affiliate links. See our affiliate disclosure.