What's the best small LLM to start with?

For most people in 2026, Llama 3.1 8B or Qwen2.5 7B are the safest first picks — both are strong all-rounders with wide tool support in Ollama and LM Studio. If you're on very limited hardware, drop to a 3B–4B model like Phi or Gemma. Always check for the latest version, as these families update often.

How much VRAM do I need for an 8B model?

A 4-bit quantized 8B model needs roughly 5–6 GB of VRAM, so an 8 GB card handles it comfortably with room for context. Smaller 3B–4B models run in 3–4 GB, and the tiniest 1B–2B models run on almost any modern laptop, even CPU-only.

Can I run a small LLM without a GPU?

Yes. Quantized 1B–4B models run on CPU via Ollama or LM Studio, just slower (a few tokens per second). A modern laptop with 16 GB of RAM can run a 7B–8B model on CPU too, though a GPU or Apple Silicon makes it far more pleasant.

The Best Small LLMs You Can Run on Almost Anything (2026)

By LocalLLMGear Editorial · Editorial Team · Updated 2026-06-29

We test hardware hands-on and may use AI tools in research — every guide is human-reviewed. Editorial policy.

We may earn a commission from links in this article, at no extra cost to you. Disclosure.

You don’t need a 24 GB GPU or a giant model to run a genuinely useful AI locally. In 2026, small LLMs (8B parameters and under) have gotten so good that a 7B or 8B model runs a chatbot, summarizes documents and writes code on hardware you probably already own — a mid-range GPU, a gaming laptop, or an Apple Silicon Mac. This guide covers the best small models from the families you can trust, and exactly what runs on what.

The 30-second answer: Start with Llama 3.1 8B or Qwen2.5 7B — both are excellent all-rounders that fit in 8 GB of VRAM when quantized to 4-bit. Tight on hardware? Drop to Phi or Gemma at 3B–4B. These families update constantly, so as of 2026, check for the latest version before you download.

Why small models are worth your time

A few years ago “small” meant “dumb.” Not anymore. The 7B–8B models from major labs now handle everyday tasks — drafting, summarizing, Q&A, light coding — well enough that most people never need a bigger model running locally. They’re also fast, private (your data never leaves the machine), and free to run once you own the hardware.

The trick that makes them fit is quantization — compressing the model’s weights to 4-bit (the common GGUF Q4 formats) shrinks memory use by roughly 4x with only a small quality hit. That’s why an 8B model that would need ~16 GB at full precision squeezes into about 5–6 GB quantized. Tools like Ollama and LM Studio handle this for you automatically.

The best small LLM families (≤8B)

These are established families with active releases and broad tooling support. Pick by what you want to do — they’re all close enough that you can swap between them in minutes.

Llama (Meta) — the safe default. The Llama 3.x 8B is the most widely supported small model in the ecosystem. It’s a strong generalist with reliable instruction-following, and almost every local tool, front-end and tutorial assumes you can run it. If you don’t know where to start, start here.

Qwen (Alibaba) — the multilingual all-rounder. The Qwen2.5 family at 7B punches above its size, especially on reasoning, math and coding, and it’s genuinely good across many languages. It comes in a wide range of sizes (0.5B up to large), so you can scale down to tiny variants on weak hardware and keep the same family.

Mistral 7B — the efficient classic. Mistral’s 7B is the model that proved small could be smart. It’s lean, fast and a great fit for 8 GB cards. Newer small Mistral variants keep the same efficiency-first character — a solid pick when you want speed.

Gemma (Google) — light and tidy. Gemma ships in small sizes (around 2B and 7B-class) that are easy to run and behave well for chat and summarization. The smaller variants are a nice middle ground when 7B is a touch heavy for your machine.

Phi (Microsoft) — the tiny overachiever. The Phi family (around 3.8B) is trained to do a lot with very few parameters. It runs on almost anything — modest laptops, even CPU-only — and is a great choice when VRAM is scarce but you still want coherent answers.

What runs on what (by VRAM)

The single number that decides what you can run is VRAM (or unified memory on a Mac). Here’s the practical mapping for 4-bit quantized models, with approximate sizes — treat these as ballpark, not benchmarks:

Small LLM size vs. hardware (4-bit quantized, approximate)

GPU / Option	VRAM	Best for
1B–2B (Gemma 2B, Qwen 0.5–1.5B)	~1–2 GB	Any laptop, even CPU-only — quick tasks
3B–4B (Phi ~3.8B, Llama 3B) ★ Our pick	~3–4 GB	Old/entry GPUs, light laptops
7B–8B (Llama 8B, Qwen 7B, Mistral 7B)	~5–6 GB	8 GB GPUs, gaming laptops, M-series Macs
8B at higher quality (Q6/Q8)	~8–10 GB	12 GB+ GPUs — best small-model quality

A few rules of thumb that hold in 2026 (verify against your exact card and the latest model release):

8 GB GPU (e.g. an RTX 3060/4060): runs any 7B–8B model at 4-bit comfortably with room for a decent context window. This is the small-LLM sweet spot.
Gaming laptop with 6 GB: stick to 4B–7B; close other apps to free VRAM.
Apple Silicon Mac: unified memory is your VRAM. An 8 GB Mac runs 3B–7B; 16 GB+ handles 8B with ease and bigger context. Macs are quietly excellent for this.
No GPU at all: 1B–4B models run on CPU; a 16 GB-RAM machine can even run a 7B model, just slowly.

Which small model should you pick?

Just want it to work? Llama 3.x 8B. Maximum compatibility, strong all-round.
Coding, math or multilingual? Qwen2.5 7B.
Speed on an 8 GB card? Mistral 7B.
Weak hardware / CPU-only? Phi (~3.8B) or Gemma 2B.

Don’t overthink it — pull two and compare on your own prompts. With Ollama that’s two commands; our Ollama guide walks through it from zero.

If you want to go past “it runs” and actually understand quantization, prompting and how these models work under the hood, a structured course saves a lot of trial and error:

Learn the fundamentals on DataCamp Ad

The bottom line

Small LLMs are the best entry point to running AI locally: they’re fast, private, and most of them fit on hardware you already have. Start with an 8B model if your GPU has 8 GB, drop to 3B–4B if it doesn’t, and only reach for bigger models when a small one genuinely isn’t keeping up.

When that day comes — bigger models, longer context, or local training — the bottleneck is always VRAM. See our best budget GPU for AI pick to add headroom cheaply, or browse all our hardware guides to plan an upgrade that won’t box you in.