Ollama: The Complete Guide to Running LLMs Locally (2026)
By LocalLLMGear Editorial · Editorial Team · Updated 2026-06-29
We test hardware hands-on and may use AI tools in research — every guide is human-reviewed. Editorial policy.
We may earn a commission from links in this article, at no extra cost to you. Disclosure.
Ollama is the simplest way to run open large language models on your
own machine — and once you go past the first ollama run, it’s also a quietly powerful
local AI server. This is the deeper reference: the API, Modelfiles, a real UI, and how to
match models to your hardware. If you just want the fastest possible start, our
step-by-step Ollama quick start gets you
chatting in two minutes — come back here when you want to build on top of it.
The 30-second answer: Ollama is a free, open-source tool that downloads and runs open models with one command (
ollama run llama3) and exposes a local REST API athttp://localhost:11434so your own apps can use them — fully private, no cloud, no keys.
What Ollama actually is
Two things in one package. First, a CLI for pulling and chatting with models. Second, a background server that exposes those models over a local HTTP API. That second part is what makes Ollama more than a toy: any app on your machine — a chat UI, a code assistant, a Python script — can talk to it without sending a byte to the cloud. The models themselves are open weights (Llama, Mistral, Qwen, Gemma, Phi, DeepSeek and more), and Ollama ships quantized versions by default so they fit on normal GPUs.
Installing Ollama
- macOS / Windows: download the installer from ollama.com and run it. It installs a menu-bar/tray app that keeps the server running.
- Linux:
curl -fsSL https://ollama.com/install.sh | sh. It sets up asystemdservice, so the API is available on boot.
After install, confirm it’s alive:
ollama --version
ollama list # shows installed models (empty at first)
Pulling and running models
ollama run is the do-everything command — it downloads the model if you don’t have it,
then drops you into a chat:
ollama run llama3
If you only want to download (for later, or to script with), use pull:
ollama pull qwen2.5:14b
ollama pull mistral
ollama pull gemma2:2b
A model name can carry a tag for size or variant — llama3.1:8b, qwen2.5:32b,
phi3:mini. No tag means the default (usually a sensible mid-size, 4-bit quantized build).
Other handy commands:
ollama list # what you have
ollama ps # what's loaded in memory right now
ollama rm <model> # delete a model to reclaim disk
ollama show <model> # license, params, quantization, context length
Using the local API in your apps
Whenever Ollama is running, it serves an HTTP API on port 11434. The simplest call:
curl http://localhost:11434/api/generate -d '{
"model": "llama3",
"prompt": "Explain RAG in one sentence.",
"stream": false
}'
There’s also a /api/chat endpoint for multi-turn conversations, plus an
OpenAI-compatible path at /v1/chat/completions — point an existing OpenAI client at
http://localhost:11434/v1 with any dummy API key and most code “just works.” The official
ollama Python and JavaScript libraries wrap all of this if you’d rather not hand-roll
requests. This is the bridge that lets local models power editors, note apps and
VS Code assistants without a cloud bill.
Customizing with a Modelfile
A Modelfile is Ollama’s recipe format — think Dockerfile, but for a model. It lets you bake in a system prompt, default parameters, or a fine-tuned/GGUF base into a reusable named model:
FROM llama3
PARAMETER temperature 0.3
SYSTEM "You are a terse senior Go engineer. Answer with code first, prose second."
Build and run it:
ollama create go-helper -f ./Modelfile
ollama run go-helper
You can also point FROM at a downloaded GGUF file to import models that aren’t in the
Ollama library — handy for community fine-tunes from Hugging Face.
Adding a UI: Open WebUI
The terminal is great, but a ChatGPT-style interface makes local models feel finished. Open WebUI is the popular, self-hosted front-end; it auto-detects Ollama on the same machine. The quickest path is Docker:
docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway \
-v open-webui:/app/backend/data --name open-webui \
ghcr.io/open-webui/open-webui:main
Then open http://localhost:3000 — model picker, chat history, document upload and prompt
presets, all running locally against your Ollama server.
Choosing a model size for your VRAM
The single rule that decides everything: the quantized model has to fit in your VRAM, with a little headroom for context. Go over, and Ollama offloads layers to system RAM/CPU — it still runs, just much slower. Rough, approximate guidance:
Popular Ollama models by size (sizes/VRAM are approximate, 4-bit quantized)
| GPU / Option | VRAM | Best for |
|---|---|---|
| gemma2:2b | 2B · ~2 GB | Tiny/fast — laptops, CPU-friendly |
| llama3.1:8b ★ Our pick | 8B · ~6 GB | Great all-rounder — the default pick |
| qwen2.5:14b | 14B · ~10 GB | Stronger reasoning + coding |
| qwen2.5:32b | 32B · ~20 GB | High quality, needs a 24 GB card |
| llama3.1:70b | 70B · ~40 GB+ | Top tier — multi-GPU or big unified RAM |
Not sure your card can keep up? Our best GPU for local LLMs guide maps VRAM tiers to real models, and on Apple Silicon the unified memory pool changes the math entirely.
Common troubleshooting
- Model runs slowly / pegs the CPU: it didn’t fit in VRAM and offloaded. Drop to a
smaller model or a heavier quant, or close other GPU apps. Check with
ollama ps. connection refusedon port 11434: the server isn’t running. Launch the app (macOS/Windows) orsudo systemctl start ollama(Linux). Runollama servemanually to see logs.- Out of memory mid-generation: lower the context window, use a smaller model, or reduce how much you’re feeding it at once.
- Another app can’t reach Ollama: by default it binds to localhost. To expose it to
other machines on your LAN, set
OLLAMA_HOST=0.0.0.0before starting the server (only on networks you trust). - Out of disk: models are big. Prune with
ollama rm;ollama listshows sizes.
Where to go next
Ollama is the engine; the rest is matching it to hardware and learning to build on it. Deciding between tools? Read LM Studio vs Ollama. Need the right card first? Start with best GPU for local LLMs or browse the full hardware hub.
And if you want to genuinely understand prompting, RAG and fine-tuning on top of local models — not just run them — a structured course shortcuts months of trial and error:
Learn the fundamentals on DataCamp Ad