Running Local AI: How to Run LLMs Privately on Your Own Hardware
I’ve been running AI models locally for about a year now, and it completely changed how I think about privacy. Every word I type stays on my machine — no servers, no telemetry, no third parties.
Cloud tools like ChatGPT and Claude are powerful, but many services store your input on remote servers. That trade-off bothered me when working with sensitive code or private documents, so I started exploring local AI.
Why Go Local?
- Absolute Privacy: My prompts never leave my machine. No data goes to any cloud provider, period.
- No Recurring Costs: After the initial hardware investment, running models is free.
- Offline Capability: I’ve used local AI on flights and during internet outages. It just works.
- Full Control: Run any open-source model without rate limits or usage caps.
What I Actually Use It For
This isn’t just theory — I use local AI daily in two practical ways:
Speech-to-text on My Mac: I have a local model on my Mac for speech-to-text. I speak, it writes — that’s how I’m writing this post right now.
News Summarization on My iPhone: I run a small model as an iPhone app. Long articles get fed in, and I get a clean 150-200 word summary. Fully offline.
Model Sizes
The “B” stands for billions of parameters — more is generally smarter, but needs more hardware.
- 7B (Llama 3 8B, Mistral 7B): My go-to. Runs on consumer hardware, handles summarization, coding, and conversation well.
- 13B (Llama 2 13B, CodeLlama 13B): Smarter, especially for complex reasoning and longer code. Needs more VRAM.
- 70B (Llama 2 70B): Approaching cloud-level quality, but requires 48GB+ VRAM or heavy quantization.
- 141B total / 22B active (Mixtral 8x22B): A Mixture-of-Experts model with 8 experts — uses ~22B parameters per forward pass. Requires 48GB+ VRAM or heavy quantization.
VRAM Requirements
- 8GB: Sweet spot for quantized 7B models — what I started with.
- 16GB: Handles 13B models and some quantized 30B models.
- 24GB+: Unlocks larger quantized models and 13B at full precision.
- 48GB+: 70B models become accessible, especially on Apple Silicon with unified memory.
Getting Started
- Install Ollama: Head to ollama.com and download the installer. Takes about two minutes.
- Pull your first model: Run
ollama run llama3in a terminal. Downloads the 8B model (~4.7GB) and starts chatting. That’s it. - Add a web interface: Run
docker run -d -p 3000:8080 ghcr.io/open-webui/open-webui:mainand openhttp://localhost:3000for a ChatGPT-like experience. - Explore: Try
ollama pull mistralorollama pull codellamafor different use cases.
Troubleshooting
Running on CPU instead of GPU? On Linux, install NVIDIA drivers first. Out of memory? Try a smaller model or Ollama’s :q4_0 tag for lighter quantization. Everything runs fully offline once downloaded.
Personal Take
Local AI shines for sensitive data, offline work, and experimenting without API costs. That said, I still use cloud AI when I need the latest models or large context windows — it’s not either-or.
Note: Hardware prices and model availability may change. Always check current pricing before purchasing.
Disclosure: Some links in this post may be affiliate links. I may earn a commission at no extra cost to you.
💻 Hardware Recommendation: To run local LLMs smoothly, you’ll need a GPU with at least 16GB of VRAM or an Apple Silicon Mac. See my GPU Buying Guide for recommendations.