Back to Blog

Running Local AI: How to Run LLMs Privately on Your Own Hardware

Running Local AI: How to Run LLMs Privately on Your Own Hardware

I’ve been running AI models locally for about a year now, and it completely changed how I think about privacy. Every word I type stays on my machine — no servers, no telemetry, no third parties.

Cloud tools like ChatGPT and Claude are powerful, but many services store your input on remote servers. That trade-off bothered me when working with sensitive code or private documents, so I started exploring local AI.

Why Go Local?

  • Absolute Privacy: My prompts never leave my machine. No data goes to any cloud provider, period.
  • No Recurring Costs: After the initial hardware investment, running models is free.
  • Offline Capability: I’ve used local AI on flights and during internet outages. It just works.
  • Full Control: Run any open-source model without rate limits or usage caps.

What I Actually Use It For

This isn’t just theory — I use local AI daily in two practical ways:

Speech-to-text on My Mac: I have a local model on my Mac for speech-to-text. I speak, it writes — that’s how I’m writing this post right now.

News Summarization on My iPhone: I run a small model as an iPhone app. Long articles get fed in, and I get a clean 150-200 word summary. Fully offline.

Model Sizes

The “B” stands for billions of parameters — more is generally smarter, but needs more hardware.

  • 7B (Llama 3 8B, Mistral 7B): My go-to. Runs on consumer hardware, handles summarization, coding, and conversation well.
  • 13B (Llama 2 13B, CodeLlama 13B): Smarter, especially for complex reasoning and longer code. Needs more VRAM.
  • 70B (Llama 2 70B): Approaching cloud-level quality, but requires 48GB+ VRAM or heavy quantization.
  • 141B total / 22B active (Mixtral 8x22B): A Mixture-of-Experts model with 8 experts — uses ~22B parameters per forward pass. Requires 48GB+ VRAM or heavy quantization.

VRAM Requirements

  • 8GB: Sweet spot for quantized 7B models — what I started with.
  • 16GB: Handles 13B models and some quantized 30B models.
  • 24GB+: Unlocks larger quantized models and 13B at full precision.
  • 48GB+: 70B models become accessible, especially on Apple Silicon with unified memory.

Getting Started

  1. Install Ollama: Head to ollama.com and download the installer. Takes about two minutes.
  2. Pull your first model: Run ollama run llama3 in a terminal. Downloads the 8B model (~4.7GB) and starts chatting. That’s it.
  3. Add a web interface: Run docker run -d -p 3000:8080 ghcr.io/open-webui/open-webui:main and open http://localhost:3000 for a ChatGPT-like experience.
  4. Explore: Try ollama pull mistral or ollama pull codellama for different use cases.

Troubleshooting

Running on CPU instead of GPU? On Linux, install NVIDIA drivers first. Out of memory? Try a smaller model or Ollama’s :q4_0 tag for lighter quantization. Everything runs fully offline once downloaded.

Personal Take

Local AI shines for sensitive data, offline work, and experimenting without API costs. That said, I still use cloud AI when I need the latest models or large context windows — it’s not either-or.

Note: Hardware prices and model availability may change. Always check current pricing before purchasing.

Disclosure: Some links in this post may be affiliate links. I may earn a commission at no extra cost to you.

💻 Hardware Recommendation: To run local LLMs smoothly, you’ll need a GPU with at least 16GB of VRAM or an Apple Silicon Mac. See my GPU Buying Guide for recommendations.