The Hardware You Actually Need
Three tiers with exact specs and VRAM math. From $5/month to RTX 4090 — pick what fits your scale.
The Three Tiers at a Glance
Cloud API Mode — No GPU Needed
Open Notebook app + SurrealDB. All AI inference happens on the provider's servers.
A $5/month DigitalOcean droplet or Hetzner CX22 runs this fine. Use OpenAI API, Groq (free tier), or Anthropic for inference. Open Notebook itself needs ~2 GB RAM; the rest is headroom for SurrealDB and file processing.
Local Ollama Models — Consumer GPU
Run models locally for privacy and zero per-token cost. One user, one GPU.
GPU recommendations:
| GPU | VRAM | Best Model Fit | Approx. Price |
|---|---|---|---|
| RTX 3060 | 12 GB | 7B models @ 8192 ctx | ~$280 |
| RTX 4070 | 12 GB | 7B–13B models @ 4096 ctx | ~$550 |
| RTX 4090 | 24 GB | 20B models @ 8192 ctx, or 7B @ 128K | ~$1,700 |
| Apple M3/M4 | 32 GB unified | 7B–13B models @ 8192 ctx | ~$1,600+ |
| 2× RTX 3090 | 48 GB total | Mixtral 8×7B, 34B models | ~$1,400 used |
Multi-User, High-Volume — Server GPU
5+ concurrent users running local models. Separate Ollama machine recommended.
For 5+ concurrent users, separate Ollama onto a dedicated GPU machine and connect via network URL. Run Open Notebook + SurrealDB + nginx on the app server, Ollama on the compute server. This prevents a single long inference from queuing all other users.
# On the Ollama machine:
OLLAMA_HOST=0.0.0.0:11434 ollama serve
# In Open Notebook docker-compose.yml: OLLAMA_BASE_URL=http://10.0.0.5:11434
VRAM Math: What Actually Matters
The three variables that determine VRAM usage:
| Factor | How It Affects VRAM | Rule of Thumb |
|---|---|---|
| Model parameters | Determines base memory. A 7B param model @ 4-bit quantization ≈ 4 GB. | ~0.5 GB per 1B params (4-bit quantized) |
| Context window (num_ctx) | Each additional 2048 tokens ≈ 1 GB VRAM for 7B. Scales with parameters. | 1 GB per 2048 ctx @ 7B |
| Concurrent requests | Ollama processes one request at a time per model. Multiple users queue. | No VRAM multiplier, but latency grows |
num_ctx from 128000 to 8192 to prevent OOM on consumer GPUs. If you bump it back up, calculate your VRAM first. A 7B model at 128K context needs ~64 GB VRAM — that's 2× RTX 4090 territory.
Quick formula for 4-bit quantized models:
VRAM ≈ (params_in_B × 0.5) + (num_ctx / 2048) × (params_in_B / 7) GB
Example: 7B model @ 8192 ctx → 3.5 + (8192/2048) × 1 = 7.5 GB VRAM. Fits in 8 GB.
Cloud vs Local: Cost Breakeven
If you're processing documents all day, local models pay for themselves quickly:
| Scenario | Cloud API (monthly) | Local Ollama (monthly) | Breakeven |
|---|---|---|---|
| Light use (50 docs, 200 queries) | $3–8 | $0 + electricity | Never — stay cloud |
| Medium use (200 docs, 1K queries) | $15–30 | $0 + elec | ~2 years vs RTX 3060 |
| Heavy use (1K docs, 5K queries) | $50–100 | $0 + elec | ~6 months vs RTX 4090 |
| Team of 5, heavy use | $200–500 | $0 + elec + infra | ~3 months |
Assumes GPT-4o-mini pricing (~$0.15/1M input, ~$0.60/1M output). Electricity: ~$15–40/month for a GPU machine under load. Detailed model cost comparison →
Picked Your Hardware?
Next: follow the deployment guide → or get the Production Manual for monitoring, CI/CD, and the 30+ errors that hit at scale.
Get the Production Manual — $19