The Hardware You Actually Need

Three tiers with exact specs and VRAM math. From $5/month to RTX 4090 — pick what fits your scale.

The Three Tiers at a Glance

Minimum

Cloud API Mode — No GPU Needed

Open Notebook app + SurrealDB. All AI inference happens on the provider's servers.

4 GB

RAM

2+ GB

Disk

None

GPU

$5/mo

VPS Cost

A $5/month DigitalOcean droplet or Hetzner CX22 runs this fine. Use OpenAI API, Groq (free tier), or Anthropic for inference. Open Notebook itself needs ~2 GB RAM; the rest is headroom for SurrealDB and file processing.

Recommended

Local Ollama Models — Consumer GPU

Run models locally for privacy and zero per-token cost. One user, one GPU.

16–32 GB

RAM

20+ GB

Disk

8–24 GB

VRAM

$0/mo

API Cost

GPU recommendations:

GPU	VRAM	Best Model Fit	Approx. Price
RTX 3060	12 GB	7B models @ 8192 ctx	~$280
RTX 4070	12 GB	7B–13B models @ 4096 ctx	~$550
RTX 4090	24 GB	20B models @ 8192 ctx, or 7B @ 128K	~$1,700
Apple M3/M4	32 GB unified	7B–13B models @ 8192 ctx	~$1,600+
2× RTX 3090	48 GB total	Mixtral 8×7B, 34B models	~$1,400 used

Production

Multi-User, High-Volume — Server GPU

5+ concurrent users running local models. Separate Ollama machine recommended.

64 GB

RAM

100+ GB

NVMe SSD

24–48 GB

VRAM

$200+/mo

Infra

For 5+ concurrent users, separate Ollama onto a dedicated GPU machine and connect via network URL. Run Open Notebook + SurrealDB + nginx on the app server, Ollama on the compute server. This prevents a single long inference from queuing all other users.

# On the Ollama machine:
OLLAMA_HOST=0.0.0.0:11434 ollama serve

# In Open Notebook docker-compose.yml: OLLAMA_BASE_URL=http://10.0.0.5:11434

VRAM Math: What Actually Matters

The three variables that determine VRAM usage:

Factor	How It Affects VRAM	Rule of Thumb
Model parameters	Determines base memory. A 7B param model @ 4-bit quantization ≈ 4 GB.	~0.5 GB per 1B params (4-bit quantized)
Context window (num_ctx)	Each additional 2048 tokens ≈ 1 GB VRAM for 7B. Scales with parameters.	1 GB per 2048 ctx @ 7B
Concurrent requests	Ollama processes one request at a time per model. Multiple users queue.	No VRAM multiplier, but latency grows

Critical: Open Notebook v1.8+ changed the default num_ctx from 128000 to 8192 to prevent OOM on consumer GPUs. If you bump it back up, calculate your VRAM first. A 7B model at 128K context needs ~64 GB VRAM — that's 2× RTX 4090 territory.

Quick formula for 4-bit quantized models:

VRAM ≈ (params_in_B × 0.5) + (num_ctx / 2048) × (params_in_B / 7) GB

Example: 7B model @ 8192 ctx → 3.5 + (8192/2048) × 1 = 7.5 GB VRAM. Fits in 8 GB.

Cloud vs Local: Cost Breakeven

If you're processing documents all day, local models pay for themselves quickly:

Scenario	Cloud API (monthly)	Local Ollama (monthly)	Breakeven
Light use (50 docs, 200 queries)	$3–8	$0 + electricity	Never — stay cloud
Medium use (200 docs, 1K queries)	$15–30	$0 + elec	~2 years vs RTX 3060
Heavy use (1K docs, 5K queries)	$50–100	$0 + elec	~6 months vs RTX 4090
Team of 5, heavy use	$200–500	$0 + elec + infra	~3 months

Assumes GPT-4o-mini pricing (~$0.15/1M input, ~$0.60/1M output). Electricity: ~$15–40/month for a GPU machine under load. Detailed model cost comparison →

Picked Your Hardware?

Next: follow the deployment guide → or get the Production Manual for monitoring, CI/CD, and the 30+ errors that hit at scale.

Get the Production Manual — $19