How we calculate LLM hardware requirements

Every number on these pages comes from the formulas below — no hidden magic, no scraped spec sheets. They are approximations, and we explain their limits.

Download size by quantization

size_GB = params_B × bits_per_weight ÷ 8

A model's file size is parameter count × bits-per-weight ÷ 8. The bits-per-weight values include GGUF format overhead — for example Q4_K_M is 4.85 effective bits, so an 8B model is 8 × 4.85 ÷ 8 ≈ 4.9 GB. Real GGUF builds vary by a few percent.

Minimum system RAM

min_RAM = size_GB × 1.25 + 1.5 → next standard tier

We take the Q4_K_M in-memory size, add 25% runtime overhead (activations, buffers) plus 1.5 GB for the operating system, then round up to the next standard memory size (8, 12, 16, 24, 32 GB and so on). That rounded value is the 'Min RAM' shown in every table.

KV cache by context length

kv_bytes/token ≈ 131 072 × (params_B ÷ 8)^0.45

The KV cache grows linearly with context length. We anchor to Llama 3.1 8B with grouped-query attention — 32 layers × 8 KV heads × 128 head dimension × 2 (K and V) × 2 bytes ≈ 131 kB per token — and scale sub-linearly with parameter count (power 0.45), because depth and KV width grow slower than total parameters. This is why a model that fits at 4K context can run out of memory at 32K.

Speed (tok/s) estimates

tok/s ≈ bandwidth_GBs × 0.85 ÷ active_size_GB

Token generation is memory-bandwidth bound: producing one token reads all active weights once. So tok/s ≈ bandwidth × 0.85 ÷ model size at Q4, where 0.85 is an empirical efficiency factor versus a raw copy benchmark. For mixture-of-experts models only the active parameters count — which is why a 30B MoE can be faster than a dense 8B.

The in-browser bandwidth benchmark

The optional benchmark measures effective GPU memory bandwidth with repeated large WebGPU buffer-to-buffer copies and takes about 1–2 seconds. It runs entirely in your browser; nothing is uploaded and nothing is stored. On Apple Silicon the measured bandwidth also refines the chip-class guess (base / Pro / Max / Ultra).

Known limits

These are planning estimates, not benchmarks of your exact machine. Real speed varies with the runtime (llama.cpp, MLX, vLLM), context length, batch size and thermals. Fit verdicts assume the recommended Q4_K_M build and a mostly idle machine — when a model is borderline, expect to close apps or drop one quantization level.

Effective bits per weight

Quantization	Bits/weight	Quality
Q2_K	3.35	Noticeable loss
Q4_K_M	4.85	Recommended
Q5_K_M	5.65	High
Q8_0	8.5	Near-original
F16	16	Original

Sizes are estimates from parameter count × bits per weight; real GGUF builds vary slightly. · Data updated: 2026-06-11