How much RAM do I need to run Mellum 2 12B-A2.5B?

About 12 GB of total system memory for the recommended 4-bit (Q4_K_M) build, which is a 7.3 GB download. More RAM lets you use higher-quality quantizations or longer context.

Can Mellum 2 12B-A2.5B run without a dedicated GPU?

Yes — tools like Ollama and llama.cpp run it on the CPU as long as it fits in RAM. A GPU or Apple Silicon makes generation several times faster, but it's optional.

Which quantization of Mellum 2 12B-A2.5B should I download?

Q4_K_M is the sweet spot for almost everyone — roughly 4× smaller than the original with minimal quality loss. Pick Q5 or Q8 if you have plenty of RAM, or Q2 only when nothing else fits.

Can I fine-tune Mellum 2 12B-A2.5B on my own machine?

Fine-tuning needs far more memory than inference. Full fine-tuning of Mellum 2 12B-A2.5B takes roughly 144 GB of GPU memory, while QLoRA brings it down to about 18 GB. For most people, QLoRA on a rented GPU is the practical path.

Is a bigger model at Q2/Q3 better than a smaller one at Q4/Q5?

Usually no. Below Q3, quality degrades sharply — a smaller model at Q4_K_M typically beats a bigger one squeezed into Q2. Drop below Q4 only when nothing else fits in your memory.

← All modelsMODEL CHECK

Can I run Mellum 2 12B-A2.5B?

Mellum 2 12B-A2.5B by JetBrains needs around 12 GB of RAM at the recommended 4-bit quantization (7.3 GB download). Your hardware is checked below — instantly, nothing leaves your browser. Expect roughly ~202 tok/s on a NVIDIA RTX 3060 12GB.

Reading your hardware signals…

Real-world notes

Mellum 2 is JetBrains' coding-focused model, built as a mixture-of-experts: 12B parameters total but only about 2.5B active per token. That's the whole point of the design. You get the speed of a roughly 2-3B model, but you still hold the full 12B in memory, so don't be fooled by the active count. At a 4-bit quant it lands around 7.3 GB, with a practical floor of about 12 GB RAM. That fits a 12 GB card like an RTX 3060 or unified memory on an Apple Silicon Mac, but 8 GB is too tight. If you live in JetBrains IDEs and want local code completion, it's aimed squarely at you.

In daily use the MoE design pays off: it feels much faster than its size suggests. On an RTX 3060 12GB you can expect around 202 tokens per second at 4-bit, and an RTX 4090 pushes past 565 — well into the range where completions land before you've finished typing the next line. The 128K context is genuinely large for a coding model, handy for feeding it whole files or a repo's worth of headers, but it isn't free. Fill it all the way and total memory climbs to roughly 27.4 GB, far past what a single 12 GB card holds, so keep working context modest unless you have a 24 GB GPU or generous unified memory.

It's worth being clear about scope: this is a coding specialist, not a general assistant. For chat, reasoning, or anything with images, a broader 12B like Gemma 4 12B generally serves you better, and Mistral Nemo 12B tends to be the friendlier pick for open-ended conversation. Mellum 2's standout trait is that MoE speed-to-size ratio on completion-style work, paired with first-class IDE integration from the people who make your editor. And the license is the easy part: Apache 2.0, so you can use it commercially and in production without legal worries. If your main job is code and you have a 12 GB card, it's a strong, fast local choice.

Specifications

Parameters12B (2.5B active)

Context window128K tokens

ProviderJetBrains

LicenseApache 2.0

Released2026-06

Best forCoding

Size by quantization

Quantization	Bits/weight	Download	Min RAM	Quality
Q2_K	3.35	5.0 GB	8 GB	Noticeable loss
Q4_K_MRecommended	4.85	7.3 GB	12 GB	Recommended
Q5_K_M	5.65	8.5 GB	16 GB	High
Q8_0	8.5	12.8 GB	24 GB	Near-original
F16	16	24.0 GB	32 GB	Original

Sizes are estimates from parameter count × bits per weight; real GGUF builds vary slightly. · Data updated: 2026-06-11 · How we calculate these numbers →

Memory needed by context length

Context	KV cache (est.)	Total memory (Q4)
4K tokens	~0.6 GB	~7.9 GB
8K tokens	~1.3 GB	~8.6 GB
32K tokens	~5.0 GB	~12.3 GB
128K tokens	~20.1 GB	~27.4 GB

The KV cache grows with context length — a model that fits at 4K can run out of memory at 32K. Estimates assume an FP16 cache with grouped-query attention; actual usage varies by runtime.

Estimated speed by hardware

Hardware	Bandwidth	~Speed
NVIDIA RTX 3060 12GB	360 GB/s	~202 tok/s
NVIDIA RTX 4090 24GB	1008 GB/s	~565 tok/s
Apple M-series (base)	100 GB/s	~56 tok/s
Apple M-series Pro	270 GB/s	~151 tok/s
Apple M-series Max	410 GB/s	~230 tok/s
CPU only (dual-channel DDR5)	60 GB/s	~34 tok/s