Can the NVIDIA RTX 4090 run Llama 3.1 8B?

Yes — the 4-bit build is a 4.9 GB download and fits in 24 GB VRAM. Expect roughly ~177 tok/s.

What is the biggest LLM the NVIDIA RTX 4090 can run?

Qwen 3.5 35B-A3B is the largest model in our catalog that fits (21.2 GB at 4-bit). Expect about ~471 tok/s.

How fast is the NVIDIA RTX 4090 for local LLMs?

Token generation is memory-bandwidth bound. At roughly 1008 GB/s, the NVIDIA RTX 4090 generates about ~177 tok/s on an 8B-class model at 4-bit — speed scales inversely with model size.

Does the whole model need to fit in VRAM?

For full GPU speed, yes. Runtimes like llama.cpp can split layers between VRAM and system RAM, but every layer that spills to RAM slows generation down sharply.

← All modelsDEVICE CHECK

What LLMs can the NVIDIA RTX 4090 run?

The NVIDIA RTX 4090 has 24 GB VRAM and roughly 1008 GB/s of memory bandwidth. Below is every model in our catalog that fits, with estimated generation speed. Biggest pick: Qwen 3.5 35B-A3B at ~471 tok/s.

Specifications

Memory24 GB VRAM

Bandwidth~1008 GB/s

Memory typeDedicated VRAM

Released2022-10

Models on the NVIDIA RTX 4090

62 / 73 models

Model	Download (Q4)	Fits?	~Speed
Qwen 3.5 35B-A3BAlibaba	21.2 GB	Runs	~471 tok/s
Qwen 3.6 35B-A3BAlibaba	21.2 GB	Runs	~471 tok/s
Command R 35BCohere	21.2 GB	Runs	~40 tok/s
Qwen3-VL 32BAlibaba	20.0 GB	Runs	~43 tok/s
EXAONE 4.5 33BLG AI Research	20.0 GB	Runs	~43 tok/s
Qwen 3 32BAlibaba	19.9 GB	Runs	~43 tok/s
Qwen 2.5 Coder 32BAlibaba	19.9 GB	Runs	~43 tok/s
QwQ 32BAlibaba	19.9 GB	Runs	~43 tok/s
DeepSeek R1 32BDeepSeek	19.9 GB	Runs	~43 tok/s
Granite 4.0 H SmallIBM	19.4 GB	Runs	~157 tok/s
Nemotron 3 Nano 30B-A3BNVIDIA	19.2 GB	Runs	~393 tok/s
Gemma 4 31BGoogle	18.6 GB	Runs	~46 tok/s
Qwen 3 30B-A3BAlibaba	18.5 GB	Runs	~428 tok/s
Qwen3-VL 30B-A3BAlibaba	18.2 GB	Runs	~471 tok/s
Gemma 3 27BGoogle	16.6 GB	Runs	~52 tok/s
Qwen 3.5 27BAlibaba	16.4 GB	Runs	~52 tok/s
Qwen 3.6 27BAlibaba	16.4 GB	Runs	~52 tok/s
Gemma 4 26B A4BGoogle	15.3 GB	Runs	~372 tok/s
Mistral Small 3.1 24BMistral AI	14.6 GB	Runs	~59 tok/s
Devstral 24BMistral AI	14.6 GB	Runs	~59 tok/s
Magistral Small 1.2Mistral AI	14.6 GB	Runs	~59 tok/s
Devstral Small 2 24BMistral AI	14.6 GB	Runs	~59 tok/s
Codestral 22BMistral AI	13.5 GB	Runs	~64 tok/s
GPT-OSS 20BOpenAI	12.7 GB	Runs	~393 tok/s
Phi-4 Reasoning Vision 15BMicrosoft	9.1 GB	Runs	~94 tok/s
Qwen 3 14BAlibaba	9.0 GB	Runs	~95 tok/s
DeepSeek R1 14BDeepSeek	9.0 GB	Runs	~95 tok/s
Phi-4 14BMicrosoft	8.9 GB	Runs	~96 tok/s
Ministral 3 14BMistral AI	8.5 GB	Runs	~101 tok/s
OLMo 2 13BAi2	8.3 GB	Runs	~103 tok/s
Gemma 3 12BGoogle	7.4 GB	Runs	~116 tok/s
Mistral Nemo 12BMistral AI	7.4 GB	Runs	~116 tok/s
Gemma 4 12BGoogle	7.3 GB	Runs	~118 tok/s
Mellum 2 12B-A2.5BJetBrains	7.3 GB	Runs	~565 tok/s
Qwen 3.5 9BAlibaba	5.5 GB	Runs	~157 tok/s
GLM-4.6V-FlashZ.ai	5.5 GB	Runs	~157 tok/s
Qwen 2.5 VL 7BAlibaba	5.0 GB	Runs	~170 tok/s
Qwen 3 8BAlibaba	5.0 GB	Runs	~172 tok/s
Granite 3.3 8BIBM	5.0 GB	Runs	~172 tok/s
Llama 3.1 8BMeta	4.9 GB	Runs	~177 tok/s
DeepSeek R1 8BDeepSeek	4.9 GB	Runs	~177 tok/s
Gemma 4 E4BGoogle	4.9 GB	Runs	~314 tok/s
Qwen3-VL 8BAlibaba	4.9 GB	Runs	~177 tok/s
Ministral 3 8BMistral AI	4.9 GB	Runs	~177 tok/s
Gemma 3n E4BGoogle	4.7 GB	Runs	~353 tok/s
Qwen 2.5 Coder 7BAlibaba	4.6 GB	Runs	~186 tok/s
DeepSeek R1 7BDeepSeek	4.6 GB	Runs	~186 tok/s
Mistral 7BMistral AI	4.4 GB	Runs	~196 tok/s
Gemma 4 E2BGoogle	3.1 GB	Runs	~614 tok/s
Gemma 3 4BGoogle	2.6 GB	Runs	~329 tok/s
Qwen 3 4BAlibaba	2.4 GB	Runs	~353 tok/s
Qwen 3.5 4BAlibaba	2.4 GB	Runs	~353 tok/s
Phi-4 Mini 3.8BMicrosoft	2.3 GB	Runs	~372 tok/s
Llama 3.2 3BMeta	1.9 GB	Runs	~442 tok/s
DeepSeek-OCRDeepSeek	1.8 GB	Runs	~2479 tok/s
Ministral 3 3BMistral AI	1.8 GB	Runs	~471 tok/s
DeepSeek R1 1.5BDeepSeek	1.1 GB	Runs	~785 tok/s
Qwen 3 1.7BAlibaba	1.0 GB	Runs	~831 tok/s
SmolLM2 1.7BHugging Face	1.0 GB	Runs	~831 tok/s
Llama 3.2 1BMeta	0.7 GB	Runs	~1178 tok/s
Gemma 3 1BGoogle	0.6 GB	Runs	~1413 tok/s
Qwen 3 0.6BAlibaba	0.4 GB	Runs	~2355 tok/s

To run fully on the GPU, the 4-bit build must fit in VRAM. Models that don't fit can still run on CPU + system RAM, several times slower. · Data updated: 2026-06-11 · How we calculate these numbers →