The Memory Problem
A 7-billion-parameter model in full precision (FP16) needs roughly 14GB of GPU memory just to load — before you’ve run a single token. Most consumer GPUs don’t have that. Quantization shrinks each weight from 16 bits to 8, 4, or even fewer bits, cutting memory use by 2–4x with a small, often negligible, accuracy cost.
This is the same model-shrinking idea behind running DeepSeek V4 Flash locally — quantization is the mechanism that makes that possible on consumer hardware.
1
| pip install transformers bitsandbytes accelerate torch
|
Precision Levels, Visualized
1
2
3
4
| FP32 (32-bit): 4 bytes/param → 7B model ≈ 28GB
FP16 (16-bit): 2 bytes/param → 7B model ≈ 14GB
INT8 (8-bit): 1 byte/param → 7B model ≈ 7GB
INT4 (4-bit): 0.5 byte/param → 7B model ≈ 3.5GB
|
Each step down roughly halves memory. The cost is precision in the weight values themselves — quantization rounds them into a smaller set of discrete levels.
8-bit Quantization with bitsandbytes
1
2
3
4
5
6
7
8
9
10
11
12
13
| from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
quant_config = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B-Instruct",
quantization_config=quant_config,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
print(f"Memory footprint: {model.get_memory_footprint() / 1e9:.2f} GB")
|
1
| Memory footprint: 8.45 GB
|
4-bit Quantization (QLoRA-style)
4-bit is where most local deployments land — it’s the sweet spot between memory savings and output quality:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
| quant_config_4bit = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # normalized float 4 — better than plain int4
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True, # quantize the quantization constants too
)
model_4bit = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B-Instruct",
quantization_config=quant_config_4bit,
device_map="auto",
)
print(f"Memory footprint: {model_4bit.get_memory_footprint() / 1e9:.2f} GB")
|
1
| Memory footprint: 5.59 GB
|
nf4 (Normal Float 4) outperforms naive 4-bit rounding because it allocates more precision to the weight values that actually occur most often in trained models — most weights cluster near zero.
GGUF: Quantization for CPU and llama.cpp
GGUF is the format used by llama.cpp and Ollama — designed for fast CPU inference, not just GPU:
1
2
3
4
| pip install llama-cpp-python
# Download a pre-quantized GGUF model (Q4_K_M is a good default)
huggingface-cli download TheBloke/Llama-3.1-8B-Instruct-GGUF llama-3.1-8b-instruct.Q4_K_M.gguf --local-dir ./models
|
1
2
3
4
5
6
| from llama_cpp import Llama
llm = Llama(model_path="./models/llama-3.1-8b-instruct.Q4_K_M.gguf", n_ctx=2048, n_gpu_layers=0)
output = llm("Explain quantization in one sentence.", max_tokens=60)
print(output["choices"][0]["text"])
|
The Q4_K_M suffix tells you the quantization scheme — K_M quantization mixes precision levels across layers, keeping more bits where they matter most for output quality.
Benchmarking Quality vs Speed
1
2
3
4
5
6
7
8
9
10
11
12
| import time
def benchmark(model, tokenizer, prompt, max_new_tokens=50):
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
start = time.time()
output = model.generate(**inputs, max_new_tokens=max_new_tokens)
elapsed = time.time() - start
tokens_per_sec = max_new_tokens / elapsed
return tokenizer.decode(output[0], skip_special_tokens=True), tokens_per_sec
text, speed = benchmark(model_4bit, tokenizer, "The future of AI is")
print(f"Speed: {speed:.1f} tokens/sec")
|
In practice, 4-bit models run noticeably faster than FP16 on memory-bandwidth-limited consumer GPUs, because moving less data per token is often the bigger bottleneck than raw compute.
Quantization Method Comparison
| Method | Bits | Best for | Quality loss |
|---|
| bitsandbytes INT8 | 8 | Quick GPU memory savings | Minimal |
| bitsandbytes NF4 | 4 | GPU inference, fine-tuning (QLoRA) | Small |
| GPTQ | 4 | Pre-quantized GPU checkpoints | Small |
| GGUF (llama.cpp) | 2–8 (mixed) | CPU/Mac inference, Ollama | Small–moderate |
Key Takeaways
- Quantization reduces weight precision to cut memory use, typically 2–4x with small accuracy loss
- 4-bit (
nf4) is the practical default for running large models on consumer GPUs - GGUF is the format to use for CPU-only or Apple Silicon inference via
llama.cpp or Ollama - Double quantization compresses the quantization metadata itself for extra savings
- Lower precision often means faster inference too, since memory bandwidth is usually the bottleneck
- Always benchmark quality on your actual task — quantization quality loss varies by model and domain
Related Posts