Quantization
ELI5 — The Vibe Check
Quantization is the art of making AI models smaller and faster by using less precise numbers. Instead of storing each weight as a super-detailed 32-bit number, you round it to 8-bit or even 4-bit. It's like going from 'the temperature is 72.3847°F' to 'it's about 72°F' — close enough to be useful, but takes way less memory. This is how you run a 70 billion parameter model on a laptop.
Real Talk
Quantization reduces the precision of model weights and activations from higher-bit formats (FP32/FP16) to lower-bit formats (INT8, INT4, NF4). Post-training quantization (PTQ) converts after training; quantization-aware training (QAT) incorporates quantization during training. Techniques like GPTQ, AWQ, and GGUF enable running large models on consumer GPUs with minimal quality loss.
Show Me The Code
# Quantize a model with bitsandbytes
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
quant_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_quant_type="nf4"
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3-70B",
quantization_config=quant_config
)
When You'll Hear This
"4-bit quantization lets you run Llama 70B on a single 24GB GPU." / "The quantized model is 95% as good at 25% the size."
Related Terms
GGUF
GGUF is a file format for running AI models on your laptop — it's like the MP3 of AI models.
Inference
Inference is when the AI actually runs and generates output — as opposed to training, which is when it's learning.
LLM (Large Language Model)
An LLM is a humongous AI that read basically the entire internet and learned to predict what words come next, really really well.
LoRA (Low-Rank Adaptation)
LoRA is how you fine-tune a massive AI model without needing a massive GPU budget.
Model
A model is the trained AI — the finished product.