Tokenizer
ELI5 — The Vibe Check
A tokenizer chops text into pieces that the AI model can understand — but not in ways humans would expect. 'Hello' might be one token, but 'unbelievable' might be three: 'un', 'believ', 'able'. Spaces count. Emojis are expensive. And 'ChatGPT' is somehow 3 tokens but 'the' is just 1. It's like the model's weird dictionary where common words are cheap and rare words are pricey.
Real Talk
A tokenizer converts text into a sequence of discrete tokens (subword units) that a language model can process. Common algorithms include BPE (Byte Pair Encoding), WordPiece, and SentencePiece. The tokenizer's vocabulary and encoding scheme directly affect model context limits, pricing (tokens = cost), and multilingual capability. Different models use different tokenizers.
Show Me The Code
# Counting tokens with tiktoken (OpenAI)
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4")
tokens = enc.encode("Hello, world!")
print(len(tokens)) # 4 tokens
print(tokens) # [9906, 11, 1917, 0]
When You'll Hear This
"The prompt is 4,000 tokens — we need to trim it." / "The tokenizer splits 'JavaScript' into 'Java' and 'Script.'"
Related Terms
Context Window
A context window is how much text an AI can 'see' at once — its working memory.
Embedding
An embedding is turning words, sentences, or entire documents into lists of numbers (vectors) that capture their meaning.
LLM (Large Language Model)
An LLM is a humongous AI that read basically the entire internet and learned to predict what words come next, really really well.
Token
In AI-land, a token is a chunk of text — roughly 3/4 of a word.