Skip to content

Tokenizer

Medium — good to knowAI & ML

ELI5 — The Vibe Check

A tokenizer chops text into pieces that the AI model can understand — but not in ways humans would expect. 'Hello' might be one token, but 'unbelievable' might be three: 'un', 'believ', 'able'. Spaces count. Emojis are expensive. And 'ChatGPT' is somehow 3 tokens but 'the' is just 1. It's like the model's weird dictionary where common words are cheap and rare words are pricey.

Real Talk

A tokenizer converts text into a sequence of discrete tokens (subword units) that a language model can process. Common algorithms include BPE (Byte Pair Encoding), WordPiece, and SentencePiece. The tokenizer's vocabulary and encoding scheme directly affect model context limits, pricing (tokens = cost), and multilingual capability. Different models use different tokenizers.

Show Me The Code

# Counting tokens with tiktoken (OpenAI)
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4")
tokens = enc.encode("Hello, world!")
print(len(tokens))  # 4 tokens
print(tokens)       # [9906, 11, 1917, 0]

When You'll Hear This

"The prompt is 4,000 tokens — we need to trim it." / "The tokenizer splits 'JavaScript' into 'Java' and 'Script.'"

Made with passive-aggressive love by manoga.digital. Powered by Claude.