Rate Limit
ELI5 — The Vibe Check
A rate limit is the AI provider saying 'slow down, buddy.' You can only make a certain number of API calls per minute, or use a certain number of tokens per day, before you get a 429 error. It's how providers prevent one user from hogging all the compute. When you hit it, implement retry logic with exponential backoff.
Real Talk
Rate limits are restrictions on API request frequency imposed by LLM providers. They are typically enforced per key, per organization, and per tier, measured in requests per minute (RPM), tokens per minute (TPM), or tokens per day (TPD). Hitting rate limits returns HTTP 429. Standard mitigation involves exponential backoff with jitter.
Show Me The Code
import time
import anthropic
def call_with_retry(client, **kwargs, max_retries=3):
for attempt in range(max_retries):
try:
return client.messages.create(**kwargs)
except anthropic.RateLimitError:
time.sleep(2 ** attempt) # exponential backoff
raise Exception("Max retries exceeded")
When You'll Hear This
"We're hitting the rate limit — add backoff logic." / "Upgrade the tier to increase rate limits."
Related Terms
API Key
An API key is your password to use an AI service. You include it in every request to prove you're allowed to use the API and so they know who to charge.
Chat Completion
Chat Completion is the API pattern for having a back-and-forth conversation with an AI.
LLM (Large Language Model)
An LLM is a humongous AI that read basically the entire internet and learned to predict what words come next, really really well.
Token
In AI-land, a token is a chunk of text — roughly 3/4 of a word.