CLIP
ELI5 — The Vibe Check
CLIP connects text and images in one shared understanding — it can look at a photo and know what text describes it, or read text and find matching images. It's like teaching an AI to think in both words and pictures simultaneously. It's the secret ingredient behind many AI tools: DALL-E uses it to understand prompts, and it powers zero-shot image classification.
Real Talk
CLIP (Contrastive Language-Image Pre-training) is a model by OpenAI that learns visual concepts from natural language supervision. Trained on 400M image-text pairs using contrastive learning, it produces aligned embeddings for text and images in a shared vector space. This enables zero-shot classification, image search, and serves as the text encoder for many image generation models.
When You'll Hear This
"CLIP embeddings power our image search — users type text and find matching photos." / "We use CLIP for zero-shot image classification without any fine-tuning."
Related Terms
Computer Vision
Computer Vision is teaching AI to understand images and video. How does your phone unlock with your face? Computer Vision.
DALL-E
DALL-E is OpenAI's AI image generator — describe an image in words and it creates it from scratch. Want 'an avocado armchair'? Done.
Embedding
An embedding is turning words, sentences, or entire documents into lists of numbers (vectors) that capture their meaning.
Zero-Shot Learning
Zero-shot learning is when you ask an AI to do something it was never explicitly trained on — and it just... does it.