Vision Model
ELI5 — The Vibe Check
A vision model is an AI that can understand images — it's got eyes, basically. Show it a photo and it can tell you what's in it, read text from a receipt, spot a crack in a bridge, or even understand a hand-drawn wireframe. It's Computer Vision all grown up and integrated into the same models that handle text.
Real Talk
Vision models (or vision-language models) are AI systems that process and understand visual inputs. Modern approaches include vision transformers (ViT), CLIP-style contrastive learning, and natively multimodal LLMs. They handle tasks like image classification, object detection, OCR, visual question answering, and image generation. Most frontier LLMs now include vision capabilities.
When You'll Hear This
"The vision model can read the handwritten notes from the whiteboard." / "Send the screenshot to Claude's vision — it'll spot the CSS bug."
Related Terms
CLIP
CLIP connects text and images in one shared understanding — it can look at a photo and know what text describes it, or read text and find matching images.
Computer Vision
Computer Vision is teaching AI to understand images and video. How does your phone unlock with your face? Computer Vision.
Image Classification
Image classification is teaching a computer to look at a picture and say what it is — 'that's a cat,' 'that's a dog,' 'that's a suspicious mole you should...
Multimodal
Multimodal AI can see, hear, AND read — it's not limited to just text. It's like the difference between texting someone and FaceTiming them.
OCR (Optical Character Recognition)
OCR reads text from images — take a photo of a document, receipt, or sign, and OCR turns the pixels into actual text your computer can search, copy, and ed...