Multimodal
ELI5 — The Vibe Check
Multimodal AI can see, hear, AND read — it's not limited to just text. It's like the difference between texting someone and FaceTiming them. A multimodal model can look at a photo and describe it, listen to audio and transcribe it, or take a screenshot of your code and explain the bug. One model, multiple senses.
Real Talk
Multimodal AI refers to models that can process and generate content across multiple data types (modalities) — text, images, audio, video, and code. Modern multimodal models use unified architectures or adapters to handle different input/output types. Examples include GPT-4o (text+vision+audio), Claude (text+vision), and Gemini (text+vision+audio+video).
When You'll Hear This
"Use a multimodal model — it can analyze the screenshot directly." / "Multimodal AI lets users upload images instead of describing their problem."
Related Terms
Computer Vision
Computer Vision is teaching AI to understand images and video. How does your phone unlock with your face? Computer Vision.
Embedding
An embedding is turning words, sentences, or entire documents into lists of numbers (vectors) that capture their meaning.
GPT-4o
GPT-4o is OpenAI's 'omni' model — the Swiss Army knife of AI.
LLM (Large Language Model)
An LLM is a humongous AI that read basically the entire internet and learned to predict what words come next, really really well.
Vision Model
A vision model is an AI that can understand images — it's got eyes, basically.