Cosine Similarity
ELI5 — The Vibe Check
Cosine similarity measures how similar two things are by comparing the angles of their vectors. If two vectors point in the same direction, they're similar (score near 1). If they're perpendicular, they're unrelated (score near 0). If they point opposite directions, they're opposites (score near -1). It's the math behind 'these two documents are about the same topic.'
Real Talk
Cosine similarity computes the cosine of the angle between two vectors, measuring their directional alignment regardless of magnitude. It's defined as the dot product of two vectors divided by the product of their magnitudes, yielding a value between -1 and 1. It's the standard similarity metric for comparing text embeddings in semantic search and RAG systems.
Show Me The Code
import numpy as np
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
# Example: compare two embeddings
sim = cosine_similarity(embed("dog"), embed("puppy")) # ~0.92
sim = cosine_similarity(embed("dog"), embed("SQL")) # ~0.15
When You'll Hear This
"The cosine similarity between these two documents is 0.94 — they're nearly identical." / "We use cosine similarity to find the most relevant chunks for RAG."
Related Terms
Embedding
An embedding is turning words, sentences, or entire documents into lists of numbers (vectors) that capture their meaning.
RAG (Retrieval Augmented Generation)
RAG is how you give an AI access to your private documents without retraining it.
Semantic Search
Semantic search finds results based on meaning, not just keyword matching.
Vector Database
A vector database is a special database built to store and search embeddings.