Model Inversion
ELI5 — The Vibe Check
Model inversion is reconstructing training data from a trained ML model — the privacy attack that makes ML teams sweat. You trained a model on private medical records. Someone probes your model carefully, analyzing its outputs and confidence scores across thousands of queries. Over time, they reconstruct data that looks suspiciously like your private training set. The model learned the data too well and is now accidentally leaking it.
Real Talk
Model inversion attacks work by querying a model repeatedly and using the outputs (predictions, confidence scores, embeddings) to infer information about training data. Fredrikson et al. (2015) demonstrated extracting facial images from a facial recognition model. Defenses include differential privacy (adding noise during training), output perturbation, confidence score masking, and limiting API query rates. The attack is particularly relevant for models trained on PII, medical, or financial data.
When You'll Hear This
"Model inversion is why we don't expose raw confidence scores in the API." / "Fine-tuning on customer data without differential privacy is a model inversion risk."
Related Terms
AI Safety
AI Safety is the field of making sure AI doesn't go off the rails.
Alignment
Alignment is the AI safety challenge of making sure AI does what we actually want, not just what we literally said.
Machine Learning (ML)
Machine Learning is teaching a computer by showing it thousands of examples instead of writing out every rule.