Data Pipeline
ELI5 — The Vibe Check
An assembly line for data. Raw data goes in one end, gets cleaned, transformed, enriched, and validated at each station, and comes out the other end ready to use. Think of it like a factory: dirty data in, clean data out. If one station breaks, everything downstream stops.
Real Talk
A data pipeline is a series of automated steps that extract data from sources, transform it, and load it into a destination (ETL/ELT). Pipelines can be batch (scheduled) or streaming (real-time). Tools include Apache Airflow, dbt, Apache Spark, and cloud-native services like AWS Glue.
When You'll Hear This
"The data pipeline broke overnight — the dashboard shows yesterday's numbers." / "We need a pipeline to sync customer data from Stripe to our warehouse."
Related Terms
Cron Job
A cron job is a task that runs on a schedule automatically. 'Every day at midnight, clean up old sessions.' 'Every hour, send digest emails.
Database
A database is like a super-organized filing cabinet for your app's data.
ETL
ETL stands for Extract, Transform, Load. You extract data from sources, transform it (clean, reshape, calculate), then load it into your warehouse.
Queue
A queue is like a line at a coffee shop — first come, first served. The first person to get in line is the first to get their coffee.
Worker
A worker is a background process that picks up jobs from a queue and does the heavy lifting.