Alerting
ELI5 — The Vibe Check
Alerting is the part of monitoring that actually wakes people up when something goes wrong. You define rules: 'If error rate > 1% for 5 minutes, send a PagerDuty alert.' Without alerting, monitoring is like a smoke detector with no alarm — useless when the house is on fire.
Real Talk
Alerting is a monitoring subsystem that evaluates rules against collected metrics and sends notifications when thresholds are breached. Alert rules define conditions, severities, and notification channels (Slack, PagerDuty, email). Good alerting is actionable — alerts should require a human response.
Show Me The Code
# Prometheus alert rule
groups:
- name: app.rules
rules:
- alert: HighErrorRate
expr: rate(http_errors_total[5m]) > 0.01
for: 5m
labels:
severity: critical
annotations:
summary: "Error rate above 1%"
When You'll Hear This
"Set up alerting so the on-call engineer gets paged if the service is down." / "Too many false-positive alerts cause alert fatigue — tune them carefully."
Related Terms
Incident
An incident is when something has gone wrong in production and users are affected.
Metrics
Metrics are the numbers your app tracks about itself over time — requests per second, error rate, CPU usage, response time, active users.
Monitoring
Monitoring is keeping a constant eye on your app while it runs — tracking whether it's up, how fast it responds, how many errors it throws, and how much me...
On-call
On-call means it's your turn to be the person who gets woken up at 3am if production breaks.
Pager
A pager (or more likely PagerDuty/OpsGenie today) is the alert that goes off on the on-call engineer's phone when something breaks in production.
Prometheus
Prometheus scrapes your services every 15 seconds asking 'how are you?' and stores the answers (metrics) as time series.