Project 01 — System Health Monitor¶
Build a script that monitors CPU, memory, and disk usage on demand or via cron, logs the results, and alerts you when thresholds are exceeded.
What You Will Build¶
A production-ready health monitoring script that: - Checks CPU load, memory usage, and disk space - Compares each metric against configurable thresholds - Logs results with timestamps to a rotating log file - Sends an alert (to a log file, email, or Slack webhook) when thresholds are breached - Runs unattended via cron every 5 minutes
Skills Covered¶
This project touches every major topic from both weeks:
| Topic | Where it appears |
|---|---|
| Variables & arithmetic | Threshold comparisons |
| Control flow | Metric evaluation |
| Functions | Modular metric collection |
| Error handling | Graceful failure on missing tools |
| Logging | Timestamped output |
| cron | Scheduling |
| curl | Slack/webhook alerting |
Project Files¶
| File | Purpose |
|---|---|
health_check.sh |
Main script |
health_check.conf |
Configuration (thresholds, alert targets) |
health_check_test.sh |
Test cases using bats |
install.sh |
Installs cron job and creates log directory |
README.md |
Usage documentation |
Getting Started¶
Step 1 — Basic Metrics Collection¶
Start with a script that just collects and prints metrics:
#!/usr/bin/env bash
set -euo pipefail
# CPU load (1-minute average)
cpu_load=$(uptime | awk -F'load average:' '{print $2}' | awk '{print $1}' | tr -d ',')
# Memory usage percentage
mem_pct=$(free | awk '/^Mem:/ { printf "%.0f", $3/$2*100 }')
# Disk usage percentage for /
disk_pct=$(df / | awk 'NR==2 { gsub(/%/,"",$5); print $5 }')
echo "CPU load: $cpu_load"
echo "Memory: ${mem_pct}%"
echo "Disk (/): ${disk_pct}%"
Step 2 — Add Thresholds and Alerting¶
THRESHOLD_MEM=85
THRESHOLD_DISK=90
if (( mem_pct > THRESHOLD_MEM )); then
echo "ALERT: Memory at ${mem_pct}% (threshold: ${THRESHOLD_MEM}%)" >&2
fi
Step 3 — Functions, Logging, Config File¶
Refactor into functions with a log() helper. Load thresholds from a config file using:
Step 4 — cron Integration¶
# /etc/cron.d/health_check or crontab -e
*/5 * * * * /home/user/scripts/health_check.sh >> /var/log/health_check.log 2>&1
Step 5 — Optional: Slack Webhook Alert¶
send_alert() {
local message="$1"
[[ -n "${SLACK_WEBHOOK_URL:-}" ]] || return 0
curl -sf -X POST "$SLACK_WEBHOOK_URL" \
-H "Content-Type: application/json" \
-d "{\"text\": \"[$(hostname)] $message\"}"
}
Sample Output¶
[2024-01-15 09:30:00] Health check started on webserver-01
[2024-01-15 09:30:00] CPU load: 0.45 (OK)
[2024-01-15 09:30:00] Memory: 62% (OK, threshold: 85%)
[2024-01-15 09:30:00] Disk /: 78% (OK, threshold: 90%)
[2024-01-15 09:30:01] Health check complete — all systems normal
Stretch Goals¶
- Add per-process CPU/memory checks (alert if a specific process exceeds a threshold)
- Add network checks: ping a host, check HTTP endpoint response time
- Generate an HTML report and serve it with
python3 -m http.server - Add a
--historyflag that shows the last N log entries formatted as a table
index | [[02-file-organizer]]