Skip to content

Project 01 — System Health Monitor

Build a script that monitors CPU, memory, and disk usage on demand or via cron, logs the results, and alerts you when thresholds are exceeded.

What You Will Build

A production-ready health monitoring script that: - Checks CPU load, memory usage, and disk space - Compares each metric against configurable thresholds - Logs results with timestamps to a rotating log file - Sends an alert (to a log file, email, or Slack webhook) when thresholds are breached - Runs unattended via cron every 5 minutes

Skills Covered

This project touches every major topic from both weeks:

Topic Where it appears
Variables & arithmetic Threshold comparisons
Control flow Metric evaluation
Functions Modular metric collection
Error handling Graceful failure on missing tools
Logging Timestamped output
cron Scheduling
curl Slack/webhook alerting

Project Files

File Purpose
health_check.sh Main script
health_check.conf Configuration (thresholds, alert targets)
health_check_test.sh Test cases using bats
install.sh Installs cron job and creates log directory
README.md Usage documentation

Getting Started

Step 1 — Basic Metrics Collection

Start with a script that just collects and prints metrics:

#!/usr/bin/env bash
set -euo pipefail

# CPU load (1-minute average)
cpu_load=$(uptime | awk -F'load average:' '{print $2}' | awk '{print $1}' | tr -d ',')

# Memory usage percentage
mem_pct=$(free | awk '/^Mem:/ { printf "%.0f", $3/$2*100 }')

# Disk usage percentage for /
disk_pct=$(df / | awk 'NR==2 { gsub(/%/,"",$5); print $5 }')

echo "CPU load: $cpu_load"
echo "Memory: ${mem_pct}%"
echo "Disk (/): ${disk_pct}%"

Step 2 — Add Thresholds and Alerting

THRESHOLD_MEM=85
THRESHOLD_DISK=90

if (( mem_pct > THRESHOLD_MEM )); then
    echo "ALERT: Memory at ${mem_pct}% (threshold: ${THRESHOLD_MEM}%)" >&2
fi

Step 3 — Functions, Logging, Config File

Refactor into functions with a log() helper. Load thresholds from a config file using:

[[ -f "$CONFIG_FILE" ]] && source "$CONFIG_FILE"

Step 4 — cron Integration

# /etc/cron.d/health_check or crontab -e
*/5 * * * * /home/user/scripts/health_check.sh >> /var/log/health_check.log 2>&1

Step 5 — Optional: Slack Webhook Alert

send_alert() {
    local message="$1"
    [[ -n "${SLACK_WEBHOOK_URL:-}" ]] || return 0
    curl -sf -X POST "$SLACK_WEBHOOK_URL" \
         -H "Content-Type: application/json" \
         -d "{\"text\": \"[$(hostname)] $message\"}"
}

Sample Output

[2024-01-15 09:30:00] Health check started on webserver-01
[2024-01-15 09:30:00] CPU load: 0.45 (OK)
[2024-01-15 09:30:00] Memory: 62% (OK, threshold: 85%)
[2024-01-15 09:30:00] Disk /: 78% (OK, threshold: 90%)
[2024-01-15 09:30:01] Health check complete — all systems normal

Stretch Goals

  • Add per-process CPU/memory checks (alert if a specific process exceeds a threshold)
  • Add network checks: ping a host, check HTTP endpoint response time
  • Generate an HTML report and serve it with python3 -m http.server
  • Add a --history flag that shows the last N log entries formatted as a table

index | [[02-file-organizer]]