Skip to content

File Operations at Scale

Batch renaming 1,000 files, syncing directories across machines, compressing archives — these are the tasks where shell scripting pays back its learning curve in minutes.

Learning Objectives

  • Batch rename files safely with loops and parameter expansion
  • Sync directories with rsync
  • Create and extract archives with tar, gzip, and zip
  • Verify file integrity with checksums (md5sum, sha256sum)
  • Handle filenames with spaces, special characters, and Unicode

Batch Renaming Files

Safe Rename Pattern

#!/usr/bin/env bash
set -euo pipefail

# Rename all .jpeg files to .jpg
for file in *.jpeg; do
    [[ -e "$file" ]] || continue
    newname="${file%.jpeg}.jpg"
    mv -v -- "$file" "$newname"
done

The -- before $file prevents filenames starting with - from being treated as flags.

Add Timestamp Prefix

TIMESTAMP=$(date +%Y%m%d)
for file in *.log; do
    [[ -f "$file" ]] || continue
    mv -- "$file" "${TIMESTAMP}_${file}"
done

Lowercase All Filenames

for file in *; do
    [[ -f "$file" ]] || continue
    lower="${file,,}"               # bash 4+ lowercase expansion
    [[ "$file" == "$lower" ]] || mv -- "$file" "$lower"
done

Test with echo first

Before running destructive mv or rm in a loop, replace mv with echo mv to preview what would happen. Only remove echo when you are confident the logic is correct.


rsync — Synchronize Directories

rsync copies only changed files, making it ideal for backups and deployments.

# Basic sync (local to local)
rsync -av source/ destination/

# Common flags
rsync -avz source/ user@remote:~/dest/    # compress during transfer
rsync -av --delete source/ dest/          # mirror: delete files not in source
rsync -av --exclude='*.log' source/ dest/ # skip log files
rsync -av --dry-run source/ dest/         # preview without copying

Key flags: - -a — archive mode: recursive, preserves permissions, timestamps, symlinks - -v — verbose - -z — compress during transfer (useful over slow links) - --delete — remove files from dest that do not exist in source - --dry-run — show what would happen without doing it

Trailing slash on source matters

rsync -av src/ dest/ copies the contents of src into dest. rsync -av src dest/ copies the src directory itself into dest, creating dest/src/. The trailing slash on the source is a common gotcha.


Archives with tar

# Create an archive
tar -czf archive.tar.gz /path/to/directory/     # gzip compressed
tar -cjf archive.tar.bz2 /path/to/directory/    # bzip2 compressed
tar -cJf archive.tar.xz /path/to/directory/     # xz compressed (smaller)

# Extract an archive
tar -xzf archive.tar.gz                         # extract here
tar -xzf archive.tar.gz -C /target/dir/         # extract to specific dir

# List contents without extracting
tar -tzf archive.tar.gz

Flag mnemonics: c=create, x=extract, t=list, z=gzip, j=bzip2, J=xz, f=filename, v=verbose.

Timestamped Backup

TIMESTAMP=$(date +%Y%m%d_%H%M%S)
tar -czf "backup_${TIMESTAMP}.tar.gz" ~/documents/
echo "Archive: backup_${TIMESTAMP}.tar.gz ($(du -sh "backup_${TIMESTAMP}.tar.gz" | cut -f1))"

Checksums

# Generate
md5sum file.txt > file.txt.md5
sha256sum archive.tar.gz > archive.tar.gz.sha256

# Verify
md5sum -c file.txt.md5        # OK if it prints "file.txt: OK"
sha256sum -c archive.sha256

# Generate for multiple files
sha256sum *.tar.gz > checksums.sha256
sha256sum -c checksums.sha256

Handling Problematic Filenames

Filenames can contain spaces, newlines, special characters, and Unicode. These patterns handle all of them safely:

# Always use -- to stop flag processing
mv -- "$file" "$newname"
rm -- "$file"

# Use null-delimited lists with find
find . -name "*.tmp" -print0 | xargs -0 rm --

# Quote every variable reference
for f in "$dir"/*; do
    process "$f"      # not $f
done

Practice Exercises

Main (write a short script)

Build the File Organizer project prototype:

#!/usr/bin/env bash
set -euo pipefail

SOURCE="${1:-.}"
DRY_RUN=false
[[ "${2:-}" == "--dry-run" ]] && DRY_RUN=true

declare -A EXTENSIONS
EXTENSIONS=([jpg]="images" [jpeg]="images" [png]="images"
            [mp3]="music" [wav]="music" [flac]="music"
            [pdf]="documents" [docx]="documents" [txt]="documents"
            [zip]="archives" [tar]="archives" [gz]="archives")

for file in "$SOURCE"/*; do
    [[ -f "$file" ]] || continue
    ext="${file##*.}"
    ext="${ext,,}"
    dest_dir="${EXTENSIONS[$ext]:-other}"
    if "$DRY_RUN"; then
        echo "Would move: $file -> $SOURCE/$dest_dir/"
    else
        mkdir -p "$SOURCE/$dest_dir"
        mv -- "$file" "$SOURCE/$dest_dir/"
    fi
done

Stretch

  1. Add incremental backup logic to your backup script: only archive files newer than the last backup run (use find -newer).
  2. Research rsync --link-dest. How does it enable space-efficient incremental backups?

Interview Questions

  1. What does the trailing slash mean in rsync -av src/ dest/?
Show answer

A trailing slash on the source (src/) means "copy the contents of src." Without it (src), rsync copies the directory itself. So rsync src/ dest/ and rsync src dest/ produce different results: the first puts files directly in dest, the second creates dest/src.

  1. What does tar -czf stand for?
Show answer

c = create, z = compress with gzip, f = the next argument is the filename. So tar -czf archive.tar.gz dir/ creates a gzip-compressed tar archive named archive.tar.gz from dir/. To extract: tar -xzf archive.tar.gz.

  1. Why use find -print0 | xargs -0 instead of a for loop or find -exec?
Show answer

find -print0 | xargs -0 handles filenames with spaces, newlines, and special characters correctly by using null bytes as delimiters. It is also more efficient than -exec ... \; (which spawns a process per file) — xargs -0 batches multiple files per invocation. A for loop over $(find ...) breaks on filenames with spaces.


day02-part2-string-arrays | day03-part2-regular-expressions