Regular Expressions¶

A regular expression is a search pattern — learn them once and they work in grep, sed, awk, Python, JavaScript, and virtually every other tool you will ever use.

Learning Objectives¶

Understand the difference between BRE and ERE
Write patterns using character classes, quantifiers, and anchors
Use grep -E (ERE) for modern regex syntax
Use sed with regex for substitution
Use awk regex patterns for field-based filtering
Match patterns in bash with =~

BRE vs ERE¶

Feature	BRE (Basic)	ERE (Extended)
Tool default	`grep`, `sed`	`grep -E`, `awk`
Grouping	`$` `$`	`(` `)`
Alternation	not standard	`\\|`
One or more	`\+`	`+`
Zero or one	`\?`	`?`
Grouping with \|	`$a\\|b$`	`(a\\|b)`

Use ERE (grep -E) for all new patterns

ERE is cleaner, more readable, and matches the syntax of other tools. Use grep -E or grep -P (Perl regex) instead of plain grep.

Core Regex Syntax¶

Anchors and Wildcards¶

^ start of line $ end of line . any single character (except newline)

grep "^root" /etc/passwd          # lines starting with "root"
grep "bash$" /etc/passwd          # lines ending with "bash"
grep "^$" file.txt                # blank lines
grep "^..$" file.txt              # lines with exactly 2 characters

Character Classes¶

[abc] a, b, or c [a-z] any lowercase letter [A-Z] any uppercase letter [0-9] any digit [^abc] NOT a, b, or c

grep -E "[0-9]{4}-[0-9]{2}-[0-9]{2}" log.txt    # dates like 2024-01-15
grep -E "^[A-Za-z_][A-Za-z0-9_]*=" config.txt   # config key=value lines

POSIX Classes (portable)¶

[:alpha:] letters [:digit:] digits [:alnum:] letters and digits [:space:] whitespace [:upper:] uppercase letters [:lower:] lowercase letters

grep -E "[[:upper:]]{3,}" file.txt     # three or more uppercase letters in a row

Quantifiers¶

* zero or more + one or more (ERE) ? zero or one (ERE) {n} exactly n {n,} n or more {n,m} between n and m

grep -E "error{1,3}" log.txt      # "erro", "error", "errorr", "errorrr"
grep -E "[0-9]+" file.txt         # one or more digits
grep -E "https?" url.txt          # http or https

Grouping and Alternation¶

grep -E "(error|warning|critical)" log.txt    # any of these words
grep -E "^(GET|POST|PUT|DELETE) " access.log  # HTTP methods

Practical Patterns¶

# IP address (simple)
grep -E "[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}" access.log

# Email address
grep -oE "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}" contacts.txt

# ISO date (YYYY-MM-DD)
grep -oE "[0-9]{4}-[0-9]{2}-[0-9]{2}" log.txt

# Lines with exactly 3 fields (tab-separated)
awk 'NF == 3' data.tsv

# HTTP 4xx errors in access log
grep -E '"[A-Z]+ .+ HTTP/[0-9.]+" 4[0-9]{2}' access.log

Regex in `sed`¶

# Remove HTML tags
sed 's/<[^>]*>//g' page.html

# Extract text between brackets
sed -n 's/.*\[\([^]]*\)\].*/\1/p' file.txt

# Normalize whitespace
sed 's/[[:space:]]\+/ /g' file.txt

Regex in bash `[[ =~ ]]`¶

if [[ "$input" =~ ^[0-9]+$ ]]; then
    echo "Input is a number"
fi

if [[ "$email" =~ ^[^@]+@[^@]+\.[^@]{2,}$ ]]; then
    echo "Looks like an email"
fi

# Capture groups via BASH_REMATCH
if [[ "$date" =~ ([0-9]{4})-([0-9]{2})-([0-9]{2}) ]]; then
    year="${BASH_REMATCH[1]}"
    month="${BASH_REMATCH[2]}"
    day="${BASH_REMATCH[3]}"
fi

Common Mistakes¶

Anchoring matters more than you think

grep "error" log.txt matches lines containing "error" anywhere, including "directory_errors". Use grep "\berror\b" or grep -w "error" for whole-word matching.

Regex in [[ =~ ]] must not be quoted on the right side

[[ "$str" =~ "pattern" ]] treats the pattern as a literal string, not a regex. The pattern must be unquoted or stored in a variable: pattern="^[0-9]+$"; [[ "$str" =~ $pattern ]].

Practice Exercises¶

Main (write a short script)¶

Create ~/scripts/validate_config.sh that reads a config file (format KEY=value) and validates each line:

#!/usr/bin/env bash
set -euo pipefail

FILE="${1:?Usage: $0 <config_file>}"
errors=0

while IFS= read -r line || [[ -n "$line" ]]; do
    [[ "$line" =~ ^[[:space:]]*# ]] && continue   # skip comments
    [[ -z "$line" ]] && continue                   # skip blank lines
    if ! [[ "$line" =~ ^[A-Z_][A-Z0-9_]*=.+ ]]; then
        echo "Invalid line: $line" >&2
        (( errors++ )) || true
    fi
done < "$FILE"

echo "Validation complete. Errors: $errors"
(( errors == 0 ))

Stretch¶

Use grep -P (Perl-compatible regex) to find all lines in a log file where the response time exceeds 1000ms. What pattern would you use?
Rewrite grep -E "(error|warning)" log.txt using awk pattern matching.

Interview Questions¶

What is the difference between BRE and ERE in grep?

Show answer

BRE (Basic Regular Expressions, used by plain grep) requires backslashes before metacharacters like +, ?, (, ), {, }. ERE (Extended Regular Expressions, used by grep -E or egrep) treats these as special without backslashes. ERE is cleaner for complex patterns. For example, BRE: grep '$foo\|bar$'; ERE: grep -E '(foo|bar)'.

What does grep -o do?

Show answer

-o (only-matching) prints only the matched portion of each line, one match per line, instead of the full line. Useful for extracting specific data: grep -oE '[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+' log.txt extracts all IP addresses, one per line.

How do you use regex capture groups in bash?

Show answer

Use [[ string =~ pattern ]] with capturing groups in the pattern. After a successful match, ${BASH_REMATCH[0]} is the full match, ${BASH_REMATCH[1]} is the first group, etc. The pattern must not be quoted. Example: [[ "2024-01-15" =~ ([0-9]{4})-([0-9]{2}) ]] && echo "${BASH_REMATCH[1]}" prints 2024.

day03-part1-file-operations | day04-part1-automation-tools

Regular Expressions¶

Learning Objectives¶

BRE vs ERE¶

Core Regex Syntax¶

Anchors and Wildcards¶

Character Classes¶

POSIX Classes (portable)¶

Quantifiers¶

Grouping and Alternation¶

Practical Patterns¶

Regex in sed¶

Regex in bash [[ =~ ]]¶

Common Mistakes¶

Practice Exercises¶

Main (write a short script)¶

Stretch¶

Interview Questions¶

Regex in `sed`¶

Regex in bash `[[ =~ ]]`¶