Skip to content

Regular Expressions

A regular expression is a search pattern — learn them once and they work in grep, sed, awk, Python, JavaScript, and virtually every other tool you will ever use.

Learning Objectives

  • Understand the difference between BRE and ERE
  • Write patterns using character classes, quantifiers, and anchors
  • Use grep -E (ERE) for modern regex syntax
  • Use sed with regex for substitution
  • Use awk regex patterns for field-based filtering
  • Match patterns in bash with =~

BRE vs ERE

Feature BRE (Basic) ERE (Extended)
Tool default grep, sed grep -E, awk
Grouping \( \) ( )
Alternation not standard \|
One or more \+ +
Zero or one \? ?
Grouping with | \(a\|b\) (a\|b)

Use ERE (grep -E) for all new patterns

ERE is cleaner, more readable, and matches the syntax of other tools. Use grep -E or grep -P (Perl regex) instead of plain grep.


Core Regex Syntax

Anchors and Wildcards

^        start of line
$        end of line
.        any single character (except newline)
grep "^root" /etc/passwd          # lines starting with "root"
grep "bash$" /etc/passwd          # lines ending with "bash"
grep "^$" file.txt                # blank lines
grep "^..$" file.txt              # lines with exactly 2 characters

Character Classes

[abc]       a, b, or c
[a-z]       any lowercase letter
[A-Z]       any uppercase letter
[0-9]       any digit
[^abc]      NOT a, b, or c
grep -E "[0-9]{4}-[0-9]{2}-[0-9]{2}" log.txt    # dates like 2024-01-15
grep -E "^[A-Za-z_][A-Za-z0-9_]*=" config.txt   # config key=value lines

POSIX Classes (portable)

[:alpha:]   letters
[:digit:]   digits
[:alnum:]   letters and digits
[:space:]   whitespace
[:upper:]   uppercase letters
[:lower:]   lowercase letters
grep -E "[[:upper:]]{3,}" file.txt     # three or more uppercase letters in a row

Quantifiers

*       zero or more
+       one or more (ERE)
?       zero or one (ERE)
{n}     exactly n
{n,}    n or more
{n,m}   between n and m
grep -E "error{1,3}" log.txt      # "erro", "error", "errorr", "errorrr"
grep -E "[0-9]+" file.txt         # one or more digits
grep -E "https?" url.txt          # http or https

Grouping and Alternation

grep -E "(error|warning|critical)" log.txt    # any of these words
grep -E "^(GET|POST|PUT|DELETE) " access.log  # HTTP methods

Practical Patterns

# IP address (simple)
grep -E "[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}" access.log

# Email address
grep -oE "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}" contacts.txt

# ISO date (YYYY-MM-DD)
grep -oE "[0-9]{4}-[0-9]{2}-[0-9]{2}" log.txt

# Lines with exactly 3 fields (tab-separated)
awk 'NF == 3' data.tsv

# HTTP 4xx errors in access log
grep -E '"[A-Z]+ .+ HTTP/[0-9.]+" 4[0-9]{2}' access.log

Regex in sed

# Remove HTML tags
sed 's/<[^>]*>//g' page.html

# Extract text between brackets
sed -n 's/.*\[\([^]]*\)\].*/\1/p' file.txt

# Normalize whitespace
sed 's/[[:space:]]\+/ /g' file.txt

Regex in bash [[ =~ ]]

if [[ "$input" =~ ^[0-9]+$ ]]; then
    echo "Input is a number"
fi

if [[ "$email" =~ ^[^@]+@[^@]+\.[^@]{2,}$ ]]; then
    echo "Looks like an email"
fi

# Capture groups via BASH_REMATCH
if [[ "$date" =~ ([0-9]{4})-([0-9]{2})-([0-9]{2}) ]]; then
    year="${BASH_REMATCH[1]}"
    month="${BASH_REMATCH[2]}"
    day="${BASH_REMATCH[3]}"
fi

Common Mistakes

Anchoring matters more than you think

grep "error" log.txt matches lines containing "error" anywhere, including "directory_errors". Use grep "\berror\b" or grep -w "error" for whole-word matching.

Regex in [[ =~ ]] must not be quoted on the right side

[[ "$str" =~ "pattern" ]] treats the pattern as a literal string, not a regex. The pattern must be unquoted or stored in a variable: pattern="^[0-9]+$"; [[ "$str" =~ $pattern ]].


Practice Exercises

Main (write a short script)

Create ~/scripts/validate_config.sh that reads a config file (format KEY=value) and validates each line:

#!/usr/bin/env bash
set -euo pipefail

FILE="${1:?Usage: $0 <config_file>}"
errors=0

while IFS= read -r line || [[ -n "$line" ]]; do
    [[ "$line" =~ ^[[:space:]]*# ]] && continue   # skip comments
    [[ -z "$line" ]] && continue                   # skip blank lines
    if ! [[ "$line" =~ ^[A-Z_][A-Z0-9_]*=.+ ]]; then
        echo "Invalid line: $line" >&2
        (( errors++ )) || true
    fi
done < "$FILE"

echo "Validation complete. Errors: $errors"
(( errors == 0 ))

Stretch

  1. Use grep -P (Perl-compatible regex) to find all lines in a log file where the response time exceeds 1000ms. What pattern would you use?
  2. Rewrite grep -E "(error|warning)" log.txt using awk pattern matching.

Interview Questions

  1. What is the difference between BRE and ERE in grep?
Show answer

BRE (Basic Regular Expressions, used by plain grep) requires backslashes before metacharacters like +, ?, (, ), {, }. ERE (Extended Regular Expressions, used by grep -E or egrep) treats these as special without backslashes. ERE is cleaner for complex patterns. For example, BRE: grep '\(foo\|bar\)'; ERE: grep -E '(foo|bar)'.

  1. What does grep -o do?
Show answer

-o (only-matching) prints only the matched portion of each line, one match per line, instead of the full line. Useful for extracting specific data: grep -oE '[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+' log.txt extracts all IP addresses, one per line.

  1. How do you use regex capture groups in bash?
Show answer

Use [[ string =~ pattern ]] with capturing groups in the pattern. After a successful match, ${BASH_REMATCH[0]} is the full match, ${BASH_REMATCH[1]} is the first group, etc. The pattern must not be quoted. Example: [[ "2024-01-15" =~ ([0-9]{4})-([0-9]{2}) ]] && echo "${BASH_REMATCH[1]}" prints 2024.


day03-part1-file-operations | day04-part1-automation-tools