Regular Expressions¶
A regular expression is a search pattern — learn them once and they work in grep, sed, awk, Python, JavaScript, and virtually every other tool you will ever use.
Learning Objectives¶
- Understand the difference between BRE and ERE
- Write patterns using character classes, quantifiers, and anchors
- Use
grep -E(ERE) for modern regex syntax - Use
sedwith regex for substitution - Use
awkregex patterns for field-based filtering - Match patterns in bash with
=~
BRE vs ERE¶
| Feature | BRE (Basic) | ERE (Extended) |
|---|---|---|
| Tool default | grep, sed |
grep -E, awk |
| Grouping | \( \) |
( ) |
| Alternation | not standard | \| |
| One or more | \+ |
+ |
| Zero or one | \? |
? |
| Grouping with | | \(a\|b\) |
(a\|b) |
Use ERE (grep -E) for all new patterns
ERE is cleaner, more readable, and matches the syntax of other tools. Use grep -E or grep -P (Perl regex) instead of plain grep.
Core Regex Syntax¶
Anchors and Wildcards¶
grep "^root" /etc/passwd # lines starting with "root"
grep "bash$" /etc/passwd # lines ending with "bash"
grep "^$" file.txt # blank lines
grep "^..$" file.txt # lines with exactly 2 characters
Character Classes¶
[abc] a, b, or c
[a-z] any lowercase letter
[A-Z] any uppercase letter
[0-9] any digit
[^abc] NOT a, b, or c
grep -E "[0-9]{4}-[0-9]{2}-[0-9]{2}" log.txt # dates like 2024-01-15
grep -E "^[A-Za-z_][A-Za-z0-9_]*=" config.txt # config key=value lines
POSIX Classes (portable)¶
[:alpha:] letters
[:digit:] digits
[:alnum:] letters and digits
[:space:] whitespace
[:upper:] uppercase letters
[:lower:] lowercase letters
Quantifiers¶
* zero or more
+ one or more (ERE)
? zero or one (ERE)
{n} exactly n
{n,} n or more
{n,m} between n and m
grep -E "error{1,3}" log.txt # "erro", "error", "errorr", "errorrr"
grep -E "[0-9]+" file.txt # one or more digits
grep -E "https?" url.txt # http or https
Grouping and Alternation¶
grep -E "(error|warning|critical)" log.txt # any of these words
grep -E "^(GET|POST|PUT|DELETE) " access.log # HTTP methods
Practical Patterns¶
# IP address (simple)
grep -E "[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}" access.log
# Email address
grep -oE "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}" contacts.txt
# ISO date (YYYY-MM-DD)
grep -oE "[0-9]{4}-[0-9]{2}-[0-9]{2}" log.txt
# Lines with exactly 3 fields (tab-separated)
awk 'NF == 3' data.tsv
# HTTP 4xx errors in access log
grep -E '"[A-Z]+ .+ HTTP/[0-9.]+" 4[0-9]{2}' access.log
Regex in sed¶
# Remove HTML tags
sed 's/<[^>]*>//g' page.html
# Extract text between brackets
sed -n 's/.*\[\([^]]*\)\].*/\1/p' file.txt
# Normalize whitespace
sed 's/[[:space:]]\+/ /g' file.txt
Regex in bash [[ =~ ]]¶
if [[ "$input" =~ ^[0-9]+$ ]]; then
echo "Input is a number"
fi
if [[ "$email" =~ ^[^@]+@[^@]+\.[^@]{2,}$ ]]; then
echo "Looks like an email"
fi
# Capture groups via BASH_REMATCH
if [[ "$date" =~ ([0-9]{4})-([0-9]{2})-([0-9]{2}) ]]; then
year="${BASH_REMATCH[1]}"
month="${BASH_REMATCH[2]}"
day="${BASH_REMATCH[3]}"
fi
Common Mistakes¶
Anchoring matters more than you think
grep "error" log.txt matches lines containing "error" anywhere, including "directory_errors". Use grep "\berror\b" or grep -w "error" for whole-word matching.
Regex in [[ =~ ]] must not be quoted on the right side
[[ "$str" =~ "pattern" ]] treats the pattern as a literal string, not a regex. The pattern must be unquoted or stored in a variable: pattern="^[0-9]+$"; [[ "$str" =~ $pattern ]].
Practice Exercises¶
Main (write a short script)¶
Create ~/scripts/validate_config.sh that reads a config file (format KEY=value) and validates each line:
#!/usr/bin/env bash
set -euo pipefail
FILE="${1:?Usage: $0 <config_file>}"
errors=0
while IFS= read -r line || [[ -n "$line" ]]; do
[[ "$line" =~ ^[[:space:]]*# ]] && continue # skip comments
[[ -z "$line" ]] && continue # skip blank lines
if ! [[ "$line" =~ ^[A-Z_][A-Z0-9_]*=.+ ]]; then
echo "Invalid line: $line" >&2
(( errors++ )) || true
fi
done < "$FILE"
echo "Validation complete. Errors: $errors"
(( errors == 0 ))
Stretch¶
- Use
grep -P(Perl-compatible regex) to find all lines in a log file where the response time exceeds 1000ms. What pattern would you use? - Rewrite
grep -E "(error|warning)" log.txtusingawkpattern matching.
Interview Questions¶
- What is the difference between BRE and ERE in grep?
Show answer
BRE (Basic Regular Expressions, used by plain grep) requires backslashes before metacharacters like +, ?, (, ), {, }. ERE (Extended Regular Expressions, used by grep -E or egrep) treats these as special without backslashes. ERE is cleaner for complex patterns. For example, BRE: grep '\(foo\|bar\)'; ERE: grep -E '(foo|bar)'.
- What does
grep -odo?
Show answer
-o (only-matching) prints only the matched portion of each line, one match per line, instead of the full line. Useful for extracting specific data: grep -oE '[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+' log.txt extracts all IP addresses, one per line.
- How do you use regex capture groups in bash?
Show answer
Use [[ string =~ pattern ]] with capturing groups in the pattern. After a successful match, ${BASH_REMATCH[0]} is the full match, ${BASH_REMATCH[1]} is the first group, etc. The pattern must not be quoted. Example: [[ "2024-01-15" =~ ([0-9]{4})-([0-9]{2}) ]] && echo "${BASH_REMATCH[1]}" prints 2024.