Text Processing¶
Raw text is the universal format of Unix — mastering grep, cut, sort, and wc lets you answer questions about data without writing a single line of "real" code.
Learning Objectives¶
- Search files for patterns with
grep - Count lines, words, and characters with
wc - Extract columns from delimited data with
cut - Sort and deduplicate data with
sortanduniq - Chain these tools together into one-liner pipelines
grep — Search for Patterns¶
grep searches files (or stdin) for lines matching a pattern and prints them.
Jan 15 09:12:03 host kernel: error: device not found
Jan 15 09:14:55 host app[1234]: error: connection refused
grep -i "error" /var/log/syslog # case-insensitive
grep -n "error" /var/log/syslog # show line numbers
grep -c "error" /var/log/syslog # count matching lines only
grep -v "error" /var/log/syslog # invert — lines that do NOT match
grep -r "TODO" ~/scripts/ # recursive search through directory
grep -l "error" /var/log/*.log # print only filenames, not lines
Grep with context
-A N shows N lines after a match, -B N shows N lines before, -C N shows N lines on both sides. Useful for seeing what surrounds an error:
wc — Word Count¶
The three numbers are: lines, words, characters (bytes).
wc -l /etc/passwd # lines only — most common use
wc -w notes.txt # words only
wc -c file.bin # bytes only
cut — Extract Fields from Delimited Text¶
# /etc/passwd fields are colon-separated: username:x:uid:gid:comment:home:shell
cut -d: -f1 /etc/passwd # extract field 1 (username)
cut -d: -f1,7 /etc/passwd # extract fields 1 and 7 (username and shell)
cut -d, -f2 data.csv # CSV, field 2
cut -c1-10 file.txt # extract first 10 characters of each line
sort — Sort Lines¶
sort names.txt # alphabetical
sort -r names.txt # reverse order
sort -n numbers.txt # numeric sort (not lexicographic)
sort -k2 data.txt # sort by second field
sort -t: -k3 -n /etc/passwd # sort passwd by UID (field 3, numeric)
sort -u names.txt # sort and remove duplicates
Numeric vs lexicographic sorting
Without -n, sort sorts lexicographically: 10 comes before 9 because 1 < 9. Always use -n when sorting numbers.
uniq — Remove Duplicate Lines¶
uniq only removes adjacent duplicates. Always sort first.
sort names.txt | uniq # sort then deduplicate
sort names.txt | uniq -c # count occurrences of each line
sort names.txt | uniq -d # show only lines that appear more than once
sort names.txt | uniq -u # show only lines that appear exactly once
# Find the most common words in a file
tr ' ' '\n' < essay.txt | sort | uniq -c | sort -rn | head -10
Combining Tools: Pipelines¶
These tools become powerful when combined with pipes (|):
# How many unique shells are in use on this system?
cut -d: -f7 /etc/passwd | sort | uniq -c | sort -rn
Common Mistakes¶
grep without quotes
grep error log.txt works, but grep my error log.txt passes three arguments — my, error, and log.txt. Always quote the pattern: grep "my error" log.txt.
uniq without sort first
uniq only removes consecutive duplicate lines. echo -e "a\nb\na" | uniq outputs all three lines because the two as are not adjacent. Always pipe through sort first.
Practice Exercises¶
Warm-Up (run and observe)¶
- Run
wc -l /etc/passwd. Then rungrep -c "" /etc/passwd. Do they agree? Why? - Run
cut -d: -f1,3 /etc/passwd | sort -t: -k2 -n | head -10. What does this show? - Run
ls /bin | sort | uniq -c | sort -rn | head -5. What does the count tell you?
Main (write a short script)¶
Create ~/scripts/log_summary.sh:
#!/usr/bin/env bash
set -euo pipefail
LOGFILE="${1:-/var/log/syslog}"
echo "=== Log Summary: $LOGFILE ==="
echo "Total lines: $(wc -l < "$LOGFILE")"
echo "Error lines: $(grep -c -i "error" "$LOGFILE" || echo 0)"
echo "Warning lines: $(grep -c -i "warn" "$LOGFILE" || echo 0)"
echo ""
echo "=== Top 5 sources ==="
grep -oP '\w+\[\d+\]' "$LOGFILE" | sort | uniq -c | sort -rn | head -5
Stretch¶
- Using only
cut,sort, anduniq, find all unique file extensions in your~/Downloadsfolder. - Write a one-liner that counts how many lines in
/etc/passwdhave/bin/bashas their shell. - The
commcommand compares two sorted files. Usesortandcommto find lines that appear infile1.txtbut not infile2.txt.
Interview Questions¶
- What is the difference between
grep -candgrep | wc -l?
Show answer
grep -c counts matching lines per file and handles multiple files correctly, printing filename:count for each. grep pattern file | wc -l counts the output lines of grep, which gives the same number for a single file but does not handle multiple files as cleanly. grep -c is preferred.
- Why must you sort before using
uniq?
Show answer
uniq only removes adjacent duplicate lines. If duplicates are scattered throughout the file, uniq alone will not catch them. sort brings identical lines together so uniq can remove them.
- What does
sort -k2 -t,do?
Show answer
It sorts lines using comma (,) as the field delimiter (-t,) and treats the second field as the sort key (-k2). Without -t, the default delimiter is whitespace.