Text Processing¶

Raw text is the universal format of Unix — mastering grep, cut, sort, and wc lets you answer questions about data without writing a single line of "real" code.

Learning Objectives¶

Search files for patterns with grep
Count lines, words, and characters with wc
Extract columns from delimited data with cut
Sort and deduplicate data with sort and uniq
Chain these tools together into one-liner pipelines

`grep` — Search for Patterns¶

grep searches files (or stdin) for lines matching a pattern and prints them.

grep "error" /var/log/syslog

Jan 15 09:12:03 host kernel: error: device not found Jan 15 09:14:55 host app[1234]: error: connection refused

grep -i "error" /var/log/syslog    # case-insensitive
grep -n "error" /var/log/syslog    # show line numbers
grep -c "error" /var/log/syslog    # count matching lines only
grep -v "error" /var/log/syslog    # invert — lines that do NOT match
grep -r "TODO" ~/scripts/          # recursive search through directory
grep -l "error" /var/log/*.log     # print only filenames, not lines

# Count errors in a log file
grep -c "ERROR" /var/log/app.log

47

Grep with context

-A N shows N lines after a match, -B N shows N lines before, -C N shows N lines on both sides. Useful for seeing what surrounds an error:

grep -C 3 "segfault" /var/log/syslog

`wc` — Word Count¶

wc /etc/passwd

45 90 2543 /etc/passwd

The three numbers are: lines, words, characters (bytes).

wc -l /etc/passwd      # lines only — most common use
wc -w notes.txt        # words only
wc -c file.bin         # bytes only

# Count how many users are in /etc/passwd
wc -l /etc/passwd

45 /etc/passwd

`cut` — Extract Fields from Delimited Text¶

# /etc/passwd fields are colon-separated: username:x:uid:gid:comment:home:shell
cut -d: -f1 /etc/passwd        # extract field 1 (username)
cut -d: -f1,7 /etc/passwd      # extract fields 1 and 7 (username and shell)
cut -d, -f2 data.csv           # CSV, field 2
cut -c1-10 file.txt            # extract first 10 characters of each line

cut -d: -f1 /etc/passwd | head -5

root daemon bin sys sync

`sort` — Sort Lines¶

sort names.txt                   # alphabetical
sort -r names.txt                # reverse order
sort -n numbers.txt              # numeric sort (not lexicographic)
sort -k2 data.txt                # sort by second field
sort -t: -k3 -n /etc/passwd      # sort passwd by UID (field 3, numeric)
sort -u names.txt                # sort and remove duplicates

echo -e "banana\napple\ncherry" | sort

apple banana cherry

Numeric vs lexicographic sorting

Without -n, sort sorts lexicographically: 10 comes before 9 because 1 < 9. Always use -n when sorting numbers.

echo -e "10\n9\n100" | sort      # wrong: 10, 100, 9
echo -e "10\n9\n100" | sort -n   # correct: 9, 10, 100

`uniq` — Remove Duplicate Lines¶

uniq only removes adjacent duplicates. Always sort first.

sort names.txt | uniq            # sort then deduplicate
sort names.txt | uniq -c         # count occurrences of each line
sort names.txt | uniq -d         # show only lines that appear more than once
sort names.txt | uniq -u         # show only lines that appear exactly once

# Find the most common words in a file
tr ' ' '\n' < essay.txt | sort | uniq -c | sort -rn | head -10

42 the 31 a 28 to 19 in 15 and

Combining Tools: Pipelines¶

These tools become powerful when combined with pipes (|):

# How many unique shells are in use on this system?
cut -d: -f7 /etc/passwd | sort | uniq -c | sort -rn

28 /bin/bash 9 /usr/sbin/nologin 4 /bin/sh 1 /bin/false

# Find the 5 largest files in the current directory
ls -lS | head -6 | tail -5

# Count log lines by hour
grep "Jan 15" /var/log/syslog | cut -d: -f1 | sort | uniq -c

Common Mistakes¶

grep without quotes

grep error log.txt works, but grep my error log.txt passes three arguments — my, error, and log.txt. Always quote the pattern: grep "my error" log.txt.

uniq without sort first

uniq only removes consecutive duplicate lines. echo -e "a\nb\na" | uniq outputs all three lines because the two as are not adjacent. Always pipe through sort first.

Practice Exercises¶

Warm-Up (run and observe)¶

Run wc -l /etc/passwd. Then run grep -c "" /etc/passwd. Do they agree? Why?
Run cut -d: -f1,3 /etc/passwd | sort -t: -k2 -n | head -10. What does this show?
Run ls /bin | sort | uniq -c | sort -rn | head -5. What does the count tell you?

Main (write a short script)¶

Create ~/scripts/log_summary.sh:

#!/usr/bin/env bash
set -euo pipefail

LOGFILE="${1:-/var/log/syslog}"

echo "=== Log Summary: $LOGFILE ==="
echo "Total lines: $(wc -l < "$LOGFILE")"
echo "Error lines: $(grep -c -i "error" "$LOGFILE" || echo 0)"
echo "Warning lines: $(grep -c -i "warn" "$LOGFILE" || echo 0)"

echo ""
echo "=== Top 5 sources ==="
grep -oP '\w+\[\d+\]' "$LOGFILE" | sort | uniq -c | sort -rn | head -5

Stretch¶

Using only cut, sort, and uniq, find all unique file extensions in your ~/Downloads folder.
Write a one-liner that counts how many lines in /etc/passwd have /bin/bash as their shell.
The comm command compares two sorted files. Use sort and comm to find lines that appear in file1.txt but not in file2.txt.

Interview Questions¶

What is the difference between grep -c and grep | wc -l?

Show answer

grep -c counts matching lines per file and handles multiple files correctly, printing filename:count for each. grep pattern file | wc -l counts the output lines of grep, which gives the same number for a single file but does not handle multiple files as cleanly. grep -c is preferred.

Why must you sort before using uniq?

Show answer

uniq only removes adjacent duplicate lines. If duplicates are scattered throughout the file, uniq alone will not catch them. sort brings identical lines together so uniq can remove them.

What does sort -k2 -t, do?

Show answer

It sorts lines using comma (,) as the field delimiter (-t,) and treats the second field as the sort key (-k2). Without -t, the default delimiter is whitespace.

day01-part2-files-permissions | day02-part2-stream-editing

Text Processing¶

Learning Objectives¶

grep — Search for Patterns¶

wc — Word Count¶

cut — Extract Fields from Delimited Text¶

sort — Sort Lines¶

uniq — Remove Duplicate Lines¶

Combining Tools: Pipelines¶

Common Mistakes¶

Practice Exercises¶

Warm-Up (run and observe)¶

Main (write a short script)¶

Stretch¶

Interview Questions¶

`grep` — Search for Patterns¶

`wc` — Word Count¶

`cut` — Extract Fields from Delimited Text¶

`sort` — Sort Lines¶

`uniq` — Remove Duplicate Lines¶