Skip to content

Text Processing

Raw text is the universal format of Unix — mastering grep, cut, sort, and wc lets you answer questions about data without writing a single line of "real" code.

Learning Objectives

  • Search files for patterns with grep
  • Count lines, words, and characters with wc
  • Extract columns from delimited data with cut
  • Sort and deduplicate data with sort and uniq
  • Chain these tools together into one-liner pipelines

grep — Search for Patterns

grep searches files (or stdin) for lines matching a pattern and prints them.

grep "error" /var/log/syslog
Jan 15 09:12:03 host kernel: error: device not found
Jan 15 09:14:55 host app[1234]: error: connection refused

grep -i "error" /var/log/syslog    # case-insensitive
grep -n "error" /var/log/syslog    # show line numbers
grep -c "error" /var/log/syslog    # count matching lines only
grep -v "error" /var/log/syslog    # invert — lines that do NOT match
grep -r "TODO" ~/scripts/          # recursive search through directory
grep -l "error" /var/log/*.log     # print only filenames, not lines

# Count errors in a log file
grep -c "ERROR" /var/log/app.log
47

Grep with context

-A N shows N lines after a match, -B N shows N lines before, -C N shows N lines on both sides. Useful for seeing what surrounds an error:

grep -C 3 "segfault" /var/log/syslog


wc — Word Count

wc /etc/passwd
  45   90 2543 /etc/passwd

The three numbers are: lines, words, characters (bytes).

wc -l /etc/passwd      # lines only — most common use
wc -w notes.txt        # words only
wc -c file.bin         # bytes only

# Count how many users are in /etc/passwd
wc -l /etc/passwd
45 /etc/passwd


cut — Extract Fields from Delimited Text

# /etc/passwd fields are colon-separated: username:x:uid:gid:comment:home:shell
cut -d: -f1 /etc/passwd        # extract field 1 (username)
cut -d: -f1,7 /etc/passwd      # extract fields 1 and 7 (username and shell)
cut -d, -f2 data.csv           # CSV, field 2
cut -c1-10 file.txt            # extract first 10 characters of each line

cut -d: -f1 /etc/passwd | head -5
root
daemon
bin
sys
sync


sort — Sort Lines

sort names.txt                   # alphabetical
sort -r names.txt                # reverse order
sort -n numbers.txt              # numeric sort (not lexicographic)
sort -k2 data.txt                # sort by second field
sort -t: -k3 -n /etc/passwd      # sort passwd by UID (field 3, numeric)
sort -u names.txt                # sort and remove duplicates

echo -e "banana\napple\ncherry" | sort
apple
banana
cherry

Numeric vs lexicographic sorting

Without -n, sort sorts lexicographically: 10 comes before 9 because 1 < 9. Always use -n when sorting numbers.

echo -e "10\n9\n100" | sort      # wrong: 10, 100, 9
echo -e "10\n9\n100" | sort -n   # correct: 9, 10, 100


uniq — Remove Duplicate Lines

uniq only removes adjacent duplicates. Always sort first.

sort names.txt | uniq            # sort then deduplicate
sort names.txt | uniq -c         # count occurrences of each line
sort names.txt | uniq -d         # show only lines that appear more than once
sort names.txt | uniq -u         # show only lines that appear exactly once

# Find the most common words in a file
tr ' ' '\n' < essay.txt | sort | uniq -c | sort -rn | head -10
     42 the
     31 a
     28 to
     19 in
     15 and


Combining Tools: Pipelines

These tools become powerful when combined with pipes (|):

# How many unique shells are in use on this system?
cut -d: -f7 /etc/passwd | sort | uniq -c | sort -rn
     28 /bin/bash
      9 /usr/sbin/nologin
      4 /bin/sh
      1 /bin/false

# Find the 5 largest files in the current directory
ls -lS | head -6 | tail -5
# Count log lines by hour
grep "Jan 15" /var/log/syslog | cut -d: -f1 | sort | uniq -c

Common Mistakes

grep without quotes

grep error log.txt works, but grep my error log.txt passes three arguments — my, error, and log.txt. Always quote the pattern: grep "my error" log.txt.

uniq without sort first

uniq only removes consecutive duplicate lines. echo -e "a\nb\na" | uniq outputs all three lines because the two as are not adjacent. Always pipe through sort first.


Practice Exercises

Warm-Up (run and observe)

  1. Run wc -l /etc/passwd. Then run grep -c "" /etc/passwd. Do they agree? Why?
  2. Run cut -d: -f1,3 /etc/passwd | sort -t: -k2 -n | head -10. What does this show?
  3. Run ls /bin | sort | uniq -c | sort -rn | head -5. What does the count tell you?

Main (write a short script)

Create ~/scripts/log_summary.sh:

#!/usr/bin/env bash
set -euo pipefail

LOGFILE="${1:-/var/log/syslog}"

echo "=== Log Summary: $LOGFILE ==="
echo "Total lines: $(wc -l < "$LOGFILE")"
echo "Error lines: $(grep -c -i "error" "$LOGFILE" || echo 0)"
echo "Warning lines: $(grep -c -i "warn" "$LOGFILE" || echo 0)"

echo ""
echo "=== Top 5 sources ==="
grep -oP '\w+\[\d+\]' "$LOGFILE" | sort | uniq -c | sort -rn | head -5

Stretch

  1. Using only cut, sort, and uniq, find all unique file extensions in your ~/Downloads folder.
  2. Write a one-liner that counts how many lines in /etc/passwd have /bin/bash as their shell.
  3. The comm command compares two sorted files. Use sort and comm to find lines that appear in file1.txt but not in file2.txt.

Interview Questions

  1. What is the difference between grep -c and grep | wc -l?
Show answer

grep -c counts matching lines per file and handles multiple files correctly, printing filename:count for each. grep pattern file | wc -l counts the output lines of grep, which gives the same number for a single file but does not handle multiple files as cleanly. grep -c is preferred.

  1. Why must you sort before using uniq?
Show answer

uniq only removes adjacent duplicate lines. If duplicates are scattered throughout the file, uniq alone will not catch them. sort brings identical lines together so uniq can remove them.

  1. What does sort -k2 -t, do?
Show answer

It sorts lines using comma (,) as the field delimiter (-t,) and treats the second field as the sort key (-k2). Without -t, the default delimiter is whitespace.


day01-part2-files-permissions | day02-part2-stream-editing