sort, uniq & wc β Counting & Ordering#
sort#
Common flags#
| Flag | Meaning |
|---|---|
-n | Numeric sort |
-r | Reverse order |
-k N | Sort on field N |
-k N,M | Sort on fields N through M |
-t SEP | Field delimiter (default: whitespace) |
-u | Unique β remove duplicate lines |
-f | Case-insensitive (fold) |
-h | Human-readable sizes (2K, 3M, 1G) |
-V | Version sort (1.2 < 1.10) |
-R | Random shuffle |
-s | Stable sort (preserve order of equal lines) |
-c | Check if already sorted; exit 1 if not |
-m | Merge pre-sorted files (no sort step) |
-o FILE | Write output to FILE (can be same as input) |
-z | NUL-terminated lines |
Basic sort#
sort file.txt # lexicographic ascending
sort -r file.txt # reverse
sort -u file.txt # unique lines only
sort -f file.txt # case-insensitive
sort -n numbers.txt # numeric
sort -rn numbers.txt # numeric descending
sort -h sizes.txt # human sizes: 1K < 2M < 3G
sort -V versions.txt # version: 1.9 < 1.10 < 2.0
sort -R file.txt # shuffle
Multi-key sort#
# Sort by field 2 numerically, then field 1 lexicographically
sort -t, -k2,2n -k1,1 data.csv
# Sort by field 3 descending, field 1 ascending
sort -k3,3rn -k1,1 data.txt
# Sort CSV by 4th column (numeric) descending
sort -t, -k4,4rn report.csv
# Sort by month name (ignore leading whitespace in field)
sort -t: -k1,1 /etc/passwd # by username
# Sort IP addresses correctly (4-field numeric)
sort -t. -k1,1n -k2,2n -k3,3n -k4,4n ips.txt
Sort by partial field#
# -k START.CHAR,END.CHAR
sort -k1.3,1.5 file # characters 3β5 of field 1
In-place sort#
sort -o file.txt file.txt # overwrite in-place
sort file.txt | sponge file.txt # with moreutils
uniq#
uniq collapses consecutive duplicate lines. Input must be sorted first.
Common flags#
| Flag | Meaning |
|---|---|
-c | Prefix each line with occurrence count |
-d | Print only duplicate lines (once each) |
-D | Print all copies of duplicate lines |
-u | Print only unique (non-repeated) lines |
-i | Case-insensitive comparison |
-f N | Skip first N fields |
-s N | Skip first N characters |
-w N | Compare only first N characters |
sort file.txt | uniq # deduplicate
sort file.txt | uniq -c # count occurrences
sort file.txt | uniq -cd # count + only duplicates
sort file.txt | uniq -u # lines appearing exactly once
sort -f file.txt | uniq -i # case-insensitive dedup
Frequency table pattern#
# Most common words in a file
tr -s '[:space:]' '\n' < file.txt | sort | uniq -c | sort -rn | head -20
# Most frequent IPs in access log
awk '{print $1}' access.log | sort | uniq -c | sort -rn | head -10
# Most common HTTP status codes
awk '{print $9}' access.log | sort | uniq -c | sort -rn
wc β Word Count#
| Flag | Counts |
|---|---|
-l | Lines |
-w | Words |
-c | Bytes |
-m | Characters (multibyte-aware) |
-L | Length of longest line |
wc -l file.txt # line count
wc -w file.txt # word count
wc -c file.txt # byte count
wc file.txt # lines + words + bytes
wc -l *.log # count per file + total
# Count matching lines
grep -c "ERROR" app.log
# Count files in a directory
ls | wc -l
# Length of longest line (useful for column-width decisions)
wc -L report.txt
nl β Number Lines#
nl file.txt # number non-empty lines (default)
nl -b a file.txt # number all lines including empty
nl -b p'^[A-Z]' file.txt # number lines matching pattern
nl -v 0 file.txt # start numbering at 0
nl -s '. ' file.txt # custom separator after number
nl -n rz file.txt # right-justified, zero-padded (000001)
nl -n ln file.txt # left-justified
nl -w 3 file.txt # width of line number field
Practical pipelines#
# Top 10 largest files in a directory tree
du -sh * 2>/dev/null | sort -rh | head -10
# Count unique visitors in access log (by IP)
awk '{print $1}' access.log | sort -u | wc -l
# Distribution of response sizes
awk '{print $10}' access.log | grep -v '-' | sort -n | uniq -c
# Find the 5 most recently modified files
ls -lt | grep '^-' | head -5
# Sort a CSV by 3rd column (numeric), keep header
{ head -1 data.csv; tail -n +2 data.csv | sort -t, -k3,3n; }
# Check if a file is already sorted
sort -c file.txt && echo "sorted" || echo "not sorted"
# Merge two pre-sorted files
sort -m sorted1.txt sorted2.txt
# Deduplicate IPs while preserving first-seen order
awk '!seen[$0]++' ips.txt
# Rank word frequency across multiple files
cat *.txt | tr '[:upper:]' '[:lower:]' | tr -sc '[:alpha:]' '\n' \
| sort | uniq -c | sort -rn | head -30
# Show only lines that appear in both files
sort file1.txt > /tmp/s1; sort file2.txt > /tmp/s2
comm -12 /tmp/s1 /tmp/s2
# Lines only in file1 (not in file2)
comm -23 <(sort file1.txt) <(sort file2.txt)
comm β Compare Sorted Files#
comm compares two sorted files line by line, outputting three columns.
comm file1.txt file2.txt # col1: only in f1, col2: only in f2, col3: both
comm -12 f1 f2 # only lines in BOTH (suppress cols 1 and 2)
comm -23 f1 f2 # only in f1 (suppress cols 2 and 3)
comm -13 f1 f2 # only in f2
[!TIP] The idiom
sort file | uniq -c | sort -rn(sort β count β sort by count descending) is one of the most useful pipelines for log analysis and data exploration.sort -rn | head -20gives the top 20 most frequent items.