Text Processing

Text Processing

grep is a powerful tool for searching patterns in files.

Basic Usage

# Basic search
grep "pattern" file.txt

# Search multiple files
grep "error" *.log

# Recursive directory search
grep -r "TODO" ./src/

Main Options

Option Description
-i Ignore case
-r, -R Recursive search
-n Show line numbers
-v Invert match (exclude pattern)
-c Count matching lines
-l Show only filenames with matches
-L Show filenames without matches
-w Match whole words
-A n Show n lines after match
-B n Show n lines before match
-C n Show n lines before and after match (context)
-E Extended regular expressions
-o Show only matching part

Option Usage Examples

# Case insensitive
grep -i "error" log.txt

# Show line numbers
grep -n "function" script.js

# Lines not matching
grep -v "comment" code.py

# Count matching lines
grep -c "import" *.py

# Show only filenames
grep -l "password" *.conf

# Whole word match
grep -w "log" file.txt    # Matches "log" only, excludes "logging"

# Show context
grep -A 3 "ERROR" app.log    # 3 lines after
grep -B 2 "ERROR" app.log    # 2 lines before
grep -C 2 "ERROR" app.log    # 2 lines before and after

# Recursive search with line numbers
grep -rn "TODO" ./

2. Basic Regular Expressions

Pattern Description Examples
. Any single character a.c → abc, adc
* Zero or more of preceding ab*c → ac, abc, abbc
^ Start of line ^Error → Error at line start
$` | End of line | `end$ → end at line end
[ ] Character class [aeiou] → vowels
[^ ] Negated character class [^0-9] → non-digits
\ Escape \. → literal dot
# Line starts with Error
grep "^Error" log.txt

# Line ends with ;
grep ";$" code.c

# 3 characters: a, any, t
grep "a.t" file.txt    # ant, art, act

# Lines starting with digit
grep "^[0-9]" data.txt

# Find empty lines
grep "^$" file.txt

# Comment lines (starting with #)
grep "^#" config.conf

Extended Regular Expressions (-E)

Pattern Description Examples
+ One or more of preceding ab+c → abc, abbc (not ac)
? Zero or one of preceding colou?r → color, colour
| OR cat|dog
( ) Group (ab)+ → ab, abab
{n} Exactly n times a{3} → aaa
{n,m} n to m times a{2,4} → aa, aaa, aaaa
# Use extended regex
grep -E "error|warning|critical" log.txt

# IP address pattern
grep -E "[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}" access.log

# Email pattern (simple)
grep -E "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}" file.txt

# Phone number pattern
grep -E "[0-9]{3}-[0-9]{3,4}-[0-9]{4}" contacts.txt

3. cut - Field Extraction

Basic Usage

# Extract field by delimiter
cut -d'delimiter' -ffield_number file

# Extract by character position
cut -cstart-end file

Main Options

Option Description
-d Specify delimiter
-f Field number
-c Character position
# Colon delimiter, field 1 (username)
cut -d':' -f1 /etc/passwd

# Multiple fields
cut -d':' -f1,3,6 /etc/passwd

# Field range
cut -d',' -f2-4 data.csv

# Character position
cut -c1-10 file.txt

# Tab delimiter (default)
cut -f2 file.tsv

Example (/etc/passwd):

cat /etc/passwd | head -3
root:x:0:0:root:/root:/bin/bash
daemon:x:1:1:daemon:/usr/sbin:/usr/sbin/nologin
ubuntu:x:1000:1000:Ubuntu:/home/ubuntu:/bin/bash
cut -d':' -f1,6 /etc/passwd | head -3
root:/root
daemon:/usr/sbin
ubuntu:/home/ubuntu

4. sort - Sorting

Basic Usage

# Basic sort (alphabetical)
sort file.txt

# Reverse sort
sort -r file.txt

# Numeric sort
sort -n numbers.txt

# Remove duplicates
sort -u file.txt

Main Options

Option Description
-r Reverse order
-n Numeric sort
-k Sort by specific field (key)
-t Specify delimiter
-u Remove duplicates (unique)
-h Human-readable sizes
# Numeric sort
sort -n scores.txt

# Reverse numeric sort
sort -rn scores.txt

# Sort by 2nd field (comma delimiter)
sort -t',' -k2 data.csv

# Numeric sort by 3rd field
sort -t':' -k3 -n /etc/passwd

# Human-readable size sort
du -h | sort -h

# Reverse sort by file size
ls -lh | sort -k5 -hr

5. uniq - Remove Duplicates

uniq only handles consecutive duplicates, so it's usually used with sort.

# Remove consecutive duplicates
uniq file.txt

# Show count of duplicates
uniq -c file.txt

# Show only duplicated lines
uniq -d file.txt

# Show only unique lines
uniq -u file.txt
# Use with sort
sort file.txt | uniq

# Count duplicates then sort
sort file.txt | uniq -c | sort -rn

# Top 10 most frequent IPs
cat access.log | cut -d' ' -f1 | sort | uniq -c | sort -rn | head -10

6. wc - Count

# Lines, words, bytes
wc file.txt

Output:

  100   500  3000 file.txt
   │     │     │
   │     │     └── Byte count
   │     └── Word count
   └── Line count
# Line count only
wc -l file.txt

# Word count only
wc -w file.txt

# Byte count only
wc -c file.txt

# Multiple files
wc -l *.txt

# With pipe
cat /etc/passwd | wc -l

7. sed - Stream Editor

sed is a text transformation tool.

Basic Substitution

# Syntax: s/pattern/replacement/flags
sed 's/old/new/' file.txt        # First occurrence per line
sed 's/old/new/g' file.txt       # All occurrences (global)
sed 's/old/new/gi' file.txt      # Case insensitive

Main Options

Option Description
-i In-place edit (modify file directly)
-e Multiple commands
-n Suppress output
# Modify file directly
sed -i 's/old/new/g' file.txt

# Modify with backup
sed -i.bak 's/old/new/g' file.txt

# Multiple substitutions
sed -e 's/a/A/g' -e 's/b/B/g' file.txt

# Substitute only specific lines
sed '5s/old/new/' file.txt       # Line 5 only
sed '1,10s/old/new/g' file.txt   # Lines 1-10

Line Deletion

# Delete specific lines
sed '5d' file.txt               # Delete line 5
sed '1,5d' file.txt             # Delete lines 1-5
sed '/pattern/d' file.txt       # Delete lines containing pattern

# Delete empty lines
sed '/^$/d' file.txt

# Delete comment lines
sed '/^#/d' config.conf

Line Printing

# Print specific lines
sed -n '5p' file.txt            # Line 5 only
sed -n '1,10p' file.txt         # Lines 1-10
sed -n '/pattern/p' file.txt    # Lines containing pattern

# With line numbers
sed -n '=;p' file.txt

8. awk - Pattern Processing

awk is a programming language for text processing.

Basic Structure

awk 'pattern { action }' file

Field Variables

Variable Description
$0 Entire line
$1 First field
$2 Second field
$NF Last field
NR Current line number
NF Number of fields
# Print first field
awk '{print $1}' file.txt

# Multiple fields
awk '{print $1, $3}' file.txt

# Specify delimiter
awk -F':' '{print $1, $6}' /etc/passwd

# Last field
awk '{print $NF}' file.txt

# With line numbers
awk '{print NR, $0}' file.txt

Conditional Output

# Conditional filtering
awk '$3 > 100 {print $0}' data.txt

# Pattern matching
awk '/error/ {print $0}' log.txt

# Specific field pattern
awk '$1 ~ /^192/ {print $0}' access.log

# Multiple conditions
awk '$2 > 50 && $3 < 100 {print $1}' data.txt

Calculations

# Sum
awk '{sum += $1} END {print sum}' numbers.txt

# Average
awk '{sum += $1; count++} END {print sum/count}' numbers.txt

# Maximum
awk 'BEGIN {max=0} $1 > max {max=$1} END {print max}' numbers.txt

Formatting

# Formatted output
awk '{printf "%-10s %5d\n", $1, $2}' data.txt

# Add header
awk 'BEGIN {print "Name\tScore"} {print $1"\t"$2}' data.txt

9. Pipes and Redirection

Pipe (|)

Passes command output to another command's input.

# Connect commands
ls -l | grep ".txt"
cat file.txt | sort | uniq
ps aux | grep nginx | grep -v grep

Output Redirection

Symbol Description
> Redirect to file (overwrite)
>> Append to file
2> Redirect error output
2>&1 Redirect error to standard output
&> Redirect both standard and error (bash)
# Output to file
ls -l > filelist.txt

# Append to file
echo "new line" >> file.txt

# Error only to file
command 2> error.log

# Both output and error
command > output.txt 2>&1
command &> all.log

# Ignore errors
command 2>/dev/null

# Ignore all output
command > /dev/null 2>&1

Input Redirection

# Input from file
sort < unsorted.txt

# Here Document
cat << EOF
Multiple lines
of text
input
EOF

10. Practice Exercises

Exercise 1: Log Analysis

# Create sample log
cat << 'EOF' > access.log
192.168.1.10 - - [23/Jan/2024:10:15:32] "GET /index.html" 200
192.168.1.20 - - [23/Jan/2024:10:15:33] "GET /api/users" 200
192.168.1.10 - - [23/Jan/2024:10:15:34] "POST /api/login" 401
192.168.1.30 - - [23/Jan/2024:10:15:35] "GET /style.css" 200
192.168.1.10 - - [23/Jan/2024:10:15:36] "GET /api/data" 500
192.168.1.20 - - [23/Jan/2024:10:15:37] "GET /index.html" 200
EOF

# Find errors (4xx, 5xx)
grep -E " [45][0-9]{2}$" access.log

# Requests per IP
cut -d' ' -f1 access.log | sort | uniq -c | sort -rn

# Statistics by status code
awk '{print $NF}' access.log | sort | uniq -c | sort -rn

Exercise 2: Extract User Information

# Regular users only (UID >= 1000)
awk -F':' '$3 >= 1000 {print $1, $6}' /etc/passwd

# User count by shell
cut -d':' -f7 /etc/passwd | sort | uniq -c | sort -rn

# Users in /home
grep "/home/" /etc/passwd | cut -d':' -f1

Exercise 3: Data Transformation

# Create CSV
cat << 'EOF' > data.csv
name,score,grade
Alice,95,A
Bob,82,B
Charlie,78,C
David,91,A
EOF

# Score sum
awk -F',' 'NR>1 {sum+=$2} END {print "Total:", sum}' data.csv

# Average
awk -F',' 'NR>1 {sum+=$2; c++} END {print "Average:", sum/c}' data.csv

# A grade students
awk -F',' '$3=="A" {print $1}' data.csv

# Sort by score descending
sort -t',' -k2 -rn data.csv | head -5

Exercise 4: Text Transformation

# All lowercase to uppercase
cat file.txt | tr 'a-z' 'A-Z'

# Replace specific word
sed 's/error/ERROR/g' log.txt

# Multiple replacements
sed -e 's/foo/bar/g' -e 's/baz/qux/g' file.txt

# Remove empty lines
sed '/^$/d' file.txt

# Spaces to tabs
sed 's/  */\t/g' file.txt

Exercise 5: Complex Pipelines

# Top 10 largest files
find /var/log -type f -exec ls -l {} \; 2>/dev/null | \
  sort -k5 -rn | head -10

# Memory usage of specific process
ps aux | grep nginx | grep -v grep | \
  awk '{sum += $6} END {print sum/1024 " MB"}'

# Filter errors from real-time log
tail -f /var/log/syslog | grep --line-buffered -i error

Next Steps

Learn about file permissions and ownership management in 05_Permissions_Ownership.md!

to navigate between lessons