Text Processing
Text Processing¶
1. grep - Text Search¶
grep is a powerful tool for searching patterns in files.
Basic Usage¶
# Basic search
grep "pattern" file.txt
# Search multiple files
grep "error" *.log
# Recursive directory search
grep -r "TODO" ./src/
Main Options¶
| Option | Description |
|---|---|
-i |
Ignore case |
-r, -R |
Recursive search |
-n |
Show line numbers |
-v |
Invert match (exclude pattern) |
-c |
Count matching lines |
-l |
Show only filenames with matches |
-L |
Show filenames without matches |
-w |
Match whole words |
-A n |
Show n lines after match |
-B n |
Show n lines before match |
-C n |
Show n lines before and after match (context) |
-E |
Extended regular expressions |
-o |
Show only matching part |
Option Usage Examples¶
# Case insensitive
grep -i "error" log.txt
# Show line numbers
grep -n "function" script.js
# Lines not matching
grep -v "comment" code.py
# Count matching lines
grep -c "import" *.py
# Show only filenames
grep -l "password" *.conf
# Whole word match
grep -w "log" file.txt # Matches "log" only, excludes "logging"
# Show context
grep -A 3 "ERROR" app.log # 3 lines after
grep -B 2 "ERROR" app.log # 2 lines before
grep -C 2 "ERROR" app.log # 2 lines before and after
# Recursive search with line numbers
grep -rn "TODO" ./
2. Basic Regular Expressions¶
| Pattern | Description | Examples |
|---|---|---|
. |
Any single character | a.c → abc, adc |
* |
Zero or more of preceding | ab*c → ac, abc, abbc |
^ |
Start of line | ^Error → Error at line start |
$` | End of line | `end$ → end at line end |
||
[ ] |
Character class | [aeiou] → vowels |
[^ ] |
Negated character class | [^0-9] → non-digits |
\ |
Escape | \. → literal dot |
# Line starts with Error
grep "^Error" log.txt
# Line ends with ;
grep ";$" code.c
# 3 characters: a, any, t
grep "a.t" file.txt # ant, art, act
# Lines starting with digit
grep "^[0-9]" data.txt
# Find empty lines
grep "^$" file.txt
# Comment lines (starting with #)
grep "^#" config.conf
Extended Regular Expressions (-E)¶
| Pattern | Description | Examples |
|---|---|---|
+ |
One or more of preceding | ab+c → abc, abbc (not ac) |
? |
Zero or one of preceding | colou?r → color, colour |
| |
OR | cat|dog |
( ) |
Group | (ab)+ → ab, abab |
{n} |
Exactly n times | a{3} → aaa |
{n,m} |
n to m times | a{2,4} → aa, aaa, aaaa |
# Use extended regex
grep -E "error|warning|critical" log.txt
# IP address pattern
grep -E "[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}" access.log
# Email pattern (simple)
grep -E "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}" file.txt
# Phone number pattern
grep -E "[0-9]{3}-[0-9]{3,4}-[0-9]{4}" contacts.txt
3. cut - Field Extraction¶
Basic Usage¶
# Extract field by delimiter
cut -d'delimiter' -ffield_number file
# Extract by character position
cut -cstart-end file
Main Options¶
| Option | Description |
|---|---|
-d |
Specify delimiter |
-f |
Field number |
-c |
Character position |
# Colon delimiter, field 1 (username)
cut -d':' -f1 /etc/passwd
# Multiple fields
cut -d':' -f1,3,6 /etc/passwd
# Field range
cut -d',' -f2-4 data.csv
# Character position
cut -c1-10 file.txt
# Tab delimiter (default)
cut -f2 file.tsv
Example (/etc/passwd):
cat /etc/passwd | head -3
root:x:0:0:root:/root:/bin/bash
daemon:x:1:1:daemon:/usr/sbin:/usr/sbin/nologin
ubuntu:x:1000:1000:Ubuntu:/home/ubuntu:/bin/bash
cut -d':' -f1,6 /etc/passwd | head -3
root:/root
daemon:/usr/sbin
ubuntu:/home/ubuntu
4. sort - Sorting¶
Basic Usage¶
# Basic sort (alphabetical)
sort file.txt
# Reverse sort
sort -r file.txt
# Numeric sort
sort -n numbers.txt
# Remove duplicates
sort -u file.txt
Main Options¶
| Option | Description |
|---|---|
-r |
Reverse order |
-n |
Numeric sort |
-k |
Sort by specific field (key) |
-t |
Specify delimiter |
-u |
Remove duplicates (unique) |
-h |
Human-readable sizes |
# Numeric sort
sort -n scores.txt
# Reverse numeric sort
sort -rn scores.txt
# Sort by 2nd field (comma delimiter)
sort -t',' -k2 data.csv
# Numeric sort by 3rd field
sort -t':' -k3 -n /etc/passwd
# Human-readable size sort
du -h | sort -h
# Reverse sort by file size
ls -lh | sort -k5 -hr
5. uniq - Remove Duplicates¶
uniq only handles consecutive duplicates, so it's usually used with sort.
# Remove consecutive duplicates
uniq file.txt
# Show count of duplicates
uniq -c file.txt
# Show only duplicated lines
uniq -d file.txt
# Show only unique lines
uniq -u file.txt
# Use with sort
sort file.txt | uniq
# Count duplicates then sort
sort file.txt | uniq -c | sort -rn
# Top 10 most frequent IPs
cat access.log | cut -d' ' -f1 | sort | uniq -c | sort -rn | head -10
6. wc - Count¶
# Lines, words, bytes
wc file.txt
Output:
100 500 3000 file.txt
│ │ │
│ │ └── Byte count
│ └── Word count
└── Line count
# Line count only
wc -l file.txt
# Word count only
wc -w file.txt
# Byte count only
wc -c file.txt
# Multiple files
wc -l *.txt
# With pipe
cat /etc/passwd | wc -l
7. sed - Stream Editor¶
sed is a text transformation tool.
Basic Substitution¶
# Syntax: s/pattern/replacement/flags
sed 's/old/new/' file.txt # First occurrence per line
sed 's/old/new/g' file.txt # All occurrences (global)
sed 's/old/new/gi' file.txt # Case insensitive
Main Options¶
| Option | Description |
|---|---|
-i |
In-place edit (modify file directly) |
-e |
Multiple commands |
-n |
Suppress output |
# Modify file directly
sed -i 's/old/new/g' file.txt
# Modify with backup
sed -i.bak 's/old/new/g' file.txt
# Multiple substitutions
sed -e 's/a/A/g' -e 's/b/B/g' file.txt
# Substitute only specific lines
sed '5s/old/new/' file.txt # Line 5 only
sed '1,10s/old/new/g' file.txt # Lines 1-10
Line Deletion¶
# Delete specific lines
sed '5d' file.txt # Delete line 5
sed '1,5d' file.txt # Delete lines 1-5
sed '/pattern/d' file.txt # Delete lines containing pattern
# Delete empty lines
sed '/^$/d' file.txt
# Delete comment lines
sed '/^#/d' config.conf
Line Printing¶
# Print specific lines
sed -n '5p' file.txt # Line 5 only
sed -n '1,10p' file.txt # Lines 1-10
sed -n '/pattern/p' file.txt # Lines containing pattern
# With line numbers
sed -n '=;p' file.txt
8. awk - Pattern Processing¶
awk is a programming language for text processing.
Basic Structure¶
awk 'pattern { action }' file
Field Variables¶
| Variable | Description |
|---|---|
$0 |
Entire line |
$1 |
First field |
$2 |
Second field |
$NF |
Last field |
NR |
Current line number |
NF |
Number of fields |
# Print first field
awk '{print $1}' file.txt
# Multiple fields
awk '{print $1, $3}' file.txt
# Specify delimiter
awk -F':' '{print $1, $6}' /etc/passwd
# Last field
awk '{print $NF}' file.txt
# With line numbers
awk '{print NR, $0}' file.txt
Conditional Output¶
# Conditional filtering
awk '$3 > 100 {print $0}' data.txt
# Pattern matching
awk '/error/ {print $0}' log.txt
# Specific field pattern
awk '$1 ~ /^192/ {print $0}' access.log
# Multiple conditions
awk '$2 > 50 && $3 < 100 {print $1}' data.txt
Calculations¶
# Sum
awk '{sum += $1} END {print sum}' numbers.txt
# Average
awk '{sum += $1; count++} END {print sum/count}' numbers.txt
# Maximum
awk 'BEGIN {max=0} $1 > max {max=$1} END {print max}' numbers.txt
Formatting¶
# Formatted output
awk '{printf "%-10s %5d\n", $1, $2}' data.txt
# Add header
awk 'BEGIN {print "Name\tScore"} {print $1"\t"$2}' data.txt
9. Pipes and Redirection¶
Pipe (|)¶
Passes command output to another command's input.
# Connect commands
ls -l | grep ".txt"
cat file.txt | sort | uniq
ps aux | grep nginx | grep -v grep
Output Redirection¶
| Symbol | Description |
|---|---|
> |
Redirect to file (overwrite) |
>> |
Append to file |
2> |
Redirect error output |
2>&1 |
Redirect error to standard output |
&> |
Redirect both standard and error (bash) |
# Output to file
ls -l > filelist.txt
# Append to file
echo "new line" >> file.txt
# Error only to file
command 2> error.log
# Both output and error
command > output.txt 2>&1
command &> all.log
# Ignore errors
command 2>/dev/null
# Ignore all output
command > /dev/null 2>&1
Input Redirection¶
# Input from file
sort < unsorted.txt
# Here Document
cat << EOF
Multiple lines
of text
input
EOF
10. Practice Exercises¶
Exercise 1: Log Analysis¶
# Create sample log
cat << 'EOF' > access.log
192.168.1.10 - - [23/Jan/2024:10:15:32] "GET /index.html" 200
192.168.1.20 - - [23/Jan/2024:10:15:33] "GET /api/users" 200
192.168.1.10 - - [23/Jan/2024:10:15:34] "POST /api/login" 401
192.168.1.30 - - [23/Jan/2024:10:15:35] "GET /style.css" 200
192.168.1.10 - - [23/Jan/2024:10:15:36] "GET /api/data" 500
192.168.1.20 - - [23/Jan/2024:10:15:37] "GET /index.html" 200
EOF
# Find errors (4xx, 5xx)
grep -E " [45][0-9]{2}$" access.log
# Requests per IP
cut -d' ' -f1 access.log | sort | uniq -c | sort -rn
# Statistics by status code
awk '{print $NF}' access.log | sort | uniq -c | sort -rn
Exercise 2: Extract User Information¶
# Regular users only (UID >= 1000)
awk -F':' '$3 >= 1000 {print $1, $6}' /etc/passwd
# User count by shell
cut -d':' -f7 /etc/passwd | sort | uniq -c | sort -rn
# Users in /home
grep "/home/" /etc/passwd | cut -d':' -f1
Exercise 3: Data Transformation¶
# Create CSV
cat << 'EOF' > data.csv
name,score,grade
Alice,95,A
Bob,82,B
Charlie,78,C
David,91,A
EOF
# Score sum
awk -F',' 'NR>1 {sum+=$2} END {print "Total:", sum}' data.csv
# Average
awk -F',' 'NR>1 {sum+=$2; c++} END {print "Average:", sum/c}' data.csv
# A grade students
awk -F',' '$3=="A" {print $1}' data.csv
# Sort by score descending
sort -t',' -k2 -rn data.csv | head -5
Exercise 4: Text Transformation¶
# All lowercase to uppercase
cat file.txt | tr 'a-z' 'A-Z'
# Replace specific word
sed 's/error/ERROR/g' log.txt
# Multiple replacements
sed -e 's/foo/bar/g' -e 's/baz/qux/g' file.txt
# Remove empty lines
sed '/^$/d' file.txt
# Spaces to tabs
sed 's/ */\t/g' file.txt
Exercise 5: Complex Pipelines¶
# Top 10 largest files
find /var/log -type f -exec ls -l {} \; 2>/dev/null | \
sort -k5 -rn | head -10
# Memory usage of specific process
ps aux | grep nginx | grep -v grep | \
awk '{sum += $6} END {print sum/1024 " MB"}'
# Filter errors from real-time log
tail -f /var/log/syslog | grep --line-buffered -i error
Next Steps¶
Learn about file permissions and ownership management in 05_Permissions_Ownership.md!