Troubleshooting Guide¶

Learning Objectives¶

Through this document, you will learn:

Systematic problem diagnosis methodology
Resolving boot issues
Diagnosing network, disk, and memory problems
Performance bottleneck analysis

Difficulty: ⭐⭐⭐ (Intermediate-Advanced)

Table of Contents¶

Problem Solving Methodology
Boot Issues
Network Issues
Disk Issues
Memory Issues
Process Issues
Performance Analysis

1. Problem Solving Methodology¶

Systematic Approach¶

┌─────────────────────────────────────────────────────────────┐
│                    Problem Solving Process                   │
│                                                             │
│  1. Problem Definition                                       │
│     └── What are the symptoms?                              │
│     └── When did it start?                                  │
│     └── What changes were made?                             │
│                                                             │
│  2. Information Gathering                                    │
│     └── Check logs                                          │
│     └── Check system status                                 │
│     └── Review configuration                                │
│                                                             │
│  3. Hypothesis Formation                                     │
│     └── List possible causes                                │
│     └── Prioritize                                          │
│                                                             │
│  4. Testing and Verification                                 │
│     └── Test hypothesis                                     │
│     └── Verify results                                      │
│                                                             │
│  5. Resolution and Documentation                             │
│     └── Apply fix                                           │
│     └── Establish prevention measures                       │
│     └── Documentation                                       │
└─────────────────────────────────────────────────────────────┘

Basic Diagnostic Commands¶

# System overview
uptime                    # Uptime, load average
uname -a                  # Kernel version
hostnamectl               # Host information
dmidecode -t system       # Hardware information

# Resource overview
free -h                   # Memory
df -h                     # Disk
top -bn1 | head -20       # Top CPU/memory processes

# Recent logs
journalctl -p err -since "1 hour ago"
dmesg | tail -50

# Service status
systemctl --failed
systemctl status <service>

Log Checking Priority¶

# 1. System logs
journalctl -xe                    # Recent errors
journalctl -b                     # Current boot
journalctl -p err --since today   # Today's errors

# 2. Kernel messages
dmesg --level=err,warn
dmesg -T | tail -100

# 3. Service-specific logs
journalctl -u nginx -f
journalctl -u postgresql --since "1 hour ago"

# 4. Application logs
tail -f /var/log/nginx/error.log
tail -f /var/log/syslog

2. Boot Issues¶

Boot Process¶

BIOS/UEFI → GRUB → Kernel → systemd → Services → Login
    │          │       │        │          │
    └──────────┴───────┴────────┴──────────┴── Failure possible at each stage

GRUB Recovery¶

# Press 'e' in GRUB menu to edit
# Add to kernel line:
linux /vmlinuz... root=... single    # Single user mode
linux /vmlinuz... root=... init=/bin/bash  # Direct shell

# Reinstall GRUB (in recovery mode)
mount /dev/sda2 /mnt
mount /dev/sda1 /mnt/boot
mount --bind /dev /mnt/dev
mount --bind /proc /mnt/proc
mount --bind /sys /mnt/sys
chroot /mnt
grub-install /dev/sda
update-grub
exit
reboot

Filesystem Recovery¶

# Run fsck (on unmounted filesystem)
fsck -y /dev/sda1

# Check root filesystem from live environment
# 1. Boot to recovery mode or Live USB
# 2. Unmount filesystem then check
umount /dev/sda2
fsck -y /dev/sda2

# XFS recovery
xfs_repair /dev/sda2

# ext4 superblock recovery
mke2fs -n /dev/sda2  # Find backup superblock locations
e2fsck -b 32768 /dev/sda2  # Recover with backup superblock

systemd Boot Issues¶

# Boot analysis
systemd-analyze
systemd-analyze blame
systemd-analyze critical-chain

# Check failed units
systemctl --failed
systemctl reset-failed

# Debug specific service
systemctl status nginx.service
journalctl -u nginx.service -b

# Boot to emergency mode
# Add to kernel line in GRUB:
systemd.unit=emergency.target
# Or
systemd.unit=rescue.target

Password Reset¶

# 1. Press 'e' in GRUB to edit
# 2. Add to end of linux line: init=/bin/bash
# 3. Press Ctrl+X to boot

# Remount root filesystem as read-write
mount -o remount,rw /

# Change password
passwd root
passwd username

# SELinux relabeling (RHEL/CentOS)
touch /.autorelabel

# Reboot
exec /sbin/init
# Or
reboot -f

3. Network Issues¶

Step-by-Step Diagnosis¶

# 1. Interface status
ip link show
ip addr show
ethtool eth0

# 2. Routing table
ip route show
ip route get 8.8.8.8

# 3. DNS verification
cat /etc/resolv.conf
nslookup google.com
dig google.com

# 4. Connectivity test
ping -c 3 8.8.8.8           # IP connectivity
ping -c 3 google.com        # DNS resolution + connectivity
traceroute google.com       # Path tracing

# 5. Port check
ss -tlnp                    # Listening ports
ss -tnp                     # Connected sockets
netstat -anp                # Full status

Connection Problem Diagnosis¶

# Test specific port connection
nc -zv 192.168.1.100 80
telnet 192.168.1.100 80

# TCP connection status
ss -tn state established
ss -tn state time-wait | wc -l

# Packet capture
tcpdump -i eth0 port 80
tcpdump -i eth0 host 192.168.1.100
tcpdump -i eth0 -w capture.pcap

# MTU issue check
ping -M do -s 1472 192.168.1.1  # 1500 - 28 = 1472

Firewall Issues¶

# Check iptables
iptables -L -n -v
iptables -t nat -L -n -v

# Check nftables
nft list ruleset

# Check firewalld (RHEL/CentOS)
firewall-cmd --list-all
firewall-cmd --get-active-zones

# Check UFW (Ubuntu)
ufw status verbose

# Temporarily disable firewall (for testing)
systemctl stop firewalld
iptables -F

DNS Issues¶

# DNS resolution test
nslookup example.com
dig example.com
host example.com

# Use specific DNS server
nslookup example.com 8.8.8.8
dig @8.8.8.8 example.com

# DNS cache check/flush
systemd-resolve --statistics
systemd-resolve --flush-caches
# Or
resolvectl flush-caches

# Check /etc/hosts
cat /etc/hosts

# Check nsswitch configuration
cat /etc/nsswitch.conf | grep hosts

4. Disk Issues¶

Checking Disk Status¶

# Disk usage
df -h
df -i                       # Inode usage

# Partition check
lsblk
fdisk -l
parted -l

# Disk health (SMART)
smartctl -H /dev/sda
smartctl -a /dev/sda

# Disk I/O statistics
iostat -xz 1
iotop -o

Disk Space Issues¶

# Find large files
find / -xdev -type f -size +100M -exec ls -lh {} \; 2>/dev/null

# Directory-wise usage
du -h --max-depth=1 / 2>/dev/null | sort -hr | head -20
du -sh /var/log/*

# Deleted files still holding space (open file handles)
lsof | grep deleted
lsof +L1

# Log file cleanup
journalctl --vacuum-size=100M
find /var/log -name "*.gz" -mtime +30 -delete

# Inode issue (too many files)
find / -xdev -type d -exec sh -c 'echo "$(find "{}" -maxdepth 1 | wc -l) {}"' \; | sort -rn | head

Filesystem Recovery¶

# Check read-only mode
mount | grep ' / '

# Remount as read-write
mount -o remount,rw /

# Filesystem errors
dmesg | grep -i "error\|fail\|corrupt"

# Force fsck
touch /forcefsck
reboot

# Or from recovery mode
fsck -y /dev/sda1

LVM Issues¶

# Check LVM status
pvs; vgs; lvs
pvdisplay; vgdisplay; lvdisplay

# LVM metadata recovery
vgcfgrestore -l vg_name              # List backups
vgcfgrestore -f /etc/lvm/archive/... vg_name

# Activate VG
vgchange -ay vg_name

# Activate LV
lvchange -ay /dev/vg_name/lv_name

5. Memory Issues¶

Checking Memory Status¶

# Memory overview
free -h
cat /proc/meminfo

# Per-process memory
ps aux --sort=-%mem | head -20
top -o %MEM

# Swap usage
swapon -s
cat /proc/swaps

# Detailed memory usage
smem -t -k
pmap -x <PID>

OOM Killer Diagnosis¶

# Check OOM occurrence
dmesg | grep -i "out of memory"
journalctl -k | grep -i "oom"

# Check OOM score
cat /proc/<PID>/oom_score
cat /proc/<PID>/oom_score_adj

# Adjust OOM score (protect)
echo -1000 > /proc/<PID>/oom_score_adj

# Or in systemd service
# [Service]
# OOMScoreAdjust=-500

Memory Leak Diagnosis¶

# Track process memory
while true; do
    ps -o pid,vsz,rss,comm -p <PID>
    sleep 60
done

# Valgrind (development environment)
valgrind --leak-check=full ./myapp

# Check USS/PSS with smem
smem -P nginx

Cache/Buffer Cleanup¶

# Cache status
cat /proc/meminfo | grep -E "Cached|Buffers|SReclaimable"

# Clear cache (caution in production)
sync
echo 1 > /proc/sys/vm/drop_caches  # Page cache
echo 2 > /proc/sys/vm/drop_caches  # dentries, inodes
echo 3 > /proc/sys/vm/drop_caches  # All

# Swap cleanup (when memory is available)
swapoff -a && swapon -a

6. Process Issues¶

Checking Process Status¶

# Process list
ps aux
ps -ef
ps auxf  # Tree format

# Find specific process
pgrep -a nginx
pidof nginx

# Process status
cat /proc/<PID>/status
cat /proc/<PID>/limits

# Process environment variables
cat /proc/<PID>/environ | tr '\0' '\n'

Zombie/Orphan Processes¶

# Find zombie processes
ps aux | awk '$8=="Z"'

# Find zombie's parent process
ps -ef | grep <ZOMBIE_PID>

# Check parent process
cat /proc/<ZOMBIE_PID>/status | grep PPid

# Remove zombie (terminate parent)
kill -SIGCHLD <PARENT_PID>
# Or restart parent process

strace/lsof Debugging¶

# Trace system calls
strace -p <PID>
strace -p <PID> -e open,read,write
strace -f -p <PID>  # Include child processes

# Check open files
lsof -p <PID>
lsof -c nginx
lsof -i :80
lsof +D /var/log

# Check file descriptors
ls -la /proc/<PID>/fd
cat /proc/<PID>/limits | grep "open files"

Service Issues¶

# Service status
systemctl status nginx

# Service logs
journalctl -u nginx -f
journalctl -u nginx --since "10 minutes ago"

# Service restart (detailed)
systemctl restart nginx
systemctl daemon-reload && systemctl restart nginx

# Check service configuration
systemctl cat nginx
systemctl show nginx

7. Performance Analysis¶

System Overview¶

# Combined status
vmstat 1 10
mpstat -P ALL 1 5
iostat -xz 1 5

# Load average interpretation
# load average: 1.00, 0.75, 0.50
# 1-minute, 5-minute, 15-minute average
# Compare with CPU core count (1.0 = 100% utilization)
nproc  # CPU core count

CPU Analysis¶

# CPU usage
top -bn1 | head -20
htop

# Per-process CPU
pidstat 1 5
ps aux --sort=-%cpu | head -10

# Detailed CPU info
mpstat -P ALL 1

# Hotspot detection (perf)
perf top
perf record -g -p <PID> -- sleep 30
perf report

I/O Analysis¶

# Disk I/O
iostat -xz 1
iotop -o

# Per-process I/O
pidstat -d 1

# Wait time check
await, svctm in iostat output
# await > 10ms: Slow disk
# %util > 80%: Possible bottleneck

# I/O profiling
blktrace -d /dev/sda -o - | blkparse -i -

Network Analysis¶

# Network statistics
netstat -s
ss -s

# Bandwidth monitoring
iftop -i eth0
nethogs eth0

# Connection status
ss -tn state established | wc -l
ss -tn state time-wait | wc -l

# Packet loss check
netstat -s | grep -i "packet loss\|retrans"

Bottleneck Analysis Checklist¶

#!/bin/bash
# bottleneck-check.sh

echo "=== System Overview ==="
uptime
echo

echo "=== CPU ==="
mpstat 1 3 | tail -4
echo

echo "=== Memory ==="
free -h
echo

echo "=== Disk I/O ==="
iostat -xz 1 3 | tail -10
echo

echo "=== Network ==="
ss -s
echo

echo "=== Top Processes (CPU) ==="
ps aux --sort=-%cpu | head -6
echo

echo "=== Top Processes (Memory) ==="
ps aux --sort=-%mem | head -6
echo

echo "=== Failed Services ==="
systemctl --failed
echo

echo "=== Recent Errors ==="
journalctl -p err --since "1 hour ago" | tail -20

Performance Baseline¶

# Record normal state (regularly)
#!/bin/bash
DATE=$(date +%Y%m%d-%H%M)
OUTPUT_DIR=/var/log/baseline

mkdir -p $OUTPUT_DIR

# System information
vmstat 1 60 > $OUTPUT_DIR/vmstat-$DATE.log &
iostat -xz 1 60 > $OUTPUT_DIR/iostat-$DATE.log &
mpstat -P ALL 1 60 > $OUTPUT_DIR/mpstat-$DATE.log &
sar -n DEV 1 60 > $OUTPUT_DIR/sar-net-$DATE.log &

wait

# Snapshots
ps aux > $OUTPUT_DIR/ps-$DATE.log
free -h > $OUTPUT_DIR/memory-$DATE.log
df -h > $OUTPUT_DIR/disk-$DATE.log
ss -s > $OUTPUT_DIR/network-$DATE.log

Practice Problems¶

Problem 1: Boot Issue¶

The system boots to emergency mode. Explain the steps to find and resolve the cause.

Problem 2: Disk Space¶

The /var partition is 100% full. Write commands to find the cause and resolve it.

Problem 3: Network Connection¶

Cannot access external websites. Write step-by-step diagnostic procedures.

Answers¶

Problem 1 Answer¶

# 1. Check error messages
journalctl -xb
dmesg | grep -i error

# 2. Common causes
# - /etc/fstab errors
# - Filesystem corruption
# - SELinux issues

# 3. Check /etc/fstab
cat /etc/fstab
# Verify UUID and devices are correct

# 4. Check filesystem
fsck -y /dev/sda1

# 5. Fix fstab (if problematic)
# Add nofail option or comment out problematic entries

# 6. Reboot
reboot

Problem 2 Answer¶

# 1. Check overall usage
df -h /var

# 2. Find large directories
du -h --max-depth=1 /var | sort -hr | head -10

# 3. Find large files
find /var -type f -size +100M -exec ls -lh {} \;

# 4. Check common causes
du -sh /var/log
du -sh /var/cache
du -sh /var/lib/docker  # If using Docker

# 5. Clean up logs
journalctl --vacuum-size=100M
find /var/log -name "*.gz" -mtime +7 -delete
truncate -s 0 /var/log/large-file.log

# 6. Check deleted file handles
lsof +L1 | grep /var
# Restart service to release handles

Problem 3 Answer¶

# 1. Interface status
ip addr show
ip link show

# 2. Check default gateway
ip route show
ping -c 3 <gateway-ip>

# 3. External IP connectivity
ping -c 3 8.8.8.8

# 4. DNS check (if IP works but domain doesn't)
nslookup google.com
cat /etc/resolv.conf

# 5. Firewall check
iptables -L -n
firewall-cmd --list-all

# 6. Test specific port
nc -zv google.com 443

# 7. Check routing path
traceroute google.com

# Actions based on diagnosis results:
# - No IP: DHCP or manual IP configuration
# - Gateway unreachable: Check network cable/switch
# - External IP unreachable: Check router/firewall
# - Only DNS failing: Fix resolv.conf

References¶

Conclusion¶

This document concludes the Linux learning series.

Complete Learning Content: - 01-03: Linux Basics - 04-08: Intermediate Administration - 09-12: Advanced Server Management - 13-16: Advanced Topics (systemd, Performance, Containers, Storage) - 17-26: Expert Level (Security, Virtualization, Automation, HA, Troubleshooting)

Return to 00_Overview.md to review the complete learning roadmap.