Troubleshooting Guide
Troubleshooting Guide¶
Learning Objectives¶
Through this document, you will learn:
- Systematic problem diagnosis methodology
- Resolving boot issues
- Diagnosing network, disk, and memory problems
- Performance bottleneck analysis
Difficulty: ⭐⭐⭐ (Intermediate-Advanced)
Table of Contents¶
- Problem Solving Methodology
- Boot Issues
- Network Issues
- Disk Issues
- Memory Issues
- Process Issues
- Performance Analysis
1. Problem Solving Methodology¶
Systematic Approach¶
┌─────────────────────────────────────────────────────────────┐
│ Problem Solving Process │
│ │
│ 1. Problem Definition │
│ └── What are the symptoms? │
│ └── When did it start? │
│ └── What changes were made? │
│ │
│ 2. Information Gathering │
│ └── Check logs │
│ └── Check system status │
│ └── Review configuration │
│ │
│ 3. Hypothesis Formation │
│ └── List possible causes │
│ └── Prioritize │
│ │
│ 4. Testing and Verification │
│ └── Test hypothesis │
│ └── Verify results │
│ │
│ 5. Resolution and Documentation │
│ └── Apply fix │
│ └── Establish prevention measures │
│ └── Documentation │
└─────────────────────────────────────────────────────────────┘
Basic Diagnostic Commands¶
# System overview
uptime # Uptime, load average
uname -a # Kernel version
hostnamectl # Host information
dmidecode -t system # Hardware information
# Resource overview
free -h # Memory
df -h # Disk
top -bn1 | head -20 # Top CPU/memory processes
# Recent logs
journalctl -p err -since "1 hour ago"
dmesg | tail -50
# Service status
systemctl --failed
systemctl status <service>
Log Checking Priority¶
# 1. System logs
journalctl -xe # Recent errors
journalctl -b # Current boot
journalctl -p err --since today # Today's errors
# 2. Kernel messages
dmesg --level=err,warn
dmesg -T | tail -100
# 3. Service-specific logs
journalctl -u nginx -f
journalctl -u postgresql --since "1 hour ago"
# 4. Application logs
tail -f /var/log/nginx/error.log
tail -f /var/log/syslog
2. Boot Issues¶
Boot Process¶
BIOS/UEFI → GRUB → Kernel → systemd → Services → Login
│ │ │ │ │
└──────────┴───────┴────────┴──────────┴── Failure possible at each stage
GRUB Recovery¶
# Press 'e' in GRUB menu to edit
# Add to kernel line:
linux /vmlinuz... root=... single # Single user mode
linux /vmlinuz... root=... init=/bin/bash # Direct shell
# Reinstall GRUB (in recovery mode)
mount /dev/sda2 /mnt
mount /dev/sda1 /mnt/boot
mount --bind /dev /mnt/dev
mount --bind /proc /mnt/proc
mount --bind /sys /mnt/sys
chroot /mnt
grub-install /dev/sda
update-grub
exit
reboot
Filesystem Recovery¶
# Run fsck (on unmounted filesystem)
fsck -y /dev/sda1
# Check root filesystem from live environment
# 1. Boot to recovery mode or Live USB
# 2. Unmount filesystem then check
umount /dev/sda2
fsck -y /dev/sda2
# XFS recovery
xfs_repair /dev/sda2
# ext4 superblock recovery
mke2fs -n /dev/sda2 # Find backup superblock locations
e2fsck -b 32768 /dev/sda2 # Recover with backup superblock
systemd Boot Issues¶
# Boot analysis
systemd-analyze
systemd-analyze blame
systemd-analyze critical-chain
# Check failed units
systemctl --failed
systemctl reset-failed
# Debug specific service
systemctl status nginx.service
journalctl -u nginx.service -b
# Boot to emergency mode
# Add to kernel line in GRUB:
systemd.unit=emergency.target
# Or
systemd.unit=rescue.target
Password Reset¶
# 1. Press 'e' in GRUB to edit
# 2. Add to end of linux line: init=/bin/bash
# 3. Press Ctrl+X to boot
# Remount root filesystem as read-write
mount -o remount,rw /
# Change password
passwd root
passwd username
# SELinux relabeling (RHEL/CentOS)
touch /.autorelabel
# Reboot
exec /sbin/init
# Or
reboot -f
3. Network Issues¶
Step-by-Step Diagnosis¶
# 1. Interface status
ip link show
ip addr show
ethtool eth0
# 2. Routing table
ip route show
ip route get 8.8.8.8
# 3. DNS verification
cat /etc/resolv.conf
nslookup google.com
dig google.com
# 4. Connectivity test
ping -c 3 8.8.8.8 # IP connectivity
ping -c 3 google.com # DNS resolution + connectivity
traceroute google.com # Path tracing
# 5. Port check
ss -tlnp # Listening ports
ss -tnp # Connected sockets
netstat -anp # Full status
Connection Problem Diagnosis¶
# Test specific port connection
nc -zv 192.168.1.100 80
telnet 192.168.1.100 80
# TCP connection status
ss -tn state established
ss -tn state time-wait | wc -l
# Packet capture
tcpdump -i eth0 port 80
tcpdump -i eth0 host 192.168.1.100
tcpdump -i eth0 -w capture.pcap
# MTU issue check
ping -M do -s 1472 192.168.1.1 # 1500 - 28 = 1472
Firewall Issues¶
# Check iptables
iptables -L -n -v
iptables -t nat -L -n -v
# Check nftables
nft list ruleset
# Check firewalld (RHEL/CentOS)
firewall-cmd --list-all
firewall-cmd --get-active-zones
# Check UFW (Ubuntu)
ufw status verbose
# Temporarily disable firewall (for testing)
systemctl stop firewalld
iptables -F
DNS Issues¶
# DNS resolution test
nslookup example.com
dig example.com
host example.com
# Use specific DNS server
nslookup example.com 8.8.8.8
dig @8.8.8.8 example.com
# DNS cache check/flush
systemd-resolve --statistics
systemd-resolve --flush-caches
# Or
resolvectl flush-caches
# Check /etc/hosts
cat /etc/hosts
# Check nsswitch configuration
cat /etc/nsswitch.conf | grep hosts
4. Disk Issues¶
Checking Disk Status¶
# Disk usage
df -h
df -i # Inode usage
# Partition check
lsblk
fdisk -l
parted -l
# Disk health (SMART)
smartctl -H /dev/sda
smartctl -a /dev/sda
# Disk I/O statistics
iostat -xz 1
iotop -o
Disk Space Issues¶
# Find large files
find / -xdev -type f -size +100M -exec ls -lh {} \; 2>/dev/null
# Directory-wise usage
du -h --max-depth=1 / 2>/dev/null | sort -hr | head -20
du -sh /var/log/*
# Deleted files still holding space (open file handles)
lsof | grep deleted
lsof +L1
# Log file cleanup
journalctl --vacuum-size=100M
find /var/log -name "*.gz" -mtime +30 -delete
# Inode issue (too many files)
find / -xdev -type d -exec sh -c 'echo "$(find "{}" -maxdepth 1 | wc -l) {}"' \; | sort -rn | head
Filesystem Recovery¶
# Check read-only mode
mount | grep ' / '
# Remount as read-write
mount -o remount,rw /
# Filesystem errors
dmesg | grep -i "error\|fail\|corrupt"
# Force fsck
touch /forcefsck
reboot
# Or from recovery mode
fsck -y /dev/sda1
LVM Issues¶
# Check LVM status
pvs; vgs; lvs
pvdisplay; vgdisplay; lvdisplay
# LVM metadata recovery
vgcfgrestore -l vg_name # List backups
vgcfgrestore -f /etc/lvm/archive/... vg_name
# Activate VG
vgchange -ay vg_name
# Activate LV
lvchange -ay /dev/vg_name/lv_name
5. Memory Issues¶
Checking Memory Status¶
# Memory overview
free -h
cat /proc/meminfo
# Per-process memory
ps aux --sort=-%mem | head -20
top -o %MEM
# Swap usage
swapon -s
cat /proc/swaps
# Detailed memory usage
smem -t -k
pmap -x <PID>
OOM Killer Diagnosis¶
# Check OOM occurrence
dmesg | grep -i "out of memory"
journalctl -k | grep -i "oom"
# Check OOM score
cat /proc/<PID>/oom_score
cat /proc/<PID>/oom_score_adj
# Adjust OOM score (protect)
echo -1000 > /proc/<PID>/oom_score_adj
# Or in systemd service
# [Service]
# OOMScoreAdjust=-500
Memory Leak Diagnosis¶
# Track process memory
while true; do
ps -o pid,vsz,rss,comm -p <PID>
sleep 60
done
# Valgrind (development environment)
valgrind --leak-check=full ./myapp
# Check USS/PSS with smem
smem -P nginx
Cache/Buffer Cleanup¶
# Cache status
cat /proc/meminfo | grep -E "Cached|Buffers|SReclaimable"
# Clear cache (caution in production)
sync
echo 1 > /proc/sys/vm/drop_caches # Page cache
echo 2 > /proc/sys/vm/drop_caches # dentries, inodes
echo 3 > /proc/sys/vm/drop_caches # All
# Swap cleanup (when memory is available)
swapoff -a && swapon -a
6. Process Issues¶
Checking Process Status¶
# Process list
ps aux
ps -ef
ps auxf # Tree format
# Find specific process
pgrep -a nginx
pidof nginx
# Process status
cat /proc/<PID>/status
cat /proc/<PID>/limits
# Process environment variables
cat /proc/<PID>/environ | tr '\0' '\n'
Zombie/Orphan Processes¶
# Find zombie processes
ps aux | awk '$8=="Z"'
# Find zombie's parent process
ps -ef | grep <ZOMBIE_PID>
# Check parent process
cat /proc/<ZOMBIE_PID>/status | grep PPid
# Remove zombie (terminate parent)
kill -SIGCHLD <PARENT_PID>
# Or restart parent process
strace/lsof Debugging¶
# Trace system calls
strace -p <PID>
strace -p <PID> -e open,read,write
strace -f -p <PID> # Include child processes
# Check open files
lsof -p <PID>
lsof -c nginx
lsof -i :80
lsof +D /var/log
# Check file descriptors
ls -la /proc/<PID>/fd
cat /proc/<PID>/limits | grep "open files"
Service Issues¶
# Service status
systemctl status nginx
# Service logs
journalctl -u nginx -f
journalctl -u nginx --since "10 minutes ago"
# Service restart (detailed)
systemctl restart nginx
systemctl daemon-reload && systemctl restart nginx
# Check service configuration
systemctl cat nginx
systemctl show nginx
7. Performance Analysis¶
System Overview¶
# Combined status
vmstat 1 10
mpstat -P ALL 1 5
iostat -xz 1 5
# Load average interpretation
# load average: 1.00, 0.75, 0.50
# 1-minute, 5-minute, 15-minute average
# Compare with CPU core count (1.0 = 100% utilization)
nproc # CPU core count
CPU Analysis¶
# CPU usage
top -bn1 | head -20
htop
# Per-process CPU
pidstat 1 5
ps aux --sort=-%cpu | head -10
# Detailed CPU info
mpstat -P ALL 1
# Hotspot detection (perf)
perf top
perf record -g -p <PID> -- sleep 30
perf report
I/O Analysis¶
# Disk I/O
iostat -xz 1
iotop -o
# Per-process I/O
pidstat -d 1
# Wait time check
await, svctm in iostat output
# await > 10ms: Slow disk
# %util > 80%: Possible bottleneck
# I/O profiling
blktrace -d /dev/sda -o - | blkparse -i -
Network Analysis¶
# Network statistics
netstat -s
ss -s
# Bandwidth monitoring
iftop -i eth0
nethogs eth0
# Connection status
ss -tn state established | wc -l
ss -tn state time-wait | wc -l
# Packet loss check
netstat -s | grep -i "packet loss\|retrans"
Bottleneck Analysis Checklist¶
#!/bin/bash
# bottleneck-check.sh
echo "=== System Overview ==="
uptime
echo
echo "=== CPU ==="
mpstat 1 3 | tail -4
echo
echo "=== Memory ==="
free -h
echo
echo "=== Disk I/O ==="
iostat -xz 1 3 | tail -10
echo
echo "=== Network ==="
ss -s
echo
echo "=== Top Processes (CPU) ==="
ps aux --sort=-%cpu | head -6
echo
echo "=== Top Processes (Memory) ==="
ps aux --sort=-%mem | head -6
echo
echo "=== Failed Services ==="
systemctl --failed
echo
echo "=== Recent Errors ==="
journalctl -p err --since "1 hour ago" | tail -20
Performance Baseline¶
# Record normal state (regularly)
#!/bin/bash
DATE=$(date +%Y%m%d-%H%M)
OUTPUT_DIR=/var/log/baseline
mkdir -p $OUTPUT_DIR
# System information
vmstat 1 60 > $OUTPUT_DIR/vmstat-$DATE.log &
iostat -xz 1 60 > $OUTPUT_DIR/iostat-$DATE.log &
mpstat -P ALL 1 60 > $OUTPUT_DIR/mpstat-$DATE.log &
sar -n DEV 1 60 > $OUTPUT_DIR/sar-net-$DATE.log &
wait
# Snapshots
ps aux > $OUTPUT_DIR/ps-$DATE.log
free -h > $OUTPUT_DIR/memory-$DATE.log
df -h > $OUTPUT_DIR/disk-$DATE.log
ss -s > $OUTPUT_DIR/network-$DATE.log
Practice Problems¶
Problem 1: Boot Issue¶
The system boots to emergency mode. Explain the steps to find and resolve the cause.
Problem 2: Disk Space¶
The /var partition is 100% full. Write commands to find the cause and resolve it.
Problem 3: Network Connection¶
Cannot access external websites. Write step-by-step diagnostic procedures.
Answers¶
Problem 1 Answer¶
# 1. Check error messages
journalctl -xb
dmesg | grep -i error
# 2. Common causes
# - /etc/fstab errors
# - Filesystem corruption
# - SELinux issues
# 3. Check /etc/fstab
cat /etc/fstab
# Verify UUID and devices are correct
# 4. Check filesystem
fsck -y /dev/sda1
# 5. Fix fstab (if problematic)
# Add nofail option or comment out problematic entries
# 6. Reboot
reboot
Problem 2 Answer¶
# 1. Check overall usage
df -h /var
# 2. Find large directories
du -h --max-depth=1 /var | sort -hr | head -10
# 3. Find large files
find /var -type f -size +100M -exec ls -lh {} \;
# 4. Check common causes
du -sh /var/log
du -sh /var/cache
du -sh /var/lib/docker # If using Docker
# 5. Clean up logs
journalctl --vacuum-size=100M
find /var/log -name "*.gz" -mtime +7 -delete
truncate -s 0 /var/log/large-file.log
# 6. Check deleted file handles
lsof +L1 | grep /var
# Restart service to release handles
Problem 3 Answer¶
# 1. Interface status
ip addr show
ip link show
# 2. Check default gateway
ip route show
ping -c 3 <gateway-ip>
# 3. External IP connectivity
ping -c 3 8.8.8.8
# 4. DNS check (if IP works but domain doesn't)
nslookup google.com
cat /etc/resolv.conf
# 5. Firewall check
iptables -L -n
firewall-cmd --list-all
# 6. Test specific port
nc -zv google.com 443
# 7. Check routing path
traceroute google.com
# Actions based on diagnosis results:
# - No IP: DHCP or manual IP configuration
# - Gateway unreachable: Check network cable/switch
# - External IP unreachable: Check router/firewall
# - Only DNS failing: Fix resolv.conf
References¶
- Linux Performance
- Red Hat System Administrator's Guide
- Ubuntu Server Guide
man strace,man lsof,man perf
Conclusion¶
This document concludes the Linux learning series.
Complete Learning Content: - 01-03: Linux Basics - 04-08: Intermediate Administration - 09-12: Advanced Server Management - 13-16: Advanced Topics (systemd, Performance, Containers, Storage) - 17-26: Expert Level (Security, Virtualization, Automation, HA, Troubleshooting)
Return to 00_Overview.md to review the complete learning roadmap.