12. Recovery Systems
12. Recovery Systems¶
Previous: Concurrency Control | Next: NoSQL and NewSQL
Learning Objectives¶
- Classify types of failures and understand storage hierarchy
- Master log-based recovery techniques: deferred and immediate modification
- Understand the Write-Ahead Logging (WAL) protocol and why it is essential
- Learn checkpoint mechanisms including fuzzy checkpoints
- Study the ARIES recovery algorithm in detail: Analysis, Redo, Undo phases
- Compare recovery approaches: shadow paging vs. WAL-based recovery
- Understand buffer management policies: force/no-force, steal/no-steal
- Learn media recovery and backup strategies
1. Failure Classification¶
Database systems must handle various types of failures gracefully. Understanding failure types is essential for designing appropriate recovery mechanisms.
1.1 Transaction Failure¶
Failures within a single transaction. The rest of the system is unaffected.
Logical errors: - Division by zero - Constraint violation (e.g., unique key, foreign key) - Application-level assertion failure - User-initiated abort (ROLLBACK)
System errors affecting a transaction: - Deadlock victim selection (transaction is aborted to break deadlock) - Timeout expiration - Out-of-memory for the transaction's workspace
Transaction Failure Recovery:
Tβ: read(A) write(A) read(B) β constraint violation!
β
UNDO Tβ's write to A
Release Tβ's locks
Tβ enters ABORTED state
1.2 System Failure (Crash)¶
The entire system halts unexpectedly. Volatile storage (main memory, buffer pool) is lost, but disk contents survive.
Causes: - Operating system crash - Power failure (without UPS) - Hardware failure (CPU, memory) - Software bug in the DBMS
System Failure:
ββββββββββββββββββββββββββββββββββββ
β Main Memory (LOST) β
β ββββββββββββββ ββββββββββββββ β
β β Buffer Pool β β Lock Table β β
β β (dirty β β (all locks β β
β β pages) β β lost) β β
β ββββββββββββββ ββββββββββββββ β
ββββββββββββββββββββββββββββββββββββ
β CRASH β
ββββββββββββββββββββββββββββββββββββ
β Disk (SURVIVES) β
β ββββββββββββββ ββββββββββββββ β
β β Database β β Log File β β
β β Files β β (WAL) β β
β ββββββββββββββ ββββββββββββββ β
ββββββββββββββββββββββββββββββββββββ
After recovery:
- Committed transactions: REDO (effects must persist)
- Uncommitted transactions: UNDO (effects must be reversed)
1.3 Disk Failure (Media Failure)¶
Physical destruction of disk storage. Data on the affected disk is lost.
Causes: - Disk head crash - Controller failure - Fire, flood, or other physical damage
Disk Failure Recovery:
Primary disk destroyed
β
Restore from backup (tape, remote replica)
β
Replay log from backup point to failure point
β
Database restored to consistent state
1.4 Failure Summary¶
| Failure Type | What is Lost | Recovery Method |
|---|---|---|
| Transaction | Transaction's partial work | Undo (rollback) using log |
| System (crash) | Volatile storage (buffer pool) | Redo committed + Undo uncommitted |
| Disk (media) | Disk contents | Restore from backup + replay log |
2. Storage Types¶
2.1 Storage Hierarchy¶
Storage Hierarchy:
βββββββββββββββββββββββ
β Volatile Storage β CPU registers, cache, main memory
β (fast, lost on β Access: nanoseconds
β power failure) β
βββββββββββ¬ββββββββββββ
β
βββββββββββ΄ββββββββββββ
β Nonvolatile Storage β SSD, HDD, flash
β (survives power β Access: microseconds (SSD) to milliseconds (HDD)
β failure, not disk β
β failure) β
βββββββββββ¬ββββββββββββ
β
βββββββββββ΄ββββββββββββ
β Stable Storage β Mirrored disks, RAID, remote replicas, tape
β (ideally survives β Access: milliseconds to hours
β ALL failures) β
βββββββββββββββββββββββ
2.2 Stable Storage¶
True stable storage is an idealization -- no real storage survives all possible failures. In practice, we approximate stable storage using:
- RAID (Redundant Array of Independent Disks):
- RAID 1: Mirroring (two copies of every block)
-
RAID 5/6: Parity-based redundancy
-
Remote replication: Write to geographically separate locations
-
Multiple copies: Maintain copies on different media (disk + tape + cloud)
For the log file, we need the strongest durability guarantee, so the log is typically stored on the most reliable storage available.
3. Log-Based Recovery¶
3.1 The Log¶
The log (also called journal or write-ahead log) is a sequential record of all database modifications. It is the foundation of recovery.
Log Record Types:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β <T_i, start> Transaction T_i has started β
β <T_i, X, V_old, V_new> T_i changed X from V_old to V_newβ
β <T_i, commit> Transaction T_i has committed β
β <T_i, abort> Transaction T_i has aborted β
β <checkpoint L> Checkpoint with active txn list L β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Example log:
LSN Log Record
βββ βββββββββββββββββββββββββββββ
1 <Tβ, start>
2 <Tβ, A, 100, 50> Tβ changed A from 100 to 50
3 <Tβ, B, 200, 250> Tβ changed B from 200 to 250
4 <Tβ, start>
5 <Tβ, C, 300, 400> Tβ changed C from 300 to 400
6 <Tβ, commit>
7 <Tβ, D, 500, 600> Tβ changed D from 500 to 600
β CRASH (Tβ not committed)
3.2 Log Properties¶
Key properties: 1. Append-only: New records are always added at the end 2. Sequential writes: Efficient for disk I/O 3. Write-ahead: Log records written BEFORE corresponding data modifications (WAL rule) 4. Force at commit: Log records for a transaction must be on stable storage before the commit completes
3.3 Deferred Database Modification¶
In the deferred modification scheme, all writes are deferred until the transaction commits. During execution, modifications are recorded only in the log.
Deferred Modification:
Tβ executes:
Log: <Tβ, start>
Log: <Tβ, A, _, 50> Record intent to write (old value not needed)
Log: <Tβ, B, _, 250> Record intent to write
Log: <Tβ, commit>
ββ NOW apply writes to database ββ
Database: A = 50, B = 250
Recovery needs only REDO (no UNDO needed because uncommitted
transactions never wrote to the database).
Recovery after crash: - Committed transactions: Redo all their writes (from the log) - Uncommitted transactions: Ignore (they never modified the database)
Advantages: - No UNDO needed (simpler recovery) - No dirty data in the database
Disadvantages: - All writes must fit in memory until commit - Long transactions require large buffers - Cannot release locks early (must hold all locks until commit + write)
3.4 Immediate Database Modification¶
In the immediate modification scheme, writes are applied to the database as they occur (possibly before commit). This is the approach used by all major database systems.
Immediate Modification:
Tβ executes:
Log: <Tβ, start>
Log: <Tβ, A, 100, 50> Record both old and new values
Database: A = 50 β written to database immediately
Log: <Tβ, B, 200, 250>
Database: B = 250 β written to database immediately
Log: <Tβ, commit>
Recovery may need both REDO and UNDO.
Recovery after crash: - Committed transactions: REDO their writes (to ensure changes persist) - Uncommitted transactions: UNDO their writes (to reverse partial changes)
Recovery from the example log:
LSN Log Record
βββ βββββββββββββββββββββ
1 <Tβ, start>
2 <Tβ, A, 100, 50>
3 <Tβ, B, 200, 250>
4 <Tβ, start>
5 <Tβ, C, 300, 400>
6 <Tβ, commit>
7 <Tβ, D, 500, 600>
β CRASH
Recovery:
Tβ committed (found <Tβ, commit>) β REDO:
A = 50 (from log record 2)
B = 250 (from log record 3)
Tβ not committed (no <Tβ, commit>) β UNDO:
D = 500 (reverse log record 7: restore old value)
C = 300 (reverse log record 5: restore old value)
REDO is forward (oldest to newest for committed txns)
UNDO is backward (newest to oldest for uncommitted txns)
Why REDO is needed even for committed transactions: Because a committed transaction's writes might still be in the buffer pool (volatile memory) and not yet flushed to disk when the crash occurred.
Why UNDO is needed for uncommitted transactions: Because an uncommitted transaction's writes might have been flushed to disk (by the buffer manager) before the crash.
4. Write-Ahead Logging (WAL) Protocol¶
4.1 The WAL Rule¶
The Write-Ahead Logging (WAL) protocol is the fundamental rule that makes log-based recovery work:
WAL Rule: Before a modified data page is written from the buffer pool to disk, all log records pertaining to that page must be flushed to stable storage.
WAL Protocol:
Step 1: Write log record to log buffer (in memory)
Step 2: Flush log record to disk (stable storage)
Step 3: THEN (and only then) write modified data to disk
Correct order:
Log buffer: <Tβ, A, 100, 50> β Log on disk: <Tβ, A, 100, 50> β Data on disk: A=50
β
Must happen BEFORE data write
Why? If data is written first and the system crashes before the log record
is written, we cannot undo the change (we don't know the old value).
4.2 Commit Rule (Force-at-Commit)¶
When a transaction commits, all its log records must be flushed to stable storage before the commit is acknowledged to the user:
Commit Protocol:
Tβ: ... operations ...
Log: <Tβ, commit>
Step 1: Flush ALL of Tβ's log records to disk (including <Tβ, commit>)
Step 2: Acknowledge commit to user ("Transaction committed")
Step 3: Modified data pages may be flushed later (lazy)
If crash occurs after Step 1 but before Step 3:
The log has <Tβ, commit> on disk β REDO Tβ's changes during recovery β
If crash occurs before Step 1:
No <Tβ, commit> on disk β UNDO Tβ's changes during recovery β
4.3 Group Commit¶
Flushing the log to disk for every single commit is expensive. Group commit batches multiple commits into a single disk write:
Without Group Commit:
Tβ commit β flush β ack
Tβ commit β flush β ack
Tβ commit β flush β ack
= 3 disk flushes (each ~5-10ms for HDD)
With Group Commit:
Tβ commit β buffer
Tβ commit β buffer β single flush β ack Tβ, Tβ, Tβ
Tβ commit β buffer
= 1 disk flush
Trade-off: slightly higher latency for individual commits,
but much higher throughput (more commits per second).
PostgreSQL uses group commit by default (commit_delay and commit_siblings settings).
4.4 WAL in PostgreSQL¶
PostgreSQL WAL Architecture:
ββββββββββββββββββββββββββββββββββββββββββββ
β Shared Buffers (Buffer Pool) β
β βββββββ βββββββ βββββββ βββββββ β
β βPage1β βPage2β βPage3β βPage4β ... β
β βdirtyβ βcleanβ βdirtyβ βcleanβ β
β βββββββ βββββββ βββββββ βββββββ β
ββββββββββββββββββ¬ββββββββββββββββββββββββββ
β
β WAL must be flushed BEFORE
β dirty page can be written
β
ββββββββββββββββββββββββββββββββββββββββββββ
β WAL Buffers β WAL Segment Files β
β βββββββββββββββ βββββββββββββββββββββββ
β β WAL buffer β β β 000000010000000042 ββ
β β (in memory) β β 000000010000000043 ββ
β βββββββββββββββ β ... (16MB each) ββ
β βββββββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββ
-- PostgreSQL WAL configuration
SHOW wal_level; -- minimal, replica, logical
SHOW max_wal_size; -- max WAL size before checkpoint (default: 1GB)
SHOW wal_buffers; -- WAL buffer size (default: ~16MB)
SHOW synchronous_commit; -- on (default), off (async), remote_write, etc.
-- View WAL statistics
SELECT * FROM pg_stat_wal;
5. Checkpoints¶
5.1 Why Checkpoints?¶
Without checkpoints, recovery must scan the entire log from the beginning to determine which transactions to redo and undo. For a database that has been running for months, this could take hours.
A checkpoint establishes a known good state, limiting how far back recovery must scan.
5.2 Simple (Quiescent) Checkpoint¶
The simplest checkpoint protocol:
SIMPLE-CHECKPOINT():
1. Temporarily stop accepting new transactions
2. Wait for all active transactions to complete (commit or abort)
3. Flush all dirty pages from buffer pool to disk
4. Write <checkpoint> record to log
5. Flush log to disk
6. Resume accepting transactions
Log with simple checkpoint:
... <Tβ,start> <Tβ,A,100,50> <Tβ,commit> <checkpoint> <Tβ,start> ...
β
All prior transactions are
fully reflected on disk.
Recovery starts here.
Problem: Requires stopping all transactions -- unacceptable for production systems.
5.3 Non-Quiescent (Active) Checkpoint¶
A more practical checkpoint that does not require stopping all transactions:
ACTIVE-CHECKPOINT():
1. Write <checkpoint L> to log, where L = list of active transactions
2. Flush all dirty pages from buffer pool to disk
(new transactions and writes can continue during this flush)
3. Write <end_checkpoint> to log when flush is complete
Log with active checkpoint:
<Tβ,start> <Tβ,A,100,50> <checkpoint {Tβ,Tβ}> <Tβ,B,200,300>
β
Tβ and Tβ were active when
checkpoint started
Recovery:
Must scan back to earliest start of transactions in L
(i.e., <Tβ,start> since Tβ is in the active list)
5.4 Fuzzy Checkpoints¶
A fuzzy checkpoint is the most practical approach, used by ARIES and most modern systems:
FUZZY-CHECKPOINT():
1. Write <begin_checkpoint> to log
2. Record:
- Active transaction table (ATT): all active transactions and their status
- Dirty page table (DPT): all dirty pages in the buffer pool
3. Write <end_checkpoint ATT, DPT> to log
4. Flush log to stable storage
5. Update master record (on disk) to point to <begin_checkpoint>
Note: Dirty pages are NOT flushed during checkpoint!
They are flushed asynchronously by the background writer.
Fuzzy Checkpoint Contents:
<begin_checkpoint>
<end_checkpoint>
Active Transaction Table:
βββββββββββ¬βββββββββββ¬βββββββββββββββββ
β TxnID β Status β LastLSN β
βββββββββββΌβββββββββββΌβββββββββββββββββ€
β Tβ β Active β LSN 150 β
β Tβ β Active β LSN 180 β
βββββββββββ΄βββββββββββ΄βββββββββββββββββ
Dirty Page Table:
βββββββββββ¬βββββββββββββββββ
β PageID β RecoveryLSN β
βββββββββββΌβββββββββββββββββ€
β Page 5 β LSN 120 β
β Page 12 β LSN 145 β
β Page 8 β LSN 170 β
βββββββββββ΄βββββββββββββββββ
Advantages of fuzzy checkpoints: - No transaction stalling - No forced page flushing during checkpoint - Very fast checkpoint operation - Recovery uses the checkpoint tables to determine the starting point
6. ARIES Recovery Algorithm¶
6.1 Overview¶
ARIES (Algorithm for Recovery and Isolation Exploiting Semantics) is the most widely used recovery algorithm. Developed at IBM Research by C. Mohan et al. in the early 1990s, it is the basis for recovery in DB2, SQL Server, PostgreSQL (adapted), and many other systems.
Core principles: 1. Write-Ahead Logging (WAL): Log before data 2. Repeating history during redo: Redo all actions (including those of uncommitted transactions) to restore the exact pre-crash state 3. Logging changes during undo: Write Compensation Log Records (CLRs) during undo to ensure undo operations are idempotent
6.2 Key Data Structures¶
Log Sequence Number (LSN): Every log record has a unique, monotonically increasing LSN. Each data page stores the LSN of the most recent log record that modified it (pageLSN).
Log Record:
βββββββββ¬βββββββ¬βββββββββ¬βββββββββββββ¬βββββββββββ
β LSN β TxnIDβ Type β PageID β PrevLSN β
β β β(update,β β(previous β
β β β commit,β β log recordβ
β β β CLR,...β β of same β
β β β) β β txn) β
βββββββββ΄βββββββ΄βββββββββ΄βββββββββββββ΄βββββββββββ€
β Before Image (old value) β
β After Image (new value) β
β UndoNextLSN (for CLR only) β
ββββββββββββββββββββββββββββββββββββββββββββββββββ
Transaction Table (ATT - Active Transaction Table): Maintained in memory. For each active transaction: - TransID - Status (active, committing, aborting) - LastLSN: LSN of the most recent log record for this transaction
Dirty Page Table (DPT): Maintained in memory. For each dirty page in the buffer pool: - PageID - RecoveryLSN (recLSN): LSN of the first log record that dirtied this page since it was last flushed
Transaction Table: Dirty Page Table:
βββββββββββ¬βββββββββ¬ββββββββ ββββββββββ¬βββββββββββ
β TransID β Status βLastLSNβ β PageID β recLSN β
βββββββββββΌβββββββββΌββββββββ€ ββββββββββΌβββββββββββ€
β Tβ β active β 45 β β P5 β 20 β
β Tβ β active β 60 β β P3 β 35 β
β Tβ βcommit β 55 β β P8 β 42 β
βββββββββββ΄βββββββββ΄ββββββββ ββββββββββ΄βββββββββββ
6.3 Compensation Log Records (CLR)¶
A CLR is a special log record written during the undo phase. It records the undo of a previous update and includes a pointer (UndoNextLSN) to the next log record to undo.
CLR Structure:
ββββββββββββββββββββββββββββββββββββββββββββββββ
β LSN: 100 β
β TransID: Tβ β
β Type: CLR β
β UndoNextLSN: 30 (β next record to undo) β
β Redo info: "restore A to old value 50" β
β PrevLSN: 80 β
ββββββββββββββββββββββββββββββββββββββββββββββββ
Purpose:
- If the system crashes DURING recovery (during undo),
the CLR ensures we don't undo the same operation twice.
- CLRs are REDO-only (they are never undone themselves).
6.4 The Three Phases¶
ARIES recovery proceeds in three phases:
Log Timeline:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β β β
β β β β
β Begin Checkpoint β CRASH β
β β β β
β β End Checkpoint β
β β β β
β βββββββββββββ΄βββββββββββββββ΄βββββββββββββββββββββ€
β β β
β β β Phase 1: ANALYSIS β β
β β (scan forward from checkpoint) β
β β β
β β β Phase 2: REDO βββββββββββββββββββ β
β β (from min recLSN in DPT, forward) β
β β β
β β β Phase 3: UNDO βββββββββββββββββββ β
β β (from end of log, backward) β
β β β
6.5 Phase 1: Analysis¶
Goal: Determine exactly which transactions were active at crash time and which pages were dirty.
ANALYSIS-PHASE():
// Start from the most recent checkpoint
Read <end_checkpoint ATT, DPT>
Initialize ATT and DPT from checkpoint data
// Scan forward from checkpoint to end of log
for each log record in forward order:
if record is <T_i, start>:
Add T_i to ATT (status = active)
if record is an UPDATE or CLR:
if T_i not in ATT:
Add T_i to ATT
ATT[T_i].LastLSN = record.LSN
if record.PageID not in DPT:
Add page to DPT with recLSN = record.LSN
if record is <T_i, commit>:
ATT[T_i].status = committed
if record is <T_i, abort>:
ATT[T_i].status = aborting
if record is <T_i, end>:
Remove T_i from ATT
// After analysis:
// ATT contains all transactions that were active at crash time
// DPT contains all pages that might have been dirty at crash time
6.6 Phase 2: Redo (Repeating History)¶
Goal: Restore the database to the exact state it was in at the moment of the crash. This includes redoing ALL modifications, even those of uncommitted transactions.
REDO-PHASE():
// Start from the smallest recLSN in the DPT
start_LSN = min(recLSN for all pages in DPT)
// Scan forward from start_LSN to end of log
for each UPDATE or CLR log record with LSN β₯ start_LSN:
// Three conditions to SKIP redo:
if record.PageID not in DPT:
skip (page is clean)
if DPT[record.PageID].recLSN > record.LSN:
skip (page was already flushed after this update)
// Read the page from disk
page = read_page(record.PageID)
if page.pageLSN β₯ record.LSN:
skip (this update is already reflected on the page)
// Apply the redo
Apply the REDO information from the log record to the page
page.pageLSN = record.LSN
// After redo:
// Database is in the exact pre-crash state
// (including effects of uncommitted transactions)
Why redo uncommitted transactions? ARIES "repeats history" to ensure that the undo phase works correctly. The undo phase assumes the database is in the exact pre-crash state, including all effects of all transactions (committed and uncommitted).
6.7 Phase 3: Undo¶
Goal: Roll back all transactions that were active at crash time (not committed).
UNDO-PHASE():
// Build the undo list: all uncommitted transactions in ATT
undo_list = {T_i in ATT where status != committed}
// Collect the LastLSN for each transaction to undo
// Process from the end of the log backward
ToUndo = {ATT[T_i].LastLSN for T_i in undo_list}
while ToUndo is not empty:
// Pick the largest LSN (most recent operation)
lsn = max(ToUndo)
record = log_record at lsn
T_i = record.TransID
if record is a CLR:
// CLRs are not undone; follow UndoNextLSN
if record.UndoNextLSN != null:
replace lsn in ToUndo with record.UndoNextLSN
else:
// All of T_i's updates have been undone
Write <T_i, end> to log
Remove T_i's entries from ToUndo
else if record is an UPDATE:
// Undo this update
// Step 1: Write a CLR
CLR = create_CLR(
TransID = T_i,
UndoNextLSN = record.PrevLSN,
Redo_info = "restore to before-image"
)
Write CLR to log
// Step 2: Apply the undo (restore before-image)
Restore the page to its before-image value
// Step 3: Follow the chain
if record.PrevLSN != null:
replace lsn in ToUndo with record.PrevLSN
else:
Write <T_i, end> to log
Remove T_i's entries from ToUndo
6.8 Complete ARIES Example¶
Log at time of crash:
LSN Record PrevLSN UndoNextLSN
βββ ββββββββββββββββββββββββββββββββ βββββββ βββββββββββ
10 <Tβ, start> -
20 <Tβ, P5, A: 10β20> 10
30 <Tβ, start> -
40 <Tβ, P3, B: 30β40> 30
50 <begin_checkpoint>
55 <end_checkpoint ATT={Tβ,Tβ}, DPT={P5:20, P3:40}>
60 <Tβ, P3, B: 40β50> 40
70 <Tβ, start> -
80 <Tβ, P5, C: 60β70> 20
90 <Tβ, P8, D: 80β90> 70
100 <Tβ, commit>
110 <Tβ, P8, E: 15β25> 90
β CRASH
Phase 1: Analysis (scan from LSN 55 to 110)
LSN 60: Tβ update β ATT: Tβ.LastLSN=60
LSN 70: Tβ start β ATT adds Tβ
LSN 80: Tβ update β ATT: Tβ.LastLSN=80, DPT: P5.recLSN stays 20
LSN 90: Tβ update β ATT: Tβ.LastLSN=90, DPT adds P8.recLSN=90
LSN 100: Tβ commit β ATT: Tβ.status=committed
LSN 110: Tβ update β ATT: Tβ.LastLSN=110
Result:
ATT = {Tβ(committed, LSN=80), Tβ(active, LSN=60), Tβ(active, LSN=110)}
DPT = {P5(recLSN=20), P3(recLSN=40), P8(recLSN=90)}
Phase 2: Redo (from min recLSN = 20)
LSN 20: P5 in DPT, recLSN=20 β€ 20, check pageLSN β redo if needed
LSN 40: P3 in DPT, recLSN=40 β€ 40, check pageLSN β redo if needed
LSN 60: P3 in DPT, recLSN=40 β€ 60, check pageLSN β redo if needed
LSN 80: P5 in DPT, recLSN=20 β€ 80, check pageLSN β redo if needed
LSN 90: P8 in DPT, recLSN=90 β€ 90, check pageLSN β redo if needed
LSN 110: P8 in DPT, recLSN=90 β€ 110, check pageLSN β redo if needed
Phase 3: Undo (uncommitted: Tβ, Tβ)
ToUndo = {60 (Tβ), 110 (Tβ)}
Step 1: Undo LSN 110 (Tβ, P8, E: 15β25)
Write CLR: <LSN=120, Tβ, CLR, UndoNext=90>
Restore E to 15
ToUndo = {60, 90}
Step 2: Undo LSN 90 (Tβ, P8, D: 80β90)
Write CLR: <LSN=130, Tβ, CLR, UndoNext=70>
Restore D to 80
ToUndo = {60, 70}
Step 3: Undo LSN 70 β it's <Tβ, start>
Actually LSN 70 is a start record, so Tβ has no PrevLSN from 70.
We trace via PrevLSN of LSN 90 β 70 is start.
Write <Tβ, end>
ToUndo = {60}
Step 4: Undo LSN 60 (Tβ, P3, B: 40β50)
Write CLR: <LSN=140, Tβ, CLR, UndoNext=40>
Restore B to 40
ToUndo = {40}
Step 5: Undo LSN 40 (Tβ, P3, B: 30β40)
Write CLR: <LSN=150, Tβ, CLR, UndoNext=30>
Restore B to 30
ToUndo = {30}
Step 6: LSN 30 is <Tβ, start>
Write <Tβ, end>
ToUndo = {} β DONE
Final state:
Tβ's changes persist (committed)
Tβ's changes undone (B restored to 30)
Tβ's changes undone (D restored to 80, E restored to 15)
6.9 Crash During Recovery¶
One of ARIES's key strengths is handling crashes during recovery:
Scenario: System crashes during the UNDO phase
Original crash recovery was at Phase 3 (Undo):
Undo of LSN 110 β wrote CLR (LSN 120)
Undo of LSN 90 β wrote CLR (LSN 130)
β CRASH AGAIN
Second recovery:
Phase 1 (Analysis): Finds Tβ, Tβ still active
Phase 2 (Redo): Redoes ALL log records including CLRs at LSN 120, 130
(This re-applies the undo operations)
Phase 3 (Undo): For Tβ, follows CLR at LSN 130 β UndoNextLSN = 70
Tβ's LSNs 110 and 90 are NOT undone again (CLRs skip them)
The CLR chain ensures undo operations are IDEMPOTENT.
No work is repeated, no matter how many times recovery crashes.
7. Shadow Paging¶
7.1 Concept¶
Shadow paging is an alternative to WAL-based recovery. Instead of logging changes, it maintains two page tables:
Shadow Paging:
Current Page Table: Shadow Page Table (on disk):
βββββ¬βββββββ βββββ¬βββββββ
β 1 β β P1'β (modified) β 1 β β P1 β (original)
β 2 β β P2 β (unchanged) β 2 β β P2 β
β 3 β β P3'β (modified) β 3 β β P3 β (original)
β 4 β β P4 β (unchanged) β 4 β β P4 β
βββββ΄βββββββ βββββ΄βββββββ
On commit: Replace shadow page table with current page table
On abort: Discard current page table, revert to shadow
7.2 Shadow Paging vs. WAL¶
| Aspect | Shadow Paging | WAL (ARIES) |
|---|---|---|
| Recovery speed | Fast (just use shadow) | Slower (three phases) |
| Normal operation | Slower (copy-on-write) | Faster (in-place update) |
| Concurrent txns | Difficult to support | Naturally supports |
| Fragmentation | High (scattered pages) | Low |
| Commit overhead | Atomic page table swap | Log flush |
| Used in | SQLite (journal mode), CouchDB | PostgreSQL, MySQL, Oracle, SQL Server |
Shadow paging is rarely used in modern multi-user database systems due to poor support for concurrent transactions and page fragmentation.
8. Buffer Management Policies¶
8.1 The Four Combinations¶
Buffer management policies determine when dirty pages are written to disk:
Steal Policy: Can the buffer manager flush a dirty page belonging to an uncommitted transaction? - Steal: Yes, dirty pages can be flushed at any time (requires UNDO capability) - No-Steal: No, dirty pages of uncommitted transactions stay in the buffer
Force Policy: Must all dirty pages of a transaction be flushed to disk at commit time? - Force: Yes, flush all dirty pages on commit (no REDO needed after crash) - No-Force: No, dirty pages may be flushed later (requires REDO capability)
β No-Steal β Steal
ββββββββββββΌββββββββββββββββββββΌβββββββββββββββββββββ
Force β No undo, no redo β Undo, no redo
β (simplest recoveryβ (rare in practice)
β but worst perf) β
ββββββββββββΌββββββββββββββββββββΌβββββββββββββββββββββ
No-Force β No undo, redo β Undo AND redo
β (deferred mod.) β (ARIES, most systems)
β β β Best performance
8.2 Why Steal/No-Force is Best¶
Steal advantage:
Without steal: uncommitted transaction's dirty pages MUST stay in buffer
β Large transactions can exhaust the buffer pool
β Limits the number of concurrent transactions
With steal: buffer manager can evict any page as needed
β Better buffer utilization
β Supports larger transactions
No-Force advantage:
Without no-force: ALL dirty pages flushed at commit
β Very slow commits (especially if many pages modified)
β Random I/O pattern (scattered dirty pages)
With no-force: pages flushed lazily by background writer
β Fast commits (just flush the log)
β Sequential I/O for the log
β Pages flushed in batches (more efficient)
8.3 Interaction with Recovery¶
| Policy | UNDO needed? | REDO needed? | Explanation |
|---|---|---|---|
| Steal | Yes | - | Stolen page may have uncommitted data on disk |
| No-Steal | No | - | Uncommitted data never reaches disk |
| Force | - | No | All committed data is on disk at commit |
| No-Force | - | Yes | Committed data may still be only in buffer |
ARIES uses Steal + No-Force, which requires both UNDO and REDO capabilities -- but provides the best runtime performance.
9. Media Recovery and Backup Strategies¶
9.1 The Need for Backups¶
WAL-based recovery handles transaction failures and system crashes, but NOT media failures (disk destruction). For media recovery, we need backups.
Media Recovery:
Time: ββββββββββββββββββββββββββββββββββββββββββββββ
β β β β
Full Backup β Incremental Disk Failure
Bβ β Backup Bβ β
β β Restore Bβ
Log continues β Apply Bβ
β β Replay log from Bβ to failure
β β β
β β Database recovered!
9.2 Backup Types¶
Full Backup (Base Backup): - Complete copy of the entire database - Self-contained: can restore from this alone - Large size, time-consuming - Typically done weekly or monthly
Incremental Backup: - Only pages/blocks changed since the last backup - Smaller and faster than full backup - Requires the base backup + all incrementals to restore - Typically done daily
Continuous Archiving (WAL Archiving): - Archive WAL segments as they are completed - Combined with a base backup, allows Point-in-Time Recovery (PITR)
PostgreSQL Backup Strategy:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Sunday: Full base backup (pg_basebackup) β
β Mon-Sat: Continuous WAL archiving β
β β
β To restore to Wednesday 3:00 PM: β
β 1. Restore Sunday's base backup β
β 2. Replay WAL from Sunday to Wed 3:00 PM β
β 3. Database is at the exact state of Wed 3:00 PM β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
9.3 PostgreSQL Backup Commands¶
# Full base backup
pg_basebackup -D /backup/base -Ft -z -P
# Configure WAL archiving (in postgresql.conf)
# archive_mode = on
# archive_command = 'cp %p /backup/wal/%f'
# Point-in-Time Recovery
# 1. Stop PostgreSQL
# 2. Restore base backup to data directory
# 3. Create recovery.signal file
# 4. Set restore_command and recovery_target_time in postgresql.conf
# restore_command = 'cp /backup/wal/%f %p'
# recovery_target_time = '2025-03-15 15:00:00'
# 5. Start PostgreSQL β automatic recovery to target time
9.4 Logical vs. Physical Backup¶
Physical Backup (pg_basebackup):
ββββββββββββββββββββββββββββββββββββββββββββββ
β Copies raw data files (binary) β
β + Fast backup and restore β
β + Supports PITR with WAL archiving β
β - Same PostgreSQL major version required β
β - Same architecture (OS, endianness) β
ββββββββββββββββββββββββββββββββββββββββββββββ
Logical Backup (pg_dump):
ββββββββββββββββββββββββββββββββββββββββββββββ
β Exports SQL statements or custom format β
β + Version-independent β
β + Can restore to different architecture β
β + Can selectively restore tables β
β - Slower backup and restore β
β - No PITR (point-in-time snapshot only) β
ββββββββββββββββββββββββββββββββββββββββββββββ
9.5 Backup Best Practices¶
The 3-2-1 Rule:
3 copies of data (original + 2 backups)
2 different storage media (disk + tape/cloud)
1 offsite copy (different physical location)
Additional guidelines:
- Test restores regularly (untested backup = no backup)
- Monitor WAL archiving (gaps mean data loss risk)
- Retention policy: keep backups for regulatory period
- Encrypt backups (especially offsite ones)
- Document the recovery procedure
10. Recovery in Modern Systems¶
10.1 Recovery and Replication¶
Modern systems often combine recovery with replication for both durability and high availability:
Synchronous Replication:
Primary ββWALβββ Standby
β β
β Commit only after standby
β confirms WAL receipt
β β
βΌ βΌ
Data File Data File
If primary fails:
Standby already has all committed WAL
Promote standby to primary (seconds)
No data loss (RPO = 0)
10.2 Performance Considerations¶
Recovery Time Components:
ββββββββββββββββββββββββββββββββββββββββββ
β Analysis Phase: Fast (scan log once) β
β Time: proportional to log since β
β last checkpoint β
ββββββββββββββββββββββββββββββββββββββββββ€
β Redo Phase: Can be slow β
β Time: proportional to # of log records β
β since min(recLSN in DPT) β
β Optimization: parallel redo by page β
ββββββββββββββββββββββββββββββββββββββββββ€
β Undo Phase: Usually fast β
β Time: proportional to # of updates β
β by uncommitted transactions β
β Note: Can be deferred (lazy undo) β
ββββββββββββββββββββββββββββββββββββββββββ
To minimize recovery time:
- Frequent checkpoints (less to redo)
- Short transactions (less to undo)
- Sufficient WAL buffer (fewer forced flushes)
10.3 Parallel Recovery¶
Modern systems perform redo in parallel:
Parallel Redo:
Log records: [P1, P3, P1, P2, P3, P1, P2, ...]
Thread 1 (Page P1): redo [LSN1, LSN3, LSN6, ...]
Thread 2 (Page P2): redo [LSN4, LSN7, ...]
Thread 3 (Page P3): redo [LSN2, LSN5, ...]
Records for the SAME page must be applied in order.
Records for DIFFERENT pages can be applied in parallel.
11. Exercises¶
Conceptual Questions¶
Exercise 1: Classify each of the following failures and describe the appropriate recovery action: (a) A transaction divides by zero (b) A power outage occurs (c) A disk head crashes, destroying the data disk (d) A deadlock is detected (e) The operating system kernel panics
Exercise 2: Explain the difference between volatile, nonvolatile, and stable storage. Why is "true" stable storage impossible to achieve in practice? How do real systems approximate it?
Exercise 3: Explain why the WAL (Write-Ahead Logging) protocol requires log records to be flushed to stable storage BEFORE the corresponding data pages. What could go wrong if data pages were flushed first?
Log-Based Recovery¶
Exercise 4: Given the following log (no checkpoints):
LSN Record
1 <Tβ, start>
2 <Tβ, A, 10, 20>
3 <Tβ, start>
4 <Tβ, B, 30, 40>
5 <Tβ, C, 50, 60>
6 <Tβ, commit>
7 <Tβ, D, 70, 80>
8 <Tβ, start>
9 <Tβ, A, 20, 30>
10 <Tβ, commit>
β CRASH
(a) Which transactions must be redone? Which must be undone? (b) What are the final values of A, B, C, D after recovery? (c) Show the complete recovery process step by step.
Exercise 5: Repeat Exercise 4, but with a checkpoint <checkpoint {Tβ}> inserted between LSN 6 and LSN 7. How does the checkpoint change the recovery process?
ARIES Algorithm¶
Exercise 6: Given the following log with a fuzzy checkpoint:
LSN Record PrevLSN
10 <Tβ, start> -
20 <Tβ, P1, X: 5β10> 10
30 <Tβ, start> -
40 <Tβ, P2, Y: 15β25> 30
50 <begin_checkpoint>
55 <end_checkpoint ATT={Tβ(20),Tβ(40)}, DPT={P1:20, P2:40}>
60 <Tβ, P3, Z: 35β45> 20
70 <Tβ, commit>
80 <Tβ, start> -
90 <Tβ, P1, X: 10β20> 80
100 <Tβ, P2, W: 50β60> 60
β CRASH
(a) Perform the Analysis phase. Show the final ATT and DPT. (b) Determine the starting LSN for the Redo phase. (c) Perform the Redo phase. List which log records are redone (assume all pages need redo). (d) Perform the Undo phase. Show all CLRs written and the final ToUndo set at each step. (e) What are the final values of X, Y, Z, W after recovery?
Exercise 7: Explain why ARIES "repeats history" during the redo phase (i.e., redoes even uncommitted transactions' updates). Why not skip uncommitted transactions' updates and only redo committed ones?
Exercise 8: A system crashes during the UNDO phase of ARIES recovery. At the time of the second crash, CLRs have been written for some (but not all) undo operations. Explain step by step what happens during the second recovery attempt. Why is no work repeated?
Buffer Management¶
Exercise 9: For each combination of steal/no-steal and force/no-force policies, explain: (a) Whether UNDO capability is needed and why (b) Whether REDO capability is needed and why (c) The impact on runtime performance (commit latency, buffer utilization) (d) Name a system or context where this combination is appropriate
Exercise 10: A transaction T modifies 1000 pages. Under a force policy, all 1000 pages must be flushed to disk at commit time. Under a no-force policy, only the log records (~10 KB) need to be flushed. If each page flush takes 5ms (random I/O) and the log flush takes 2ms (sequential I/O):
(a) Calculate the commit latency under force vs. no-force policies (b) If the system processes 100 such transactions per second, what is the I/O bandwidth requirement under each policy? (c) Why do virtually all modern database systems use no-force?
Backup and Media Recovery¶
Exercise 11: A PostgreSQL database has the following backup schedule: - Full base backup every Sunday at 2:00 AM - Continuous WAL archiving
On Wednesday at 4:00 PM, a junior DBA accidentally runs DROP TABLE orders; and commits. The error is discovered at 5:00 PM. Describe the step-by-step Point-in-Time Recovery process to restore the database to Wednesday 3:59 PM (one minute before the DROP TABLE).
Exercise 12: Compare the recovery time objectives for three failure scenarios in a system with: - Checkpoints every 5 minutes - Base backup every week - WAL archived continuously - Synchronous standby replica
(a) Transaction failure (single transaction aborted): Estimated recovery time? (b) System failure (server crashes): Estimated recovery time? (c) Disk failure (primary data disk destroyed): Estimated recovery time? What is the difference between promoting the standby vs. restoring from backup?
Previous: Concurrency Control | Next: NoSQL and NewSQL