Lesson 5: Top-Down Parsing
Lesson 5: Top-Down Parsing¶
Learning Objectives¶
After completing this lesson, you will be able to:
- Understand the principles of top-down parsing and how it constructs parse trees from the root downward
- Compute FIRST and FOLLOW sets for any context-free grammar
- Construct LL(1) parsing tables and identify LL(1) grammars
- Implement recursive descent parsers by hand for small languages
- Implement table-driven LL(1) parsers from constructed parsing tables
- Eliminate left recursion and apply left factoring to make grammars LL(1)-compatible
- Resolve LL(1) conflicts and apply error recovery strategies
- Explain the extensions LL(k) and ALL(*) parsing provide beyond LL(1)
1. Introduction to Top-Down Parsing¶
Top-down parsing is a parsing strategy that builds the parse tree starting from the root (the start symbol) and works its way down toward the leaves (the terminal symbols). At each step, the parser attempts to predict which production to apply based on the current input token.
The fundamental question in top-down parsing is:
Given a nonterminal $A$ on top of the parsing stack and the current lookahead token $a$, which production $A \to \alpha$ should the parser apply?
Top-down parsers come in two main forms:
- Recursive descent parsers -- a collection of mutually recursive procedures, one per nonterminal
- Table-driven predictive parsers -- driven by an LL(1) parsing table and an explicit stack
Both forms require the grammar to be LL(1) (or close to it). The name LL(1) stands for:
- L: scan input Left to right
- L: produce a Leftmost derivation
- 1: use 1 token of lookahead
Top-Down Parse Tree Construction (for input "a + b * c"):
Step 1: E Step 2: E
/|\ /|\
? ? ? T E'
/
?
Step 3: E Step 4: E
/|\ /|\
T E' T E'
/ | /| |\
F T' F T' + T E'
| | |
a ... a Ξ΅ ...
The parser proceeds in a leftmost derivation: at every step, it expands the leftmost nonterminal.
2. Recursive Descent Parsing¶
2.1 Basic Concept¶
In a recursive descent parser, each nonterminal in the grammar corresponds to a function. The function for nonterminal $A$ examines the current lookahead token and decides which production to use, then calls functions for each symbol in the chosen production's right-hand side.
Consider the classic expression grammar (after eliminating left recursion):
$$ \begin{aligned} E &\to T \; E' \\ E' &\to + \; T \; E' \;\mid\; \varepsilon \\ T &\to F \; T' \\ T' &\to * \; F \; T' \;\mid\; \varepsilon \\ F &\to ( \; E \; ) \;\mid\; \textbf{id} \end{aligned} $$
2.2 Implementation¶
Here is a complete recursive descent parser for arithmetic expressions:
"""
Recursive Descent Parser for Arithmetic Expressions
Grammar (LL(1)-compatible):
E -> T E'
E' -> + T E' | Ξ΅
T -> F T'
T' -> * F T' | Ξ΅
F -> ( E ) | id | num
"""
from dataclasses import dataclass
from enum import Enum, auto
from typing import Optional
# βββ Token Types βββ
class TokenType(Enum):
PLUS = auto()
STAR = auto()
LPAREN = auto()
RPAREN = auto()
ID = auto()
NUM = auto()
EOF = auto()
@dataclass
class Token:
type: TokenType
value: str
pos: int # position in input for error reporting
# βββ Lexer βββ
class Lexer:
"""Simple tokenizer for arithmetic expressions."""
def __init__(self, text: str):
self.text = text
self.pos = 0
def _skip_whitespace(self):
while self.pos < len(self.text) and self.text[self.pos].isspace():
self.pos += 1
def next_token(self) -> Token:
self._skip_whitespace()
if self.pos >= len(self.text):
return Token(TokenType.EOF, "", self.pos)
ch = self.text[self.pos]
start = self.pos
if ch == '+':
self.pos += 1
return Token(TokenType.PLUS, "+", start)
elif ch == '*':
self.pos += 1
return Token(TokenType.STAR, "*", start)
elif ch == '(':
self.pos += 1
return Token(TokenType.LPAREN, "(", start)
elif ch == ')':
self.pos += 1
return Token(TokenType.RPAREN, ")", start)
elif ch.isdigit():
while self.pos < len(self.text) and self.text[self.pos].isdigit():
self.pos += 1
return Token(TokenType.NUM, self.text[start:self.pos], start)
elif ch.isalpha() or ch == '_':
while self.pos < len(self.text) and (
self.text[self.pos].isalnum() or self.text[self.pos] == '_'
):
self.pos += 1
return Token(TokenType.ID, self.text[start:self.pos], start)
else:
raise SyntaxError(
f"Unexpected character '{ch}' at position {start}"
)
def tokenize(self) -> list[Token]:
tokens = []
while True:
tok = self.next_token()
tokens.append(tok)
if tok.type == TokenType.EOF:
break
return tokens
# βββ AST Nodes βββ
@dataclass
class ASTNode:
pass
@dataclass
class NumLit(ASTNode):
value: int
@dataclass
class Ident(ASTNode):
name: str
@dataclass
class BinOp(ASTNode):
op: str
left: ASTNode
right: ASTNode
# βββ Recursive Descent Parser βββ
class RecursiveDescentParser:
"""
Parses arithmetic expressions using recursive descent.
Each nonterminal in the grammar has a corresponding method:
E -> T E' => parse_E()
E' -> + T E' | Ξ΅ => parse_E_prime(left)
T -> F T' => parse_T()
T' -> * F T' | Ξ΅ => parse_T_prime(left)
F -> ( E ) | id => parse_F()
"""
def __init__(self, tokens: list[Token]):
self.tokens = tokens
self.pos = 0
@property
def current(self) -> Token:
return self.tokens[self.pos]
def eat(self, expected: TokenType) -> Token:
"""Consume the current token if it matches the expected type."""
tok = self.current
if tok.type != expected:
raise SyntaxError(
f"Expected {expected.name} but found {tok.type.name} "
f"('{tok.value}') at position {tok.pos}"
)
self.pos += 1
return tok
def parse(self) -> ASTNode:
"""Entry point: parse the entire expression."""
tree = self.parse_E()
self.eat(TokenType.EOF)
return tree
def parse_E(self) -> ASTNode:
"""E -> T E'"""
left = self.parse_T()
return self.parse_E_prime(left)
def parse_E_prime(self, left: ASTNode) -> ASTNode:
"""E' -> + T E' | Ξ΅"""
if self.current.type == TokenType.PLUS:
self.eat(TokenType.PLUS)
right = self.parse_T()
node = BinOp("+", left, right)
return self.parse_E_prime(node) # left-associate
else:
# Ξ΅-production: return what we have
return left
def parse_T(self) -> ASTNode:
"""T -> F T'"""
left = self.parse_F()
return self.parse_T_prime(left)
def parse_T_prime(self, left: ASTNode) -> ASTNode:
"""T' -> * F T' | Ξ΅"""
if self.current.type == TokenType.STAR:
self.eat(TokenType.STAR)
right = self.parse_F()
node = BinOp("*", left, right)
return self.parse_T_prime(node) # left-associate
else:
return left
def parse_F(self) -> ASTNode:
"""F -> ( E ) | id | num"""
if self.current.type == TokenType.LPAREN:
self.eat(TokenType.LPAREN)
node = self.parse_E()
self.eat(TokenType.RPAREN)
return node
elif self.current.type == TokenType.NUM:
tok = self.eat(TokenType.NUM)
return NumLit(int(tok.value))
elif self.current.type == TokenType.ID:
tok = self.eat(TokenType.ID)
return Ident(tok.name if hasattr(tok, 'name') else tok.value)
else:
raise SyntaxError(
f"Unexpected token {self.current.type.name} "
f"('{self.current.value}') at position {self.current.pos}"
)
# βββ Pretty Printer βββ
def pretty_print(node: ASTNode, indent: int = 0) -> str:
"""Produce a human-readable representation of the AST."""
prefix = " " * indent
if isinstance(node, NumLit):
return f"{prefix}NumLit({node.value})"
elif isinstance(node, Ident):
return f"{prefix}Ident({node.name})"
elif isinstance(node, BinOp):
left_str = pretty_print(node.left, indent + 1)
right_str = pretty_print(node.right, indent + 1)
return (
f"{prefix}BinOp({node.op})\n"
f"{left_str}\n"
f"{right_str}"
)
return f"{prefix}Unknown({node})"
# βββ Demo βββ
if __name__ == "__main__":
examples = [
"3 + 4 * 5",
"(a + b) * c",
"x + y + z",
"2 * (3 + 4)",
]
for expr in examples:
print(f"\nInput: {expr}")
tokens = Lexer(expr).tokenize()
print(f"Tokens: {[(t.type.name, t.value) for t in tokens]}")
ast = RecursiveDescentParser(tokens).parse()
print(f"AST:\n{pretty_print(ast)}")
Output for "3 + 4 * 5":
Input: 3 + 4 * 5
Tokens: [('NUM', '3'), ('PLUS', '+'), ('NUM', '4'), ('STAR', '*'), ('NUM', '5'), ('EOF', '')]
AST:
BinOp(+)
NumLit(3)
BinOp(*)
NumLit(4)
NumLit(5)
2.3 Advantages and Limitations¶
Advantages:
- Simple to understand and implement by hand
- Natural error messages (each function knows what it expects)
- Easy to embed semantic actions (AST construction, type checking)
- Can handle some non-LL(1) constructs with manual backtracking
Limitations:
- Requires the grammar to be (approximately) LL(1)
- Cannot handle left-recursive grammars directly
- Performance may degrade with excessive backtracking
- Maintaining large parsers by hand is error-prone
3. FIRST and FOLLOW Sets¶
The FIRST and FOLLOW sets are the key data structures that make predictive parsing possible. They answer two essential questions:
- FIRST($\alpha$): What terminals can appear as the first symbol of a string derived from $\alpha$?
- FOLLOW($A$): What terminals can appear immediately after nonterminal $A$ in some sentential form?
3.1 FIRST Sets¶
Definition. For a string of grammar symbols $\alpha$, $\text{FIRST}(\alpha)$ is the set of terminals that can begin strings derived from $\alpha$. If $\alpha \Rightarrow^* \varepsilon$, then $\varepsilon \in \text{FIRST}(\alpha)$.
Algorithm to compute FIRST sets:
Algorithm: Compute FIRST(X) for all grammar symbols X
1. If X is a terminal:
FIRST(X) = {X}
2. If X is a nonterminal:
For each production X -> Y1 Y2 ... Yk:
Add FIRST(Y1) - {Ξ΅} to FIRST(X)
If Ξ΅ β FIRST(Y1):
Add FIRST(Y2) - {Ξ΅} to FIRST(X)
If Ξ΅ β FIRST(Y2):
Add FIRST(Y3) - {Ξ΅} to FIRST(X)
...continue until Yi where Ξ΅ β FIRST(Yi)
If Ξ΅ β FIRST(Yi) for all i = 1..k:
Add Ξ΅ to FIRST(X)
3. If X -> Ξ΅ is a production:
Add Ξ΅ to FIRST(X)
Repeat until no FIRST set changes.
Extending FIRST to strings:
For a string $\alpha = X_1 X_2 \cdots X_n$:
$$ \text{FIRST}(X_1 X_2 \cdots X_n) = \begin{cases} \text{FIRST}(X_1) & \text{if } \varepsilon \notin \text{FIRST}(X_1) \\ (\text{FIRST}(X_1) - \{\varepsilon\}) \cup \text{FIRST}(X_2 \cdots X_n) & \text{if } \varepsilon \in \text{FIRST}(X_1) \end{cases} $$
3.2 FOLLOW Sets¶
Definition. For a nonterminal $A$, $\text{FOLLOW}(A)$ is the set of terminals that can appear immediately to the right of $A$ in some sentential form. The end-of-input marker $\ LATEX2 (columns). Each entry $M[A, a]$ contains a production to apply.
Algorithm: LL(1) Table Construction
For each production A -> Ξ± in the grammar:
1. For each terminal a in FIRST(Ξ±):
Add "A -> Ξ±" to M[A, a]
2. If Ξ΅ β FIRST(Ξ±):
For each terminal b in FOLLOW(A):
Add "A -> Ξ±" to M[A, b]
If $ β FOLLOW(A):
Add "A -> Ξ±" to M[A, $]
Definition. A grammar is LL(1) if and only if every entry in the parsing table $M$ contains at most one production.
4.2 Python Implementation¶
def build_ll1_table(
grammar: Grammar,
first: dict[str, set[str]],
follow: dict[str, set[str]],
) -> tuple[dict[tuple[str, str], list[str]], list[str]]:
"""
Build the LL(1) parsing table.
Returns:
(table, conflicts) where:
- table: dict mapping (nonterminal, terminal) -> production RHS
- conflicts: list of conflict descriptions (empty if grammar is LL(1))
"""
table: dict[tuple[str, str], list[list[str]]] = {}
conflicts: list[str] = []
for lhs, rhs_list in grammar.productions.items():
for rhs in rhs_list:
# Compute FIRST of the RHS
rhs_first = first_of_string(rhs, first)
# Rule 1: For each terminal a in FIRST(rhs)
for a in rhs_first:
if a == EPSILON:
continue
key = (lhs, a)
if key not in table:
table[key] = []
table[key].append(rhs)
# Rule 2: If Ξ΅ in FIRST(rhs), for each b in FOLLOW(lhs)
if EPSILON in rhs_first:
for b in follow[lhs]:
key = (lhs, b)
if key not in table:
table[key] = []
table[key].append(rhs)
# Check for conflicts
final_table: dict[tuple[str, str], list[str]] = {}
for key, prods in table.items():
if len(prods) > 1:
conflict_strs = [" ".join(p) for p in prods]
conflicts.append(
f"Conflict at M[{key[0]}, {key[1]}]: "
f"{' vs '.join(conflict_strs)}"
)
final_table[key] = prods[0]
return final_table, conflicts
def print_ll1_table(
table: dict[tuple[str, str], list[str]],
grammar: Grammar,
):
"""Pretty-print the LL(1) parsing table."""
terminals_list = sorted(grammar.terminals) + [EOF]
nonterminals_list = sorted(grammar.nonterminals)
# Compute column widths
col_width = max(
max(len(t) for t in terminals_list),
max(
len(" ".join(table.get((nt, t), [])))
for nt in nonterminals_list
for t in terminals_list
),
8
) + 2
# Header
header = f"{'':>6}" + "".join(f"{t:>{col_width}}" for t in terminals_list)
print(header)
print("-" * len(header))
# Rows
for nt in nonterminals_list:
row = f"{nt:>6}"
for t in terminals_list:
entry = table.get((nt, t))
if entry:
cell = f"{nt}->{' '.join(entry)}"
else:
cell = ""
row += f"{cell:>{col_width}}"
print(row)
if __name__ == "__main__":
grammar = build_expression_grammar()
first = compute_first(grammar)
follow = compute_follow(grammar, first)
table, conflicts = build_ll1_table(grammar, first, follow)
if conflicts:
print("LL(1) conflicts found:")
for c in conflicts:
print(f" {c}")
else:
print("Grammar is LL(1)!")
print("\nLL(1) Parsing Table:")
print_ll1_table(table, grammar)
4.3 The Expression Grammar Table¶
For the expression grammar, the LL(1) table is:
( ) * + id $
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
E E->T E' E->T E'
E' E'->Ξ΅ E'->+T E' E'->Ξ΅
F F->(E) F->id
T T->F T' T->F T'
T' T'->Ξ΅ T'->*F T' T'->Ξ΅ T'->Ξ΅
Reading the table: If the parser sees nonterminal $E$ on top of the stack and the lookahead is id, it applies the production $E \to T\ E'$.
5. Table-Driven LL(1) Parser¶
5.1 The Parsing Algorithm¶
The table-driven LL(1) parser uses an explicit stack and the parsing table to simulate a leftmost derivation.
Algorithm: Table-Driven LL(1) Parsing
Input: string w$, parsing table M
Output: leftmost derivation of w or error
Initialize stack = [S, $] (start symbol on top)
Set ip to first symbol of w$
repeat:
let X = top of stack, a = current input symbol
if X is a terminal:
if X == a:
pop X from stack
advance ip ("match")
else:
error()
else if X == $:
if a == $:
accept() ("success")
else:
error()
else: (X is a nonterminal)
if M[X, a] is defined as X -> Y1 Y2 ... Yk:
pop X from stack
push Yk, ..., Y2, Y1 onto stack (Y1 on top)
output "X -> Y1 Y2 ... Yk"
else:
error()
5.2 Complete Implementation¶
"""
Table-Driven LL(1) Parser
Implements a complete LL(1) parser using the parsing table constructed
from FIRST and FOLLOW sets.
"""
class LL1Parser:
"""
A table-driven LL(1) parser.
Uses an explicit stack and a parsing table to parse input strings.
"""
def __init__(
self,
grammar: Grammar,
table: dict[tuple[str, str], list[str]],
):
self.grammar = grammar
self.table = table
def parse(self, tokens: list[str], verbose: bool = False) -> bool:
"""
Parse a list of terminal symbols.
Args:
tokens: list of terminal symbols (without $)
verbose: if True, print each step
Returns:
True if the input is accepted, False otherwise.
"""
# Add end marker
input_symbols = tokens + [EOF]
ip = 0 # input pointer
# Initialize stack with $ and start symbol
stack = [EOF, self.grammar.start]
step = 0
if verbose:
print(f"{'Step':>4} {'Stack':<30} {'Input':<30} {'Action'}")
print("-" * 90)
while True:
top = stack[-1]
current = input_symbols[ip]
remaining = " ".join(input_symbols[ip:])
if verbose:
stack_str = " ".join(reversed(stack))
print(
f"{step:>4} {stack_str:<30} {remaining:<30}",
end="",
)
# Case 1: Top is $ (end marker)
if top == EOF:
if current == EOF:
if verbose:
print("ACCEPT")
return True
else:
if verbose:
print("ERROR: unexpected input after stack empty")
return False
# Case 2: Top is a terminal
if top in self.grammar.terminals:
if top == current:
stack.pop()
ip += 1
if verbose:
print(f"Match '{top}'")
else:
if verbose:
print(
f"ERROR: expected '{top}', found '{current}'"
)
return False
# Case 3: Top is a nonterminal
elif top in self.grammar.nonterminals:
key = (top, current)
if key in self.table:
production = self.table[key]
stack.pop()
# Push RHS in reverse order (so first symbol is on top)
if production != [EPSILON]:
for symbol in reversed(production):
stack.append(symbol)
prod_str = " ".join(production)
if verbose:
print(f"Apply {top} -> {prod_str}")
else:
if verbose:
print(
f"ERROR: no entry for M[{top}, {current}]"
)
return False
else:
if verbose:
print(f"ERROR: unknown symbol '{top}'")
return False
step += 1
# Safety: prevent infinite loops
if step > 10000:
if verbose:
print("ERROR: too many steps, possible infinite loop")
return False
# βββ Demo βββ
def demo_ll1_parser():
"""Demonstrate the LL(1) parser on the expression grammar."""
grammar = build_expression_grammar()
first = compute_first(grammar)
follow = compute_follow(grammar, first)
table, conflicts = build_ll1_table(grammar, first, follow)
parser = LL1Parser(grammar, table)
# Test inputs
tests = [
(["id", "+", "id", "*", "id"], True),
(["(", "id", "+", "id", ")", "*", "id"], True),
(["id", "+", "+"], False),
(["(", "id"], False),
]
for tokens, expected in tests:
print(f"\n{'='*90}")
print(f"Input: {' '.join(tokens)}")
print(f"{'='*90}")
result = parser.parse(tokens, verbose=True)
status = "ACCEPTED" if result else "REJECTED"
assert result == expected, f"Expected {expected}, got {result}"
print(f"\nResult: {status}")
if __name__ == "__main__":
demo_ll1_parser()
Example trace for "id + id * id":
Step Stack Input Action
------------------------------------------------------------------------------------------
0 E $ id + id * id $ Apply E -> T E'
1 T E' $ id + id * id $ Apply T -> F T'
2 F T' E' $ id + id * id $ Apply F -> id
3 id T' E' $ id + id * id $ Match 'id'
4 T' E' $ + id * id $ Apply T' -> Ξ΅
5 E' $ + id * id $ Apply E' -> + T E'
6 + T E' $ + id * id $ Match '+'
7 T E' $ id * id $ Apply T -> F T'
8 F T' E' $ id * id $ Apply F -> id
9 id T' E' $ id * id $ Match 'id'
10 T' E' $ * id $ Apply T' -> * F T'
11 * F T' E' $ * id $ Match '*'
12 F T' E' $ id $ Apply F -> id
13 id T' E' $ id $ Match 'id'
14 T' E' $ $ Apply T' -> Ξ΅
15 E' $ $ Apply E' -> Ξ΅
16 $ $ ACCEPT
6. Left Recursion Elimination¶
6.1 The Problem¶
A grammar is left-recursive if there exists a nonterminal $A$ such that $A \Rightarrow^+ A\alpha$ for some string $\alpha$. Top-down parsers cannot handle left recursion because parsing $A$ would immediately require parsing $A$ again, leading to infinite recursion.
Immediate left recursion has the form:
$$A \to A\alpha_1 \mid A\alpha_2 \mid \cdots \mid A\alpha_m \mid \beta_1 \mid \beta_2 \mid \cdots \mid \beta_n$$
where none of the $\beta_i$ starts with $A$.
6.2 Eliminating Immediate Left Recursion¶
Transformation rule:
Replace: $$A \to A\alpha_1 \mid A\alpha_2 \mid \cdots \mid \beta_1 \mid \beta_2 \mid \cdots$$
With: $$ \begin{aligned} A &\to \beta_1 A' \mid \beta_2 A' \mid \cdots \\ A' &\to \alpha_1 A' \mid \alpha_2 A' \mid \cdots \mid \varepsilon \end{aligned} $$
Example: The left-recursive expression grammar:
$$E \to E + T \mid T$$
Becomes:
$$ \begin{aligned} E &\to T\ E' \\ E' &\to +\ T\ E' \mid \varepsilon \end{aligned} $$
6.3 Eliminating Indirect Left Recursion¶
Indirect (or hidden) left recursion occurs when the cycle involves multiple nonterminals:
$$ \begin{aligned} A &\to B\alpha \\ B &\to A\beta \end{aligned} $$
Here $A \Rightarrow B\alpha \Rightarrow A\beta\alpha$, so $A$ is indirectly left-recursive.
Algorithm: Eliminate All Left Recursion
Input: Grammar G with no cycles or Ξ΅-productions
(except Ξ΅-productions introduced by the algorithm)
1. Order the nonterminals: A1, A2, ..., An
2. For i = 1 to n:
For j = 1 to i-1:
Replace each production Ai -> Aj Ξ³ with:
Ai -> Ξ΄1 Ξ³ | Ξ΄2 Ξ³ | ... | Ξ΄k Ξ³
where Aj -> Ξ΄1 | Ξ΄2 | ... | Ξ΄k are all Aj-productions
Eliminate immediate left recursion from Ai productions
6.4 Python Implementation¶
def eliminate_immediate_left_recursion(
lhs: str, productions: list[list[str]]
) -> tuple[str, list[list[str]], list[list[str]]]:
"""
Eliminate immediate left recursion from productions of a nonterminal.
Args:
lhs: the nonterminal
productions: list of RHS alternatives
Returns:
(new_nonterminal, new_productions_for_lhs, productions_for_new)
"""
left_recursive = []
non_recursive = []
for rhs in productions:
if rhs[0] == lhs:
# A -> A Ξ± => Ξ± part
left_recursive.append(rhs[1:])
else:
non_recursive.append(rhs)
if not left_recursive:
# No left recursion to eliminate
return lhs, productions, []
# Create new nonterminal A'
new_nt = lhs + "'"
# A -> Ξ²1 A' | Ξ²2 A' | ...
new_lhs_prods = []
for beta in non_recursive:
if beta == [EPSILON]:
new_lhs_prods.append([new_nt])
else:
new_lhs_prods.append(beta + [new_nt])
# A' -> Ξ±1 A' | Ξ±2 A' | ... | Ξ΅
new_nt_prods = []
for alpha in left_recursive:
new_nt_prods.append(alpha + [new_nt])
new_nt_prods.append([EPSILON])
return new_nt, new_lhs_prods, new_nt_prods
def eliminate_all_left_recursion(grammar: Grammar) -> Grammar:
"""
Eliminate all left recursion (immediate and indirect) from a grammar.
Returns a new grammar with no left recursion.
"""
new_grammar = Grammar()
nonterminals = list(grammar.nonterminals)
# Working copy of productions
current_prods: dict[str, list[list[str]]] = {
nt: list(rhs_list) for nt, rhs_list in grammar.productions.items()
}
for i, ai in enumerate(nonterminals):
# Step 1: Replace Ai -> Aj Ξ³ for j < i
for j in range(i):
aj = nonterminals[j]
new_prods = []
for rhs in current_prods[ai]:
if rhs[0] == aj:
# Substitute: Ai -> Aj Ξ³ becomes Ai -> Ξ΄k Ξ³
gamma = rhs[1:]
for delta in current_prods[aj]:
if delta == [EPSILON]:
new_prods.append(gamma if gamma else [EPSILON])
else:
new_prods.append(delta + gamma)
else:
new_prods.append(rhs)
current_prods[ai] = new_prods
# Step 2: Eliminate immediate left recursion
new_nt, lhs_prods, nt_prods = eliminate_immediate_left_recursion(
ai, current_prods[ai]
)
current_prods[ai] = lhs_prods
if nt_prods:
current_prods[new_nt] = nt_prods
# Build new grammar
for lhs, rhs_list in current_prods.items():
for rhs in rhs_list:
new_grammar.add_production(lhs, rhs)
new_grammar.start = grammar.start
new_grammar.finalize()
return new_grammar
# Example:
if __name__ == "__main__":
g = Grammar()
# Left-recursive expression grammar
g.add_production("E", ["E", "+", "T"])
g.add_production("E", ["T"])
g.add_production("T", ["T", "*", "F"])
g.add_production("T", ["F"])
g.add_production("F", ["(", "E", ")"])
g.add_production("F", ["id"])
g.finalize()
print("Original grammar:")
print(g)
print()
g2 = eliminate_all_left_recursion(g)
print("After left-recursion elimination:")
print(g2)
7. Left Factoring¶
7.1 The Problem¶
Left factoring is needed when two or more productions for a nonterminal share a common prefix, causing the parser to be unable to decide which production to use.
Example:
$$ \text{Stmt} \to \textbf{if}\ E\ \textbf{then}\ S\ \textbf{else}\ S \mid \textbf{if}\ E\ \textbf{then}\ S $$
Both alternatives start with if E then S, so with a single lookahead token the parser cannot choose.
7.2 The Transformation¶
For productions $A \to \alpha\beta_1 \mid \alpha\beta_2$, replace with:
$$ \begin{aligned} A &\to \alpha\ A' \\ A' &\to \beta_1 \mid \beta_2 \end{aligned} $$
After left factoring the if-then-else:
$$ \begin{aligned} \text{Stmt} &\to \textbf{if}\ E\ \textbf{then}\ S\ \text{Stmt'} \\ \text{Stmt'} &\to \textbf{else}\ S \mid \varepsilon \end{aligned} $$
7.3 Implementation¶
def left_factor(grammar: Grammar) -> Grammar:
"""
Apply left factoring to the grammar.
Returns a new grammar with common prefixes factored out.
"""
new_grammar = Grammar()
counter = 0
for lhs, rhs_list in grammar.productions.items():
remaining = list(rhs_list)
while remaining:
# Find the longest common prefix among remaining productions
if len(remaining) == 1:
new_grammar.add_production(lhs, remaining[0])
remaining = []
continue
# Group by first symbol
groups: dict[str, list[list[str]]] = {}
for rhs in remaining:
key = rhs[0] if rhs else EPSILON
if key not in groups:
groups[key] = []
groups[key].append(rhs)
new_remaining = []
for first_sym, group in groups.items():
if len(group) == 1:
new_grammar.add_production(lhs, group[0])
else:
# Find longest common prefix
prefix = list(group[0])
for rhs in group[1:]:
new_prefix = []
for a, b in zip(prefix, rhs):
if a == b:
new_prefix.append(a)
else:
break
prefix = new_prefix
if not prefix:
# No common prefix; just add all
for rhs in group:
new_grammar.add_production(lhs, rhs)
else:
# Factor out the prefix
counter += 1
new_nt = f"{lhs}_F{counter}"
new_grammar.add_production(
lhs, prefix + [new_nt]
)
for rhs in group:
suffix = rhs[len(prefix):]
if not suffix:
suffix = [EPSILON]
new_grammar.add_production(new_nt, suffix)
remaining = new_remaining
new_grammar.start = grammar.start
new_grammar.finalize()
return new_grammar
8. LL(1) Conflicts and Resolution¶
8.1 Types of Conflicts¶
An LL(1) conflict occurs when a parsing table entry $M[A, a]$ has multiple productions. This can happen for two reasons:
FIRST-FIRST Conflict: Two productions for $A$ have overlapping FIRST sets.
$$A \to \alpha \mid \beta \quad \text{where } \text{FIRST}(\alpha) \cap \text{FIRST}(\beta) \neq \emptyset$$
FIRST-FOLLOW Conflict: A nullable production's FIRST set overlaps with the nonterminal's FOLLOW set.
$$A \to \alpha \mid \varepsilon \quad \text{where } \text{FIRST}(\alpha) \cap \text{FOLLOW}(A) \neq \emptyset$$
8.2 Resolution Strategies¶
| Conflict Type | Resolution Strategy |
|---|---|
| FIRST-FIRST | Left factoring, grammar rewriting |
| FIRST-FOLLOW | Grammar restructuring, or use LL(k)/ALL(*) |
| Dangling else | Convention: match with nearest unmatched if |
| Inherent ambiguity | Rewrite grammar to remove ambiguity |
8.3 The Classic Dangling Else¶
The dangling else problem is perhaps the most famous LL(1) conflict:
Stmt -> if Expr then Stmt else Stmt
| if Expr then Stmt
| other
After left factoring:
Stmt -> if Expr then Stmt Else | other
Else -> else Stmt | Ξ΅
This still has a FIRST-FOLLOW conflict for $\text{Else}$ when the lookahead is else: both $\text{Else} \to \textbf{else}\ \text{Stmt}$ and $\text{Else} \to \varepsilon$ apply.
Resolution: By convention, always choose the non-epsilon production (match else with the nearest if). In the parsing table, simply prefer the else Stmt production.
9. Error Recovery¶
When the parser encounters an error (no entry in the parsing table, or a terminal mismatch), it must recover gracefully rather than simply halting.
9.1 Panic Mode Recovery¶
Panic mode is the simplest and most commonly used error recovery strategy. When an error occurs, the parser discards input symbols until it finds a synchronization token -- typically a member of the FOLLOW set of the nonterminal being expanded.
def panic_mode_recovery(
self,
nonterminal: str,
input_symbols: list[str],
ip: int,
follow: dict[str, set[str]],
) -> int:
"""
Panic mode error recovery.
Skip input tokens until we find one in FOLLOW(nonterminal),
then pop the nonterminal from the stack and continue.
Returns the new input pointer position.
"""
sync_set = follow.get(nonterminal, set())
print(
f" Error: skipping input, looking for sync tokens "
f"in FOLLOW({nonterminal}) = {sync_set}"
)
while ip < len(input_symbols):
if input_symbols[ip] in sync_set:
print(f" Synchronized on '{input_symbols[ip]}'")
return ip
print(f" Skipping '{input_symbols[ip]}'")
ip += 1
return ip
9.2 Phrase-Level Recovery¶
Phrase-level recovery fills in error entries in the parsing table with specific error-handling routines. For example:
| Situation | Recovery Action |
|---|---|
Missing operand (+ * x) |
Insert a dummy operand |
Missing operator (x y) |
Insert a dummy operator |
| Unbalanced parenthesis | Insert missing ) or discard extra ( |
| Missing semicolon | Insert ; and continue |
# Phrase-level error recovery entries for expression grammar
ERROR_ACTIONS = {
# M[E, +]: missing operand before +
("E", "+"): ("insert_id", "Missing operand before '+'"),
# M[F, +]: missing operand
("F", "+"): ("insert_id", "Missing operand"),
# M[F, *]: missing operand
("F", "*"): ("insert_id", "Missing operand"),
# M[E, )]: extra closing paren
("E", ")"): ("skip", "Unexpected ')'"),
# M[F, )]: missing operand before )
("F", ")"): ("insert_id", "Missing operand before ')'"),
}
9.3 Error Productions¶
Some parser generators allow adding error productions to the grammar:
Stmt -> error ; // skip everything until ';'
When the parser cannot match any normal production, the error token matches and consumes input until the synchronizing token (; in this case).
10. LL(k) and ALL(*) Parsing¶
10.1 LL(k) Parsing¶
LL(1) uses a single lookahead token. LL(k) extends this to $k$ tokens of lookahead. The parsing table becomes indexed by a nonterminal and a $k$-length prefix of the remaining input.
Definition. A grammar is LL(k) if for any two leftmost derivations:
$$ \begin{aligned} S &\Rightarrow^*_{lm} wA\alpha \Rightarrow_{lm} w\beta\alpha \Rightarrow^*_{lm} wx \\ S &\Rightarrow^*_{lm} wA\alpha \Rightarrow_{lm} w\gamma\alpha \Rightarrow^*_{lm} wy \end{aligned} $$
If $\text{FIRST}_k(x) = \text{FIRST}_k(y)$, then $\beta = \gamma$ (the same production was used).
Practical impact: As $k$ increases, the parsing table grows exponentially ($O(|\Sigma|^k)$ columns). In practice, $k > 2$ is rare for hand-built LL parsers.
10.2 ALL(*) Parsing¶
ALL(*) (Adaptive LL) is the parsing algorithm used by ANTLR 4. It provides the power of unlimited lookahead without the exponential table size.
Key ideas:
- Arbitrary lookahead: Instead of a fixed $k$, ALL(*) examines as many tokens as needed to resolve the prediction
- Augmented Transition Networks (ATNs): The grammar is represented as a set of recursive state machines
- Dynamic analysis: At runtime, if the ATN simulation encounters a prediction that requires more context, it performs a lookahead DFA construction
- Caching: Once a prediction is resolved for a given lookahead context, the result is cached as a DFA for future reuse
ALL(*) Decision Process:
Regular LL(1) ALL(*) Adaptive
βββββββββββ βββββββββββββββ
β Single β β ATN β
β token β ββresolveβββΆ prod β simulation β
β lookup β β β
βββββββββββ β If ambig: β
β build DFA ββββΆ prod
β cache it β
βββββββββββββββ
Advantages of ALL(*):
- Handles any deterministic context-free language
- No manual left-factoring or left-recursion elimination needed (ANTLR 4 handles these automatically for direct left recursion)
- Error messages can be very precise
- Practical performance is often near-linear
Example in ANTLR 4:
// ANTLR 4 grammar -- no left factoring needed
expr : expr '*' expr // direct left recursion is OK
| expr '+' expr
| '(' expr ')'
| ID
| NUM
;
10.3 Comparison of Top-Down Parsing Strategies¶
| Feature | LL(1) | LL(k) | ALL(*) |
|---|---|---|---|
| Lookahead | 1 token | $k$ tokens | Unbounded |
| Table size | $O(|\Sigma|)$ | $O(|\Sigma|^k)$ | Dynamic DFA |
| Left recursion | Must eliminate | Must eliminate | Direct LR auto-handled |
| Left factoring | Required | Sometimes | Not needed |
| Parser generator | Hand-written / simple | Uncommon | ANTLR 4 |
| Error recovery | Manual | Manual | Automatic (ANTLR) |
| Performance | $O(n)$ | $O(n)$ | $O(n)$ typical, $O(n^2)$ worst |
11. Putting It All Together¶
Let us build a complete mini-language parser that demonstrates the full pipeline from grammar to parsed output.
"""
Complete LL(1) Parser for a Mini-Language
Language syntax:
Program -> StmtList
StmtList -> Stmt StmtList | Ξ΅
Stmt -> id = Expr ;
| print ( Expr ) ;
Expr -> Term Expr'
Expr' -> + Term Expr' | Ξ΅
Term -> Factor Term'
Term' -> * Factor Term' | Ξ΅
Factor -> ( Expr ) | id | num
"""
class MiniLangParser:
"""LL(1) parser for a mini-language with assignments and print."""
def __init__(self):
self.grammar = Grammar()
self._build_grammar()
self.grammar.finalize()
self.first = compute_first(self.grammar)
self.follow = compute_follow(self.grammar, self.first)
self.table, conflicts = build_ll1_table(
self.grammar, self.first, self.follow
)
if conflicts:
raise ValueError(f"Grammar is not LL(1): {conflicts}")
def _build_grammar(self):
g = self.grammar
g.add_production("Program", ["StmtList"])
g.add_production("StmtList", ["Stmt", "StmtList"])
g.add_production("StmtList", [EPSILON])
g.add_production("Stmt", ["id", "=", "Expr", ";"])
g.add_production("Stmt", ["print", "(", "Expr", ")", ";"])
g.add_production("Expr", ["Term", "Expr'"])
g.add_production("Expr'", ["+", "Term", "Expr'"])
g.add_production("Expr'", [EPSILON])
g.add_production("Term", ["Factor", "Term'"])
g.add_production("Term'", ["*", "Factor", "Term'"])
g.add_production("Term'", [EPSILON])
g.add_production("Factor", ["(", "Expr", ")"])
g.add_production("Factor", ["id"])
g.add_production("Factor", ["num"])
def parse(self, tokens: list[str]) -> list[str]:
"""
Parse input tokens and return the list of productions applied.
"""
parser = LL1Parser(self.grammar, self.table)
result = parser.parse(tokens, verbose=True)
return result
if __name__ == "__main__":
mini = MiniLangParser()
# Parse: x = 3 + 4 * 5 ; print ( x ) ;
tokens = [
"id", "=", "num", "+", "num", "*", "num", ";",
"print", "(", "id", ")", ";",
]
print("Parsing: x = 3 + 4 * 5 ; print ( x ) ;")
print("=" * 90)
mini.parse(tokens)
12. Summary¶
Top-down parsing builds parse trees from the root to the leaves, guided by lookahead tokens and prediction tables.
Key concepts:
| Concept | Description |
|---|---|
| Recursive descent | One function per nonterminal; the most intuitive parsing method |
| FIRST sets | Terminals that can begin strings derived from a grammar symbol |
| FOLLOW sets | Terminals that can appear after a nonterminal in any derivation |
| LL(1) table | Maps (nonterminal, terminal) pairs to productions |
| Left recursion elimination | Required for top-down parsing; transforms $A \to A\alpha$ patterns |
| Left factoring | Factors out common prefixes to resolve FIRST-FIRST conflicts |
| Panic mode | Error recovery by skipping to synchronization tokens |
| ALL(*) | Adaptive LL parsing with unbounded lookahead (ANTLR 4) |
When to use top-down parsing:
- When the grammar is naturally LL(1) or close to it
- When you want to write a parser by hand (recursive descent)
- When you need fine-grained control over error messages
- When using ANTLR 4 (which uses ALL(*) internally)
When NOT to use top-down parsing:
- Grammars with inherent left recursion that is hard to eliminate
- Operator-heavy languages where precedence climbing or Pratt parsing is simpler
- When a parser generator like Yacc/Bison (LR-based) is already in your toolchain
Exercises¶
Exercise 1: FIRST and FOLLOW Computation¶
Compute the FIRST and FOLLOW sets for the following grammar:
$$ \begin{aligned} S &\to A\ B \\ A &\to a\ A \mid \varepsilon \\ B &\to b\ B \mid c \end{aligned} $$
Verify your answer by implementing the grammar in the Python code provided and running the computation.
Exercise 2: LL(1) Table Construction¶
Given the grammar:
$$ \begin{aligned} S &\to i\ E\ t\ S\ S' \mid a \\ S' &\to e\ S \mid \varepsilon \\ E &\to b \end{aligned} $$
(where $i$ = if, $t$ = then, $e$ = else, $a$ = assignment, $b$ = boolean expression)
- Compute FIRST and FOLLOW sets.
- Construct the LL(1) parsing table.
- Identify any conflicts. How would you resolve the dangling else?
Exercise 3: Left Recursion Elimination¶
Eliminate all left recursion from the following grammar:
$$ \begin{aligned} S &\to A\ a \mid b \\ A &\to A\ c \mid S\ d \mid e \end{aligned} $$
Note that $A \to S\ d$ creates indirect left recursion. Show each step of the algorithm.
Exercise 4: Recursive Descent Parser Extension¶
Extend the recursive descent parser provided in Section 2.2 to support:
- Subtraction (
-) with the same precedence as addition - Division (
/) with the same precedence as multiplication - Unary negation (
-x) - Integer literals and variable names
Write the modified grammar and the corresponding Python code.
Exercise 5: Error Recovery¶
Implement panic-mode error recovery in the table-driven LL(1) parser. Your implementation should:
- When $M[A, a]$ is empty, report the error with the position
- Skip input tokens until finding one in $\text{FOLLOW}(A)$
- Pop $A$ from the stack and continue parsing
- Report all errors found (not just the first one)
Test with the input "id + * id" (missing operand between + and *).
Exercise 6: Grammar Design Challenge¶
Design an LL(1) grammar for the following language features:
- Variable declarations:
let x = expr; - Assignment:
x = expr; - If-else statements:
if (expr) { stmts } else { stmts } - While loops:
while (expr) { stmts } - Print statements:
print(expr); -
Arithmetic expressions with
+,-,*,/ -
Write the grammar.
- Verify it is LL(1) by computing FIRST and FOLLOW sets.
- If not LL(1), apply left factoring and/or left recursion elimination.
Previous: 04_Context_Free_Grammars.md | Next: 06_Bottom_Up_Parsing.md | Overview