Lesson 4: Context-Free Grammars
Lesson 4: Context-Free Grammars¶
Learning Objectives¶
After completing this lesson, you will be able to:
- Give the formal definition of a context-free grammar (CFG)
- Use BNF and EBNF notation to write grammars
- Construct leftmost and rightmost derivations for a given string
- Build parse trees and understand their relationship to derivations
- Identify and resolve ambiguity in grammars
- Convert grammars to Chomsky Normal Form (CNF) and Greibach Normal Form (GNF)
- Apply the CYK parsing algorithm
- Understand the connection between CFGs and pushdown automata
- Apply the pumping lemma for context-free languages
- Implement grammar manipulation and CYK parsing in Python
1. Formal Definition of a Context-Free Grammar¶
A context-free grammar (CFG) is a 4-tuple:
$$G = (V, \Sigma, P, S)$$
where:
- $V$ is a finite set of variables (also called nonterminals)
- $\Sigma$ is a finite set of terminal symbols (the alphabet), with $V \cap \Sigma = \emptyset$
- $P$ is a finite set of productions (rules), each of the form $A \rightarrow \alpha$ where $A \in V$ and $\alpha \in (V \cup \Sigma)^*$
- $S \in V$ is the start symbol
Notation Conventions¶
Throughout this lesson, we use:
| Symbol | Represents |
|---|---|
| $A, B, C, S$ | Variables (nonterminals), uppercase |
| $a, b, c$ | Terminals, lowercase |
| $\alpha, \beta, \gamma$ | Strings in $(V \cup \Sigma)^*$ |
| $w, x, y, z$ | Strings in $\Sigma^*$ (terminal strings) |
| $\epsilon$ | The empty string |
Example CFG¶
A grammar for simple arithmetic expressions:
$$G = (\{E, T, F\}, \{+, *, (, ), \texttt{id}\}, P, E)$$
Productions $P$:
$$E \rightarrow E + T \mid T$$ $$T \rightarrow T * F \mid F$$ $$F \rightarrow (E) \mid \texttt{id}$$
This grammar defines the language of arithmetic expressions with +, *, parentheses, and identifiers, respecting the conventional precedence (* binds tighter than +).
Derivation¶
A derivation is a sequence of steps that transforms the start symbol into a string of terminals by repeatedly replacing a variable with the right-hand side of one of its productions.
If $A \rightarrow \beta$ is a production and $\alpha A \gamma$ is a sentential form, then:
$$\alpha A \gamma \Rightarrow \alpha \beta \gamma$$
We write $\alpha \xRightarrow{*} \beta$ for zero or more derivation steps.
Language of a Grammar¶
The language generated by $G$ is:
$$L(G) = \{w \in \Sigma^* \mid S \xRightarrow{*} w\}$$
The set of all terminal strings derivable from the start symbol.
Python Representation¶
from dataclasses import dataclass, field
from typing import List, Set, Dict, Tuple, Optional
@dataclass
class Production:
"""A grammar production A -> symbols."""
head: str # Left-hand side (nonterminal)
body: Tuple[str, ...] # Right-hand side (tuple of symbols)
def __repr__(self):
body_str = ' '.join(self.body) if self.body else 'ε'
return f"{self.head} -> {body_str}"
def __hash__(self):
return hash((self.head, self.body))
def __eq__(self, other):
return (isinstance(other, Production) and
self.head == other.head and self.body == other.body)
class Grammar:
"""Context-Free Grammar."""
def __init__(self, variables: Set[str], terminals: Set[str],
productions: List[Production], start: str):
self.variables = variables
self.terminals = terminals
self.productions = productions
self.start = start
# Index productions by head
self.prod_index: Dict[str, List[Production]] = {}
for p in productions:
self.prod_index.setdefault(p.head, []).append(p)
@classmethod
def from_string(cls, grammar_str: str, start: str = None) -> 'Grammar':
"""
Parse a grammar from a multi-line string.
Format: 'A -> α | β | γ' (one nonterminal per line).
Nonterminals are uppercase single letters or quoted names.
Use 'ε' or 'epsilon' for epsilon productions.
"""
variables = set()
terminals = set()
productions = []
for line in grammar_str.strip().split('\n'):
line = line.strip()
if not line or line.startswith('#'):
continue
head, _, body_str = line.partition('->')
head = head.strip()
variables.add(head)
for alt in body_str.split('|'):
alt = alt.strip()
if alt in ('ε', 'epsilon', ''):
productions.append(Production(head, ()))
else:
symbols = tuple(alt.split())
productions.append(Production(head, symbols))
# Infer terminals
for p in productions:
for sym in p.body:
if sym not in variables:
terminals.add(sym)
if start is None:
start = productions[0].head
return cls(variables, terminals, productions, start)
def __repr__(self):
lines = [f"Grammar(start={self.start})"]
for var in sorted(self.variables):
prods = self.prod_index.get(var, [])
bodies = [' '.join(p.body) if p.body else 'ε' for p in prods]
lines.append(f" {var} -> {' | '.join(bodies)}")
return '\n'.join(lines)
def is_terminal(self, symbol: str) -> bool:
return symbol in self.terminals
def is_variable(self, symbol: str) -> bool:
return symbol in self.variables
# Example: Arithmetic expression grammar
expr_grammar = Grammar.from_string("""
E -> E + T | T
T -> T * F | F
F -> ( E ) | id
""")
print(expr_grammar)
2. BNF and EBNF Notation¶
Backus-Naur Form (BNF)¶
BNF (Backus-Naur Form), introduced by John Backus for the ALGOL 60 report, is the standard notation for context-free grammars.
<expression> ::= <expression> "+" <term> | <term>
<term> ::= <term> "*" <factor> | <factor>
<factor> ::= "(" <expression> ")" | <identifier>
<identifier> ::= <letter> | <identifier> <letter> | <identifier> <digit>
<letter> ::= "a" | "b" | ... | "z"
<digit> ::= "0" | "1" | ... | "9"
BNF conventions:
- Nonterminals are enclosed in angle brackets: <expression>
- Terminals are enclosed in quotes: "+"
- ::= means "is defined as"
- | separates alternatives
- Each rule defines one nonterminal
Extended BNF (EBNF)¶
EBNF adds convenience notation to BNF, reducing the number of rules needed:
| EBNF Notation | Meaning | BNF Equivalent |
|---|---|---|
{X} or X* |
Zero or more repetitions of X | New nonterminal with recursion |
[X] or X? |
Optional (zero or one) | New nonterminal with epsilon alternative |
(X \| Y) |
Grouping | New nonterminal |
X+ |
One or more repetitions | New nonterminal with recursion |
Example: The expression grammar in EBNF:
expression = term { "+" term } .
term = factor { "*" factor } .
factor = "(" expression ")" | identifier .
identifier = letter { letter | digit } .
This is more concise because repetition ({...}) replaces explicit recursion.
EBNF to BNF Conversion¶
EBNF: A = α { β } γ .
BNF: A -> α A' γ
A' -> β A' | ε
EBNF: A = α [ β ] γ .
BNF: A -> α β γ | α γ
EBNF: A = α ( β | γ ) δ .
BNF: A -> α A' δ
A' -> β | γ
Python Implementation¶
def ebnf_to_bnf(ebnf_rules: str) -> str:
"""
Convert EBNF notation to BNF.
Handles: { } (repetition), [ ] (optional), ( | ) (grouping).
This is a simplified converter for demonstration.
"""
bnf_rules = []
aux_counter = [0]
def new_aux(base_name: str) -> str:
aux_counter[0] += 1
return f"{base_name}_{aux_counter[0]}"
for line in ebnf_rules.strip().split('\n'):
line = line.strip()
if not line:
continue
head, _, body = line.partition('=')
head = head.strip()
body = body.strip().rstrip('.')
# For simplicity, just handle the basic transformations
# In practice, you'd need a proper EBNF parser
# Handle { X } -> X_rep where X_rep -> X X_rep | ε
import re
while '{' in body:
match = re.search(r'\{([^{}]+)\}', body)
if not match:
break
inner = match.group(1).strip()
aux = new_aux(head + "_rep")
body = body[:match.start()] + aux + body[match.end():]
bnf_rules.append(f"{aux} -> {inner} {aux} | ε")
# Handle [ X ] -> X_opt where X_opt -> X | ε
while '[' in body:
match = re.search(r'\[([^\[\]]+)\]', body)
if not match:
break
inner = match.group(1).strip()
aux = new_aux(head + "_opt")
body = body[:match.start()] + aux + body[match.end():]
bnf_rules.append(f"{aux} -> {inner} | ε")
bnf_rules.insert(0, f"{head} -> {body}")
return '\n'.join(bnf_rules)
ebnf = """
expr = term { + term } .
term = factor { * factor } .
factor = ( expr ) | id .
"""
print("=== EBNF to BNF Conversion ===\n")
print("EBNF:")
print(ebnf)
print("BNF:")
print(ebnf_to_bnf(ebnf))
3. Derivations¶
Leftmost Derivation¶
In a leftmost derivation, at each step we replace the leftmost variable in the sentential form.
Notation: $\alpha \xRightarrow{lm} \beta$
Example: Derive id + id * id using the expression grammar.
$$E \xRightarrow{lm} E + T \xRightarrow{lm} T + T \xRightarrow{lm} F + T \xRightarrow{lm} \texttt{id} + T$$ $$\xRightarrow{lm} \texttt{id} + T * F \xRightarrow{lm} \texttt{id} + F * F \xRightarrow{lm} \texttt{id} + \texttt{id} * F \xRightarrow{lm} \texttt{id} + \texttt{id} * \texttt{id}$$
Rightmost Derivation¶
In a rightmost derivation (also called canonical derivation), at each step we replace the rightmost variable.
Notation: $\alpha \xRightarrow{rm} \beta$
$$E \xRightarrow{rm} E + T \xRightarrow{rm} E + T * F \xRightarrow{rm} E + T * \texttt{id} \xRightarrow{rm} E + F * \texttt{id}$$ $$\xRightarrow{rm} E + \texttt{id} * \texttt{id} \xRightarrow{rm} T + \texttt{id} * \texttt{id} \xRightarrow{rm} F + \texttt{id} * \texttt{id} \xRightarrow{rm} \texttt{id} + \texttt{id} * \texttt{id}$$
Both derivations produce the same parse tree (for unambiguous grammars).
Why Derivation Order Matters¶
- Top-down parsers (LL) construct a leftmost derivation
- Bottom-up parsers (LR) construct a rightmost derivation in reverse
- For unambiguous grammars, both yield the same parse tree
- For ambiguous grammars, different derivation orders may yield different parse trees
Python Implementation¶
def leftmost_derivation(grammar: Grammar, target: str) -> Optional[List[str]]:
"""
Find a leftmost derivation of target string.
Uses brute-force search (only practical for short strings).
Returns list of sentential forms, or None if not derivable.
"""
target_symbols = tuple(target.split())
# BFS
from collections import deque
queue = deque()
queue.append(((grammar.start,), [(grammar.start,)]))
visited = {(grammar.start,)}
max_len = len(target_symbols) * 3 # Bound to prevent infinite search
while queue:
current, history = queue.popleft()
# Check if we've derived the target
if current == target_symbols:
return [' '.join(sf) for sf in history]
# If current is all terminals, no more derivations possible
if all(grammar.is_terminal(s) for s in current):
continue
# If too long, skip
if len(current) > max_len:
continue
# Find leftmost variable
for i, sym in enumerate(current):
if grammar.is_variable(sym):
# Try all productions for this variable
for prod in grammar.prod_index.get(sym, []):
new_form = current[:i] + prod.body + current[i+1:]
if new_form not in visited:
visited.add(new_form)
queue.append((new_form, history + [new_form]))
break # Only expand leftmost variable
return None
# Example
print("=== Leftmost Derivation ===\n")
derivation = leftmost_derivation(expr_grammar, "id + id * id")
if derivation:
for i, step in enumerate(derivation):
arrow = " " if i == 0 else "=>"
print(f" {arrow} {step}")
4. Parse Trees¶
A parse tree (also called a derivation tree or concrete syntax tree) is a graphical representation of a derivation.
Definition¶
For a grammar $G = (V, \Sigma, P, S)$, a parse tree has:
- The root is labeled with the start symbol $S$
- Each interior node is labeled with a variable $A \in V$
- Each leaf is labeled with a terminal $a \in \Sigma$ or $\epsilon$
- If interior node $A$ has children $X_1, X_2, \ldots, X_k$, then $A \rightarrow X_1 X_2 \cdots X_k \in P$
Example Parse Tree¶
For id + id * id with the expression grammar:
E
/ | \
E + T
| /|\
T T * F
| | |
F F id
| |
id id
Reading the leaves left-to-right gives the derived string: id + id * id.
Relationship to Derivations¶
- Every derivation corresponds to exactly one parse tree (for unambiguous grammars)
- Every parse tree corresponds to exactly one leftmost derivation and one rightmost derivation
- Different derivations (leftmost, rightmost, or arbitrary order) may correspond to the same parse tree
Python Implementation¶
@dataclass
class ParseTreeNode:
"""A node in a parse tree."""
symbol: str
children: List['ParseTreeNode'] = field(default_factory=list)
is_terminal: bool = False
def __repr__(self):
if self.is_terminal:
return f"'{self.symbol}'"
return f"{self.symbol}"
def leaves(self) -> List[str]:
"""Return the leaves (yield) of the parse tree."""
if self.is_terminal:
return [self.symbol] if self.symbol != 'ε' else []
result = []
for child in self.children:
result.extend(child.leaves())
return result
def pretty_print(self, prefix="", is_last=True):
"""Print the tree in a readable format."""
connector = "└── " if is_last else "├── "
label = f"'{self.symbol}'" if self.is_terminal else self.symbol
print(prefix + connector + label)
new_prefix = prefix + (" " if is_last else "│ ")
for i, child in enumerate(self.children):
child.pretty_print(new_prefix, i == len(self.children) - 1)
def build_parse_tree(grammar: Grammar, derivation_steps: list) -> ParseTreeNode:
"""
Build a parse tree from a leftmost derivation.
Each step is a sentential form (list of symbols).
"""
# Start with root
root = ParseTreeNode(grammar.start)
# Track which nodes need expansion
def find_leftmost_unexpanded(node):
"""Find the leftmost nonterminal leaf that hasn't been expanded."""
if node.is_terminal:
return None
if not node.children:
if grammar.is_variable(node.symbol):
return node
return None
for child in node.children:
result = find_leftmost_unexpanded(child)
if result is not None:
return result
return None
# Process each derivation step
for i in range(1, len(derivation_steps)):
current = derivation_steps[i].split()
prev = derivation_steps[i-1].split()
# Find which variable was expanded
node = find_leftmost_unexpanded(root)
if node is None:
break
# Find the production used
for prod in grammar.prod_index.get(node.symbol, []):
# Check if this production could have been applied
expanded = []
for j, sym in enumerate(prev):
if sym == node.symbol and j == prev.index(node.symbol):
expanded.extend(prod.body if prod.body else ['ε'])
else:
expanded.append(sym)
if expanded == current or (not prod.body and len(current) == len(prev) - 1):
# This is the right production
if prod.body:
for sym in prod.body:
child = ParseTreeNode(
sym,
is_terminal=grammar.is_terminal(sym)
)
node.children.append(child)
else:
node.children.append(
ParseTreeNode('ε', is_terminal=True)
)
break
return root
# Manually build a parse tree for "id + id * id"
def make_expr_tree():
"""Build the parse tree for 'id + id * id' manually."""
# F -> id (three instances)
id1 = ParseTreeNode('id', is_terminal=True)
id2 = ParseTreeNode('id', is_terminal=True)
id3 = ParseTreeNode('id', is_terminal=True)
f1 = ParseTreeNode('F', [id1])
f2 = ParseTreeNode('F', [id2])
f3 = ParseTreeNode('F', [id3])
# T -> F (left), T -> T * F (right)
t1 = ParseTreeNode('T', [f1])
t2 = ParseTreeNode('T', [
ParseTreeNode('T', [f2]),
ParseTreeNode('*', is_terminal=True),
f3
])
# E -> E + T
plus = ParseTreeNode('+', is_terminal=True)
e = ParseTreeNode('E', [
ParseTreeNode('E', [t1]),
plus,
t2
])
return e
tree = make_expr_tree()
print("=== Parse Tree for 'id + id * id' ===\n")
tree.pretty_print(prefix="", is_last=True)
print(f"\nYield: {' '.join(tree.leaves())}")
5. Ambiguity¶
Definition¶
A grammar $G$ is ambiguous if there exists a string $w \in L(G)$ that has two or more distinct parse trees (equivalently, two or more distinct leftmost derivations).
Example of an Ambiguous Grammar¶
Consider this simpler expression grammar:
$$E \rightarrow E + E \mid E * E \mid (E) \mid \texttt{id}$$
The string id + id * id has two parse trees:
Tree 1 (multiplication binds tighter):
E
/|\
E + E
| /|\
id E * E
| |
id id
Interpretation: id + (id * id)
Tree 2 (addition binds tighter):
E
/|\
E * E
/|\ |
E + E id
| |
id id
Interpretation: (id + id) * id
These two parse trees give different meanings to the same expression!
Why Ambiguity Matters¶
- Different parse trees $\rightarrow$ different semantics
- A compiler must produce a unique interpretation
- Ambiguity in a grammar makes parsing nondeterministic
Inherent Ambiguity¶
A context-free language is inherently ambiguous if every grammar for it is ambiguous. (This is different from a grammar being ambiguous -- you might be able to find an unambiguous grammar for the same language.)
A classic example:
$$L = \{a^i b^j c^k \mid i = j \text{ or } j = k\}$$
This language is inherently ambiguous. The string $a^n b^n c^n$ can be parsed by matching $i = j$ or $j = k$, and no unambiguous grammar exists for this language.
Detecting Ambiguity¶
Undecidable: There is no algorithm that, given an arbitrary CFG, determines whether it is ambiguous. (This can be reduced to Post's Correspondence Problem.)
However, specific classes of grammars (LL(1), LR(1)) are guaranteed to be unambiguous.
def check_ambiguity_brute_force(grammar: Grammar, max_length: int = 6) -> list:
"""
Brute-force check: enumerate all leftmost derivations up to a given
string length and check for strings with multiple derivations.
WARNING: Exponential complexity. Only for very small grammars.
Returns list of (string, count) for ambiguous strings.
"""
from collections import defaultdict
derivation_count = defaultdict(int)
def derive(sentential_form, depth):
"""Enumerate all leftmost derivations."""
# Check if all terminals
if all(grammar.is_terminal(s) for s in sentential_form):
if len(sentential_form) <= max_length:
key = ' '.join(sentential_form)
derivation_count[key] += 1
return
# Prune: too many symbols
terminal_count = sum(1 for s in sentential_form if grammar.is_terminal(s))
if terminal_count > max_length:
return
if depth > max_length * 3:
return
# Find leftmost variable
for i, sym in enumerate(sentential_form):
if grammar.is_variable(sym):
for prod in grammar.prod_index.get(sym, []):
new_form = sentential_form[:i] + list(prod.body) + sentential_form[i+1:]
derive(new_form, depth + 1)
break
derive([grammar.start], 0)
ambiguous = [(s, c) for s, c in derivation_count.items() if c > 1]
return ambiguous
# Test with the ambiguous grammar
ambig_grammar = Grammar.from_string("""
E -> E + E | E * E | ( E ) | id
""")
print("=== Ambiguity Check ===\n")
results = check_ambiguity_brute_force(ambig_grammar, max_length=5)
if results:
print("Ambiguous strings found:")
for s, count in sorted(results, key=lambda x: len(x[0])):
print(f" '{s}' has {count} distinct leftmost derivations")
else:
print("No ambiguity found (up to given length)")
6. Resolving Ambiguity¶
Strategy 1: Operator Precedence¶
Introduce a hierarchy of nonterminals, one per precedence level:
Level 0 (lowest): E -> E + T | T (addition)
Level 1: T -> T * F | F (multiplication)
Level 2 (highest): F -> ( E ) | id (atoms)
Higher-precedence operators are "deeper" in the grammar, so they bind tighter.
For a language with four levels of precedence:
expr -> expr || or_t | or_t // Level 0: logical OR
or_t -> or_t && and_t | and_t // Level 1: logical AND
and_t -> and_t == rel_t | rel_t // Level 2: equality
rel_t -> rel_t < factor | factor // Level 3: relational
factor -> ( expr ) | id | num // Level 4: atoms
Strategy 2: Associativity¶
- Left-associative operators use left recursion: $E \rightarrow E + T$
-
a + b + cparses as(a + b) + c -
Right-associative operators use right recursion: $E \rightarrow T = E$
-
a = b = cparses asa = (b = c) -
Non-associative operators: no recursion on the same level
a < b < cis a syntax error
# Left-associative: a + b + c = (a + b) + c
E -> E + T | T
# Right-associative: a = b = c = a = (b = c)
E -> T = E | T
# Non-associative: a < b is ok, a < b < c is error
E -> T < T | T
Strategy 3: Disambiguation Rules (External)¶
Some parser generators allow you to specify precedence and associativity externally rather than encoding them in the grammar:
/* Yacc/Bison precedence declarations */
%left '+' '-' /* lowest precedence, left-associative */
%left '*' '/' /* higher precedence, left-associative */
%right UMINUS /* highest precedence, right-associative (unary minus) */
%%
expr : expr '+' expr
| expr '-' expr
| expr '*' expr
| expr '/' expr
| '-' expr %prec UMINUS
| '(' expr ')'
| NUMBER
;
This approach keeps the grammar simple (and ambiguous) but resolves ambiguity with explicit rules.
Strategy 4: The Dangling Else Problem¶
The dangling else is a classic ambiguity:
stmt -> if expr then stmt
| if expr then stmt else stmt
| other
For if E1 then if E2 then S1 else S2, two parse trees exist:
Parse tree 1: if E1 then (if E2 then S1 else S2) -- else matches inner if
Parse tree 2: if E1 then (if E2 then S1) else S2 -- else matches outer if
Resolution: The convention is that else matches the nearest unmatched if. This can be encoded in the grammar:
stmt -> matched | unmatched
matched -> if expr then matched else matched | other
unmatched -> if expr then stmt
| if expr then matched else unmatched
Or resolved externally: most parser generators associate else with the nearest if by default.
Python Example: Resolving Ambiguity¶
# Ambiguous grammar for expressions
ambiguous = Grammar.from_string("""
E -> E + E | E * E | ( E ) | id
""")
# Unambiguous grammar (precedence + left-associativity)
unambiguous = Grammar.from_string("""
E -> E + T | T
T -> T * F | F
F -> ( E ) | id
""")
print("=== Ambiguous Grammar ===")
print(ambiguous)
print()
print("=== Unambiguous Grammar (same language) ===")
print(unambiguous)
print()
# Verify they generate the same strings (for small strings)
def generate_strings(grammar, max_depth=8):
"""Generate all strings derivable up to a certain derivation depth."""
strings = set()
def derive(form, depth):
if depth > max_depth:
return
if all(grammar.is_terminal(s) for s in form):
strings.add(tuple(form))
return
for i, sym in enumerate(form):
if grammar.is_variable(sym):
for prod in grammar.prod_index.get(sym, []):
new_form = form[:i] + list(prod.body) + form[i+1:]
derive(new_form, depth + 1)
break
derive([grammar.start], 0)
return {' '.join(s) for s in strings}
strings_ambig = generate_strings(ambiguous, 6)
strings_unambig = generate_strings(unambiguous, 8)
print(f"Ambiguous grammar generates {len(strings_ambig)} strings (depth 6)")
print(f"Unambiguous grammar generates {len(strings_unambig)} strings (depth 8)")
print(f"Strings in ambiguous but not unambiguous: {strings_ambig - strings_unambig}")
print(f"Strings in unambiguous but not ambiguous: {strings_unambig - strings_ambig}")
7. Chomsky Normal Form (CNF)¶
A CFG is in Chomsky Normal Form if every production has one of these forms:
- $A \rightarrow BC$ where $B, C \in V$ (two nonterminals)
- $A \rightarrow a$ where $a \in \Sigma$ (one terminal)
- $S \rightarrow \epsilon$ (only if $\epsilon \in L(G)$, and $S$ does not appear on any right-hand side)
Why CNF Matters¶
- CYK parsing algorithm requires CNF (see Section 9)
- Every parse tree in CNF is a binary tree (useful for analysis)
- Every CFG can be converted to an equivalent grammar in CNF
- A derivation of a string of length $n$ in CNF has exactly $2n - 1$ steps
Conversion to CNF¶
Input: CFG $G = (V, \Sigma, P, S)$ Output: Equivalent CFG $G'$ in CNF
Step 1: Remove $\epsilon$-productions (except possibly $S \rightarrow \epsilon$)
Find all nullable variables: $A$ is nullable if $A \xRightarrow{*} \epsilon$.
For each production $B \rightarrow \alpha$, add new productions with every possible subset of nullable symbols removed. Then delete all $\epsilon$-productions (except $S \rightarrow \epsilon$ if needed).
Step 2: Remove unit productions ($A \rightarrow B$)
Find all unit pairs: $(A, B)$ such that $A \xRightarrow{*} B$ using only unit productions. Replace each unit chain with direct productions.
Step 3: Replace long productions
Replace $A \rightarrow B_1 B_2 \cdots B_k$ (where $k > 2$) with:
$$A \rightarrow B_1 C_1$$ $$C_1 \rightarrow B_2 C_2$$ $$\vdots$$ $$C_{k-2} \rightarrow B_{k-1} B_k$$
Step 4: Replace terminal-nonterminal mixtures
In any production $A \rightarrow \alpha$ with $|\alpha| \geq 2$, replace each terminal $a$ with a new variable $T_a$ and add $T_a \rightarrow a$.
Python Implementation¶
def to_cnf(grammar: Grammar) -> Grammar:
"""
Convert a grammar to Chomsky Normal Form.
"""
variables = set(grammar.variables)
terminals = set(grammar.terminals)
productions = list(grammar.productions)
start = grammar.start
aux_counter = [0]
def new_var(prefix="X"):
aux_counter[0] += 1
name = f"{prefix}{aux_counter[0]}"
variables.add(name)
return name
# ============================================================
# Step 0: New start symbol (ensure S doesn't appear on RHS)
# ============================================================
rhs_symbols = set()
for p in productions:
rhs_symbols.update(p.body)
if start in rhs_symbols:
new_start = new_var("S")
productions.append(Production(new_start, (start,)))
start = new_start
# ============================================================
# Step 1: Remove ε-productions
# ============================================================
# Find nullable variables
nullable = set()
changed = True
while changed:
changed = False
for p in productions:
if not p.body or all(s in nullable for s in p.body):
if p.head not in nullable:
nullable.add(p.head)
changed = True
# Generate new productions with nullable symbols optionally removed
new_prods = set()
for p in productions:
if not p.body:
# ε-production: only keep for start symbol
if p.head == start:
new_prods.add(p)
continue
# Find all positions with nullable symbols
nullable_positions = [i for i, s in enumerate(p.body) if s in nullable]
# Generate all subsets of nullable positions
from itertools import combinations
for r in range(len(nullable_positions) + 1):
for positions_to_remove in combinations(nullable_positions, r):
new_body = tuple(
s for i, s in enumerate(p.body)
if i not in positions_to_remove
)
if new_body: # Don't add empty unless it's start
new_prods.add(Production(p.head, new_body))
elif p.head == start:
new_prods.add(Production(p.head, ()))
productions = list(new_prods)
# ============================================================
# Step 2: Remove unit productions (A -> B)
# ============================================================
# Find unit pairs using transitive closure
unit_pairs = set()
for v in variables:
unit_pairs.add((v, v)) # Reflexive
changed = True
while changed:
changed = False
for p in productions:
if len(p.body) == 1 and p.body[0] in variables:
for (a, b) in list(unit_pairs):
if b == p.head:
new_pair = (a, p.body[0])
if new_pair not in unit_pairs:
unit_pairs.add(new_pair)
changed = True
# Replace unit productions
new_prods = set()
for p in productions:
if len(p.body) == 1 and p.body[0] in variables:
continue # Skip unit productions
# For each unit pair (A, B) where B is the head of this production
for (a, b) in unit_pairs:
if b == p.head:
new_prods.add(Production(a, p.body))
productions = list(new_prods)
# ============================================================
# Step 3: Replace terminals in mixed productions
# ============================================================
term_vars = {} # terminal -> variable name
additional = []
final_prods = []
for p in productions:
if len(p.body) <= 1:
final_prods.append(p)
continue
# Replace terminals with new variables
new_body = []
for sym in p.body:
if sym in terminals:
if sym not in term_vars:
tv = new_var("T")
term_vars[sym] = tv
additional.append(Production(tv, (sym,)))
new_body.append(term_vars[sym])
else:
new_body.append(sym)
final_prods.append(Production(p.head, tuple(new_body)))
productions = final_prods + additional
# ============================================================
# Step 4: Break long productions into binary
# ============================================================
final_prods = []
for p in productions:
if len(p.body) <= 2:
final_prods.append(p)
continue
# A -> B1 B2 B3 ... Bk => A -> B1 C1, C1 -> B2 C2, ..., Ck-2 -> Bk-1 Bk
symbols = list(p.body)
current_head = p.head
while len(symbols) > 2:
new_head = new_var("C")
final_prods.append(Production(current_head, (symbols[0], new_head)))
symbols = symbols[1:]
current_head = new_head
final_prods.append(Production(current_head, tuple(symbols)))
productions = final_prods
return Grammar(variables, terminals, productions, start)
# Example
print("=== CNF Conversion ===\n")
expr_g = Grammar.from_string("""
E -> E + T | T
T -> T * F | F
F -> ( E ) | id
""")
print("Original grammar:")
print(expr_g)
print()
cnf = to_cnf(expr_g)
print("CNF grammar:")
for p in sorted(cnf.productions, key=lambda p: (p.head, p.body)):
print(f" {p}")
# Verify: check that some strings are still derivable
print("\nVerification (derivability check):")
for w in ["id", "id + id", "id * id", "id + id * id"]:
symbols = w.split()
# We'll verify using CYK (next section)
8. Greibach Normal Form (GNF)¶
A CFG is in Greibach Normal Form if every production has the form:
$$A \rightarrow a B_1 B_2 \cdots B_k$$
where $a \in \Sigma$ and $B_1, \ldots, B_k \in V$ (with $k \geq 0$).
In other words, every production starts with exactly one terminal, followed by zero or more nonterminals.
Properties¶
- Every derivation step in GNF consumes exactly one terminal
- A string of length $n$ requires exactly $n$ derivation steps (no $\epsilon$-productions, except possibly $S \rightarrow \epsilon$)
- GNF is useful for constructing pushdown automata
- Conversion to GNF requires eliminating left recursion (covered in Lesson 5)
Comparison of Normal Forms¶
| Property | CNF | GNF |
|---|---|---|
| Production form | $A \rightarrow BC$ or $A \rightarrow a$ | $A \rightarrow a\alpha$ ($\alpha \in V^*$) |
| Parse tree shape | Binary tree | High branching factor |
| Derivation length for $|w| = n$ | $2n - 1$ | $n$ |
| Primary use | CYK algorithm | PDA construction |
| Left recursion | Allowed | Not allowed |
9. The CYK Algorithm¶
The Cocke-Younger-Kasami (CYK) algorithm is a dynamic programming algorithm that determines whether a string $w$ is in $L(G)$ for a grammar $G$ in CNF. It also constructs a parse tree.
Algorithm¶
Input: Grammar $G$ in CNF, string $w = a_1 a_2 \cdots a_n$ Output: Whether $w \in L(G)$, and a parse table
Idea: Build a triangular table $T$ where $T[i][j]$ is the set of variables that can derive the substring $a_i a_{i+1} \cdots a_j$.
$$T[i][j] = \{A \in V \mid A \xRightarrow{*} a_i a_{i+1} \cdots a_j\}$$
Algorithm:
1. // Base case: substrings of length 1
for i = 1 to n:
T[i][i] = {A | A -> aᵢ ∈ P}
2. // Inductive case: substrings of length 2, 3, ..., n
for length = 2 to n:
for i = 1 to n - length + 1:
j = i + length - 1
T[i][j] = ∅
for k = i to j - 1:
for each production A -> BC:
if B ∈ T[i][k] and C ∈ T[k+1][j]:
T[i][j] = T[i][j] ∪ {A}
3. // Accept if start symbol is in top-right cell
return S ∈ T[1][n]
Time complexity: $O(n^3 \cdot |P|)$ Space complexity: $O(n^2)$
Worked Example¶
Grammar in CNF:
$$S \rightarrow AB \mid BC$$ $$A \rightarrow BA \mid a$$ $$B \rightarrow CC \mid b$$ $$C \rightarrow AB \mid a$$
Parse "baaba":
String: b a a b a
Index: 1 2 3 4 5
Table T[i][j]:
i\j | 1 2 3 4 5
----+-------------------------------------------
1 | {B} {S,A} {B} {S,A,C} {B}
2 | {A,C} {S,C} {B} {S,A,C}
3 | {A,C} {S,A} {B}
4 | {B} {S,A,C}
5 | {A,C}
Step-by-step:
Length 1 (diagonal): - $T[1][1]$: $b$ matched by $B \rightarrow b$, so $\{B\}$ - $T[2][2]$: $a$ matched by $A \rightarrow a$ and $C \rightarrow a$, so $\{A, C\}$ - Similarly for positions 3, 4, 5
Length 2: - $T[1][2]$: split at $k=1$: $T[1][1] \times T[2][2] = \{B\} \times \{A,C\}$ - $B, A$: Check $? \rightarrow BA$: $A \rightarrow BA$ and $S \rightarrow BC$... Actually $A \rightarrow BA$ gives $A$, and any production giving $BA$? Let's check: $S \rightarrow AB$ needs $(A,B)$, $S \rightarrow BC$ needs $(B,C)$: $B \in T[1][1]$ and $C \in T[2][2]$, so $S$. - $A \rightarrow BA$: $B \in T[1][1]$ and $A \in T[2][2]$, so $A$. - Result: $\{S, A\}$
(Continuing similarly for all cells...)
Since $S \in T[1][5]$, the string "baaba" is in the language.
Python Implementation¶
def cyk_parse(grammar: Grammar, word: str) -> Tuple[bool, list]:
"""
CYK parsing algorithm.
Grammar must be in Chomsky Normal Form.
word is a space-separated string of terminals.
Returns (accepted, table) where table[i][j] is the set of
variables that can derive word[i..j].
"""
symbols = word.split()
n = len(symbols)
if n == 0:
# Check if S -> ε exists
for p in grammar.prod_index.get(grammar.start, []):
if not p.body:
return (True, [])
return (False, [])
# Initialize table: T[i][j] = set of variables
# Using 0-indexed
T = [[set() for _ in range(n)] for _ in range(n)]
# Also store back-pointers for parse tree construction
# back[i][j][A] = (B, C, k) meaning A -> BC, B derives [i..k], C derives [k+1..j]
back = [[{} for _ in range(n)] for _ in range(n)]
# Step 1: Base case (length 1)
for i in range(n):
for p in grammar.productions:
if len(p.body) == 1 and p.body[0] == symbols[i]:
T[i][i].add(p.head)
back[i][i][p.head] = ('terminal', symbols[i])
# Step 2: Fill table for increasing lengths
for length in range(2, n + 1):
for i in range(n - length + 1):
j = i + length - 1
for k in range(i, j):
# Try all productions A -> BC
for p in grammar.productions:
if len(p.body) == 2:
B, C = p.body
if B in T[i][k] and C in T[k+1][j]:
T[i][j].add(p.head)
if p.head not in back[i][j]:
back[i][j][p.head] = ('split', B, C, k)
accepted = grammar.start in T[0][n-1]
# Print table
print(f"CYK Table for '{word}':\n")
col_width = 12
header = "i\\j |" + "|".join(f"{j:^{col_width}}" for j in range(n))
print(header)
print("-" * len(header))
for i in range(n):
cells = []
for j in range(n):
if j < i:
cells.append(" " * col_width)
else:
cell = "{" + ",".join(sorted(T[i][j])) + "}" if T[i][j] else "∅"
cells.append(f"{cell:^{col_width}}")
print(f" {i} |" + "|".join(cells))
print(f"\nAccepted: {accepted}")
# Build parse tree
def build_tree(var, i, j):
info = back[i][j].get(var)
if info is None:
return ParseTreeNode(var, [ParseTreeNode("?", is_terminal=True)])
if info[0] == 'terminal':
return ParseTreeNode(var, [ParseTreeNode(info[1], is_terminal=True)])
else:
_, B, C, k = info
left = build_tree(B, i, k)
right = build_tree(C, k+1, j)
return ParseTreeNode(var, [left, right])
if accepted:
tree = build_tree(grammar.start, 0, n-1)
print("\nParse tree:")
tree.pretty_print()
return (accepted, T)
# Example: CYK parsing
cnf_grammar = Grammar.from_string("""
S -> A B | B C
A -> B A | a
B -> C C | b
C -> A B | a
""")
print("=== CYK Parsing ===\n")
print("Grammar:")
print(cnf_grammar)
print()
cyk_parse(cnf_grammar, "b a a b a")
print()
cyk_parse(cnf_grammar, "a b")
print()
cyk_parse(cnf_grammar, "b b b")
10. Pushdown Automata (Overview)¶
A pushdown automaton (PDA) is a finite automaton augmented with a stack. PDAs recognize exactly the context-free languages.
Formal Definition¶
A PDA is a 7-tuple:
$$M = (Q, \Sigma, \Gamma, \delta, q_0, Z_0, F)$$
where:
- $Q$ is a finite set of states
- $\Sigma$ is the input alphabet
- $\Gamma$ is the stack alphabet
- $\delta: Q \times (\Sigma \cup \{\epsilon\}) \times \Gamma \rightarrow \mathcal{P}(Q \times \Gamma^*)$ is the transition function
- $q_0 \in Q$ is the start state
- $Z_0 \in \Gamma$ is the initial stack symbol
- $F \subseteq Q$ is the set of accepting states
How a PDA Works¶
At each step, based on: 1. The current state 2. The current input symbol (or $\epsilon$ for no input) 3. The top of the stack
The PDA: 1. Moves to a new state 2. Optionally consumes an input symbol 3. Replaces the top of the stack with zero or more symbols
Equivalence with CFGs¶
Theorem: A language is context-free if and only if it is recognized by some PDA.
CFG $\rightarrow$ PDA: For every CFG, we can construct a PDA that simulates leftmost derivations. The PDA uses the stack to track the remaining symbols to be derived.
PDA $\rightarrow$ CFG: For every PDA, we can construct a CFG. The construction creates variables $[q_i, A, q_j]$ representing "the PDA goes from state $q_i$ to state $q_j$ while popping $A$ off the stack."
Python: Simple PDA Simulation¶
class PDA:
"""Simple Pushdown Automaton simulation."""
def __init__(self, transitions, start_state, start_stack, accept_states):
"""
transitions: dict mapping (state, input_or_eps, stack_top) ->
list of (new_state, stack_push)
stack_push is a tuple: () means pop, ('A',) means replace with A,
('A', 'B') means replace top with A then push B
"""
self.transitions = transitions
self.start_state = start_state
self.start_stack = start_stack
self.accept_states = accept_states
def accepts(self, word: str) -> bool:
"""Check if the PDA accepts the word (by final state)."""
# Configuration: (state, remaining_input_pos, stack)
# Use BFS to explore all nondeterministic choices
from collections import deque
initial = (self.start_state, 0, (self.start_stack,))
queue = deque([initial])
visited = set()
symbols = list(word) if word else []
while queue:
state, pos, stack = queue.popleft()
config = (state, pos, stack)
if config in visited:
continue
visited.add(config)
# Check acceptance
if pos == len(symbols) and state in self.accept_states:
return True
if not stack:
continue
top = stack[-1]
rest = stack[:-1]
# Try epsilon transitions
for new_state, push in self.transitions.get(
(state, 'ε', top), []
):
new_stack = rest + push
if len(new_stack) < 100: # Bound stack size
queue.append((new_state, pos, new_stack))
# Try input transitions
if pos < len(symbols):
ch = symbols[pos]
for new_state, push in self.transitions.get(
(state, ch, top), []
):
new_stack = rest + push
if len(new_stack) < 100:
queue.append((new_state, pos + 1, new_stack))
return False
# PDA for {a^n b^n | n >= 0}
# States: q0 (push a's), q1 (pop a's matching b's), q_accept
pda_anbn = PDA(
transitions={
# In q0, push a's onto stack
('q0', 'a', 'Z'): [('q0', ('Z', 'A'))], # Push A, keep Z
('q0', 'a', 'A'): [('q0', ('A', 'A'))], # Push A
# Switch to q1 when seeing b
('q0', 'b', 'A'): [('q1', ())], # Pop A
# In q0, accept empty string
('q0', 'ε', 'Z'): [('q_accept', ('Z',))],
# In q1, pop a's matching b's
('q1', 'b', 'A'): [('q1', ())], # Pop A
# Accept when stack has only Z
('q1', 'ε', 'Z'): [('q_accept', ('Z',))],
},
start_state='q0',
start_stack='Z',
accept_states={'q_accept'}
)
print("=== PDA for {a^n b^n | n >= 0} ===\n")
for w in ["", "ab", "aabb", "aaabbb", "aab", "abb", "ba", "abab"]:
result = "accept" if pda_anbn.accepts(w) else "reject"
print(f" '{w}': {result}")
11. The Pumping Lemma for Context-Free Languages¶
Statement¶
If $L$ is a context-free language, then there exists a constant $p \geq 1$ such that every string $w \in L$ with $|w| \geq p$ can be split into five parts:
$$w = uvxyz$$
satisfying:
- $|vy| > 0$ (at least one of $v$ and $y$ is nonempty)
- $|vxy| \leq p$
- $\forall i \geq 0: uv^ixy^iz \in L$ (pumping $v$ and $y$ together)
Proof Idea¶
Any sufficiently long string must have a derivation tree of height greater than $|V|$ (the number of variables). By the pigeonhole principle, some variable $A$ must repeat on a path from root to leaf:
S
/ \
/ \
/ \
A <-- first occurrence of A
/ \
v A <-- second occurrence of A
/ \
x y
The subtree rooted at the first $A$ generates $vxy$, and the subtree rooted at the second $A$ generates $x$. We can "pump" by repeating or removing the portion between the two $A$'s.
Using the CFL Pumping Lemma¶
To prove a language is NOT context-free:
- Assume $L$ is context-free
- Let $p$ be the pumping length
- Choose a specific $w \in L$ with $|w| \geq p$
- Show that for all decompositions $w = uvxyz$ with $|vy| > 0$ and $|vxy| \leq p$, there exists some $i$ where $uv^ixy^iz \notin L$
- Contradiction
Example: $L = \{a^n b^n c^n \mid n \geq 0\}$ Is Not Context-Free¶
Proof:
- Assume $L$ is context-free with pumping length $p$.
- Choose $w = a^p b^p c^p \in L$, with $|w| = 3p \geq p$.
- Let $w = uvxyz$ with $|vy| > 0$ and $|vxy| \leq p$.
- Since $|vxy| \leq p$, the substring $vxy$ can span at most two of the three groups ($a$'s, $b$'s, $c$'s).
- Case analysis:
- $vxy$ is within $a^p$: pumping changes $a$ count but not $b$ or $c$.
- $vxy$ is within $a^p b^p$: pumping changes $a$ and/or $b$ but not $c$.
- $vxy$ is within $b^p$: pumping changes $b$ but not $a$ or $c$.
- $vxy$ is within $b^p c^p$: pumping changes $b$ and/or $c$ but not $a$.
- $vxy$ is within $c^p$: pumping changes $c$ but not $a$ or $b$.
- In all cases, pumping ($i = 2$) produces a string where the three counts are unequal, so $uv^2xy^2z \notin L$.
- Contradiction. $\blacksquare$
Languages Beyond Context-Free¶
| Language | Type | Why Not CF |
|---|---|---|
| $\{a^n b^n c^n \mid n \geq 0\}$ | Context-sensitive | Needs three-way matching |
| $\{ww \mid w \in \{a,b\}^*\}$ | Context-sensitive | Copy language |
| $\{a^{2^n} \mid n \geq 0\}$ | Context-sensitive | Exponential growth |
| Any decidable language | Type 0 or below | Turing machine required |
Note: $\{a^n b^n \mid n \geq 0\}$ IS context-free. The limitation is three or more matched groups.
12. The Chomsky Hierarchy¶
The Chomsky hierarchy classifies formal languages into four levels:
| Type | Name | Grammar restriction | Automaton | Example |
|---|---|---|---|---|
| 3 | Regular | $A \rightarrow aB$ or $A \rightarrow a$ | Finite automaton (DFA/NFA) | $a^*b^*$ |
| 2 | Context-free | $A \rightarrow \alpha$ ($\alpha \in (V \cup \Sigma)^*$) | Pushdown automaton | $a^nb^n$ |
| 1 | Context-sensitive | $\alpha A \beta \rightarrow \alpha \gamma \beta$ ($|\gamma| \geq 1$) | Linear-bounded automaton | $a^nb^nc^n$ |
| 0 | Recursively enumerable | No restriction | Turing machine | Halting problem complement |
Hierarchy (proper inclusions):
Regular ⊂ Context-Free ⊂ Context-Sensitive ⊂ Recursively Enumerable
(Type 3) (Type 2) (Type 1) (Type 0)
Relevance to Compiler Design¶
| Phase | Language Type | Formalism |
|---|---|---|
| Lexical analysis | Regular (Type 3) | DFA, regular expressions |
| Syntax analysis | Context-free (Type 2) | CFG, pushdown automata |
| Semantic analysis | Context-sensitive (Type 1) | Attribute grammars, type systems |
| Full language semantics | Beyond CF | Turing machines (undecidable in general) |
13. Grammar Transformations¶
Eliminating Left Recursion¶
A grammar has left recursion if there exists a derivation $A \xRightarrow{+} A\alpha$ for some $\alpha$. Left recursion causes problems for top-down parsers (infinite loop).
Direct left recursion ($A \rightarrow A\alpha \mid \beta$):
Replace with: $$A \rightarrow \beta A'$$ $$A' \rightarrow \alpha A' \mid \epsilon$$
Example: $$E \rightarrow E + T \mid T$$
becomes: $$E \rightarrow T E'$$ $$E' \rightarrow + T E' \mid \epsilon$$
Left Factoring¶
When two productions for the same nonterminal share a common prefix:
$$A \rightarrow \alpha \beta_1 \mid \alpha \beta_2$$
Replace with: $$A \rightarrow \alpha A'$$ $$A' \rightarrow \beta_1 \mid \beta_2$$
Python Implementation¶
def eliminate_left_recursion(grammar: Grammar) -> Grammar:
"""
Eliminate immediate left recursion from all productions.
(Does not handle indirect left recursion.)
"""
new_productions = []
new_variables = set(grammar.variables)
for var in sorted(grammar.variables):
prods = grammar.prod_index.get(var, [])
# Separate left-recursive and non-left-recursive productions
left_recursive = [] # A -> A α
non_recursive = [] # A -> β
for p in prods:
if p.body and p.body[0] == var:
left_recursive.append(p.body[1:]) # α (without leading A)
else:
non_recursive.append(p.body)
if not left_recursive:
# No left recursion for this variable
new_productions.extend(prods)
continue
# Eliminate left recursion
# A -> β1 | β2 | ... | A α1 | A α2 | ...
# becomes:
# A -> β1 A' | β2 A' | ...
# A' -> α1 A' | α2 A' | ... | ε
new_var = var + "'"
while new_var in new_variables:
new_var += "'"
new_variables.add(new_var)
# A -> βi A'
for beta in non_recursive:
new_body = beta + (new_var,) if beta else (new_var,)
new_productions.append(Production(var, new_body))
# A' -> αi A'
for alpha in left_recursive:
new_body = alpha + (new_var,)
new_productions.append(Production(new_var, new_body))
# A' -> ε
new_productions.append(Production(new_var, ()))
return Grammar(
new_variables, grammar.terminals,
new_productions, grammar.start
)
def left_factor(grammar: Grammar) -> Grammar:
"""
Apply left factoring to a grammar.
"""
new_productions = list(grammar.productions)
new_variables = set(grammar.variables)
aux_counter = [0]
changed = True
while changed:
changed = False
# Index by head
prod_index = {}
for p in new_productions:
prod_index.setdefault(p.head, []).append(p)
next_prods = []
processed_vars = set()
for var in sorted(prod_index.keys()):
if var in processed_vars:
continue
prods = prod_index[var]
# Find longest common prefix among any two productions
best_prefix = None
best_group = None
for i in range(len(prods)):
for j in range(i + 1, len(prods)):
# Find common prefix
prefix = []
for k in range(min(len(prods[i].body), len(prods[j].body))):
if prods[i].body[k] == prods[j].body[k]:
prefix.append(prods[i].body[k])
else:
break
if prefix and (best_prefix is None or len(prefix) > len(best_prefix)):
best_prefix = tuple(prefix)
# Find all productions sharing this prefix
group = [p for p in prods if p.body[:len(prefix)] == best_prefix]
best_group = group
if best_prefix and best_group and len(best_group) > 1:
changed = True
processed_vars.add(var)
# Create new variable
aux_counter[0] += 1
new_var = f"{var}_LF{aux_counter[0]}"
new_variables.add(new_var)
# A -> prefix A_LF
next_prods.append(Production(var, best_prefix + (new_var,)))
# A_LF -> remaining1 | remaining2 | ...
for p in best_group:
remaining = p.body[len(best_prefix):]
next_prods.append(Production(new_var, remaining if remaining else ()))
# Keep non-matching productions
for p in prods:
if p not in best_group:
next_prods.append(p)
else:
next_prods.extend(prods)
new_productions = next_prods
return Grammar(
new_variables, grammar.terminals,
new_productions, grammar.start
)
# Example: Eliminate left recursion
print("=== Eliminating Left Recursion ===\n")
print("Original:")
print(expr_grammar)
print()
no_lr = eliminate_left_recursion(expr_grammar)
print("After eliminating left recursion:")
print(no_lr)
print()
# Example: Left factoring
lf_grammar = Grammar.from_string("""
S -> if E then S else S | if E then S | other
E -> id
""")
print("=== Left Factoring ===\n")
print("Original:")
print(lf_grammar)
print()
factored = left_factor(lf_grammar)
print("After left factoring:")
print(factored)
Summary¶
- A context-free grammar $G = (V, \Sigma, P, S)$ generates a language through derivations
- BNF and EBNF are standard notations for writing grammars
- Leftmost and rightmost derivations correspond to different parsing strategies
- Parse trees are the structural representation of derivations
- A grammar is ambiguous if some string has multiple parse trees; ambiguity is undecidable in general
- Ambiguity is resolved through precedence, associativity, and grammar restructuring
- Chomsky Normal Form (binary productions) enables the CYK algorithm for parsing in $O(n^3)$
- Greibach Normal Form (terminal-first productions) is useful for PDA construction
- Pushdown automata recognize exactly the context-free languages
- The CFL pumping lemma proves that languages like $\{a^n b^n c^n\}$ are not context-free
- The Chomsky hierarchy classifies languages into four types: regular, context-free, context-sensitive, and recursively enumerable
- Left recursion elimination and left factoring prepare grammars for top-down parsing
Exercises¶
Exercise 1: Grammar Writing¶
Write a context-free grammar for each of the following languages:
- Palindromes over $\{a, b\}$
- Balanced parentheses (possibly nested)
- $\{a^i b^j c^k \mid i = j \text{ or } j = k\}$
- A simplified subset of Python
if/elif/elsestatements - Arithmetic expressions with
+,-,*,/, unary-, and parentheses, respecting standard precedence and left-associativity
Exercise 2: Derivations and Parse Trees¶
Using the grammar $E \rightarrow E + T \mid T$, $T \rightarrow T * F \mid F$, $F \rightarrow (E) \mid \texttt{id} \mid \texttt{num}$:
- Find the leftmost derivation for
id * (num + id) - Find the rightmost derivation for the same string
- Draw the parse tree
- Verify that both derivations correspond to the same parse tree
Exercise 3: Ambiguity Analysis¶
Consider the grammar:
$$S \rightarrow aSb \mid aSbb \mid \epsilon$$
- What language does this grammar generate?
- Is the grammar ambiguous? If so, find a string with two parse trees.
- Can you find an unambiguous grammar for the same language?
Exercise 4: CNF Conversion¶
Convert the following grammar to Chomsky Normal Form:
$$S \rightarrow ASA \mid aB$$ $$A \rightarrow B \mid S$$ $$B \rightarrow b \mid \epsilon$$
Show each step of the conversion process.
Exercise 5: CYK Algorithm¶
Using the CNF grammar from Exercise 4 (or the one provided in Section 9), run the CYK algorithm on the strings:
1. "aabb"
2. "abab"
3. "baba"
Show the complete table for each string.
Exercise 6: Pumping Lemma¶
Use the pumping lemma for CFLs to prove that the following languages are NOT context-free:
- $L_1 = \{a^n b^n c^n d^n \mid n \geq 0\}$
- $L_2 = \{a^i b^j c^k \mid 0 \leq i \leq j \leq k\}$
- $L_3 = \{ww \mid w \in \{a, b\}^*\}$ (the copy language)
Previous: Finite Automata | Next: Top-Down Parsing | Overview