Lesson 4: Context-Free Grammars¶

Learning Objectives¶

After completing this lesson, you will be able to:

Give the formal definition of a context-free grammar (CFG)
Use BNF and EBNF notation to write grammars
Construct leftmost and rightmost derivations for a given string
Build parse trees and understand their relationship to derivations
Identify and resolve ambiguity in grammars
Convert grammars to Chomsky Normal Form (CNF) and Greibach Normal Form (GNF)
Apply the CYK parsing algorithm
Understand the connection between CFGs and pushdown automata
Apply the pumping lemma for context-free languages
Implement grammar manipulation and CYK parsing in Python

1. Formal Definition of a Context-Free Grammar¶

A context-free grammar (CFG) is a 4-tuple:

$$G = (V, \Sigma, P, S)$$

where:

$V$ is a finite set of variables (also called nonterminals)
$\Sigma$ is a finite set of terminal symbols (the alphabet), with $V \cap \Sigma = \emptyset$
$P$ is a finite set of productions (rules), each of the form $A \rightarrow \alpha$ where $A \in V$ and $\alpha \in (V \cup \Sigma)^*$
$S \in V$ is the start symbol

Notation Conventions¶

Throughout this lesson, we use:

Symbol	Represents
$A, B, C, S$	Variables (nonterminals), uppercase
$a, b, c$	Terminals, lowercase
$\alpha, \beta, \gamma$	Strings in $(V \cup \Sigma)^*$
$w, x, y, z$	Strings in $\Sigma^*$ (terminal strings)
$\epsilon$	The empty string

Example CFG¶

A grammar for simple arithmetic expressions:

$$G = (\{E, T, F\}, \{+, *, (, ), \texttt{id}\}, P, E)$$

Productions $P$:

$$E \rightarrow E + T \mid T$$ $$T \rightarrow T * F \mid F$$ $$F \rightarrow (E) \mid \texttt{id}$$

This grammar defines the language of arithmetic expressions with +, *, parentheses, and identifiers, respecting the conventional precedence (* binds tighter than +).

Derivation¶

A derivation is a sequence of steps that transforms the start symbol into a string of terminals by repeatedly replacing a variable with the right-hand side of one of its productions.

If $A \rightarrow \beta$ is a production and $\alpha A \gamma$ is a sentential form, then:

$$\alpha A \gamma \Rightarrow \alpha \beta \gamma$$

We write $\alpha \xRightarrow{*} \beta$ for zero or more derivation steps.

Language of a Grammar¶

The language generated by $G$ is:

$$L(G) = \{w \in \Sigma^* \mid S \xRightarrow{*} w\}$$

The set of all terminal strings derivable from the start symbol.

Python Representation¶

from dataclasses import dataclass, field
from typing import List, Set, Dict, Tuple, Optional


@dataclass
class Production:
    """A grammar production A -> symbols."""
    head: str                 # Left-hand side (nonterminal)
    body: Tuple[str, ...]     # Right-hand side (tuple of symbols)

    def __repr__(self):
        body_str = ' '.join(self.body) if self.body else 'ε'
        return f"{self.head} -> {body_str}"

    def __hash__(self):
        return hash((self.head, self.body))

    def __eq__(self, other):
        return (isinstance(other, Production) and
                self.head == other.head and self.body == other.body)


class Grammar:
    """Context-Free Grammar."""

    def __init__(self, variables: Set[str], terminals: Set[str],
                 productions: List[Production], start: str):
        self.variables = variables
        self.terminals = terminals
        self.productions = productions
        self.start = start

        # Index productions by head
        self.prod_index: Dict[str, List[Production]] = {}
        for p in productions:
            self.prod_index.setdefault(p.head, []).append(p)

    @classmethod
    def from_string(cls, grammar_str: str, start: str = None) -> 'Grammar':
        """
        Parse a grammar from a multi-line string.
        Format: 'A -> α | β | γ' (one nonterminal per line).
        Nonterminals are uppercase single letters or quoted names.
        Use 'ε' or 'epsilon' for epsilon productions.
        """
        variables = set()
        terminals = set()
        productions = []

        for line in grammar_str.strip().split('\n'):
            line = line.strip()
            if not line or line.startswith('#'):
                continue

            head, _, body_str = line.partition('->')
            head = head.strip()
            variables.add(head)

            for alt in body_str.split('|'):
                alt = alt.strip()
                if alt in ('ε', 'epsilon', ''):
                    productions.append(Production(head, ()))
                else:
                    symbols = tuple(alt.split())
                    productions.append(Production(head, symbols))

        # Infer terminals
        for p in productions:
            for sym in p.body:
                if sym not in variables:
                    terminals.add(sym)

        if start is None:
            start = productions[0].head

        return cls(variables, terminals, productions, start)

    def __repr__(self):
        lines = [f"Grammar(start={self.start})"]
        for var in sorted(self.variables):
            prods = self.prod_index.get(var, [])
            bodies = [' '.join(p.body) if p.body else 'ε' for p in prods]
            lines.append(f"  {var} -> {' | '.join(bodies)}")
        return '\n'.join(lines)

    def is_terminal(self, symbol: str) -> bool:
        return symbol in self.terminals

    def is_variable(self, symbol: str) -> bool:
        return symbol in self.variables


# Example: Arithmetic expression grammar
expr_grammar = Grammar.from_string("""
E -> E + T | T
T -> T * F | F
F -> ( E ) | id
""")

print(expr_grammar)

2. BNF and EBNF Notation¶

Backus-Naur Form (BNF)¶

BNF (Backus-Naur Form), introduced by John Backus for the ALGOL 60 report, is the standard notation for context-free grammars.

<expression>  ::= <expression> "+" <term> | <term>
<term>        ::= <term> "*" <factor> | <factor>
<factor>      ::= "(" <expression> ")" | <identifier>
<identifier>  ::= <letter> | <identifier> <letter> | <identifier> <digit>
<letter>      ::= "a" | "b" | ... | "z"
<digit>       ::= "0" | "1" | ... | "9"

BNF conventions: - Nonterminals are enclosed in angle brackets: <expression> - Terminals are enclosed in quotes: "+" - ::= means "is defined as" - | separates alternatives - Each rule defines one nonterminal

Extended BNF (EBNF)¶

EBNF adds convenience notation to BNF, reducing the number of rules needed:

EBNF Notation	Meaning	BNF Equivalent
`{X}` or `X*`	Zero or more repetitions of X	New nonterminal with recursion
`[X]` or `X?`	Optional (zero or one)	New nonterminal with epsilon alternative
`(X \\| Y)`	Grouping	New nonterminal
`X+`	One or more repetitions	New nonterminal with recursion

Example: The expression grammar in EBNF:

expression = term { "+" term } .
term       = factor { "*" factor } .
factor     = "(" expression ")" | identifier .
identifier = letter { letter | digit } .

This is more concise because repetition ({...}) replaces explicit recursion.

EBNF to BNF Conversion¶

EBNF: A = α { β } γ .
BNF:  A  -> α A' γ
      A' -> β A' | ε

EBNF: A = α [ β ] γ .
BNF:  A -> α β γ | α γ

EBNF: A = α ( β | γ ) δ .
BNF:  A  -> α A' δ
      A' -> β | γ

Python Implementation¶

def ebnf_to_bnf(ebnf_rules: str) -> str:
    """
    Convert EBNF notation to BNF.
    Handles: { } (repetition), [ ] (optional), ( | ) (grouping).

    This is a simplified converter for demonstration.
    """
    bnf_rules = []
    aux_counter = [0]

    def new_aux(base_name: str) -> str:
        aux_counter[0] += 1
        return f"{base_name}_{aux_counter[0]}"

    for line in ebnf_rules.strip().split('\n'):
        line = line.strip()
        if not line:
            continue

        head, _, body = line.partition('=')
        head = head.strip()
        body = body.strip().rstrip('.')

        # For simplicity, just handle the basic transformations
        # In practice, you'd need a proper EBNF parser

        # Handle { X } -> X_rep where X_rep -> X X_rep | ε
        import re
        while '{' in body:
            match = re.search(r'\{([^{}]+)\}', body)
            if not match:
                break
            inner = match.group(1).strip()
            aux = new_aux(head + "_rep")
            body = body[:match.start()] + aux + body[match.end():]
            bnf_rules.append(f"{aux} -> {inner} {aux} | ε")

        # Handle [ X ] -> X_opt where X_opt -> X | ε
        while '[' in body:
            match = re.search(r'\[([^\[\]]+)\]', body)
            if not match:
                break
            inner = match.group(1).strip()
            aux = new_aux(head + "_opt")
            body = body[:match.start()] + aux + body[match.end():]
            bnf_rules.append(f"{aux} -> {inner} | ε")

        bnf_rules.insert(0, f"{head} -> {body}")

    return '\n'.join(bnf_rules)


ebnf = """
expr = term { + term } .
term = factor { * factor } .
factor = ( expr ) | id .
"""

print("=== EBNF to BNF Conversion ===\n")
print("EBNF:")
print(ebnf)
print("BNF:")
print(ebnf_to_bnf(ebnf))

3. Derivations¶

Leftmost Derivation¶

In a leftmost derivation, at each step we replace the leftmost variable in the sentential form.

Notation: $\alpha \xRightarrow{lm} \beta$

Example: Derive id + id * id using the expression grammar.

$$E \xRightarrow{lm} E + T \xRightarrow{lm} T + T \xRightarrow{lm} F + T \xRightarrow{lm} \texttt{id} + T$$ $$\xRightarrow{lm} \texttt{id} + T * F \xRightarrow{lm} \texttt{id} + F * F \xRightarrow{lm} \texttt{id} + \texttt{id} * F \xRightarrow{lm} \texttt{id} + \texttt{id} * \texttt{id}$$

Rightmost Derivation¶

In a rightmost derivation (also called canonical derivation), at each step we replace the rightmost variable.

Notation: $\alpha \xRightarrow{rm} \beta$

$$E \xRightarrow{rm} E + T \xRightarrow{rm} E + T * F \xRightarrow{rm} E + T * \texttt{id} \xRightarrow{rm} E + F * \texttt{id}$$ $$\xRightarrow{rm} E + \texttt{id} * \texttt{id} \xRightarrow{rm} T + \texttt{id} * \texttt{id} \xRightarrow{rm} F + \texttt{id} * \texttt{id} \xRightarrow{rm} \texttt{id} + \texttt{id} * \texttt{id}$$

Both derivations produce the same parse tree (for unambiguous grammars).

Why Derivation Order Matters¶

Top-down parsers (LL) construct a leftmost derivation
Bottom-up parsers (LR) construct a rightmost derivation in reverse
For unambiguous grammars, both yield the same parse tree
For ambiguous grammars, different derivation orders may yield different parse trees

Python Implementation¶

def leftmost_derivation(grammar: Grammar, target: str) -> Optional[List[str]]:
    """
    Find a leftmost derivation of target string.
    Uses brute-force search (only practical for short strings).
    Returns list of sentential forms, or None if not derivable.
    """
    target_symbols = tuple(target.split())

    # BFS
    from collections import deque

    queue = deque()
    queue.append(((grammar.start,), [(grammar.start,)]))
    visited = {(grammar.start,)}

    max_len = len(target_symbols) * 3  # Bound to prevent infinite search

    while queue:
        current, history = queue.popleft()

        # Check if we've derived the target
        if current == target_symbols:
            return [' '.join(sf) for sf in history]

        # If current is all terminals, no more derivations possible
        if all(grammar.is_terminal(s) for s in current):
            continue

        # If too long, skip
        if len(current) > max_len:
            continue

        # Find leftmost variable
        for i, sym in enumerate(current):
            if grammar.is_variable(sym):
                # Try all productions for this variable
                for prod in grammar.prod_index.get(sym, []):
                    new_form = current[:i] + prod.body + current[i+1:]
                    if new_form not in visited:
                        visited.add(new_form)
                        queue.append((new_form, history + [new_form]))
                break  # Only expand leftmost variable

    return None


# Example
print("=== Leftmost Derivation ===\n")
derivation = leftmost_derivation(expr_grammar, "id + id * id")
if derivation:
    for i, step in enumerate(derivation):
        arrow = "  " if i == 0 else "=>"
        print(f"  {arrow} {step}")

4. Parse Trees¶

A parse tree (also called a derivation tree or concrete syntax tree) is a graphical representation of a derivation.

Definition¶

For a grammar $G = (V, \Sigma, P, S)$, a parse tree has:

The root is labeled with the start symbol $S$
Each interior node is labeled with a variable $A \in V$
Each leaf is labeled with a terminal $a \in \Sigma$ or $\epsilon$
If interior node $A$ has children $X_1, X_2, \ldots, X_k$, then $A \rightarrow X_1 X_2 \cdots X_k \in P$

Example Parse Tree¶

For id + id * id with the expression grammar:

Reading the leaves left-to-right gives the derived string: id + id * id.

Relationship to Derivations¶

Every derivation corresponds to exactly one parse tree (for unambiguous grammars)
Every parse tree corresponds to exactly one leftmost derivation and one rightmost derivation
Different derivations (leftmost, rightmost, or arbitrary order) may correspond to the same parse tree

Python Implementation¶

@dataclass
class ParseTreeNode:
    """A node in a parse tree."""
    symbol: str
    children: List['ParseTreeNode'] = field(default_factory=list)
    is_terminal: bool = False

    def __repr__(self):
        if self.is_terminal:
            return f"'{self.symbol}'"
        return f"{self.symbol}"

    def leaves(self) -> List[str]:
        """Return the leaves (yield) of the parse tree."""
        if self.is_terminal:
            return [self.symbol] if self.symbol != 'ε' else []
        result = []
        for child in self.children:
            result.extend(child.leaves())
        return result

    def pretty_print(self, prefix="", is_last=True):
        """Print the tree in a readable format."""
        connector = "└── " if is_last else "├── "
        label = f"'{self.symbol}'" if self.is_terminal else self.symbol
        print(prefix + connector + label)
        new_prefix = prefix + ("    " if is_last else "│   ")
        for i, child in enumerate(self.children):
            child.pretty_print(new_prefix, i == len(self.children) - 1)


def build_parse_tree(grammar: Grammar, derivation_steps: list) -> ParseTreeNode:
    """
    Build a parse tree from a leftmost derivation.
    Each step is a sentential form (list of symbols).
    """
    # Start with root
    root = ParseTreeNode(grammar.start)

    # Track which nodes need expansion
    def find_leftmost_unexpanded(node):
        """Find the leftmost nonterminal leaf that hasn't been expanded."""
        if node.is_terminal:
            return None
        if not node.children:
            if grammar.is_variable(node.symbol):
                return node
            return None
        for child in node.children:
            result = find_leftmost_unexpanded(child)
            if result is not None:
                return result
        return None

    # Process each derivation step
    for i in range(1, len(derivation_steps)):
        current = derivation_steps[i].split()
        prev = derivation_steps[i-1].split()

        # Find which variable was expanded
        node = find_leftmost_unexpanded(root)
        if node is None:
            break

        # Find the production used
        for prod in grammar.prod_index.get(node.symbol, []):
            # Check if this production could have been applied
            expanded = []
            for j, sym in enumerate(prev):
                if sym == node.symbol and j == prev.index(node.symbol):
                    expanded.extend(prod.body if prod.body else ['ε'])
                else:
                    expanded.append(sym)

            if expanded == current or (not prod.body and len(current) == len(prev) - 1):
                # This is the right production
                if prod.body:
                    for sym in prod.body:
                        child = ParseTreeNode(
                            sym,
                            is_terminal=grammar.is_terminal(sym)
                        )
                        node.children.append(child)
                else:
                    node.children.append(
                        ParseTreeNode('ε', is_terminal=True)
                    )
                break

    return root


# Manually build a parse tree for "id + id * id"
def make_expr_tree():
    """Build the parse tree for 'id + id * id' manually."""
    # F -> id (three instances)
    id1 = ParseTreeNode('id', is_terminal=True)
    id2 = ParseTreeNode('id', is_terminal=True)
    id3 = ParseTreeNode('id', is_terminal=True)

    f1 = ParseTreeNode('F', [id1])
    f2 = ParseTreeNode('F', [id2])
    f3 = ParseTreeNode('F', [id3])

    # T -> F (left), T -> T * F (right)
    t1 = ParseTreeNode('T', [f1])
    t2 = ParseTreeNode('T', [
        ParseTreeNode('T', [f2]),
        ParseTreeNode('*', is_terminal=True),
        f3
    ])

    # E -> E + T
    plus = ParseTreeNode('+', is_terminal=True)
    e = ParseTreeNode('E', [
        ParseTreeNode('E', [t1]),
        plus,
        t2
    ])

    return e

tree = make_expr_tree()
print("=== Parse Tree for 'id + id * id' ===\n")
tree.pretty_print(prefix="", is_last=True)
print(f"\nYield: {' '.join(tree.leaves())}")

5. Ambiguity¶

Definition¶

A grammar $G$ is ambiguous if there exists a string $w \in L(G)$ that has two or more distinct parse trees (equivalently, two or more distinct leftmost derivations).

Example of an Ambiguous Grammar¶

Consider this simpler expression grammar:

$$E \rightarrow E + E \mid E * E \mid (E) \mid \texttt{id}$$

The string id + id * id has two parse trees:

Tree 1 (multiplication binds tighter):

       E
      /|\
     E + E
     |  /|\
    id E * E
       |   |
      id  id

Interpretation: id + (id * id)

Tree 2 (addition binds tighter):

       E
      /|\
     E * E
    /|\   |
   E + E  id
   |   |
  id  id

Interpretation: (id + id) * id

These two parse trees give different meanings to the same expression!

Why Ambiguity Matters¶

Different parse trees $\rightarrow$ different semantics
A compiler must produce a unique interpretation
Ambiguity in a grammar makes parsing nondeterministic

Inherent Ambiguity¶

A context-free language is inherently ambiguous if every grammar for it is ambiguous. (This is different from a grammar being ambiguous -- you might be able to find an unambiguous grammar for the same language.)

A classic example:

$$L = \{a^i b^j c^k \mid i = j \text{ or } j = k\}$$

This language is inherently ambiguous. The string $a^n b^n c^n$ can be parsed by matching $i = j$ or $j = k$, and no unambiguous grammar exists for this language.

Detecting Ambiguity¶

Undecidable: There is no algorithm that, given an arbitrary CFG, determines whether it is ambiguous. (This can be reduced to Post's Correspondence Problem.)

However, specific classes of grammars (LL(1), LR(1)) are guaranteed to be unambiguous.

def check_ambiguity_brute_force(grammar: Grammar, max_length: int = 6) -> list:
    """
    Brute-force check: enumerate all leftmost derivations up to a given
    string length and check for strings with multiple derivations.

    WARNING: Exponential complexity. Only for very small grammars.
    Returns list of (string, count) for ambiguous strings.
    """
    from collections import defaultdict

    derivation_count = defaultdict(int)

    def derive(sentential_form, depth):
        """Enumerate all leftmost derivations."""
        # Check if all terminals
        if all(grammar.is_terminal(s) for s in sentential_form):
            if len(sentential_form) <= max_length:
                key = ' '.join(sentential_form)
                derivation_count[key] += 1
            return

        # Prune: too many symbols
        terminal_count = sum(1 for s in sentential_form if grammar.is_terminal(s))
        if terminal_count > max_length:
            return

        if depth > max_length * 3:
            return

        # Find leftmost variable
        for i, sym in enumerate(sentential_form):
            if grammar.is_variable(sym):
                for prod in grammar.prod_index.get(sym, []):
                    new_form = sentential_form[:i] + list(prod.body) + sentential_form[i+1:]
                    derive(new_form, depth + 1)
                break

    derive([grammar.start], 0)

    ambiguous = [(s, c) for s, c in derivation_count.items() if c > 1]
    return ambiguous


# Test with the ambiguous grammar
ambig_grammar = Grammar.from_string("""
E -> E + E | E * E | ( E ) | id
""")

print("=== Ambiguity Check ===\n")
results = check_ambiguity_brute_force(ambig_grammar, max_length=5)
if results:
    print("Ambiguous strings found:")
    for s, count in sorted(results, key=lambda x: len(x[0])):
        print(f"  '{s}' has {count} distinct leftmost derivations")
else:
    print("No ambiguity found (up to given length)")

6. Resolving Ambiguity¶

Strategy 1: Operator Precedence¶

Introduce a hierarchy of nonterminals, one per precedence level:

Level 0 (lowest):   E -> E + T | T        (addition)
Level 1:            T -> T * F | F        (multiplication)
Level 2 (highest):  F -> ( E ) | id       (atoms)

Higher-precedence operators are "deeper" in the grammar, so they bind tighter.

For a language with four levels of precedence:

expr   -> expr || or_t | or_t            // Level 0: logical OR
or_t   -> or_t && and_t | and_t          // Level 1: logical AND
and_t  -> and_t == rel_t | rel_t         // Level 2: equality
rel_t  -> rel_t < factor | factor        // Level 3: relational
factor -> ( expr ) | id | num            // Level 4: atoms

Strategy 2: Associativity¶

Left-associative operators use left recursion: $E \rightarrow E + T$
a + b + c parses as (a + b) + c
Right-associative operators use right recursion: $E \rightarrow T = E$
a = b = c parses as a = (b = c)
Non-associative operators: no recursion on the same level
a < b < c is a syntax error

# Left-associative: a + b + c = (a + b) + c
E -> E + T | T

# Right-associative: a = b = c = a = (b = c)
E -> T = E | T

# Non-associative: a < b is ok, a < b < c is error
E -> T < T | T

Strategy 3: Disambiguation Rules (External)¶

Some parser generators allow you to specify precedence and associativity externally rather than encoding them in the grammar:

/* Yacc/Bison precedence declarations */
%left '+' '-'         /* lowest precedence, left-associative */
%left '*' '/'         /* higher precedence, left-associative */
%right UMINUS         /* highest precedence, right-associative (unary minus) */

%%
expr : expr '+' expr
     | expr '-' expr
     | expr '*' expr
     | expr '/' expr
     | '-' expr %prec UMINUS
     | '(' expr ')'
     | NUMBER
     ;

This approach keeps the grammar simple (and ambiguous) but resolves ambiguity with explicit rules.

Strategy 4: The Dangling Else Problem¶

The dangling else is a classic ambiguity:

stmt -> if expr then stmt
      | if expr then stmt else stmt
      | other

For if E1 then if E2 then S1 else S2, two parse trees exist:

Parse tree 1: if E1 then (if E2 then S1 else S2)     -- else matches inner if
Parse tree 2: if E1 then (if E2 then S1) else S2      -- else matches outer if

Resolution: The convention is that else matches the nearest unmatched if. This can be encoded in the grammar:

stmt         -> matched | unmatched
matched      -> if expr then matched else matched | other
unmatched    -> if expr then stmt
              | if expr then matched else unmatched

Or resolved externally: most parser generators associate else with the nearest if by default.

Python Example: Resolving Ambiguity¶

# Ambiguous grammar for expressions
ambiguous = Grammar.from_string("""
E -> E + E | E * E | ( E ) | id
""")

# Unambiguous grammar (precedence + left-associativity)
unambiguous = Grammar.from_string("""
E -> E + T | T
T -> T * F | F
F -> ( E ) | id
""")

print("=== Ambiguous Grammar ===")
print(ambiguous)
print()

print("=== Unambiguous Grammar (same language) ===")
print(unambiguous)
print()

# Verify they generate the same strings (for small strings)
def generate_strings(grammar, max_depth=8):
    """Generate all strings derivable up to a certain derivation depth."""
    strings = set()

    def derive(form, depth):
        if depth > max_depth:
            return
        if all(grammar.is_terminal(s) for s in form):
            strings.add(tuple(form))
            return
        for i, sym in enumerate(form):
            if grammar.is_variable(sym):
                for prod in grammar.prod_index.get(sym, []):
                    new_form = form[:i] + list(prod.body) + form[i+1:]
                    derive(new_form, depth + 1)
                break

    derive([grammar.start], 0)
    return {' '.join(s) for s in strings}

strings_ambig = generate_strings(ambiguous, 6)
strings_unambig = generate_strings(unambiguous, 8)

print(f"Ambiguous grammar generates {len(strings_ambig)} strings (depth 6)")
print(f"Unambiguous grammar generates {len(strings_unambig)} strings (depth 8)")
print(f"Strings in ambiguous but not unambiguous: {strings_ambig - strings_unambig}")
print(f"Strings in unambiguous but not ambiguous: {strings_unambig - strings_ambig}")

7. Chomsky Normal Form (CNF)¶

A CFG is in Chomsky Normal Form if every production has one of these forms:

$A \rightarrow BC$ where $B, C \in V$ (two nonterminals)
$A \rightarrow a$ where $a \in \Sigma$ (one terminal)
$S \rightarrow \epsilon$ (only if $\epsilon \in L(G)$, and $S$ does not appear on any right-hand side)

Why CNF Matters¶

CYK parsing algorithm requires CNF (see Section 9)
Every parse tree in CNF is a binary tree (useful for analysis)
Every CFG can be converted to an equivalent grammar in CNF
A derivation of a string of length $n$ in CNF has exactly $2n - 1$ steps

Conversion to CNF¶

Input: CFG $G = (V, \Sigma, P, S)$ Output: Equivalent CFG $G'$ in CNF

Step 1: Remove $\epsilon$-productions (except possibly $S \rightarrow \epsilon$)

Find all nullable variables: $A$ is nullable if $A \xRightarrow{*} \epsilon$.

For each production $B \rightarrow \alpha$, add new productions with every possible subset of nullable symbols removed. Then delete all $\epsilon$-productions (except $S \rightarrow \epsilon$ if needed).

Step 2: Remove unit productions ($A \rightarrow B$)

Find all unit pairs: $(A, B)$ such that $A \xRightarrow{*} B$ using only unit productions. Replace each unit chain with direct productions.

Step 3: Replace long productions

Replace $A \rightarrow B_1 B_2 \cdots B_k$ (where $k > 2$) with:

$$A \rightarrow B_1 C_1$$ $$C_1 \rightarrow B_2 C_2$$ $$\vdots$$ $$C_{k-2} \rightarrow B_{k-1} B_k$$

Step 4: Replace terminal-nonterminal mixtures

In any production $A \rightarrow \alpha$ with $|\alpha| \geq 2$, replace each terminal $a$ with a new variable $T_a$ and add $T_a \rightarrow a$.

Python Implementation¶

def to_cnf(grammar: Grammar) -> Grammar:
    """
    Convert a grammar to Chomsky Normal Form.
    """
    variables = set(grammar.variables)
    terminals = set(grammar.terminals)
    productions = list(grammar.productions)
    start = grammar.start

    aux_counter = [0]
    def new_var(prefix="X"):
        aux_counter[0] += 1
        name = f"{prefix}{aux_counter[0]}"
        variables.add(name)
        return name

    # ============================================================
    # Step 0: New start symbol (ensure S doesn't appear on RHS)
    # ============================================================
    rhs_symbols = set()
    for p in productions:
        rhs_symbols.update(p.body)
    if start in rhs_symbols:
        new_start = new_var("S")
        productions.append(Production(new_start, (start,)))
        start = new_start

    # ============================================================
    # Step 1: Remove ε-productions
    # ============================================================

    # Find nullable variables
    nullable = set()
    changed = True
    while changed:
        changed = False
        for p in productions:
            if not p.body or all(s in nullable for s in p.body):
                if p.head not in nullable:
                    nullable.add(p.head)
                    changed = True

    # Generate new productions with nullable symbols optionally removed
    new_prods = set()
    for p in productions:
        if not p.body:
            # ε-production: only keep for start symbol
            if p.head == start:
                new_prods.add(p)
            continue

        # Find all positions with nullable symbols
        nullable_positions = [i for i, s in enumerate(p.body) if s in nullable]

        # Generate all subsets of nullable positions
        from itertools import combinations
        for r in range(len(nullable_positions) + 1):
            for positions_to_remove in combinations(nullable_positions, r):
                new_body = tuple(
                    s for i, s in enumerate(p.body)
                    if i not in positions_to_remove
                )
                if new_body:  # Don't add empty unless it's start
                    new_prods.add(Production(p.head, new_body))
                elif p.head == start:
                    new_prods.add(Production(p.head, ()))

    productions = list(new_prods)

    # ============================================================
    # Step 2: Remove unit productions (A -> B)
    # ============================================================

    # Find unit pairs using transitive closure
    unit_pairs = set()
    for v in variables:
        unit_pairs.add((v, v))  # Reflexive

    changed = True
    while changed:
        changed = False
        for p in productions:
            if len(p.body) == 1 and p.body[0] in variables:
                for (a, b) in list(unit_pairs):
                    if b == p.head:
                        new_pair = (a, p.body[0])
                        if new_pair not in unit_pairs:
                            unit_pairs.add(new_pair)
                            changed = True

    # Replace unit productions
    new_prods = set()
    for p in productions:
        if len(p.body) == 1 and p.body[0] in variables:
            continue  # Skip unit productions

        # For each unit pair (A, B) where B is the head of this production
        for (a, b) in unit_pairs:
            if b == p.head:
                new_prods.add(Production(a, p.body))

    productions = list(new_prods)

    # ============================================================
    # Step 3: Replace terminals in mixed productions
    # ============================================================
    term_vars = {}  # terminal -> variable name
    additional = []

    final_prods = []
    for p in productions:
        if len(p.body) <= 1:
            final_prods.append(p)
            continue

        # Replace terminals with new variables
        new_body = []
        for sym in p.body:
            if sym in terminals:
                if sym not in term_vars:
                    tv = new_var("T")
                    term_vars[sym] = tv
                    additional.append(Production(tv, (sym,)))
                new_body.append(term_vars[sym])
            else:
                new_body.append(sym)
        final_prods.append(Production(p.head, tuple(new_body)))

    productions = final_prods + additional

    # ============================================================
    # Step 4: Break long productions into binary
    # ============================================================
    final_prods = []
    for p in productions:
        if len(p.body) <= 2:
            final_prods.append(p)
            continue

        # A -> B1 B2 B3 ... Bk  =>  A -> B1 C1, C1 -> B2 C2, ..., Ck-2 -> Bk-1 Bk
        symbols = list(p.body)
        current_head = p.head
        while len(symbols) > 2:
            new_head = new_var("C")
            final_prods.append(Production(current_head, (symbols[0], new_head)))
            symbols = symbols[1:]
            current_head = new_head
        final_prods.append(Production(current_head, tuple(symbols)))

    productions = final_prods

    return Grammar(variables, terminals, productions, start)


# Example
print("=== CNF Conversion ===\n")

expr_g = Grammar.from_string("""
E -> E + T | T
T -> T * F | F
F -> ( E ) | id
""")

print("Original grammar:")
print(expr_g)
print()

cnf = to_cnf(expr_g)
print("CNF grammar:")
for p in sorted(cnf.productions, key=lambda p: (p.head, p.body)):
    print(f"  {p}")

# Verify: check that some strings are still derivable
print("\nVerification (derivability check):")
for w in ["id", "id + id", "id * id", "id + id * id"]:
    symbols = w.split()
    # We'll verify using CYK (next section)

8. Greibach Normal Form (GNF)¶

A CFG is in Greibach Normal Form if every production has the form:

$$A \rightarrow a B_1 B_2 \cdots B_k$$

where $a \in \Sigma$ and $B_1, \ldots, B_k \in V$ (with $k \geq 0$).

In other words, every production starts with exactly one terminal, followed by zero or more nonterminals.

Properties¶

Every derivation step in GNF consumes exactly one terminal
A string of length $n$ requires exactly $n$ derivation steps (no $\epsilon$-productions, except possibly $S \rightarrow \epsilon$)
GNF is useful for constructing pushdown automata
Conversion to GNF requires eliminating left recursion (covered in Lesson 5)

Comparison of Normal Forms¶

Property	CNF	GNF
Production form	$A \rightarrow BC$ or $A \rightarrow a$	$A \rightarrow a\alpha$ ($\alpha \in V^*$)
Parse tree shape	Binary tree	High branching factor
Derivation length for $\|w\| = n$	$2n - 1$	$n$
Primary use	CYK algorithm	PDA construction
Left recursion	Allowed	Not allowed

9. The CYK Algorithm¶

The Cocke-Younger-Kasami (CYK) algorithm is a dynamic programming algorithm that determines whether a string $w$ is in $L(G)$ for a grammar $G$ in CNF. It also constructs a parse tree.

Algorithm¶

Input: Grammar $G$ in CNF, string $w = a_1 a_2 \cdots a_n$ Output: Whether $w \in L(G)$, and a parse table

Idea: Build a triangular table $T$ where $T[i][j]$ is the set of variables that can derive the substring $a_i a_{i+1} \cdots a_j$.

$$T[i][j] = \{A \in V \mid A \xRightarrow{*} a_i a_{i+1} \cdots a_j\}$$

Algorithm:

1. // Base case: substrings of length 1
   for i = 1 to n:
       T[i][i] = {A | A -> aᵢ ∈ P}

2. // Inductive case: substrings of length 2, 3, ..., n
   for length = 2 to n:
       for i = 1 to n - length + 1:
           j = i + length - 1
           T[i][j] = ∅
           for k = i to j - 1:
               for each production A -> BC:
                   if B ∈ T[i][k] and C ∈ T[k+1][j]:
                       T[i][j] = T[i][j] ∪ {A}

3. // Accept if start symbol is in top-right cell
   return S ∈ T[1][n]

Time complexity: $O(n^3 \cdot |P|)$ Space complexity: $O(n^2)$

Worked Example¶

Grammar in CNF:

$$S \rightarrow AB \mid BC$$ $$A \rightarrow BA \mid a$$ $$B \rightarrow CC \mid b$$ $$C \rightarrow AB \mid a$$

Parse "baaba":

String:  b   a   a   b   a
Index:   1   2   3   4   5

Table T[i][j]:

i\j |  1      2       3       4       5
----+-------------------------------------------
 1  | {B}   {S,A}  {B}     {S,A,C} {B}
 2  |        {A,C}  {S,C}   {B}    {S,A,C}
 3  |               {A,C}   {S,A}  {B}
 4  |                        {B}   {S,A,C}
 5  |                               {A,C}

Step-by-step:

Length 1 (diagonal): - $T[1][1]$: $b$ matched by $B \rightarrow b$, so $\{B\}$ - $T[2][2]$: $a$ matched by $A \rightarrow a$ and $C \rightarrow a$, so $\{A, C\}$ - Similarly for positions 3, 4, 5

Length 2: - $T[1][2]$: split at $k=1$: $T[1][1] \times T[2][2] = \{B\} \times \{A,C\}$ - $B, A$: Check $? \rightarrow BA$: $A \rightarrow BA$ and $S \rightarrow BC$... Actually $A \rightarrow BA$ gives $A$, and any production giving $BA$? Let's check: $S \rightarrow AB$ needs $(A,B)$, $S \rightarrow BC$ needs $(B,C)$: $B \in T[1][1]$ and $C \in T[2][2]$, so $S$. - $A \rightarrow BA$: $B \in T[1][1]$ and $A \in T[2][2]$, so $A$. - Result: $\{S, A\}$

(Continuing similarly for all cells...)

Since $S \in T[1][5]$, the string "baaba" is in the language.

Python Implementation¶

def cyk_parse(grammar: Grammar, word: str) -> Tuple[bool, list]:
    """
    CYK parsing algorithm.
    Grammar must be in Chomsky Normal Form.
    word is a space-separated string of terminals.

    Returns (accepted, table) where table[i][j] is the set of
    variables that can derive word[i..j].
    """
    symbols = word.split()
    n = len(symbols)

    if n == 0:
        # Check if S -> ε exists
        for p in grammar.prod_index.get(grammar.start, []):
            if not p.body:
                return (True, [])
        return (False, [])

    # Initialize table: T[i][j] = set of variables
    # Using 0-indexed
    T = [[set() for _ in range(n)] for _ in range(n)]

    # Also store back-pointers for parse tree construction
    # back[i][j][A] = (B, C, k) meaning A -> BC, B derives [i..k], C derives [k+1..j]
    back = [[{} for _ in range(n)] for _ in range(n)]

    # Step 1: Base case (length 1)
    for i in range(n):
        for p in grammar.productions:
            if len(p.body) == 1 and p.body[0] == symbols[i]:
                T[i][i].add(p.head)
                back[i][i][p.head] = ('terminal', symbols[i])

    # Step 2: Fill table for increasing lengths
    for length in range(2, n + 1):
        for i in range(n - length + 1):
            j = i + length - 1
            for k in range(i, j):
                # Try all productions A -> BC
                for p in grammar.productions:
                    if len(p.body) == 2:
                        B, C = p.body
                        if B in T[i][k] and C in T[k+1][j]:
                            T[i][j].add(p.head)
                            if p.head not in back[i][j]:
                                back[i][j][p.head] = ('split', B, C, k)

    accepted = grammar.start in T[0][n-1]

    # Print table
    print(f"CYK Table for '{word}':\n")
    col_width = 12
    header = "i\\j |" + "|".join(f"{j:^{col_width}}" for j in range(n))
    print(header)
    print("-" * len(header))
    for i in range(n):
        cells = []
        for j in range(n):
            if j < i:
                cells.append(" " * col_width)
            else:
                cell = "{" + ",".join(sorted(T[i][j])) + "}" if T[i][j] else "∅"
                cells.append(f"{cell:^{col_width}}")
        print(f" {i}  |" + "|".join(cells))

    print(f"\nAccepted: {accepted}")

    # Build parse tree
    def build_tree(var, i, j):
        info = back[i][j].get(var)
        if info is None:
            return ParseTreeNode(var, [ParseTreeNode("?", is_terminal=True)])
        if info[0] == 'terminal':
            return ParseTreeNode(var, [ParseTreeNode(info[1], is_terminal=True)])
        else:
            _, B, C, k = info
            left = build_tree(B, i, k)
            right = build_tree(C, k+1, j)
            return ParseTreeNode(var, [left, right])

    if accepted:
        tree = build_tree(grammar.start, 0, n-1)
        print("\nParse tree:")
        tree.pretty_print()

    return (accepted, T)


# Example: CYK parsing
cnf_grammar = Grammar.from_string("""
S -> A B | B C
A -> B A | a
B -> C C | b
C -> A B | a
""")

print("=== CYK Parsing ===\n")
print("Grammar:")
print(cnf_grammar)
print()

cyk_parse(cnf_grammar, "b a a b a")
print()
cyk_parse(cnf_grammar, "a b")
print()
cyk_parse(cnf_grammar, "b b b")

10. Pushdown Automata (Overview)¶

A pushdown automaton (PDA) is a finite automaton augmented with a stack. PDAs recognize exactly the context-free languages.

Formal Definition¶

A PDA is a 7-tuple:

$$M = (Q, \Sigma, \Gamma, \delta, q_0, Z_0, F)$$

where:

$Q$ is a finite set of states
$\Sigma$ is the input alphabet
$\Gamma$ is the stack alphabet
$\delta: Q \times (\Sigma \cup \{\epsilon\}) \times \Gamma \rightarrow \mathcal{P}(Q \times \Gamma^*)$ is the transition function
$q_0 \in Q$ is the start state
$Z_0 \in \Gamma$ is the initial stack symbol
$F \subseteq Q$ is the set of accepting states

How a PDA Works¶

At each step, based on: 1. The current state 2. The current input symbol (or $\epsilon$ for no input) 3. The top of the stack

The PDA: 1. Moves to a new state 2. Optionally consumes an input symbol 3. Replaces the top of the stack with zero or more symbols

Equivalence with CFGs¶

Theorem: A language is context-free if and only if it is recognized by some PDA.

CFG $\rightarrow$ PDA: For every CFG, we can construct a PDA that simulates leftmost derivations. The PDA uses the stack to track the remaining symbols to be derived.

PDA $\rightarrow$ CFG: For every PDA, we can construct a CFG. The construction creates variables $[q_i, A, q_j]$ representing "the PDA goes from state $q_i$ to state $q_j$ while popping $A$ off the stack."

Python: Simple PDA Simulation¶

class PDA:
    """Simple Pushdown Automaton simulation."""

    def __init__(self, transitions, start_state, start_stack, accept_states):
        """
        transitions: dict mapping (state, input_or_eps, stack_top) ->
                     list of (new_state, stack_push)
        stack_push is a tuple: () means pop, ('A',) means replace with A,
                                ('A', 'B') means replace top with A then push B
        """
        self.transitions = transitions
        self.start_state = start_state
        self.start_stack = start_stack
        self.accept_states = accept_states

    def accepts(self, word: str) -> bool:
        """Check if the PDA accepts the word (by final state)."""
        # Configuration: (state, remaining_input_pos, stack)
        # Use BFS to explore all nondeterministic choices
        from collections import deque

        initial = (self.start_state, 0, (self.start_stack,))
        queue = deque([initial])
        visited = set()

        symbols = list(word) if word else []

        while queue:
            state, pos, stack = queue.popleft()

            config = (state, pos, stack)
            if config in visited:
                continue
            visited.add(config)

            # Check acceptance
            if pos == len(symbols) and state in self.accept_states:
                return True

            if not stack:
                continue

            top = stack[-1]
            rest = stack[:-1]

            # Try epsilon transitions
            for new_state, push in self.transitions.get(
                (state, 'ε', top), []
            ):
                new_stack = rest + push
                if len(new_stack) < 100:  # Bound stack size
                    queue.append((new_state, pos, new_stack))

            # Try input transitions
            if pos < len(symbols):
                ch = symbols[pos]
                for new_state, push in self.transitions.get(
                    (state, ch, top), []
                ):
                    new_stack = rest + push
                    if len(new_stack) < 100:
                        queue.append((new_state, pos + 1, new_stack))

        return False


# PDA for {a^n b^n | n >= 0}
# States: q0 (push a's), q1 (pop a's matching b's), q_accept
pda_anbn = PDA(
    transitions={
        # In q0, push a's onto stack
        ('q0', 'a', 'Z'): [('q0', ('Z', 'A'))],    # Push A, keep Z
        ('q0', 'a', 'A'): [('q0', ('A', 'A'))],    # Push A
        # Switch to q1 when seeing b
        ('q0', 'b', 'A'): [('q1', ())],             # Pop A
        # In q0, accept empty string
        ('q0', 'ε', 'Z'): [('q_accept', ('Z',))],
        # In q1, pop a's matching b's
        ('q1', 'b', 'A'): [('q1', ())],             # Pop A
        # Accept when stack has only Z
        ('q1', 'ε', 'Z'): [('q_accept', ('Z',))],
    },
    start_state='q0',
    start_stack='Z',
    accept_states={'q_accept'}
)

print("=== PDA for {a^n b^n | n >= 0} ===\n")
for w in ["", "ab", "aabb", "aaabbb", "aab", "abb", "ba", "abab"]:
    result = "accept" if pda_anbn.accepts(w) else "reject"
    print(f"  '{w}': {result}")

11. The Pumping Lemma for Context-Free Languages¶

Statement¶

If $L$ is a context-free language, then there exists a constant $p \geq 1$ such that every string $w \in L$ with $|w| \geq p$ can be split into five parts:

$$w = uvxyz$$

satisfying:

$|vy| > 0$ (at least one of $v$ and $y$ is nonempty)
$|vxy| \leq p$
$\forall i \geq 0: uv^ixy^iz \in L$ (pumping $v$ and $y$ together)

Proof Idea¶

Any sufficiently long string must have a derivation tree of height greater than $|V|$ (the number of variables). By the pigeonhole principle, some variable $A$ must repeat on a path from root to leaf:

        S
       / \
      /   \
     /     \
    A       <-- first occurrence of A
   / \
  v   A    <-- second occurrence of A
     / \
    x   y

The subtree rooted at the first $A$ generates $vxy$, and the subtree rooted at the second $A$ generates $x$. We can "pump" by repeating or removing the portion between the two $A$'s.

Using the CFL Pumping Lemma¶

To prove a language is NOT context-free:

Assume $L$ is context-free
Let $p$ be the pumping length
Choose a specific $w \in L$ with $|w| \geq p$
Show that for all decompositions $w = uvxyz$ with $|vy| > 0$ and $|vxy| \leq p$, there exists some $i$ where $uv^ixy^iz \notin L$
Contradiction

Example: $L = \{a^n b^n c^n \mid n \geq 0\}$ Is Not Context-Free¶

Proof:

Assume $L$ is context-free with pumping length $p$.
Choose $w = a^p b^p c^p \in L$, with $|w| = 3p \geq p$.
Let $w = uvxyz$ with $|vy| > 0$ and $|vxy| \leq p$.
Since $|vxy| \leq p$, the substring $vxy$ can span at most two of the three groups ($a$'s, $b$'s, $c$'s).
Case analysis:
$vxy$ is within $a^p$: pumping changes $a$ count but not $b$ or $c$.
$vxy$ is within $a^p b^p$: pumping changes $a$ and/or $b$ but not $c$.
$vxy$ is within $b^p$: pumping changes $b$ but not $a$ or $c$.
$vxy$ is within $b^p c^p$: pumping changes $b$ and/or $c$ but not $a$.
$vxy$ is within $c^p$: pumping changes $c$ but not $a$ or $b$.
In all cases, pumping ($i = 2$) produces a string where the three counts are unequal, so $uv^2xy^2z \notin L$.
Contradiction. $\blacksquare$

Languages Beyond Context-Free¶

Language	Type	Why Not CF
$\{a^n b^n c^n \mid n \geq 0\}$	Context-sensitive	Needs three-way matching
$\{ww \mid w \in \{a,b\}^*\}$	Context-sensitive	Copy language
$\{a^{2^n} \mid n \geq 0\}$	Context-sensitive	Exponential growth
Any decidable language	Type 0 or below	Turing machine required

Note: $\{a^n b^n \mid n \geq 0\}$ IS context-free. The limitation is three or more matched groups.

12. The Chomsky Hierarchy¶

The Chomsky hierarchy classifies formal languages into four levels:

Type	Name	Grammar restriction	Automaton	Example
3	Regular	$A \rightarrow aB$ or $A \rightarrow a$	Finite automaton (DFA/NFA)	$a^b^$
2	Context-free	$A \rightarrow \alpha$ ($\alpha \in (V \cup \Sigma)^*$)	Pushdown automaton	$a^nb^n$
1	Context-sensitive	$\alpha A \beta \rightarrow \alpha \gamma \beta$ ($\|\gamma\| \geq 1$)	Linear-bounded automaton	$a^nb^nc^n$
0	Recursively enumerable	No restriction	Turing machine	Halting problem complement

Hierarchy (proper inclusions):

Regular  ⊂  Context-Free  ⊂  Context-Sensitive  ⊂  Recursively Enumerable
  (Type 3)    (Type 2)         (Type 1)              (Type 0)

Relevance to Compiler Design¶

Phase	Language Type	Formalism
Lexical analysis	Regular (Type 3)	DFA, regular expressions
Syntax analysis	Context-free (Type 2)	CFG, pushdown automata
Semantic analysis	Context-sensitive (Type 1)	Attribute grammars, type systems
Full language semantics	Beyond CF	Turing machines (undecidable in general)

13. Grammar Transformations¶

Eliminating Left Recursion¶

A grammar has left recursion if there exists a derivation $A \xRightarrow{+} A\alpha$ for some $\alpha$. Left recursion causes problems for top-down parsers (infinite loop).

Direct left recursion ($A \rightarrow A\alpha \mid \beta$):

Replace with: $$A \rightarrow \beta A'$$ $$A' \rightarrow \alpha A' \mid \epsilon$$

Example: $$E \rightarrow E + T \mid T$$

becomes: $$E \rightarrow T E'$$ $$E' \rightarrow + T E' \mid \epsilon$$

Left Factoring¶

When two productions for the same nonterminal share a common prefix:

$$A \rightarrow \alpha \beta_1 \mid \alpha \beta_2$$

Replace with: $$A \rightarrow \alpha A'$$ $$A' \rightarrow \beta_1 \mid \beta_2$$

Python Implementation¶

def eliminate_left_recursion(grammar: Grammar) -> Grammar:
    """
    Eliminate immediate left recursion from all productions.
    (Does not handle indirect left recursion.)
    """
    new_productions = []
    new_variables = set(grammar.variables)

    for var in sorted(grammar.variables):
        prods = grammar.prod_index.get(var, [])

        # Separate left-recursive and non-left-recursive productions
        left_recursive = []  # A -> A α
        non_recursive = []   # A -> β

        for p in prods:
            if p.body and p.body[0] == var:
                left_recursive.append(p.body[1:])  # α (without leading A)
            else:
                non_recursive.append(p.body)

        if not left_recursive:
            # No left recursion for this variable
            new_productions.extend(prods)
            continue

        # Eliminate left recursion
        # A -> β1 | β2 | ... | A α1 | A α2 | ...
        # becomes:
        # A  -> β1 A' | β2 A' | ...
        # A' -> α1 A' | α2 A' | ... | ε

        new_var = var + "'"
        while new_var in new_variables:
            new_var += "'"
        new_variables.add(new_var)

        # A -> βi A'
        for beta in non_recursive:
            new_body = beta + (new_var,) if beta else (new_var,)
            new_productions.append(Production(var, new_body))

        # A' -> αi A'
        for alpha in left_recursive:
            new_body = alpha + (new_var,)
            new_productions.append(Production(new_var, new_body))

        # A' -> ε
        new_productions.append(Production(new_var, ()))

    return Grammar(
        new_variables, grammar.terminals,
        new_productions, grammar.start
    )


def left_factor(grammar: Grammar) -> Grammar:
    """
    Apply left factoring to a grammar.
    """
    new_productions = list(grammar.productions)
    new_variables = set(grammar.variables)
    aux_counter = [0]

    changed = True
    while changed:
        changed = False

        # Index by head
        prod_index = {}
        for p in new_productions:
            prod_index.setdefault(p.head, []).append(p)

        next_prods = []
        processed_vars = set()

        for var in sorted(prod_index.keys()):
            if var in processed_vars:
                continue
            prods = prod_index[var]

            # Find longest common prefix among any two productions
            best_prefix = None
            best_group = None

            for i in range(len(prods)):
                for j in range(i + 1, len(prods)):
                    # Find common prefix
                    prefix = []
                    for k in range(min(len(prods[i].body), len(prods[j].body))):
                        if prods[i].body[k] == prods[j].body[k]:
                            prefix.append(prods[i].body[k])
                        else:
                            break

                    if prefix and (best_prefix is None or len(prefix) > len(best_prefix)):
                        best_prefix = tuple(prefix)
                        # Find all productions sharing this prefix
                        group = [p for p in prods if p.body[:len(prefix)] == best_prefix]
                        best_group = group

            if best_prefix and best_group and len(best_group) > 1:
                changed = True
                processed_vars.add(var)

                # Create new variable
                aux_counter[0] += 1
                new_var = f"{var}_LF{aux_counter[0]}"
                new_variables.add(new_var)

                # A -> prefix A_LF
                next_prods.append(Production(var, best_prefix + (new_var,)))

                # A_LF -> remaining1 | remaining2 | ...
                for p in best_group:
                    remaining = p.body[len(best_prefix):]
                    next_prods.append(Production(new_var, remaining if remaining else ()))

                # Keep non-matching productions
                for p in prods:
                    if p not in best_group:
                        next_prods.append(p)
            else:
                next_prods.extend(prods)

        new_productions = next_prods

    return Grammar(
        new_variables, grammar.terminals,
        new_productions, grammar.start
    )


# Example: Eliminate left recursion
print("=== Eliminating Left Recursion ===\n")
print("Original:")
print(expr_grammar)
print()

no_lr = eliminate_left_recursion(expr_grammar)
print("After eliminating left recursion:")
print(no_lr)
print()

# Example: Left factoring
lf_grammar = Grammar.from_string("""
S -> if E then S else S | if E then S | other
E -> id
""")

print("=== Left Factoring ===\n")
print("Original:")
print(lf_grammar)
print()

factored = left_factor(lf_grammar)
print("After left factoring:")
print(factored)

Summary¶

A context-free grammar $G = (V, \Sigma, P, S)$ generates a language through derivations
BNF and EBNF are standard notations for writing grammars
Leftmost and rightmost derivations correspond to different parsing strategies
Parse trees are the structural representation of derivations
A grammar is ambiguous if some string has multiple parse trees; ambiguity is undecidable in general
Ambiguity is resolved through precedence, associativity, and grammar restructuring
Chomsky Normal Form (binary productions) enables the CYK algorithm for parsing in $O(n^3)$
Greibach Normal Form (terminal-first productions) is useful for PDA construction
Pushdown automata recognize exactly the context-free languages
The CFL pumping lemma proves that languages like $\{a^n b^n c^n\}$ are not context-free
The Chomsky hierarchy classifies languages into four types: regular, context-free, context-sensitive, and recursively enumerable
Left recursion elimination and left factoring prepare grammars for top-down parsing

Exercises¶

Exercise 1: Grammar Writing¶

Write a context-free grammar for each of the following languages:

Palindromes over $\{a, b\}$
Balanced parentheses (possibly nested)
$\{a^i b^j c^k \mid i = j \text{ or } j = k\}$
A simplified subset of Python if/elif/else statements
Arithmetic expressions with +, -, *, /, unary -, and parentheses, respecting standard precedence and left-associativity

Exercise 2: Derivations and Parse Trees¶

Using the grammar $E \rightarrow E + T \mid T$, $T \rightarrow T * F \mid F$, $F \rightarrow (E) \mid \texttt{id} \mid \texttt{num}$:

Find the leftmost derivation for id * (num + id)
Find the rightmost derivation for the same string
Draw the parse tree
Verify that both derivations correspond to the same parse tree

Exercise 3: Ambiguity Analysis¶

Consider the grammar:

$$S \rightarrow aSb \mid aSbb \mid \epsilon$$

What language does this grammar generate?
Is the grammar ambiguous? If so, find a string with two parse trees.
Can you find an unambiguous grammar for the same language?

Exercise 4: CNF Conversion¶

Convert the following grammar to Chomsky Normal Form:

$$S \rightarrow ASA \mid aB$$ $$A \rightarrow B \mid S$$ $$B \rightarrow b \mid \epsilon$$

Show each step of the conversion process.

Exercise 5: CYK Algorithm¶

Using the CNF grammar from Exercise 4 (or the one provided in Section 9), run the CYK algorithm on the strings: 1. "aabb" 2. "abab" 3. "baba"

Show the complete table for each string.

Exercise 6: Pumping Lemma¶

Use the pumping lemma for CFLs to prove that the following languages are NOT context-free:

$L_1 = \{a^n b^n c^n d^n \mid n \geq 0\}$
$L_2 = \{a^i b^j c^k \mid 0 \leq i \leq j \leq k\}$
$L_3 = \{ww \mid w \in \{a, b\}^*\}$ (the copy language)

Previous: Finite Automata | Next: Top-Down Parsing | Overview