Lesson 4: Context-Free Grammars

Lesson 4: Context-Free Grammars

Learning Objectives

After completing this lesson, you will be able to:

  1. Give the formal definition of a context-free grammar (CFG)
  2. Use BNF and EBNF notation to write grammars
  3. Construct leftmost and rightmost derivations for a given string
  4. Build parse trees and understand their relationship to derivations
  5. Identify and resolve ambiguity in grammars
  6. Convert grammars to Chomsky Normal Form (CNF) and Greibach Normal Form (GNF)
  7. Apply the CYK parsing algorithm
  8. Understand the connection between CFGs and pushdown automata
  9. Apply the pumping lemma for context-free languages
  10. Implement grammar manipulation and CYK parsing in Python

1. Formal Definition of a Context-Free Grammar

A context-free grammar (CFG) is a 4-tuple:

$$G = (V, \Sigma, P, S)$$

where:

  • $V$ is a finite set of variables (also called nonterminals)
  • $\Sigma$ is a finite set of terminal symbols (the alphabet), with $V \cap \Sigma = \emptyset$
  • $P$ is a finite set of productions (rules), each of the form $A \rightarrow \alpha$ where $A \in V$ and $\alpha \in (V \cup \Sigma)^*$
  • $S \in V$ is the start symbol

Notation Conventions

Throughout this lesson, we use:

Symbol Represents
$A, B, C, S$ Variables (nonterminals), uppercase
$a, b, c$ Terminals, lowercase
$\alpha, \beta, \gamma$ Strings in $(V \cup \Sigma)^*$
$w, x, y, z$ Strings in $\Sigma^*$ (terminal strings)
$\epsilon$ The empty string

Example CFG

A grammar for simple arithmetic expressions:

$$G = (\{E, T, F\}, \{+, *, (, ), \texttt{id}\}, P, E)$$

Productions $P$:

$$E \rightarrow E + T \mid T$$ $$T \rightarrow T * F \mid F$$ $$F \rightarrow (E) \mid \texttt{id}$$

This grammar defines the language of arithmetic expressions with +, *, parentheses, and identifiers, respecting the conventional precedence (* binds tighter than +).

Derivation

A derivation is a sequence of steps that transforms the start symbol into a string of terminals by repeatedly replacing a variable with the right-hand side of one of its productions.

If $A \rightarrow \beta$ is a production and $\alpha A \gamma$ is a sentential form, then:

$$\alpha A \gamma \Rightarrow \alpha \beta \gamma$$

We write $\alpha \xRightarrow{*} \beta$ for zero or more derivation steps.

Language of a Grammar

The language generated by $G$ is:

$$L(G) = \{w \in \Sigma^* \mid S \xRightarrow{*} w\}$$

The set of all terminal strings derivable from the start symbol.

Python Representation

from dataclasses import dataclass, field
from typing import List, Set, Dict, Tuple, Optional


@dataclass
class Production:
    """A grammar production A -> symbols."""
    head: str                 # Left-hand side (nonterminal)
    body: Tuple[str, ...]     # Right-hand side (tuple of symbols)

    def __repr__(self):
        body_str = ' '.join(self.body) if self.body else 'ε'
        return f"{self.head} -> {body_str}"

    def __hash__(self):
        return hash((self.head, self.body))

    def __eq__(self, other):
        return (isinstance(other, Production) and
                self.head == other.head and self.body == other.body)


class Grammar:
    """Context-Free Grammar."""

    def __init__(self, variables: Set[str], terminals: Set[str],
                 productions: List[Production], start: str):
        self.variables = variables
        self.terminals = terminals
        self.productions = productions
        self.start = start

        # Index productions by head
        self.prod_index: Dict[str, List[Production]] = {}
        for p in productions:
            self.prod_index.setdefault(p.head, []).append(p)

    @classmethod
    def from_string(cls, grammar_str: str, start: str = None) -> 'Grammar':
        """
        Parse a grammar from a multi-line string.
        Format: 'A -> α | β | γ' (one nonterminal per line).
        Nonterminals are uppercase single letters or quoted names.
        Use 'ε' or 'epsilon' for epsilon productions.
        """
        variables = set()
        terminals = set()
        productions = []

        for line in grammar_str.strip().split('\n'):
            line = line.strip()
            if not line or line.startswith('#'):
                continue

            head, _, body_str = line.partition('->')
            head = head.strip()
            variables.add(head)

            for alt in body_str.split('|'):
                alt = alt.strip()
                if alt in ('ε', 'epsilon', ''):
                    productions.append(Production(head, ()))
                else:
                    symbols = tuple(alt.split())
                    productions.append(Production(head, symbols))

        # Infer terminals
        for p in productions:
            for sym in p.body:
                if sym not in variables:
                    terminals.add(sym)

        if start is None:
            start = productions[0].head

        return cls(variables, terminals, productions, start)

    def __repr__(self):
        lines = [f"Grammar(start={self.start})"]
        for var in sorted(self.variables):
            prods = self.prod_index.get(var, [])
            bodies = [' '.join(p.body) if p.body else 'ε' for p in prods]
            lines.append(f"  {var} -> {' | '.join(bodies)}")
        return '\n'.join(lines)

    def is_terminal(self, symbol: str) -> bool:
        return symbol in self.terminals

    def is_variable(self, symbol: str) -> bool:
        return symbol in self.variables


# Example: Arithmetic expression grammar
expr_grammar = Grammar.from_string("""
E -> E + T | T
T -> T * F | F
F -> ( E ) | id
""")

print(expr_grammar)

2. BNF and EBNF Notation

Backus-Naur Form (BNF)

BNF (Backus-Naur Form), introduced by John Backus for the ALGOL 60 report, is the standard notation for context-free grammars.

<expression>  ::= <expression> "+" <term> | <term>
<term>        ::= <term> "*" <factor> | <factor>
<factor>      ::= "(" <expression> ")" | <identifier>
<identifier>  ::= <letter> | <identifier> <letter> | <identifier> <digit>
<letter>      ::= "a" | "b" | ... | "z"
<digit>       ::= "0" | "1" | ... | "9"

BNF conventions: - Nonterminals are enclosed in angle brackets: <expression> - Terminals are enclosed in quotes: "+" - ::= means "is defined as" - | separates alternatives - Each rule defines one nonterminal

Extended BNF (EBNF)

EBNF adds convenience notation to BNF, reducing the number of rules needed:

EBNF Notation Meaning BNF Equivalent
{X} or X* Zero or more repetitions of X New nonterminal with recursion
[X] or X? Optional (zero or one) New nonterminal with epsilon alternative
(X \| Y) Grouping New nonterminal
X+ One or more repetitions New nonterminal with recursion

Example: The expression grammar in EBNF:

expression = term { "+" term } .
term       = factor { "*" factor } .
factor     = "(" expression ")" | identifier .
identifier = letter { letter | digit } .

This is more concise because repetition ({...}) replaces explicit recursion.

EBNF to BNF Conversion

EBNF: A = α { β } γ .
BNF:  A  -> α A' γ
      A' -> β A' | ε

EBNF: A = α [ β ] γ .
BNF:  A -> α β γ | α γ

EBNF: A = α ( β | γ ) δ .
BNF:  A  -> α A' δ
      A' -> β | γ

Python Implementation

def ebnf_to_bnf(ebnf_rules: str) -> str:
    """
    Convert EBNF notation to BNF.
    Handles: { } (repetition), [ ] (optional), ( | ) (grouping).

    This is a simplified converter for demonstration.
    """
    bnf_rules = []
    aux_counter = [0]

    def new_aux(base_name: str) -> str:
        aux_counter[0] += 1
        return f"{base_name}_{aux_counter[0]}"

    for line in ebnf_rules.strip().split('\n'):
        line = line.strip()
        if not line:
            continue

        head, _, body = line.partition('=')
        head = head.strip()
        body = body.strip().rstrip('.')

        # For simplicity, just handle the basic transformations
        # In practice, you'd need a proper EBNF parser

        # Handle { X } -> X_rep where X_rep -> X X_rep | ε
        import re
        while '{' in body:
            match = re.search(r'\{([^{}]+)\}', body)
            if not match:
                break
            inner = match.group(1).strip()
            aux = new_aux(head + "_rep")
            body = body[:match.start()] + aux + body[match.end():]
            bnf_rules.append(f"{aux} -> {inner} {aux} | ε")

        # Handle [ X ] -> X_opt where X_opt -> X | ε
        while '[' in body:
            match = re.search(r'\[([^\[\]]+)\]', body)
            if not match:
                break
            inner = match.group(1).strip()
            aux = new_aux(head + "_opt")
            body = body[:match.start()] + aux + body[match.end():]
            bnf_rules.append(f"{aux} -> {inner} | ε")

        bnf_rules.insert(0, f"{head} -> {body}")

    return '\n'.join(bnf_rules)


ebnf = """
expr = term { + term } .
term = factor { * factor } .
factor = ( expr ) | id .
"""

print("=== EBNF to BNF Conversion ===\n")
print("EBNF:")
print(ebnf)
print("BNF:")
print(ebnf_to_bnf(ebnf))

3. Derivations

Leftmost Derivation

In a leftmost derivation, at each step we replace the leftmost variable in the sentential form.

Notation: $\alpha \xRightarrow{lm} \beta$

Example: Derive id + id * id using the expression grammar.

$$E \xRightarrow{lm} E + T \xRightarrow{lm} T + T \xRightarrow{lm} F + T \xRightarrow{lm} \texttt{id} + T$$ $$\xRightarrow{lm} \texttt{id} + T * F \xRightarrow{lm} \texttt{id} + F * F \xRightarrow{lm} \texttt{id} + \texttt{id} * F \xRightarrow{lm} \texttt{id} + \texttt{id} * \texttt{id}$$

Rightmost Derivation

In a rightmost derivation (also called canonical derivation), at each step we replace the rightmost variable.

Notation: $\alpha \xRightarrow{rm} \beta$

$$E \xRightarrow{rm} E + T \xRightarrow{rm} E + T * F \xRightarrow{rm} E + T * \texttt{id} \xRightarrow{rm} E + F * \texttt{id}$$ $$\xRightarrow{rm} E + \texttt{id} * \texttt{id} \xRightarrow{rm} T + \texttt{id} * \texttt{id} \xRightarrow{rm} F + \texttt{id} * \texttt{id} \xRightarrow{rm} \texttt{id} + \texttt{id} * \texttt{id}$$

Both derivations produce the same parse tree (for unambiguous grammars).

Why Derivation Order Matters

  • Top-down parsers (LL) construct a leftmost derivation
  • Bottom-up parsers (LR) construct a rightmost derivation in reverse
  • For unambiguous grammars, both yield the same parse tree
  • For ambiguous grammars, different derivation orders may yield different parse trees

Python Implementation

def leftmost_derivation(grammar: Grammar, target: str) -> Optional[List[str]]:
    """
    Find a leftmost derivation of target string.
    Uses brute-force search (only practical for short strings).
    Returns list of sentential forms, or None if not derivable.
    """
    target_symbols = tuple(target.split())

    # BFS
    from collections import deque

    queue = deque()
    queue.append(((grammar.start,), [(grammar.start,)]))
    visited = {(grammar.start,)}

    max_len = len(target_symbols) * 3  # Bound to prevent infinite search

    while queue:
        current, history = queue.popleft()

        # Check if we've derived the target
        if current == target_symbols:
            return [' '.join(sf) for sf in history]

        # If current is all terminals, no more derivations possible
        if all(grammar.is_terminal(s) for s in current):
            continue

        # If too long, skip
        if len(current) > max_len:
            continue

        # Find leftmost variable
        for i, sym in enumerate(current):
            if grammar.is_variable(sym):
                # Try all productions for this variable
                for prod in grammar.prod_index.get(sym, []):
                    new_form = current[:i] + prod.body + current[i+1:]
                    if new_form not in visited:
                        visited.add(new_form)
                        queue.append((new_form, history + [new_form]))
                break  # Only expand leftmost variable

    return None


# Example
print("=== Leftmost Derivation ===\n")
derivation = leftmost_derivation(expr_grammar, "id + id * id")
if derivation:
    for i, step in enumerate(derivation):
        arrow = "  " if i == 0 else "=>"
        print(f"  {arrow} {step}")

4. Parse Trees

A parse tree (also called a derivation tree or concrete syntax tree) is a graphical representation of a derivation.

Definition

For a grammar $G = (V, \Sigma, P, S)$, a parse tree has:

  1. The root is labeled with the start symbol $S$
  2. Each interior node is labeled with a variable $A \in V$
  3. Each leaf is labeled with a terminal $a \in \Sigma$ or $\epsilon$
  4. If interior node $A$ has children $X_1, X_2, \ldots, X_k$, then $A \rightarrow X_1 X_2 \cdots X_k \in P$

Example Parse Tree

For id + id * id with the expression grammar:

             E
           / | \
          E  +  T
          |    /|\
          T  T  *  F
          |  |     |
          F  F    id
          |  |
         id  id

Reading the leaves left-to-right gives the derived string: id + id * id.

Relationship to Derivations

  • Every derivation corresponds to exactly one parse tree (for unambiguous grammars)
  • Every parse tree corresponds to exactly one leftmost derivation and one rightmost derivation
  • Different derivations (leftmost, rightmost, or arbitrary order) may correspond to the same parse tree

Python Implementation

@dataclass
class ParseTreeNode:
    """A node in a parse tree."""
    symbol: str
    children: List['ParseTreeNode'] = field(default_factory=list)
    is_terminal: bool = False

    def __repr__(self):
        if self.is_terminal:
            return f"'{self.symbol}'"
        return f"{self.symbol}"

    def leaves(self) -> List[str]:
        """Return the leaves (yield) of the parse tree."""
        if self.is_terminal:
            return [self.symbol] if self.symbol != 'ε' else []
        result = []
        for child in self.children:
            result.extend(child.leaves())
        return result

    def pretty_print(self, prefix="", is_last=True):
        """Print the tree in a readable format."""
        connector = "└── " if is_last else "├── "
        label = f"'{self.symbol}'" if self.is_terminal else self.symbol
        print(prefix + connector + label)
        new_prefix = prefix + ("    " if is_last else "│   ")
        for i, child in enumerate(self.children):
            child.pretty_print(new_prefix, i == len(self.children) - 1)


def build_parse_tree(grammar: Grammar, derivation_steps: list) -> ParseTreeNode:
    """
    Build a parse tree from a leftmost derivation.
    Each step is a sentential form (list of symbols).
    """
    # Start with root
    root = ParseTreeNode(grammar.start)

    # Track which nodes need expansion
    def find_leftmost_unexpanded(node):
        """Find the leftmost nonterminal leaf that hasn't been expanded."""
        if node.is_terminal:
            return None
        if not node.children:
            if grammar.is_variable(node.symbol):
                return node
            return None
        for child in node.children:
            result = find_leftmost_unexpanded(child)
            if result is not None:
                return result
        return None

    # Process each derivation step
    for i in range(1, len(derivation_steps)):
        current = derivation_steps[i].split()
        prev = derivation_steps[i-1].split()

        # Find which variable was expanded
        node = find_leftmost_unexpanded(root)
        if node is None:
            break

        # Find the production used
        for prod in grammar.prod_index.get(node.symbol, []):
            # Check if this production could have been applied
            expanded = []
            for j, sym in enumerate(prev):
                if sym == node.symbol and j == prev.index(node.symbol):
                    expanded.extend(prod.body if prod.body else ['ε'])
                else:
                    expanded.append(sym)

            if expanded == current or (not prod.body and len(current) == len(prev) - 1):
                # This is the right production
                if prod.body:
                    for sym in prod.body:
                        child = ParseTreeNode(
                            sym,
                            is_terminal=grammar.is_terminal(sym)
                        )
                        node.children.append(child)
                else:
                    node.children.append(
                        ParseTreeNode('ε', is_terminal=True)
                    )
                break

    return root


# Manually build a parse tree for "id + id * id"
def make_expr_tree():
    """Build the parse tree for 'id + id * id' manually."""
    # F -> id (three instances)
    id1 = ParseTreeNode('id', is_terminal=True)
    id2 = ParseTreeNode('id', is_terminal=True)
    id3 = ParseTreeNode('id', is_terminal=True)

    f1 = ParseTreeNode('F', [id1])
    f2 = ParseTreeNode('F', [id2])
    f3 = ParseTreeNode('F', [id3])

    # T -> F (left), T -> T * F (right)
    t1 = ParseTreeNode('T', [f1])
    t2 = ParseTreeNode('T', [
        ParseTreeNode('T', [f2]),
        ParseTreeNode('*', is_terminal=True),
        f3
    ])

    # E -> E + T
    plus = ParseTreeNode('+', is_terminal=True)
    e = ParseTreeNode('E', [
        ParseTreeNode('E', [t1]),
        plus,
        t2
    ])

    return e

tree = make_expr_tree()
print("=== Parse Tree for 'id + id * id' ===\n")
tree.pretty_print(prefix="", is_last=True)
print(f"\nYield: {' '.join(tree.leaves())}")

5. Ambiguity

Definition

A grammar $G$ is ambiguous if there exists a string $w \in L(G)$ that has two or more distinct parse trees (equivalently, two or more distinct leftmost derivations).

Example of an Ambiguous Grammar

Consider this simpler expression grammar:

$$E \rightarrow E + E \mid E * E \mid (E) \mid \texttt{id}$$

The string id + id * id has two parse trees:

Tree 1 (multiplication binds tighter):

       E
      /|\
     E + E
     |  /|\
    id E * E
       |   |
      id  id

Interpretation: id + (id * id)

Tree 2 (addition binds tighter):

       E
      /|\
     E * E
    /|\   |
   E + E  id
   |   |
  id  id

Interpretation: (id + id) * id

These two parse trees give different meanings to the same expression!

Why Ambiguity Matters

  • Different parse trees $\rightarrow$ different semantics
  • A compiler must produce a unique interpretation
  • Ambiguity in a grammar makes parsing nondeterministic

Inherent Ambiguity

A context-free language is inherently ambiguous if every grammar for it is ambiguous. (This is different from a grammar being ambiguous -- you might be able to find an unambiguous grammar for the same language.)

A classic example:

$$L = \{a^i b^j c^k \mid i = j \text{ or } j = k\}$$

This language is inherently ambiguous. The string $a^n b^n c^n$ can be parsed by matching $i = j$ or $j = k$, and no unambiguous grammar exists for this language.

Detecting Ambiguity

Undecidable: There is no algorithm that, given an arbitrary CFG, determines whether it is ambiguous. (This can be reduced to Post's Correspondence Problem.)

However, specific classes of grammars (LL(1), LR(1)) are guaranteed to be unambiguous.

def check_ambiguity_brute_force(grammar: Grammar, max_length: int = 6) -> list:
    """
    Brute-force check: enumerate all leftmost derivations up to a given
    string length and check for strings with multiple derivations.

    WARNING: Exponential complexity. Only for very small grammars.
    Returns list of (string, count) for ambiguous strings.
    """
    from collections import defaultdict

    derivation_count = defaultdict(int)

    def derive(sentential_form, depth):
        """Enumerate all leftmost derivations."""
        # Check if all terminals
        if all(grammar.is_terminal(s) for s in sentential_form):
            if len(sentential_form) <= max_length:
                key = ' '.join(sentential_form)
                derivation_count[key] += 1
            return

        # Prune: too many symbols
        terminal_count = sum(1 for s in sentential_form if grammar.is_terminal(s))
        if terminal_count > max_length:
            return

        if depth > max_length * 3:
            return

        # Find leftmost variable
        for i, sym in enumerate(sentential_form):
            if grammar.is_variable(sym):
                for prod in grammar.prod_index.get(sym, []):
                    new_form = sentential_form[:i] + list(prod.body) + sentential_form[i+1:]
                    derive(new_form, depth + 1)
                break

    derive([grammar.start], 0)

    ambiguous = [(s, c) for s, c in derivation_count.items() if c > 1]
    return ambiguous


# Test with the ambiguous grammar
ambig_grammar = Grammar.from_string("""
E -> E + E | E * E | ( E ) | id
""")

print("=== Ambiguity Check ===\n")
results = check_ambiguity_brute_force(ambig_grammar, max_length=5)
if results:
    print("Ambiguous strings found:")
    for s, count in sorted(results, key=lambda x: len(x[0])):
        print(f"  '{s}' has {count} distinct leftmost derivations")
else:
    print("No ambiguity found (up to given length)")

6. Resolving Ambiguity

Strategy 1: Operator Precedence

Introduce a hierarchy of nonterminals, one per precedence level:

Level 0 (lowest):   E -> E + T | T        (addition)
Level 1:            T -> T * F | F        (multiplication)
Level 2 (highest):  F -> ( E ) | id       (atoms)

Higher-precedence operators are "deeper" in the grammar, so they bind tighter.

For a language with four levels of precedence:

expr   -> expr || or_t | or_t            // Level 0: logical OR
or_t   -> or_t && and_t | and_t          // Level 1: logical AND
and_t  -> and_t == rel_t | rel_t         // Level 2: equality
rel_t  -> rel_t < factor | factor        // Level 3: relational
factor -> ( expr ) | id | num            // Level 4: atoms

Strategy 2: Associativity

  • Left-associative operators use left recursion: $E \rightarrow E + T$
  • a + b + c parses as (a + b) + c

  • Right-associative operators use right recursion: $E \rightarrow T = E$

  • a = b = c parses as a = (b = c)

  • Non-associative operators: no recursion on the same level

  • a < b < c is a syntax error
# Left-associative: a + b + c = (a + b) + c
E -> E + T | T

# Right-associative: a = b = c = a = (b = c)
E -> T = E | T

# Non-associative: a < b is ok, a < b < c is error
E -> T < T | T

Strategy 3: Disambiguation Rules (External)

Some parser generators allow you to specify precedence and associativity externally rather than encoding them in the grammar:

/* Yacc/Bison precedence declarations */
%left '+' '-'         /* lowest precedence, left-associative */
%left '*' '/'         /* higher precedence, left-associative */
%right UMINUS         /* highest precedence, right-associative (unary minus) */

%%
expr : expr '+' expr
     | expr '-' expr
     | expr '*' expr
     | expr '/' expr
     | '-' expr %prec UMINUS
     | '(' expr ')'
     | NUMBER
     ;

This approach keeps the grammar simple (and ambiguous) but resolves ambiguity with explicit rules.

Strategy 4: The Dangling Else Problem

The dangling else is a classic ambiguity:

stmt -> if expr then stmt
      | if expr then stmt else stmt
      | other

For if E1 then if E2 then S1 else S2, two parse trees exist:

Parse tree 1: if E1 then (if E2 then S1 else S2)     -- else matches inner if
Parse tree 2: if E1 then (if E2 then S1) else S2      -- else matches outer if

Resolution: The convention is that else matches the nearest unmatched if. This can be encoded in the grammar:

stmt         -> matched | unmatched
matched      -> if expr then matched else matched | other
unmatched    -> if expr then stmt
              | if expr then matched else unmatched

Or resolved externally: most parser generators associate else with the nearest if by default.

Python Example: Resolving Ambiguity

# Ambiguous grammar for expressions
ambiguous = Grammar.from_string("""
E -> E + E | E * E | ( E ) | id
""")

# Unambiguous grammar (precedence + left-associativity)
unambiguous = Grammar.from_string("""
E -> E + T | T
T -> T * F | F
F -> ( E ) | id
""")

print("=== Ambiguous Grammar ===")
print(ambiguous)
print()

print("=== Unambiguous Grammar (same language) ===")
print(unambiguous)
print()

# Verify they generate the same strings (for small strings)
def generate_strings(grammar, max_depth=8):
    """Generate all strings derivable up to a certain derivation depth."""
    strings = set()

    def derive(form, depth):
        if depth > max_depth:
            return
        if all(grammar.is_terminal(s) for s in form):
            strings.add(tuple(form))
            return
        for i, sym in enumerate(form):
            if grammar.is_variable(sym):
                for prod in grammar.prod_index.get(sym, []):
                    new_form = form[:i] + list(prod.body) + form[i+1:]
                    derive(new_form, depth + 1)
                break

    derive([grammar.start], 0)
    return {' '.join(s) for s in strings}

strings_ambig = generate_strings(ambiguous, 6)
strings_unambig = generate_strings(unambiguous, 8)

print(f"Ambiguous grammar generates {len(strings_ambig)} strings (depth 6)")
print(f"Unambiguous grammar generates {len(strings_unambig)} strings (depth 8)")
print(f"Strings in ambiguous but not unambiguous: {strings_ambig - strings_unambig}")
print(f"Strings in unambiguous but not ambiguous: {strings_unambig - strings_ambig}")

7. Chomsky Normal Form (CNF)

A CFG is in Chomsky Normal Form if every production has one of these forms:

  1. $A \rightarrow BC$ where $B, C \in V$ (two nonterminals)
  2. $A \rightarrow a$ where $a \in \Sigma$ (one terminal)
  3. $S \rightarrow \epsilon$ (only if $\epsilon \in L(G)$, and $S$ does not appear on any right-hand side)

Why CNF Matters

  • CYK parsing algorithm requires CNF (see Section 9)
  • Every parse tree in CNF is a binary tree (useful for analysis)
  • Every CFG can be converted to an equivalent grammar in CNF
  • A derivation of a string of length $n$ in CNF has exactly $2n - 1$ steps

Conversion to CNF

Input: CFG $G = (V, \Sigma, P, S)$ Output: Equivalent CFG $G'$ in CNF

Step 1: Remove $\epsilon$-productions (except possibly $S \rightarrow \epsilon$)

Find all nullable variables: $A$ is nullable if $A \xRightarrow{*} \epsilon$.

For each production $B \rightarrow \alpha$, add new productions with every possible subset of nullable symbols removed. Then delete all $\epsilon$-productions (except $S \rightarrow \epsilon$ if needed).

Step 2: Remove unit productions ($A \rightarrow B$)

Find all unit pairs: $(A, B)$ such that $A \xRightarrow{*} B$ using only unit productions. Replace each unit chain with direct productions.

Step 3: Replace long productions

Replace $A \rightarrow B_1 B_2 \cdots B_k$ (where $k > 2$) with:

$$A \rightarrow B_1 C_1$$ $$C_1 \rightarrow B_2 C_2$$ $$\vdots$$ $$C_{k-2} \rightarrow B_{k-1} B_k$$

Step 4: Replace terminal-nonterminal mixtures

In any production $A \rightarrow \alpha$ with $|\alpha| \geq 2$, replace each terminal $a$ with a new variable $T_a$ and add $T_a \rightarrow a$.

Python Implementation

def to_cnf(grammar: Grammar) -> Grammar:
    """
    Convert a grammar to Chomsky Normal Form.
    """
    variables = set(grammar.variables)
    terminals = set(grammar.terminals)
    productions = list(grammar.productions)
    start = grammar.start

    aux_counter = [0]
    def new_var(prefix="X"):
        aux_counter[0] += 1
        name = f"{prefix}{aux_counter[0]}"
        variables.add(name)
        return name

    # ============================================================
    # Step 0: New start symbol (ensure S doesn't appear on RHS)
    # ============================================================
    rhs_symbols = set()
    for p in productions:
        rhs_symbols.update(p.body)
    if start in rhs_symbols:
        new_start = new_var("S")
        productions.append(Production(new_start, (start,)))
        start = new_start

    # ============================================================
    # Step 1: Remove ε-productions
    # ============================================================

    # Find nullable variables
    nullable = set()
    changed = True
    while changed:
        changed = False
        for p in productions:
            if not p.body or all(s in nullable for s in p.body):
                if p.head not in nullable:
                    nullable.add(p.head)
                    changed = True

    # Generate new productions with nullable symbols optionally removed
    new_prods = set()
    for p in productions:
        if not p.body:
            # ε-production: only keep for start symbol
            if p.head == start:
                new_prods.add(p)
            continue

        # Find all positions with nullable symbols
        nullable_positions = [i for i, s in enumerate(p.body) if s in nullable]

        # Generate all subsets of nullable positions
        from itertools import combinations
        for r in range(len(nullable_positions) + 1):
            for positions_to_remove in combinations(nullable_positions, r):
                new_body = tuple(
                    s for i, s in enumerate(p.body)
                    if i not in positions_to_remove
                )
                if new_body:  # Don't add empty unless it's start
                    new_prods.add(Production(p.head, new_body))
                elif p.head == start:
                    new_prods.add(Production(p.head, ()))

    productions = list(new_prods)

    # ============================================================
    # Step 2: Remove unit productions (A -> B)
    # ============================================================

    # Find unit pairs using transitive closure
    unit_pairs = set()
    for v in variables:
        unit_pairs.add((v, v))  # Reflexive

    changed = True
    while changed:
        changed = False
        for p in productions:
            if len(p.body) == 1 and p.body[0] in variables:
                for (a, b) in list(unit_pairs):
                    if b == p.head:
                        new_pair = (a, p.body[0])
                        if new_pair not in unit_pairs:
                            unit_pairs.add(new_pair)
                            changed = True

    # Replace unit productions
    new_prods = set()
    for p in productions:
        if len(p.body) == 1 and p.body[0] in variables:
            continue  # Skip unit productions

        # For each unit pair (A, B) where B is the head of this production
        for (a, b) in unit_pairs:
            if b == p.head:
                new_prods.add(Production(a, p.body))

    productions = list(new_prods)

    # ============================================================
    # Step 3: Replace terminals in mixed productions
    # ============================================================
    term_vars = {}  # terminal -> variable name
    additional = []

    final_prods = []
    for p in productions:
        if len(p.body) <= 1:
            final_prods.append(p)
            continue

        # Replace terminals with new variables
        new_body = []
        for sym in p.body:
            if sym in terminals:
                if sym not in term_vars:
                    tv = new_var("T")
                    term_vars[sym] = tv
                    additional.append(Production(tv, (sym,)))
                new_body.append(term_vars[sym])
            else:
                new_body.append(sym)
        final_prods.append(Production(p.head, tuple(new_body)))

    productions = final_prods + additional

    # ============================================================
    # Step 4: Break long productions into binary
    # ============================================================
    final_prods = []
    for p in productions:
        if len(p.body) <= 2:
            final_prods.append(p)
            continue

        # A -> B1 B2 B3 ... Bk  =>  A -> B1 C1, C1 -> B2 C2, ..., Ck-2 -> Bk-1 Bk
        symbols = list(p.body)
        current_head = p.head
        while len(symbols) > 2:
            new_head = new_var("C")
            final_prods.append(Production(current_head, (symbols[0], new_head)))
            symbols = symbols[1:]
            current_head = new_head
        final_prods.append(Production(current_head, tuple(symbols)))

    productions = final_prods

    return Grammar(variables, terminals, productions, start)


# Example
print("=== CNF Conversion ===\n")

expr_g = Grammar.from_string("""
E -> E + T | T
T -> T * F | F
F -> ( E ) | id
""")

print("Original grammar:")
print(expr_g)
print()

cnf = to_cnf(expr_g)
print("CNF grammar:")
for p in sorted(cnf.productions, key=lambda p: (p.head, p.body)):
    print(f"  {p}")

# Verify: check that some strings are still derivable
print("\nVerification (derivability check):")
for w in ["id", "id + id", "id * id", "id + id * id"]:
    symbols = w.split()
    # We'll verify using CYK (next section)

8. Greibach Normal Form (GNF)

A CFG is in Greibach Normal Form if every production has the form:

$$A \rightarrow a B_1 B_2 \cdots B_k$$

where $a \in \Sigma$ and $B_1, \ldots, B_k \in V$ (with $k \geq 0$).

In other words, every production starts with exactly one terminal, followed by zero or more nonterminals.

Properties

  1. Every derivation step in GNF consumes exactly one terminal
  2. A string of length $n$ requires exactly $n$ derivation steps (no $\epsilon$-productions, except possibly $S \rightarrow \epsilon$)
  3. GNF is useful for constructing pushdown automata
  4. Conversion to GNF requires eliminating left recursion (covered in Lesson 5)

Comparison of Normal Forms

Property CNF GNF
Production form $A \rightarrow BC$ or $A \rightarrow a$ $A \rightarrow a\alpha$ ($\alpha \in V^*$)
Parse tree shape Binary tree High branching factor
Derivation length for $|w| = n$ $2n - 1$ $n$
Primary use CYK algorithm PDA construction
Left recursion Allowed Not allowed

9. The CYK Algorithm

The Cocke-Younger-Kasami (CYK) algorithm is a dynamic programming algorithm that determines whether a string $w$ is in $L(G)$ for a grammar $G$ in CNF. It also constructs a parse tree.

Algorithm

Input: Grammar $G$ in CNF, string $w = a_1 a_2 \cdots a_n$ Output: Whether $w \in L(G)$, and a parse table

Idea: Build a triangular table $T$ where $T[i][j]$ is the set of variables that can derive the substring $a_i a_{i+1} \cdots a_j$.

$$T[i][j] = \{A \in V \mid A \xRightarrow{*} a_i a_{i+1} \cdots a_j\}$$

Algorithm:

1. // Base case: substrings of length 1
   for i = 1 to n:
       T[i][i] = {A | A -> aᵢ  P}

2. // Inductive case: substrings of length 2, 3, ..., n
   for length = 2 to n:
       for i = 1 to n - length + 1:
           j = i + length - 1
           T[i][j] = 
           for k = i to j - 1:
               for each production A -> BC:
                   if B  T[i][k] and C  T[k+1][j]:
                       T[i][j] = T[i][j]  {A}

3. // Accept if start symbol is in top-right cell
   return S  T[1][n]

Time complexity: $O(n^3 \cdot |P|)$ Space complexity: $O(n^2)$

Worked Example

Grammar in CNF:

$$S \rightarrow AB \mid BC$$ $$A \rightarrow BA \mid a$$ $$B \rightarrow CC \mid b$$ $$C \rightarrow AB \mid a$$

Parse "baaba":

String:  b   a   a   b   a
Index:   1   2   3   4   5

Table T[i][j]:

i\j |  1      2       3       4       5
----+-------------------------------------------
 1  | {B}   {S,A}  {B}     {S,A,C} {B}
 2  |        {A,C}  {S,C}   {B}    {S,A,C}
 3  |               {A,C}   {S,A}  {B}
 4  |                        {B}   {S,A,C}
 5  |                               {A,C}

Step-by-step:

Length 1 (diagonal): - $T[1][1]$: $b$ matched by $B \rightarrow b$, so $\{B\}$ - $T[2][2]$: $a$ matched by $A \rightarrow a$ and $C \rightarrow a$, so $\{A, C\}$ - Similarly for positions 3, 4, 5

Length 2: - $T[1][2]$: split at $k=1$: $T[1][1] \times T[2][2] = \{B\} \times \{A,C\}$ - $B, A$: Check $? \rightarrow BA$: $A \rightarrow BA$ and $S \rightarrow BC$... Actually $A \rightarrow BA$ gives $A$, and any production giving $BA$? Let's check: $S \rightarrow AB$ needs $(A,B)$, $S \rightarrow BC$ needs $(B,C)$: $B \in T[1][1]$ and $C \in T[2][2]$, so $S$. - $A \rightarrow BA$: $B \in T[1][1]$ and $A \in T[2][2]$, so $A$. - Result: $\{S, A\}$

(Continuing similarly for all cells...)

Since $S \in T[1][5]$, the string "baaba" is in the language.

Python Implementation

def cyk_parse(grammar: Grammar, word: str) -> Tuple[bool, list]:
    """
    CYK parsing algorithm.
    Grammar must be in Chomsky Normal Form.
    word is a space-separated string of terminals.

    Returns (accepted, table) where table[i][j] is the set of
    variables that can derive word[i..j].
    """
    symbols = word.split()
    n = len(symbols)

    if n == 0:
        # Check if S -> ε exists
        for p in grammar.prod_index.get(grammar.start, []):
            if not p.body:
                return (True, [])
        return (False, [])

    # Initialize table: T[i][j] = set of variables
    # Using 0-indexed
    T = [[set() for _ in range(n)] for _ in range(n)]

    # Also store back-pointers for parse tree construction
    # back[i][j][A] = (B, C, k) meaning A -> BC, B derives [i..k], C derives [k+1..j]
    back = [[{} for _ in range(n)] for _ in range(n)]

    # Step 1: Base case (length 1)
    for i in range(n):
        for p in grammar.productions:
            if len(p.body) == 1 and p.body[0] == symbols[i]:
                T[i][i].add(p.head)
                back[i][i][p.head] = ('terminal', symbols[i])

    # Step 2: Fill table for increasing lengths
    for length in range(2, n + 1):
        for i in range(n - length + 1):
            j = i + length - 1
            for k in range(i, j):
                # Try all productions A -> BC
                for p in grammar.productions:
                    if len(p.body) == 2:
                        B, C = p.body
                        if B in T[i][k] and C in T[k+1][j]:
                            T[i][j].add(p.head)
                            if p.head not in back[i][j]:
                                back[i][j][p.head] = ('split', B, C, k)

    accepted = grammar.start in T[0][n-1]

    # Print table
    print(f"CYK Table for '{word}':\n")
    col_width = 12
    header = "i\\j |" + "|".join(f"{j:^{col_width}}" for j in range(n))
    print(header)
    print("-" * len(header))
    for i in range(n):
        cells = []
        for j in range(n):
            if j < i:
                cells.append(" " * col_width)
            else:
                cell = "{" + ",".join(sorted(T[i][j])) + "}" if T[i][j] else "∅"
                cells.append(f"{cell:^{col_width}}")
        print(f" {i}  |" + "|".join(cells))

    print(f"\nAccepted: {accepted}")

    # Build parse tree
    def build_tree(var, i, j):
        info = back[i][j].get(var)
        if info is None:
            return ParseTreeNode(var, [ParseTreeNode("?", is_terminal=True)])
        if info[0] == 'terminal':
            return ParseTreeNode(var, [ParseTreeNode(info[1], is_terminal=True)])
        else:
            _, B, C, k = info
            left = build_tree(B, i, k)
            right = build_tree(C, k+1, j)
            return ParseTreeNode(var, [left, right])

    if accepted:
        tree = build_tree(grammar.start, 0, n-1)
        print("\nParse tree:")
        tree.pretty_print()

    return (accepted, T)


# Example: CYK parsing
cnf_grammar = Grammar.from_string("""
S -> A B | B C
A -> B A | a
B -> C C | b
C -> A B | a
""")

print("=== CYK Parsing ===\n")
print("Grammar:")
print(cnf_grammar)
print()

cyk_parse(cnf_grammar, "b a a b a")
print()
cyk_parse(cnf_grammar, "a b")
print()
cyk_parse(cnf_grammar, "b b b")

10. Pushdown Automata (Overview)

A pushdown automaton (PDA) is a finite automaton augmented with a stack. PDAs recognize exactly the context-free languages.

Formal Definition

A PDA is a 7-tuple:

$$M = (Q, \Sigma, \Gamma, \delta, q_0, Z_0, F)$$

where:

  • $Q$ is a finite set of states
  • $\Sigma$ is the input alphabet
  • $\Gamma$ is the stack alphabet
  • $\delta: Q \times (\Sigma \cup \{\epsilon\}) \times \Gamma \rightarrow \mathcal{P}(Q \times \Gamma^*)$ is the transition function
  • $q_0 \in Q$ is the start state
  • $Z_0 \in \Gamma$ is the initial stack symbol
  • $F \subseteq Q$ is the set of accepting states

How a PDA Works

At each step, based on: 1. The current state 2. The current input symbol (or $\epsilon$ for no input) 3. The top of the stack

The PDA: 1. Moves to a new state 2. Optionally consumes an input symbol 3. Replaces the top of the stack with zero or more symbols

Equivalence with CFGs

Theorem: A language is context-free if and only if it is recognized by some PDA.

CFG $\rightarrow$ PDA: For every CFG, we can construct a PDA that simulates leftmost derivations. The PDA uses the stack to track the remaining symbols to be derived.

PDA $\rightarrow$ CFG: For every PDA, we can construct a CFG. The construction creates variables $[q_i, A, q_j]$ representing "the PDA goes from state $q_i$ to state $q_j$ while popping $A$ off the stack."

Python: Simple PDA Simulation

class PDA:
    """Simple Pushdown Automaton simulation."""

    def __init__(self, transitions, start_state, start_stack, accept_states):
        """
        transitions: dict mapping (state, input_or_eps, stack_top) ->
                     list of (new_state, stack_push)
        stack_push is a tuple: () means pop, ('A',) means replace with A,
                                ('A', 'B') means replace top with A then push B
        """
        self.transitions = transitions
        self.start_state = start_state
        self.start_stack = start_stack
        self.accept_states = accept_states

    def accepts(self, word: str) -> bool:
        """Check if the PDA accepts the word (by final state)."""
        # Configuration: (state, remaining_input_pos, stack)
        # Use BFS to explore all nondeterministic choices
        from collections import deque

        initial = (self.start_state, 0, (self.start_stack,))
        queue = deque([initial])
        visited = set()

        symbols = list(word) if word else []

        while queue:
            state, pos, stack = queue.popleft()

            config = (state, pos, stack)
            if config in visited:
                continue
            visited.add(config)

            # Check acceptance
            if pos == len(symbols) and state in self.accept_states:
                return True

            if not stack:
                continue

            top = stack[-1]
            rest = stack[:-1]

            # Try epsilon transitions
            for new_state, push in self.transitions.get(
                (state, 'ε', top), []
            ):
                new_stack = rest + push
                if len(new_stack) < 100:  # Bound stack size
                    queue.append((new_state, pos, new_stack))

            # Try input transitions
            if pos < len(symbols):
                ch = symbols[pos]
                for new_state, push in self.transitions.get(
                    (state, ch, top), []
                ):
                    new_stack = rest + push
                    if len(new_stack) < 100:
                        queue.append((new_state, pos + 1, new_stack))

        return False


# PDA for {a^n b^n | n >= 0}
# States: q0 (push a's), q1 (pop a's matching b's), q_accept
pda_anbn = PDA(
    transitions={
        # In q0, push a's onto stack
        ('q0', 'a', 'Z'): [('q0', ('Z', 'A'))],    # Push A, keep Z
        ('q0', 'a', 'A'): [('q0', ('A', 'A'))],    # Push A
        # Switch to q1 when seeing b
        ('q0', 'b', 'A'): [('q1', ())],             # Pop A
        # In q0, accept empty string
        ('q0', 'ε', 'Z'): [('q_accept', ('Z',))],
        # In q1, pop a's matching b's
        ('q1', 'b', 'A'): [('q1', ())],             # Pop A
        # Accept when stack has only Z
        ('q1', 'ε', 'Z'): [('q_accept', ('Z',))],
    },
    start_state='q0',
    start_stack='Z',
    accept_states={'q_accept'}
)

print("=== PDA for {a^n b^n | n >= 0} ===\n")
for w in ["", "ab", "aabb", "aaabbb", "aab", "abb", "ba", "abab"]:
    result = "accept" if pda_anbn.accepts(w) else "reject"
    print(f"  '{w}': {result}")

11. The Pumping Lemma for Context-Free Languages

Statement

If $L$ is a context-free language, then there exists a constant $p \geq 1$ such that every string $w \in L$ with $|w| \geq p$ can be split into five parts:

$$w = uvxyz$$

satisfying:

  1. $|vy| > 0$ (at least one of $v$ and $y$ is nonempty)
  2. $|vxy| \leq p$
  3. $\forall i \geq 0: uv^ixy^iz \in L$ (pumping $v$ and $y$ together)

Proof Idea

Any sufficiently long string must have a derivation tree of height greater than $|V|$ (the number of variables). By the pigeonhole principle, some variable $A$ must repeat on a path from root to leaf:

        S
       / \
      /   \
     /     \
    A       <-- first occurrence of A
   / \
  v   A    <-- second occurrence of A
     / \
    x   y

The subtree rooted at the first $A$ generates $vxy$, and the subtree rooted at the second $A$ generates $x$. We can "pump" by repeating or removing the portion between the two $A$'s.

Using the CFL Pumping Lemma

To prove a language is NOT context-free:

  1. Assume $L$ is context-free
  2. Let $p$ be the pumping length
  3. Choose a specific $w \in L$ with $|w| \geq p$
  4. Show that for all decompositions $w = uvxyz$ with $|vy| > 0$ and $|vxy| \leq p$, there exists some $i$ where $uv^ixy^iz \notin L$
  5. Contradiction

Example: $L = \{a^n b^n c^n \mid n \geq 0\}$ Is Not Context-Free

Proof:

  1. Assume $L$ is context-free with pumping length $p$.
  2. Choose $w = a^p b^p c^p \in L$, with $|w| = 3p \geq p$.
  3. Let $w = uvxyz$ with $|vy| > 0$ and $|vxy| \leq p$.
  4. Since $|vxy| \leq p$, the substring $vxy$ can span at most two of the three groups ($a$'s, $b$'s, $c$'s).
  5. Case analysis:
  6. $vxy$ is within $a^p$: pumping changes $a$ count but not $b$ or $c$.
  7. $vxy$ is within $a^p b^p$: pumping changes $a$ and/or $b$ but not $c$.
  8. $vxy$ is within $b^p$: pumping changes $b$ but not $a$ or $c$.
  9. $vxy$ is within $b^p c^p$: pumping changes $b$ and/or $c$ but not $a$.
  10. $vxy$ is within $c^p$: pumping changes $c$ but not $a$ or $b$.
  11. In all cases, pumping ($i = 2$) produces a string where the three counts are unequal, so $uv^2xy^2z \notin L$.
  12. Contradiction. $\blacksquare$

Languages Beyond Context-Free

Language Type Why Not CF
$\{a^n b^n c^n \mid n \geq 0\}$ Context-sensitive Needs three-way matching
$\{ww \mid w \in \{a,b\}^*\}$ Context-sensitive Copy language
$\{a^{2^n} \mid n \geq 0\}$ Context-sensitive Exponential growth
Any decidable language Type 0 or below Turing machine required

Note: $\{a^n b^n \mid n \geq 0\}$ IS context-free. The limitation is three or more matched groups.


12. The Chomsky Hierarchy

The Chomsky hierarchy classifies formal languages into four levels:

Type Name Grammar restriction Automaton Example
3 Regular $A \rightarrow aB$ or $A \rightarrow a$ Finite automaton (DFA/NFA) $a^*b^*$
2 Context-free $A \rightarrow \alpha$ ($\alpha \in (V \cup \Sigma)^*$) Pushdown automaton $a^nb^n$
1 Context-sensitive $\alpha A \beta \rightarrow \alpha \gamma \beta$ ($|\gamma| \geq 1$) Linear-bounded automaton $a^nb^nc^n$
0 Recursively enumerable No restriction Turing machine Halting problem complement
Hierarchy (proper inclusions):

Regular  ⊂  Context-Free  ⊂  Context-Sensitive  ⊂  Recursively Enumerable
  (Type 3)    (Type 2)         (Type 1)              (Type 0)

Relevance to Compiler Design

Phase Language Type Formalism
Lexical analysis Regular (Type 3) DFA, regular expressions
Syntax analysis Context-free (Type 2) CFG, pushdown automata
Semantic analysis Context-sensitive (Type 1) Attribute grammars, type systems
Full language semantics Beyond CF Turing machines (undecidable in general)

13. Grammar Transformations

Eliminating Left Recursion

A grammar has left recursion if there exists a derivation $A \xRightarrow{+} A\alpha$ for some $\alpha$. Left recursion causes problems for top-down parsers (infinite loop).

Direct left recursion ($A \rightarrow A\alpha \mid \beta$):

Replace with: $$A \rightarrow \beta A'$$ $$A' \rightarrow \alpha A' \mid \epsilon$$

Example: $$E \rightarrow E + T \mid T$$

becomes: $$E \rightarrow T E'$$ $$E' \rightarrow + T E' \mid \epsilon$$

Left Factoring

When two productions for the same nonterminal share a common prefix:

$$A \rightarrow \alpha \beta_1 \mid \alpha \beta_2$$

Replace with: $$A \rightarrow \alpha A'$$ $$A' \rightarrow \beta_1 \mid \beta_2$$

Python Implementation

def eliminate_left_recursion(grammar: Grammar) -> Grammar:
    """
    Eliminate immediate left recursion from all productions.
    (Does not handle indirect left recursion.)
    """
    new_productions = []
    new_variables = set(grammar.variables)

    for var in sorted(grammar.variables):
        prods = grammar.prod_index.get(var, [])

        # Separate left-recursive and non-left-recursive productions
        left_recursive = []  # A -> A α
        non_recursive = []   # A -> β

        for p in prods:
            if p.body and p.body[0] == var:
                left_recursive.append(p.body[1:])  # α (without leading A)
            else:
                non_recursive.append(p.body)

        if not left_recursive:
            # No left recursion for this variable
            new_productions.extend(prods)
            continue

        # Eliminate left recursion
        # A -> β1 | β2 | ... | A α1 | A α2 | ...
        # becomes:
        # A  -> β1 A' | β2 A' | ...
        # A' -> α1 A' | α2 A' | ... | ε

        new_var = var + "'"
        while new_var in new_variables:
            new_var += "'"
        new_variables.add(new_var)

        # A -> βi A'
        for beta in non_recursive:
            new_body = beta + (new_var,) if beta else (new_var,)
            new_productions.append(Production(var, new_body))

        # A' -> αi A'
        for alpha in left_recursive:
            new_body = alpha + (new_var,)
            new_productions.append(Production(new_var, new_body))

        # A' -> ε
        new_productions.append(Production(new_var, ()))

    return Grammar(
        new_variables, grammar.terminals,
        new_productions, grammar.start
    )


def left_factor(grammar: Grammar) -> Grammar:
    """
    Apply left factoring to a grammar.
    """
    new_productions = list(grammar.productions)
    new_variables = set(grammar.variables)
    aux_counter = [0]

    changed = True
    while changed:
        changed = False

        # Index by head
        prod_index = {}
        for p in new_productions:
            prod_index.setdefault(p.head, []).append(p)

        next_prods = []
        processed_vars = set()

        for var in sorted(prod_index.keys()):
            if var in processed_vars:
                continue
            prods = prod_index[var]

            # Find longest common prefix among any two productions
            best_prefix = None
            best_group = None

            for i in range(len(prods)):
                for j in range(i + 1, len(prods)):
                    # Find common prefix
                    prefix = []
                    for k in range(min(len(prods[i].body), len(prods[j].body))):
                        if prods[i].body[k] == prods[j].body[k]:
                            prefix.append(prods[i].body[k])
                        else:
                            break

                    if prefix and (best_prefix is None or len(prefix) > len(best_prefix)):
                        best_prefix = tuple(prefix)
                        # Find all productions sharing this prefix
                        group = [p for p in prods if p.body[:len(prefix)] == best_prefix]
                        best_group = group

            if best_prefix and best_group and len(best_group) > 1:
                changed = True
                processed_vars.add(var)

                # Create new variable
                aux_counter[0] += 1
                new_var = f"{var}_LF{aux_counter[0]}"
                new_variables.add(new_var)

                # A -> prefix A_LF
                next_prods.append(Production(var, best_prefix + (new_var,)))

                # A_LF -> remaining1 | remaining2 | ...
                for p in best_group:
                    remaining = p.body[len(best_prefix):]
                    next_prods.append(Production(new_var, remaining if remaining else ()))

                # Keep non-matching productions
                for p in prods:
                    if p not in best_group:
                        next_prods.append(p)
            else:
                next_prods.extend(prods)

        new_productions = next_prods

    return Grammar(
        new_variables, grammar.terminals,
        new_productions, grammar.start
    )


# Example: Eliminate left recursion
print("=== Eliminating Left Recursion ===\n")
print("Original:")
print(expr_grammar)
print()

no_lr = eliminate_left_recursion(expr_grammar)
print("After eliminating left recursion:")
print(no_lr)
print()

# Example: Left factoring
lf_grammar = Grammar.from_string("""
S -> if E then S else S | if E then S | other
E -> id
""")

print("=== Left Factoring ===\n")
print("Original:")
print(lf_grammar)
print()

factored = left_factor(lf_grammar)
print("After left factoring:")
print(factored)

Summary

  • A context-free grammar $G = (V, \Sigma, P, S)$ generates a language through derivations
  • BNF and EBNF are standard notations for writing grammars
  • Leftmost and rightmost derivations correspond to different parsing strategies
  • Parse trees are the structural representation of derivations
  • A grammar is ambiguous if some string has multiple parse trees; ambiguity is undecidable in general
  • Ambiguity is resolved through precedence, associativity, and grammar restructuring
  • Chomsky Normal Form (binary productions) enables the CYK algorithm for parsing in $O(n^3)$
  • Greibach Normal Form (terminal-first productions) is useful for PDA construction
  • Pushdown automata recognize exactly the context-free languages
  • The CFL pumping lemma proves that languages like $\{a^n b^n c^n\}$ are not context-free
  • The Chomsky hierarchy classifies languages into four types: regular, context-free, context-sensitive, and recursively enumerable
  • Left recursion elimination and left factoring prepare grammars for top-down parsing

Exercises

Exercise 1: Grammar Writing

Write a context-free grammar for each of the following languages:

  1. Palindromes over $\{a, b\}$
  2. Balanced parentheses (possibly nested)
  3. $\{a^i b^j c^k \mid i = j \text{ or } j = k\}$
  4. A simplified subset of Python if/elif/else statements
  5. Arithmetic expressions with +, -, *, /, unary -, and parentheses, respecting standard precedence and left-associativity

Exercise 2: Derivations and Parse Trees

Using the grammar $E \rightarrow E + T \mid T$, $T \rightarrow T * F \mid F$, $F \rightarrow (E) \mid \texttt{id} \mid \texttt{num}$:

  1. Find the leftmost derivation for id * (num + id)
  2. Find the rightmost derivation for the same string
  3. Draw the parse tree
  4. Verify that both derivations correspond to the same parse tree

Exercise 3: Ambiguity Analysis

Consider the grammar:

$$S \rightarrow aSb \mid aSbb \mid \epsilon$$

  1. What language does this grammar generate?
  2. Is the grammar ambiguous? If so, find a string with two parse trees.
  3. Can you find an unambiguous grammar for the same language?

Exercise 4: CNF Conversion

Convert the following grammar to Chomsky Normal Form:

$$S \rightarrow ASA \mid aB$$ $$A \rightarrow B \mid S$$ $$B \rightarrow b \mid \epsilon$$

Show each step of the conversion process.

Exercise 5: CYK Algorithm

Using the CNF grammar from Exercise 4 (or the one provided in Section 9), run the CYK algorithm on the strings: 1. "aabb" 2. "abab" 3. "baba"

Show the complete table for each string.

Exercise 6: Pumping Lemma

Use the pumping lemma for CFLs to prove that the following languages are NOT context-free:

  1. $L_1 = \{a^n b^n c^n d^n \mid n \geq 0\}$
  2. $L_2 = \{a^i b^j c^k \mid 0 \leq i \leq j \leq k\}$
  3. $L_3 = \{ww \mid w \in \{a, b\}^*\}$ (the copy language)

Previous: Finite Automata | Next: Top-Down Parsing | Overview

to navigate between lessons