Lesson 2: Lexical Analysis
Lesson 2: Lexical Analysis¶
Learning Objectives¶
After completing this lesson, you will be able to:
- Explain the role of the lexer (scanner) in the compilation pipeline
- Distinguish between tokens, lexemes, and patterns
- Define regular expressions formally and use them to specify token patterns
- Apply Thompson's construction to convert a regular expression to an NFA
- Apply the subset construction to convert an NFA to a DFA
- Apply Hopcroft's algorithm to minimize a DFA
- Implement the longest-match rule and handle token priorities
- Build a complete lexer for a simple programming language in Python
1. The Role of the Lexer¶
The lexer (also called scanner or tokenizer) is the first phase of a compiler. It reads the source program as a stream of characters and groups them into meaningful sequences called tokens.
Source characters: "if (x >= 42) return y + 1;"
|
[Lexer / Scanner]
|
v
Token stream: KW_IF LPAREN ID(x) GEQ INT(42) RPAREN
KW_RETURN ID(y) PLUS INT(1) SEMI
Why Separate Lexical Analysis?¶
-
Simplicity: The parser operates on tokens (a finite, structured alphabet) rather than raw characters. This drastically simplifies the grammar.
-
Efficiency: Lexical patterns (identifiers, numbers, strings) are regular and can be recognized by finite automata -- much faster than context-free parsing.
-
Portability: Character-set issues (ASCII, UTF-8, line endings) are confined to the lexer.
-
Modularity: The lexer and parser can be developed independently. Different source encodings require changes only in the lexer.
What the Lexer Does¶
- Groups characters into tokens
- Strips whitespace and comments
- Handles line counting (for error messages)
- Recognizes keywords vs. identifiers
- Handles string and character literal escaping
- Reports lexical errors (illegal characters, unterminated strings)
What the Lexer Does NOT Do¶
- Check syntax (that is the parser's job)
- Check types (that is the semantic analyzer's job)
- Handle operator precedence (that is the parser's job)
2. Tokens, Lexemes, and Patterns¶
Three related but distinct concepts:
Token¶
A token is an abstract symbol representing a class of lexical units. It is the output of the lexer and the input to the parser.
A token typically has:
- A token type (or token name): ID, INT, PLUS, KW_IF, etc.
- An optional attribute value: the actual text, numeric value, or symbol table pointer
Lexeme¶
A lexeme is the actual substring of the source program that matches a token pattern.
Pattern¶
A pattern is a rule (usually a regular expression) that describes the set of lexemes belonging to a token type.
Example¶
| Token Type | Pattern (informal) | Example Lexemes |
|---|---|---|
ID |
Letter followed by letters/digits | x, count, myVar |
INT |
One or more digits | 0, 42, 1000 |
FLOAT |
Digits, dot, digits | 3.14, 0.001 |
STRING |
" ... " |
"hello", "" |
KW_IF |
if |
if |
KW_WHILE |
while |
while |
PLUS |
+ |
+ |
GEQ |
>= |
>= |
ASSIGN |
= |
= |
LPAREN |
( |
( |
SEMI |
; |
; |
Token Data Structure¶
from dataclasses import dataclass
from enum import Enum, auto
from typing import Any, Optional
class TokenType(Enum):
# Literals
INT_LIT = auto()
FLOAT_LIT = auto()
STRING_LIT = auto()
CHAR_LIT = auto()
# Identifiers
IDENTIFIER = auto()
# Keywords
KW_IF = auto()
KW_ELSE = auto()
KW_WHILE = auto()
KW_FOR = auto()
KW_RETURN = auto()
KW_INT = auto()
KW_FLOAT = auto()
KW_VOID = auto()
# Operators
PLUS = auto() # +
MINUS = auto() # -
STAR = auto() # *
SLASH = auto() # /
ASSIGN = auto() # =
EQ = auto() # ==
NEQ = auto() # !=
LT = auto() # <
GT = auto() # >
LEQ = auto() # <=
GEQ = auto() # >=
AND = auto() # &&
OR = auto() # ||
NOT = auto() # !
# Delimiters
LPAREN = auto() # (
RPAREN = auto() # )
LBRACE = auto() # {
RBRACE = auto() # }
LBRACKET = auto() # [
RBRACKET = auto() # ]
COMMA = auto() # ,
SEMI = auto() # ;
# Special
EOF = auto()
ERROR = auto()
@dataclass
class Token:
type: TokenType
lexeme: str
value: Any = None # Computed value (e.g., int 42, not string "42")
line: int = 0
column: int = 0
def __repr__(self):
if self.value is not None:
return f"Token({self.type.name}, {self.lexeme!r}, value={self.value})"
return f"Token({self.type.name}, {self.lexeme!r})"
3. Regular Expressions -- Formal Definition¶
Token patterns are specified using regular expressions (regex). Here we define them formally, as used in compiler theory -- not the extended regex syntax of tools like Python's re module.
Alphabet and Strings¶
- An alphabet $\Sigma$ is a finite, nonempty set of symbols (characters).
- A string over $\Sigma$ is a finite sequence of symbols from $\Sigma$.
- The empty string is denoted $\epsilon$ (epsilon).
- The length of string $w$ is $|w|$. We have $|\epsilon| = 0$.
- $\Sigma^*$ denotes the set of all strings over $\Sigma$ (including $\epsilon$).
- $\Sigma^+ = \Sigma^* \setminus \{\epsilon\}$ is the set of all nonempty strings.
Language¶
A language $L$ over $\Sigma$ is any subset of $\Sigma^*$ -- that is, any set of strings.
Regular Expression Definition¶
A regular expression $r$ over alphabet $\Sigma$ is defined inductively:
Base cases:
- $\epsilon$ is a regular expression denoting the language $\{\epsilon\}$
- For each $a \in \Sigma$, the symbol $a$ is a regular expression denoting $\{a\}$
Inductive cases (if $r$ and $s$ are regular expressions denoting $L(r)$ and $L(s)$):
- Union (alternation): $r \mid s$ denotes $L(r) \cup L(s)$
- Concatenation: $r \cdot s$ (or simply $rs$) denotes $L(r) \cdot L(s) = \{xy \mid x \in L(r), y \in L(s)\}$
- Kleene star (closure): $r^*$ denotes $L(r)^* = \bigcup_{i=0}^{\infty} L(r)^i$
where $L^0 = \{\epsilon\}$ and $L^{i+1} = L^i \cdot L$.
Operator precedence (highest to lowest): 1. Kleene star $*$ (postfix, binds tightest) 2. Concatenation (implicit, juxtaposition) 3. Union $\mid$ (binds loosest)
All operators are left-associative.
Examples¶
| Regex | Language | Description |
|---|---|---|
| $a$ | $\{a\}$ | Just the string "a" |
| $a \mid b$ | $\{a, b\}$ | Either "a" or "b" |
| $ab$ | $\{ab\}$ | The string "ab" |
| $a^*$ | $\{\epsilon, a, aa, aaa, \ldots\}$ | Zero or more a's |
| $(a \mid b)^*$ | All strings over $\{a, b\}$ | Any combination of a's and b's |
| $a(a \mid b)^*$ | Strings over $\{a,b\}$ starting with $a$ | Starts with a |
| $(0 \mid 1)^* 0$ | Binary strings ending in 0 | Even binary numbers |
Convenient Shorthands¶
These are not part of the formal definition but are commonly used:
| Shorthand | Meaning | Formal equivalent |
|---|---|---|
| $r^+$ | One or more | $rr^*$ |
| $r?$ | Zero or one | $r \mid \epsilon$ |
| $[a\text{-}z]$ | Character class | $a \mid b \mid \cdots \mid z$ |
| $[0\text{-}9]$ | Digit | $0 \mid 1 \mid \cdots \mid 9$ |
| $.$ | Any character | $\Sigma$ (union of all) |
| $r\{n\}$ | Exactly $n$ copies | $\underbrace{rr\cdots r}_{n}$ |
Token Patterns as Regular Expressions¶
Identifier: [a-zA-Z_][a-zA-Z0-9_]*
Integer: [0-9]+
Float: [0-9]+\.[0-9]+([eE][+-]?[0-9]+)?
String: "[^"\\]*(\\.[^"\\]*)*"
Comment: //[^\n]*
Whitespace: [ \t\n\r]+
Algebraic Laws of Regular Expressions¶
Regular expressions obey the following identities (useful for simplification):
| Law | Statement |
|---|---|
| Union is commutative | $r \mid s = s \mid r$ |
| Union is associative | $(r \mid s) \mid t = r \mid (s \mid t)$ |
| Concatenation is associative | $(rs)t = r(st)$ |
| Concatenation distributes over union | $r(s \mid t) = rs \mid rt$ |
| $\epsilon$ is identity for concatenation | $\epsilon r = r\epsilon = r$ |
| $\emptyset$ is identity for union | $r \mid \emptyset = r$ |
| $\emptyset$ is zero for concatenation | $\emptyset r = r\emptyset = \emptyset$ |
| Star idempotence | $(r^*)^* = r^*$ |
| Star of epsilon | $\epsilon^* = \epsilon$ |
4. From Regular Expressions to NFA: Thompson's Construction¶
Thompson's construction (Ken Thompson, 1968) converts a regular expression into an equivalent NFA (nondeterministic finite automaton) with $\epsilon$-transitions.
NFA Definition (Brief)¶
An NFA is a 5-tuple $(Q, \Sigma, \delta, q_0, F)$ where: - $Q$ is a finite set of states - $\Sigma$ is the input alphabet - $\delta: Q \times (\Sigma \cup \{\epsilon\}) \rightarrow \mathcal{P}(Q)$ is the transition function - $q_0 \in Q$ is the start state - $F \subseteq Q$ is the set of accept states
Construction Rules¶
Each rule produces an NFA with exactly one start state and one accept state. No transitions enter the start state or leave the accept state.
Base case 1: Empty string $\epsilon$
-->(q0)---ε--->(q1)
start accept
Base case 2: Single symbol $a$
-->(q0)---a--->(q1)
start accept
Inductive case 1: Union $r \mid s$
Given NFAs $N(r)$ with states $(r_0, r_f)$ and $N(s)$ with states $(s_0, s_f)$:
ε ---> [N(r)] ---ε--->
/ \
-->(q0) (q_f)
\ /
ε ---> [N(s)] ---ε--->
start accept
Inductive case 2: Concatenation $rs$
Merge the accept state of $N(r)$ with the start state of $N(s)$:
-->(r0)---[N(r)]--->(r_f = s_0)---[N(s)]--->(s_f)
start accept
Inductive case 3: Kleene star $r^*$
ε ---> [N(r)] ---ε--->
/ | \
-->(q0) <---ε-----+ (q_f)
\ /
----------ε---------->
start accept
Properties of Thompson's Construction¶
- The resulting NFA has at most $2n$ states for a regex of length $n$
- Each state has at most two outgoing transitions
- The NFA has exactly one start state and one accept state
- Construction is linear in the size of the regex: $O(n)$
Python Implementation¶
from dataclasses import dataclass, field
from typing import Dict, Set, Tuple, List, Optional
@dataclass
class NFAState:
"""A state in an NFA."""
id: int
is_accept: bool = False
transitions: Dict[str, List['NFAState']] = field(default_factory=dict)
def add_transition(self, symbol: str, target: 'NFAState'):
"""Add a transition on symbol (use 'ε' for epsilon)."""
if symbol not in self.transitions:
self.transitions[symbol] = []
self.transitions[symbol].append(target)
def __hash__(self):
return hash(self.id)
def __eq__(self, other):
return isinstance(other, NFAState) and self.id == other.id
def __repr__(self):
suffix = " (accept)" if self.is_accept else ""
return f"q{self.id}{suffix}"
class NFA:
"""Nondeterministic Finite Automaton."""
_state_counter = 0
def __init__(self, start: NFAState, accept: NFAState):
self.start = start
self.accept = accept
@classmethod
def _new_state(cls, is_accept=False) -> NFAState:
state = NFAState(id=cls._state_counter, is_accept=is_accept)
cls._state_counter += 1
return state
@classmethod
def reset_counter(cls):
cls._state_counter = 0
@classmethod
def from_epsilon(cls) -> 'NFA':
"""Thompson's construction: epsilon."""
start = cls._new_state()
accept = cls._new_state(is_accept=True)
start.add_transition('ε', accept)
return cls(start, accept)
@classmethod
def from_symbol(cls, symbol: str) -> 'NFA':
"""Thompson's construction: single symbol."""
start = cls._new_state()
accept = cls._new_state(is_accept=True)
start.add_transition(symbol, accept)
return cls(start, accept)
@classmethod
def union(cls, nfa1: 'NFA', nfa2: 'NFA') -> 'NFA':
"""Thompson's construction: r | s."""
start = cls._new_state()
accept = cls._new_state(is_accept=True)
start.add_transition('ε', nfa1.start)
start.add_transition('ε', nfa2.start)
nfa1.accept.is_accept = False
nfa1.accept.add_transition('ε', accept)
nfa2.accept.is_accept = False
nfa2.accept.add_transition('ε', accept)
return cls(start, accept)
@classmethod
def concatenation(cls, nfa1: 'NFA', nfa2: 'NFA') -> 'NFA':
"""Thompson's construction: r s."""
nfa1.accept.is_accept = False
nfa1.accept.add_transition('ε', nfa2.start)
return cls(nfa1.start, nfa2.accept)
@classmethod
def kleene_star(cls, nfa1: 'NFA') -> 'NFA':
"""Thompson's construction: r*."""
start = cls._new_state()
accept = cls._new_state(is_accept=True)
start.add_transition('ε', nfa1.start)
start.add_transition('ε', accept)
nfa1.accept.is_accept = False
nfa1.accept.add_transition('ε', nfa1.start)
nfa1.accept.add_transition('ε', accept)
return cls(start, accept)
def regex_to_nfa(regex: str) -> NFA:
"""
Convert a simple regex to NFA using Thompson's construction.
Supported operators: | (union), * (star), concatenation (implicit),
( ) (grouping)
This parser handles operator precedence correctly:
* > concatenation > |
"""
NFA.reset_counter()
pos = 0
def peek():
nonlocal pos
return regex[pos] if pos < len(regex) else None
def consume():
nonlocal pos
ch = regex[pos]
pos += 1
return ch
def parse_expr() -> NFA:
"""expr = term ('|' term)*"""
left = parse_term()
while peek() == '|':
consume() # eat '|'
right = parse_term()
left = NFA.union(left, right)
return left
def parse_term() -> NFA:
"""term = factor+"""
# A term is one or more concatenated factors
result = parse_factor()
while peek() is not None and peek() not in ('|', ')'):
right = parse_factor()
result = NFA.concatenation(result, right)
return result
def parse_factor() -> NFA:
"""factor = base ('*')?"""
base = parse_base()
while peek() == '*':
consume()
base = NFA.kleene_star(base)
return base
def parse_base() -> NFA:
"""base = '(' expr ')' | symbol"""
if peek() == '(':
consume() # eat '('
nfa = parse_expr()
if peek() != ')':
raise SyntaxError("Expected ')'")
consume() # eat ')'
return nfa
elif peek() is not None and peek() not in ('|', ')', '*'):
ch = consume()
return NFA.from_symbol(ch)
else:
# Empty: return epsilon NFA
return NFA.from_epsilon()
result = parse_expr()
if pos != len(regex):
raise SyntaxError(f"Unexpected character at position {pos}: {regex[pos]}")
return result
def epsilon_closure(states: Set[NFAState]) -> Set[NFAState]:
"""Compute the epsilon-closure of a set of NFA states."""
closure = set(states)
stack = list(states)
while stack:
state = stack.pop()
for target in state.transitions.get('ε', []):
if target not in closure:
closure.add(target)
stack.append(target)
return closure
def nfa_simulate(nfa: NFA, input_string: str) -> bool:
"""Simulate an NFA on an input string."""
current = epsilon_closure({nfa.start})
for ch in input_string:
next_states = set()
for state in current:
for target in state.transitions.get(ch, []):
next_states.add(target)
current = epsilon_closure(next_states)
return any(s.is_accept for s in current)
# Example usage
nfa = regex_to_nfa("(a|b)*abb")
print("NFA for (a|b)*abb")
print(f" 'abb': {nfa_simulate(nfa, 'abb')}") # True
print(f" 'aabb': {nfa_simulate(nfa, 'aabb')}") # True
print(f" 'babb': {nfa_simulate(nfa, 'babb')}") # True
print(f" 'ab': {nfa_simulate(nfa, 'ab')}") # False
print(f" 'abab': {nfa_simulate(nfa, 'abab')}") # False
print(f" '': {nfa_simulate(nfa, '')}") # False
5. From NFA to DFA: Subset Construction¶
While NFAs are easy to construct from regular expressions, they are nondeterministic -- a state can have multiple transitions on the same symbol. DFAs (deterministic finite automata) are more efficient to simulate because each state has exactly one transition per symbol.
The subset construction (also called the powerset construction) converts an NFA to an equivalent DFA.
Algorithm¶
Input: NFA N = (Q_N, Σ, δ_N, q_0, F_N)
Output: DFA D = (Q_D, Σ, δ_D, d_0, F_D)
1. d_0 = ε-closure({q_0}) // Start state of DFA
2. Q_D = {d_0} // Set of DFA states (each is a set of NFA states)
3. worklist = [d_0] // States to process
4. while worklist is not empty:
5. S = worklist.pop()
6. for each symbol a in Σ:
7. T = ε-closure(move(S, a)) // move(S,a) = union of δ_N(s,a) for all s in S
8. if T is not empty:
9. if T not in Q_D:
10. Q_D.add(T)
11. worklist.append(T)
12. δ_D(S, a) = T
13. F_D = {S in Q_D | S ∩ F_N ≠ ∅} // DFA state is accepting if it contains an NFA accept state
Complexity¶
In the worst case, a DFA can have $2^n$ states for an NFA with $n$ states (hence "powerset" construction). In practice, only a small fraction of these states are reachable.
Python Implementation¶
from typing import FrozenSet
@dataclass
class DFAState:
"""A state in a DFA, corresponding to a set of NFA states."""
id: int
nfa_states: FrozenSet[NFAState]
is_accept: bool = False
transitions: Dict[str, 'DFAState'] = field(default_factory=dict)
def __repr__(self):
nfa_ids = sorted(s.id for s in self.nfa_states)
suffix = " (accept)" if self.is_accept else ""
return f"D{self.id}{{{','.join(map(str, nfa_ids))}}}{suffix}"
class DFA:
"""Deterministic Finite Automaton."""
def __init__(self, start: DFAState, states: List[DFAState], alphabet: Set[str]):
self.start = start
self.states = states
self.alphabet = alphabet
@classmethod
def from_nfa(cls, nfa: NFA) -> 'DFA':
"""Convert NFA to DFA using subset construction."""
# Collect alphabet (all symbols except epsilon)
alphabet = set()
visited_nfa = set()
stack = [nfa.start]
while stack:
state = stack.pop()
if state in visited_nfa:
continue
visited_nfa.add(state)
for symbol, targets in state.transitions.items():
if symbol != 'ε':
alphabet.add(symbol)
for t in targets:
stack.append(t)
# Subset construction
dfa_id = 0
start_nfa_states = frozenset(epsilon_closure({nfa.start}))
start_is_accept = any(s.is_accept for s in start_nfa_states)
start_dfa = DFAState(dfa_id, start_nfa_states, start_is_accept)
dfa_id += 1
# Map from frozenset of NFA states -> DFA state
state_map = {start_nfa_states: start_dfa}
dfa_states = [start_dfa]
worklist = [start_dfa]
while worklist:
current = worklist.pop()
for symbol in sorted(alphabet):
# Compute move(current.nfa_states, symbol)
move_result = set()
for nfa_state in current.nfa_states:
for target in nfa_state.transitions.get(symbol, []):
move_result.add(target)
if not move_result:
continue
# Compute epsilon-closure of the move result
closure = frozenset(epsilon_closure(move_result))
if closure not in state_map:
is_accept = any(s.is_accept for s in closure)
new_state = DFAState(dfa_id, closure, is_accept)
dfa_id += 1
state_map[closure] = new_state
dfa_states.append(new_state)
worklist.append(new_state)
current.transitions[symbol] = state_map[closure]
return cls(start_dfa, dfa_states, alphabet)
def simulate(self, input_string: str) -> bool:
"""Simulate the DFA on an input string."""
current = self.start
for ch in input_string:
if ch not in current.transitions:
return False
current = current.transitions[ch]
return current.is_accept
def print_table(self):
"""Print the DFA transition table."""
symbols = sorted(self.alphabet)
header = f"{'State':>10} | {'Accept':>6} | " + " | ".join(
f"{s:>5}" for s in symbols
)
print(header)
print("-" * len(header))
for state in sorted(self.states, key=lambda s: s.id):
row = f"{'D' + str(state.id):>10} | {'yes' if state.is_accept else 'no':>6} | "
cells = []
for s in symbols:
if s in state.transitions:
cells.append(f"{'D' + str(state.transitions[s].id):>5}")
else:
cells.append(f"{'--':>5}")
row += " | ".join(cells)
print(row)
# Example: Convert NFA for (a|b)*abb to DFA
print("=== NFA to DFA: (a|b)*abb ===\n")
nfa = regex_to_nfa("(a|b)*abb")
dfa = DFA.from_nfa(nfa)
dfa.print_table()
print()
test_strings = ["abb", "aabb", "babb", "ababb", "ab", "ba", ""]
for s in test_strings:
result = dfa.simulate(s)
print(f" '{s}': {'accepted' if result else 'rejected'}")
6. DFA Minimization: Hopcroft's Algorithm¶
A DFA produced by subset construction may have more states than necessary. DFA minimization produces the smallest DFA that recognizes the same language.
Equivalent States¶
Two states $p$ and $q$ are equivalent (indistinguishable) if for every string $w$:
$$\hat{\delta}(p, w) \in F \iff \hat{\delta}(q, w) \in F$$
In other words, starting from $p$ or $q$, the DFA accepts exactly the same set of strings.
Hopcroft's Algorithm (Partition Refinement)¶
The algorithm starts with a coarse partition and refines it:
Input: DFA D = (Q, Σ, δ, q_0, F)
Output: Minimized DFA D' with equivalent states merged
1. Initial partition P = {F, Q \ F} // Accepting and non-accepting states
2. worklist W = {F} // (or the smaller of F, Q\F)
3. while W is not empty:
4. A = W.pop()
5. for each symbol c in Σ:
6. X = {q in Q | δ(q, c) in A} // States that transition to A on c
7. for each group Y in P:
8. Y1 = Y ∩ X // States in Y that go to A on c
9. Y2 = Y \ X // States in Y that don't go to A on c
10. if Y1 ≠ ∅ and Y2 ≠ ∅: // Y is split
11. Replace Y in P with Y1 and Y2
12. if Y in W:
13. Replace Y in W with Y1 and Y2
14. else:
15. Add smaller of Y1, Y2 to W
16. Return DFA with each partition group as a single state
Time complexity: $O(n \log n)$ where $n = |Q|$
Python Implementation¶
def minimize_dfa(dfa: DFA) -> DFA:
"""
Minimize a DFA using Hopcroft's algorithm (partition refinement).
Returns a new minimized DFA.
"""
states = dfa.states
alphabet = sorted(dfa.alphabet)
# Remove unreachable states
reachable = set()
stack = [dfa.start]
while stack:
s = stack.pop()
if s in reachable:
continue
reachable.add(s)
for sym in alphabet:
if sym in s.transitions:
stack.append(s.transitions[sym])
states = [s for s in states if s in reachable]
# Add dead state if needed (for completeness)
dead_state = None
for s in states:
for sym in alphabet:
if sym not in s.transitions:
if dead_state is None:
dead_state = DFAState(
id=-1,
nfa_states=frozenset(),
is_accept=False
)
for a in alphabet:
dead_state.transitions[a] = dead_state
states.append(dead_state)
s.transitions[sym] = dead_state
# Initial partition: {accept states, non-accept states}
accept_group = frozenset(s for s in states if s.is_accept)
non_accept_group = frozenset(s for s in states if not s.is_accept)
partition = set()
if accept_group:
partition.add(accept_group)
if non_accept_group:
partition.add(non_accept_group)
# Worklist
worklist = set()
if accept_group and non_accept_group:
# Add the smaller group
if len(accept_group) <= len(non_accept_group):
worklist.add(accept_group)
else:
worklist.add(non_accept_group)
elif accept_group:
worklist.add(accept_group)
elif non_accept_group:
worklist.add(non_accept_group)
# Refinement
while worklist:
A = worklist.pop()
for c in alphabet:
# X = states that transition into A on symbol c
X = frozenset(
s for s in states
if c in s.transitions and s.transitions[c] in A
)
new_partition = set()
for Y in partition:
Y1 = Y & X
Y2 = Y - X
if Y1 and Y2:
new_partition.add(Y1)
new_partition.add(Y2)
if Y in worklist:
worklist.discard(Y)
worklist.add(Y1)
worklist.add(Y2)
else:
if len(Y1) <= len(Y2):
worklist.add(Y1)
else:
worklist.add(Y2)
else:
new_partition.add(Y)
partition = new_partition
# Build minimized DFA
# Map each old state to its partition representative
state_to_group = {}
for group in partition:
representative = min(group, key=lambda s: s.id)
for s in group:
state_to_group[s] = representative
# Create new DFA states
new_dfa_id = 0
group_to_new_state = {}
for group in partition:
rep = min(group, key=lambda s: s.id)
if rep == dead_state:
continue # Skip the dead state
is_accept = any(s.is_accept for s in group)
new_state = DFAState(
id=new_dfa_id,
nfa_states=frozenset().union(*(s.nfa_states for s in group)),
is_accept=is_accept
)
group_to_new_state[frozenset(group)] = new_state
new_dfa_id += 1
# Add transitions
for group in partition:
rep = min(group, key=lambda s: s.id)
if rep == dead_state:
continue
new_state = group_to_new_state[frozenset(group)]
for c in alphabet:
if c in rep.transitions:
target = rep.transitions[c]
target_group = None
for g in partition:
if target in g:
target_group = g
break
if target_group and frozenset(target_group) in group_to_new_state:
new_state.transitions[c] = group_to_new_state[frozenset(target_group)]
# Find the new start state
start_group = None
for group in partition:
if dfa.start in group:
start_group = group
break
new_start = group_to_new_state[frozenset(start_group)]
new_states = list(group_to_new_state.values())
return DFA(new_start, new_states, dfa.alphabet)
# Example
print("\n=== DFA Minimization ===\n")
print("Before minimization:")
dfa.print_table()
min_dfa = minimize_dfa(dfa)
print(f"\nAfter minimization ({len(min_dfa.states)} states):")
min_dfa.print_table()
# Verify correctness
print("\nVerification:")
for s in ["abb", "aabb", "babb", "ababb", "ab", "ba", ""]:
orig = dfa.simulate(s)
mini = min_dfa.simulate(s)
status = "OK" if orig == mini else "MISMATCH!"
print(f" '{s}': original={orig}, minimized={mini} [{status}]")
7. Token Recognition¶
Given DFAs for all token patterns, how does the lexer decide which token to return?
The Longest Match Rule¶
The lexer always returns the longest possible token starting at the current position.
Input: "iffy = 10;"
Without longest match: KW_IF("if") ID("fy") ... WRONG!
With longest match: ID("iffy") ASSIGN("=") ... CORRECT!
Priority Rule¶
When two patterns match the same longest lexeme, the one listed first (highest priority) wins. Keywords are typically listed before identifiers.
Input: "if"
Pattern order:
1. KW_IF: "if" <--- wins (higher priority)
2. ID: [a-z]+ <--- also matches "if"
Combined DFA Approach¶
In practice, all token pattern DFAs are combined into a single DFA:
- Build an NFA for each token pattern, marking the accept state with the token type
- Create a combined NFA with a new start state and epsilon-transitions to each pattern NFA
- Convert to DFA using subset construction
- When a DFA state contains accept states from multiple patterns, choose the highest-priority one
ε ---> NFA(KW_IF)
/
start --ε ---> NFA(KW_WHILE)
\
ε ---> NFA(ID)
...
Implementing the Longest Match¶
def longest_match(dfa: DFA, source: str, pos: int) -> Tuple[Optional[str], int]:
"""
Find the longest match starting at pos.
Returns (token_type, length) or (None, 0) if no match.
"""
state = dfa.start
last_accept = None
last_accept_pos = pos
current_pos = pos
while current_pos < len(source):
ch = source[current_pos]
if ch not in state.transitions:
break
state = state.transitions[ch]
current_pos += 1
if state.is_accept:
last_accept = state # Remember this accept position
last_accept_pos = current_pos
if last_accept is not None:
return (last_accept, last_accept_pos - pos)
return (None, 0)
Error Recovery in Lexers¶
When the lexer encounters a character that doesn't match any token pattern, it has several options:
- Panic mode: Skip the offending character and report an error
- Insert: Insert a missing character (rare)
- Delete: Delete the offending character
- Replace: Replace the offending character with a valid one
Most compilers use panic mode: report the error and skip one character.
def scan_with_recovery(source: str) -> List[Token]:
"""Scan with panic-mode error recovery."""
tokens = []
pos = 0
line = 1
col = 1
while pos < len(source):
# Skip whitespace
if source[pos] in ' \t\r':
pos += 1
col += 1
continue
if source[pos] == '\n':
pos += 1
line += 1
col = 1
continue
# Try to match a token
best_match = None
best_length = 0
best_type = None
for token_type, pattern_dfa in token_patterns:
match_state, length = longest_match(pattern_dfa, source, pos)
if length > best_length:
best_match = match_state
best_length = length
best_type = token_type
if best_match:
lexeme = source[pos:pos + best_length]
tokens.append(Token(best_type, lexeme, line=line, column=col))
pos += best_length
col += best_length
else:
# Error recovery: skip one character
print(f"Lexical error at line {line}, col {col}: "
f"unexpected character '{source[pos]}'")
pos += 1
col += 1
return tokens
8. A Complete Lexer Implementation¶
Here is a complete, self-contained lexer for a C-like language, built from the ground up using the techniques discussed.
"""
Complete lexer for a simple C-like language.
Uses a table-driven approach with longest-match and priority rules.
"""
import re
from dataclasses import dataclass
from enum import Enum, auto
from typing import List, Optional, Tuple, Any
class TT(Enum):
"""Token types."""
# Keywords (listed first for priority)
KW_IF = auto()
KW_ELSE = auto()
KW_WHILE = auto()
KW_FOR = auto()
KW_RETURN = auto()
KW_INT = auto()
KW_FLOAT = auto()
KW_VOID = auto()
KW_CHAR = auto()
KW_BREAK = auto()
KW_CONTINUE = auto()
# Literals
FLOAT_LIT = auto()
INT_LIT = auto()
STRING_LIT = auto()
CHAR_LIT = auto()
# Identifiers
IDENT = auto()
# Multi-character operators (before single-char ones for longest match)
LEQ = auto() # <=
GEQ = auto() # >=
EQ = auto() # ==
NEQ = auto() # !=
AND = auto() # &&
OR = auto() # ||
ARROW = auto() # ->
INC = auto() # ++
DEC = auto() # --
PLUS_ASSIGN = auto() # +=
MINUS_ASSIGN = auto() # -=
STAR_ASSIGN = auto() # *=
SLASH_ASSIGN = auto() # /=
# Single-character operators and delimiters
PLUS = auto() # +
MINUS = auto() # -
STAR = auto() # *
SLASH = auto() # /
PERCENT = auto() # %
ASSIGN = auto() # =
LT = auto() # <
GT = auto() # >
NOT = auto() # !
AMP = auto() # &
PIPE = auto() # |
LPAREN = auto() # (
RPAREN = auto() # )
LBRACE = auto() # {
RBRACE = auto() # }
LBRACKET = auto() # [
RBRACKET = auto() # ]
COMMA = auto() # ,
SEMI = auto() # ;
DOT = auto() # .
# Special
EOF = auto()
ERROR = auto()
@dataclass
class Token:
type: TT
lexeme: str
value: Any = None
line: int = 0
column: int = 0
def __repr__(self):
val = f", value={self.value!r}" if self.value is not None else ""
return f"Token({self.type.name}, {self.lexeme!r}{val}, {self.line}:{self.column})"
class LexerError(Exception):
def __init__(self, message, line, column):
super().__init__(message)
self.line = line
self.column = column
class Lexer:
"""
A complete lexer for a C-like language.
Uses a hand-written state machine approach for correctness
and efficiency. Implements longest-match and keyword priority.
"""
KEYWORDS = {
'if': TT.KW_IF,
'else': TT.KW_ELSE,
'while': TT.KW_WHILE,
'for': TT.KW_FOR,
'return': TT.KW_RETURN,
'int': TT.KW_INT,
'float': TT.KW_FLOAT,
'void': TT.KW_VOID,
'char': TT.KW_CHAR,
'break': TT.KW_BREAK,
'continue': TT.KW_CONTINUE,
}
TWO_CHAR_OPS = {
'<=': TT.LEQ,
'>=': TT.GEQ,
'==': TT.EQ,
'!=': TT.NEQ,
'&&': TT.AND,
'||': TT.OR,
'->': TT.ARROW,
'++': TT.INC,
'--': TT.DEC,
'+=': TT.PLUS_ASSIGN,
'-=': TT.MINUS_ASSIGN,
'*=': TT.STAR_ASSIGN,
'/=': TT.SLASH_ASSIGN,
}
SINGLE_CHAR_OPS = {
'+': TT.PLUS,
'-': TT.MINUS,
'*': TT.STAR,
'/': TT.SLASH,
'%': TT.PERCENT,
'=': TT.ASSIGN,
'<': TT.LT,
'>': TT.GT,
'!': TT.NOT,
'&': TT.AMP,
'|': TT.PIPE,
'(': TT.LPAREN,
')': TT.RPAREN,
'{': TT.LBRACE,
'}': TT.RBRACE,
'[': TT.LBRACKET,
']': TT.RBRACKET,
',': TT.COMMA,
';': TT.SEMI,
'.': TT.DOT,
}
def __init__(self, source: str, filename: str = "<input>"):
self.source = source
self.filename = filename
self.pos = 0
self.line = 1
self.column = 1
self.errors: List[str] = []
def peek(self, offset=0) -> Optional[str]:
"""Look at the character at current position + offset."""
idx = self.pos + offset
if idx < len(self.source):
return self.source[idx]
return None
def advance(self) -> str:
"""Consume and return the current character."""
ch = self.source[self.pos]
self.pos += 1
if ch == '\n':
self.line += 1
self.column = 1
else:
self.column += 1
return ch
def make_token(self, token_type: TT, lexeme: str,
value: Any = None, line: int = 0, col: int = 0) -> Token:
return Token(token_type, lexeme, value, line, col)
def skip_whitespace_and_comments(self):
"""Skip whitespace, single-line comments, and multi-line comments."""
while self.pos < len(self.source):
ch = self.peek()
# Whitespace
if ch in ' \t\r\n':
self.advance()
continue
# Single-line comment: //
if ch == '/' and self.peek(1) == '/':
self.advance() # skip /
self.advance() # skip /
while self.pos < len(self.source) and self.peek() != '\n':
self.advance()
continue
# Multi-line comment: /* ... */
if ch == '/' and self.peek(1) == '*':
start_line = self.line
start_col = self.column
self.advance() # skip /
self.advance() # skip *
depth = 1
while self.pos < len(self.source) and depth > 0:
if self.peek() == '/' and self.peek(1) == '*':
depth += 1
self.advance()
self.advance()
elif self.peek() == '*' and self.peek(1) == '/':
depth -= 1
self.advance()
self.advance()
else:
self.advance()
if depth > 0:
self.errors.append(
f"{self.filename}:{start_line}:{start_col}: "
f"unterminated comment"
)
continue
break # Not whitespace or comment
def scan_number(self) -> Token:
"""Scan an integer or float literal."""
start_line = self.line
start_col = self.column
start_pos = self.pos
is_float = False
# Integer part
while self.pos < len(self.source) and self.peek().isdigit():
self.advance()
# Fractional part
if self.peek() == '.' and self.peek(1) is not None and self.peek(1).isdigit():
is_float = True
self.advance() # skip '.'
while self.pos < len(self.source) and self.peek().isdigit():
self.advance()
# Exponent part
if self.peek() in ('e', 'E'):
is_float = True
self.advance() # skip 'e'/'E'
if self.peek() in ('+', '-'):
self.advance()
if not (self.pos < len(self.source) and self.peek().isdigit()):
self.errors.append(
f"{self.filename}:{self.line}:{self.column}: "
f"expected digit in exponent"
)
while self.pos < len(self.source) and self.peek().isdigit():
self.advance()
lexeme = self.source[start_pos:self.pos]
if is_float:
return self.make_token(TT.FLOAT_LIT, lexeme, float(lexeme),
start_line, start_col)
else:
return self.make_token(TT.INT_LIT, lexeme, int(lexeme),
start_line, start_col)
def scan_string(self) -> Token:
"""Scan a string literal (double-quoted)."""
start_line = self.line
start_col = self.column
self.advance() # skip opening "
chars = []
while self.pos < len(self.source) and self.peek() != '"':
if self.peek() == '\n':
self.errors.append(
f"{self.filename}:{self.line}:{self.column}: "
f"unterminated string literal"
)
break
if self.peek() == '\\':
self.advance() # skip backslash
esc = self.advance() if self.pos < len(self.source) else ''
escape_map = {
'n': '\n', 't': '\t', 'r': '\r',
'\\': '\\', '"': '"', '0': '\0',
}
chars.append(escape_map.get(esc, esc))
else:
chars.append(self.advance())
if self.pos < len(self.source) and self.peek() == '"':
self.advance() # skip closing "
else:
self.errors.append(
f"{self.filename}:{start_line}:{start_col}: "
f"unterminated string literal"
)
value = ''.join(chars)
lexeme = self.source[
(start_col - 1) + sum(1 for _ in range(start_line - 1)):self.pos
]
return self.make_token(TT.STRING_LIT, f'"{value}"', value,
start_line, start_col)
def scan_char(self) -> Token:
"""Scan a character literal (single-quoted)."""
start_line = self.line
start_col = self.column
self.advance() # skip opening '
if self.peek() == '\\':
self.advance()
esc = self.advance() if self.pos < len(self.source) else ''
escape_map = {
'n': '\n', 't': '\t', 'r': '\r',
'\\': '\\', "'": "'", '0': '\0',
}
value = escape_map.get(esc, esc)
elif self.pos < len(self.source):
value = self.advance()
else:
value = ''
self.errors.append(
f"{self.filename}:{start_line}:{start_col}: "
f"empty character literal"
)
if self.pos < len(self.source) and self.peek() == "'":
self.advance() # skip closing '
else:
self.errors.append(
f"{self.filename}:{start_line}:{start_col}: "
f"unterminated character literal"
)
return self.make_token(TT.CHAR_LIT, f"'{value}'", value,
start_line, start_col)
def scan_identifier_or_keyword(self) -> Token:
"""Scan an identifier or keyword."""
start_line = self.line
start_col = self.column
start_pos = self.pos
while (self.pos < len(self.source) and
(self.peek().isalnum() or self.peek() == '_')):
self.advance()
lexeme = self.source[start_pos:self.pos]
# Check if it's a keyword
if lexeme in self.KEYWORDS:
return self.make_token(self.KEYWORDS[lexeme], lexeme,
line=start_line, col=start_col)
else:
return self.make_token(TT.IDENT, lexeme,
line=start_line, col=start_col)
def next_token(self) -> Token:
"""Return the next token from the source."""
self.skip_whitespace_and_comments()
if self.pos >= len(self.source):
return self.make_token(TT.EOF, "", line=self.line, col=self.column)
start_line = self.line
start_col = self.column
ch = self.peek()
# Numbers
if ch.isdigit():
return self.scan_number()
# Identifiers and keywords
if ch.isalpha() or ch == '_':
return self.scan_identifier_or_keyword()
# String literals
if ch == '"':
return self.scan_string()
# Character literals
if ch == "'":
return self.scan_char()
# Two-character operators (check before single-char)
two_char = self.source[self.pos:self.pos + 2]
if two_char in self.TWO_CHAR_OPS:
self.advance()
self.advance()
return self.make_token(self.TWO_CHAR_OPS[two_char], two_char,
line=start_line, col=start_col)
# Single-character operators and delimiters
if ch in self.SINGLE_CHAR_OPS:
self.advance()
return self.make_token(self.SINGLE_CHAR_OPS[ch], ch,
line=start_line, col=start_col)
# Error: unexpected character
self.advance()
self.errors.append(
f"{self.filename}:{start_line}:{start_col}: "
f"unexpected character '{ch}'"
)
return self.make_token(TT.ERROR, ch, line=start_line, col=start_col)
def tokenize(self) -> List[Token]:
"""Tokenize the entire source and return a list of tokens."""
tokens = []
while True:
tok = self.next_token()
tokens.append(tok)
if tok.type == TT.EOF:
break
return tokens
# ============================================================
# Example usage
# ============================================================
source_code = """
// Compute factorial iteratively
int factorial(int n) {
int result = 1;
for (int i = 2; i <= n; i++) {
result *= i;
}
return result;
}
/* Multi-line
comment */
int main() {
int x = factorial(10);
float pi = 3.14159;
char ch = 'A';
char newline = '\\n';
if (x >= 3628800 && pi != 0.0) {
return 0;
} else {
return -1;
}
}
"""
print("=== Complete Lexer Demo ===\n")
lexer = Lexer(source_code, "example.c")
tokens = lexer.tokenize()
for tok in tokens:
print(f" {tok}")
if lexer.errors:
print("\nErrors:")
for err in lexer.errors:
print(f" {err}")
else:
print(f"\nTokenization successful: {len(tokens)} tokens, no errors.")
# Summary statistics
from collections import Counter
type_counts = Counter(tok.type for tok in tokens)
print("\nToken type distribution:")
for tt, count in type_counts.most_common():
print(f" {tt.name:20s} {count}")
9. Lexer Generator Specification¶
While hand-written lexers offer maximum control, lexer generators (Lex, Flex, PLY) automate the process by taking a specification of token patterns and generating lexer code.
Lex/Flex Specification Format¶
%{
/* C declarations */
#include "parser.tab.h"
%}
/* Definitions */
DIGIT [0-9]
LETTER [a-zA-Z_]
ID {LETTER}({LETTER}|{DIGIT})*
%%
/* Rules: pattern action */
"if" { return KW_IF; }
"else" { return KW_ELSE; }
"while" { return KW_WHILE; }
"return" { return KW_RETURN; }
"int" { return KW_INT; }
"float" { return KW_FLOAT; }
{DIGIT}+ { yylval.ival = atoi(yytext); return INT_LIT; }
{DIGIT}+"."{DIGIT}+ { yylval.fval = atof(yytext); return FLOAT_LIT; }
{ID} { yylval.sval = strdup(yytext); return IDENT; }
"<=" { return LEQ; }
">=" { return GEQ; }
"==" { return EQ; }
"!=" { return NEQ; }
"+" { return PLUS; }
"-" { return MINUS; }
"*" { return STAR; }
"/" { return SLASH; }
"=" { return ASSIGN; }
"(" { return LPAREN; }
")" { return RPAREN; }
"{" { return LBRACE; }
"}" { return RBRACE; }
";" { return SEMI; }
"," { return COMMA; }
"//".* { /* skip single-line comment */ }
[ \t\r]+ { /* skip whitespace */ }
\n { yylineno++; }
. { fprintf(stderr, "Unexpected: %s\n", yytext); }
%%
PLY (Python Lex-Yacc) Specification¶
"""
Token specification using PLY (Python Lex-Yacc).
Install: pip install ply
"""
import ply.lex as lex
# List of token names (required by PLY)
tokens = [
'IDENT', 'INT_LIT', 'FLOAT_LIT', 'STRING_LIT',
'PLUS', 'MINUS', 'STAR', 'SLASH',
'ASSIGN', 'EQ', 'NEQ', 'LT', 'GT', 'LEQ', 'GEQ',
'LPAREN', 'RPAREN', 'LBRACE', 'RBRACE',
'SEMI', 'COMMA',
]
# Reserved words
reserved = {
'if': 'KW_IF',
'else': 'KW_ELSE',
'while': 'KW_WHILE',
'for': 'KW_FOR',
'return': 'KW_RETURN',
'int': 'KW_INT',
'float': 'KW_FLOAT',
'void': 'KW_VOID',
}
tokens += list(reserved.values())
# Simple tokens (single characters)
t_PLUS = r'\+'
t_MINUS = r'-'
t_STAR = r'\*'
t_SLASH = r'/'
t_ASSIGN = r'='
t_LPAREN = r'\('
t_RPAREN = r'\)'
t_LBRACE = r'\{'
t_RBRACE = r'\}'
t_SEMI = r';'
t_COMMA = r','
# Multi-character operators (functions, checked before simple tokens)
def t_LEQ(t):
r'<='
return t
def t_GEQ(t):
r'>='
return t
def t_EQ(t):
r'=='
return t
def t_NEQ(t):
r'!='
return t
def t_LT(t):
r'<'
return t
def t_GT(t):
r'>'
return t
# Float literal (must be before INT_LIT)
def t_FLOAT_LIT(t):
r'\d+\.\d+([eE][+-]?\d+)?'
t.value = float(t.value)
return t
# Integer literal
def t_INT_LIT(t):
r'\d+'
t.value = int(t.value)
return t
# String literal
def t_STRING_LIT(t):
r'"[^"\\]*(\\.[^"\\]*)*"'
t.value = t.value[1:-1] # Strip quotes
return t
# Identifier (and keyword check)
def t_IDENT(t):
r'[a-zA-Z_][a-zA-Z0-9_]*'
t.type = reserved.get(t.value, 'IDENT') # Check for keywords
return t
# Track line numbers
def t_newline(t):
r'\n+'
t.lexer.lineno += len(t.value)
# Ignored characters (whitespace)
t_ignore = ' \t\r'
# Comments
def t_COMMENT(t):
r'//[^\n]*'
pass # Discard
def t_BLOCK_COMMENT(t):
r'/\*[\s\S]*?\*/'
t.lexer.lineno += t.value.count('\n')
# Error handling
def t_error(t):
print(f"Illegal character '{t.value[0]}' at line {t.lineno}")
t.lexer.skip(1)
# Build the lexer
lexer = lex.lex()
# Test it
lexer.input("int x = 42 + y * 3.14;")
while True:
tok = lexer.token()
if not tok:
break
print(tok)
10. Handling Special Cases¶
String Literals with Escapes¶
String scanning requires handling escape sequences, which makes the pattern context-sensitive at the character level:
def scan_string_literal(source: str, pos: int) -> Tuple[str, str, int]:
"""
Scan a string literal starting at pos (which should be '"').
Returns (raw_lexeme, processed_value, new_pos).
"""
assert source[pos] == '"'
start = pos
pos += 1 # skip opening quote
chars = []
ESCAPE_MAP = {
'n': '\n',
't': '\t',
'r': '\r',
'\\': '\\',
'"': '"',
'0': '\0',
'a': '\a',
'b': '\b',
'f': '\f',
'v': '\v',
}
while pos < len(source):
ch = source[pos]
if ch == '"':
pos += 1 # skip closing quote
raw = source[start:pos]
return (raw, ''.join(chars), pos)
elif ch == '\\':
pos += 1
if pos >= len(source):
raise LexerError("Unterminated escape in string", 0, 0)
esc = source[pos]
if esc in ESCAPE_MAP:
chars.append(ESCAPE_MAP[esc])
elif esc == 'x':
# Hex escape: \xNN
hex_str = source[pos+1:pos+3]
chars.append(chr(int(hex_str, 16)))
pos += 2
elif esc == 'u':
# Unicode escape: \uNNNN
hex_str = source[pos+1:pos+5]
chars.append(chr(int(hex_str, 16)))
pos += 4
else:
chars.append(esc) # Unknown escape, keep literal
pos += 1
elif ch == '\n':
raise LexerError("Unterminated string (newline)", 0, 0)
else:
chars.append(ch)
pos += 1
raise LexerError("Unterminated string (EOF)", 0, 0)
Multi-Line Strings¶
Some languages support multi-line strings:
def scan_multiline_string(source: str, pos: int) -> Tuple[str, str, int]:
"""Scan a triple-quoted string (Python-style)."""
assert source[pos:pos+3] == '"""'
start = pos
pos += 3 # skip opening """
while pos < len(source) - 2:
if source[pos:pos+3] == '"""':
pos += 3
raw = source[start:pos]
value = raw[3:-3] # strip quotes
return (raw, value, pos)
pos += 1
raise LexerError("Unterminated multi-line string", 0, 0)
Nested Comments¶
Some languages (Haskell, Swift, Rust) support nested comments:
def skip_nested_comment(source: str, pos: int) -> int:
"""Skip a nested block comment /* ... /* ... */ ... */"""
assert source[pos:pos+2] == '/*'
pos += 2
depth = 1
while pos < len(source) - 1 and depth > 0:
if source[pos:pos+2] == '/*':
depth += 1
pos += 2
elif source[pos:pos+2] == '*/':
depth -= 1
pos += 2
else:
pos += 1
if depth > 0:
raise LexerError("Unterminated nested comment", 0, 0)
return pos
Significant Whitespace (Python-style Indentation)¶
Python's lexer generates INDENT and DEDENT tokens based on whitespace changes:
def tokenize_with_indentation(source: str) -> List[Token]:
"""
Tokenize with Python-style indentation handling.
Generates INDENT and DEDENT tokens.
"""
tokens = []
indent_stack = [0] # Stack of indentation levels
lines = source.split('\n')
for line_no, line in enumerate(lines, 1):
# Skip blank lines
stripped = line.lstrip()
if not stripped or stripped.startswith('#'):
continue
# Calculate indentation level
indent = len(line) - len(stripped)
if indent > indent_stack[-1]:
indent_stack.append(indent)
tokens.append(Token(TT.IDENT, "<INDENT>", line=line_no, column=1))
else:
while indent < indent_stack[-1]:
indent_stack.pop()
tokens.append(Token(TT.IDENT, "<DEDENT>", line=line_no, column=1))
if indent != indent_stack[-1]:
raise LexerError(
f"Inconsistent indentation at line {line_no}", line_no, 1
)
# Tokenize the rest of the line normally
# (omitted for brevity -- use the main lexer)
# Emit remaining DEDENTs
while len(indent_stack) > 1:
indent_stack.pop()
tokens.append(Token(TT.IDENT, "<DEDENT>", line=len(lines), column=1))
return tokens
11. Performance Considerations¶
Table-Driven vs. Direct-Coded Lexers¶
Table-driven (generated by Lex/Flex):
- Transition table stored as a 2D array: next_state = table[state][char]
- Compact, easy to generate
- Indirect memory access on every character
Direct-coded (hand-written or re2c):
- Transitions encoded as switch statements or goto chains
- Faster: branch prediction works better, no table lookup
- Larger code size
// Table-driven (simplified)
int state = 0;
while (state != -1) {
char c = *input++;
state = table[state][c]; // indirect lookup
}
// Direct-coded
s0:
char c = *input++;
if (c >= 'a' && c <= 'z') goto s1;
if (c >= '0' && c <= '9') goto s2;
goto error;
s1:
...
re2c generates direct-coded lexers and is typically 2-3x faster than Flex.
Buffer Management¶
Reading one character at a time from the file system is slow. Lexers use double buffering:
+-----------------------------------+-----------------------------------+
| Buffer 1 (4096 bytes) | Buffer 2 (4096 bytes) |
+-----------------------------------+-----------------------------------+
^ ^
lexeme_begin forward
When forward reaches the end of Buffer 1, load next block into Buffer 2.
When forward reaches the end of Buffer 2, load next block into Buffer 1.
This is called the sentinel technique: place a special EOF character at the end of each buffer so the main scanning loop doesn't need a bounds check on every character.
Summary¶
- The lexer is the first phase of compilation, transforming characters into tokens
- Tokens have a type and an optional attribute value; lexemes are the actual text; patterns are the rules
- Token patterns are specified using regular expressions, which support union, concatenation, and Kleene star
- Thompson's construction converts a regex to an NFA in $O(n)$ time and space
- Subset construction converts an NFA to a DFA; worst case $2^n$ states, but usually much fewer
- Hopcroft's algorithm minimizes the DFA in $O(n \log n)$ time
- The longest match rule ensures the lexer always returns the longest possible token
- Priority rules resolve ties between patterns (keywords before identifiers)
- Hand-written lexers offer flexibility; lexer generators (Lex, Flex, PLY) offer convenience
- Special cases include string escapes, nested comments, and significant whitespace
Exercises¶
Exercise 1: Regular Expressions¶
Write regular expressions for each of the following:
- Binary strings with an even number of 0's
- C-style identifiers that start with an underscore
- Email addresses (simplified:
name@domain.tld) - Floating-point numbers in scientific notation (e.g.,
1.5e-3,-2.0E+10) - C-style comments:
/* ... */(non-nested)
Exercise 2: Thompson's Construction¶
Apply Thompson's construction to the regex a(b|c)*d. Draw the NFA (label all states). Then simulate the NFA on the inputs:
- "ad" -- should it accept?
- "abcd" -- should it accept?
- "abbd" -- should it accept?
- "abc" -- should it accept?
Exercise 3: Subset Construction¶
Given the NFA from Exercise 2, apply the subset construction algorithm. Show the DFA transition table. How many states does the resulting DFA have?
Exercise 4: Lexer Extension¶
Extend the complete lexer in Section 8 to handle:
1. Hexadecimal integer literals (e.g., 0xFF, 0x1A3)
2. Octal integer literals (e.g., 0o77, 0o12)
3. Binary integer literals (e.g., 0b1010, 0b11001)
4. Numeric separators for readability (e.g., 1_000_000, 0xFF_FF)
Exercise 5: Error Recovery¶
Design and implement an error recovery strategy for the lexer that:
1. Reports unterminated string literals with the line number where the string started
2. Suggests corrections for common typos (e.g., retrun -> return)
3. Groups consecutive illegal characters into a single error message rather than reporting each one individually
Exercise 6: Performance Comparison¶
Implement a simple benchmark that:
1. Generates a large (1MB+) source file with random tokens
2. Tokenizes it using the hand-written lexer from Section 8
3. Tokenizes it using Python's re module with a combined regex
4. Compare the tokenization times and verify both produce the same tokens
Previous: Introduction to Compilers | Next: Finite Automata | Overview