Lesson 1: Introduction to Compilers¶

Learning Objectives¶

After completing this lesson, you will be able to:

Define what a compiler is and distinguish it from an interpreter
Trace the historical development of compilers from Fortran to modern systems
Identify and describe each phase of the compilation process
Explain the distinction between front-end and back-end
Understand single-pass vs. multi-pass compilation strategies
Describe bootstrapping and cross-compilation
Read and construct T-diagrams (tombstone diagrams)
Trace a simple program through every phase of compilation

1. What Is a Compiler?¶

A compiler is a program that translates source code written in one language (the source language) into another language (the target language), while reporting errors detected during the translation process.

$$\text{Compiler}: \text{Source Language} \longrightarrow \text{Target Language}$$

More precisely, a compiler is a function that maps programs:

$$C: \mathcal{P}_S \rightarrow \mathcal{P}_T$$

where $\mathcal{P}_S$ is the set of valid programs in the source language and $\mathcal{P}_T$ is the set of programs in the target language, such that for every input $x$:

$$\text{meaning}(P_S, x) = \text{meaning}(C(P_S), x)$$

The semantic equivalence requirement is crucial: the compiled program must produce the same observable behavior as the source program for all valid inputs.

Common Source-Target Pairs¶

Source Language	Target Language	Example
C	x86 machine code	GCC, Clang
Java	JVM bytecode	javac
TypeScript	JavaScript	tsc
Python	Python bytecode (.pyc)	CPython
Rust	LLVM IR, then machine code	rustc
SASS	CSS	sass compiler
LaTeX	DVI/PDF	pdflatex

Compiler vs. Transpiler¶

A transpiler (source-to-source compiler) translates between languages at a similar level of abstraction. TypeScript to JavaScript and CoffeeScript to JavaScript are transpilation. The boundary between compilation and transpilation is blurry; what matters is that both preserve semantics.

2. Compiler vs. Interpreter¶

An interpreter executes the source program directly, without producing a separate target program.

Compiler workflow:
  Source Code  --->  [Compiler]  --->  Target Code  --->  [CPU/VM]  --->  Output
                                          (stored)

Interpreter workflow:
  Source Code  --->  [Interpreter + Input]  --->  Output
                     (no separate target)

Key Differences¶

Aspect	Compiler	Interpreter
Output	Produces target program	No target program produced
Execution	Target runs independently	Interpreter must be present
Speed (startup)	Slow (compilation overhead)	Fast (immediate execution)
Speed (runtime)	Fast (optimized target code)	Slow (repeated analysis)
Error reporting	All errors before execution	Errors at runtime
Memory	Target program stored	Source re-analyzed each time

Hybrid Approaches¶

Most modern language implementations use a hybrid approach:

Java: Compile to bytecode (javac), then interpret/JIT-compile on JVM
Python: Compile to bytecode (.pyc), then interpret on CPython VM
JavaScript: Parse, then JIT-compile hot paths (V8, SpiderMonkey)
C#/.NET: Compile to CIL, then JIT-compile to native code

# Python's hybrid approach: you can see the bytecode
import dis

def add(a, b):
    return a + b

# Show the compiled bytecode instructions
dis.dis(add)
# Output:
#   2           0 LOAD_FAST                0 (a)
#               2 LOAD_FAST                1 (b)
#               4 BINARY_ADD
#               6 RETURN_VALUE

The Compilation Spectrum¶

Rather than a binary distinction, there is a spectrum:

Pure Interpreter  <----------------------------->  Ahead-of-Time Compiler
     |                    |              |                |
   Shell scripts    CPython (bytecode)  JIT (V8, JVM)   GCC/Clang

3. Historical Development¶

The Early Days (1950s)¶

Before compilers, programmers wrote machine code or assembly language by hand. The idea that a machine could translate high-level notation into efficient machine code was met with skepticism.

1957 -- FORTRAN Compiler (John Backus, IBM) - The first optimizing compiler - Took 18 person-years to develop - Goal: produce code as efficient as hand-written assembly - Success proved that high-level languages were practical

1960 -- COBOL Compiler - Grace Hopper's work on A-0 (1952) laid groundwork - First language designed for portability across machines

1962 -- LISP Compiler - First self-hosting compiler (compiler written in the language it compiles) - Introduced garbage collection

Formal Foundations (1960s-1970s)¶

1960s -- Chomsky Hierarchy and Parsing Theory - Noam Chomsky's classification of formal languages - Direct application to compiler design - Development of context-free grammars for syntax

1965 -- Knuth's LR Parsing - Donald Knuth introduced LR parsing - Provided a systematic way to build parsers for deterministic context-free languages

1970 -- Yacc (Yet Another Compiler-Compiler) - Stephen C. Johnson at Bell Labs - LALR(1) parser generator - Made parser construction practical

1975 -- Lex - Mike Lesk and Eric Schmidt at Bell Labs - Lexical analyzer generator - Works with Yacc: Lex handles tokens, Yacc handles grammar

Modern Era (1980s-Present)¶

1987 -- GCC (GNU Compiler Collection) - Richard Stallman - First major open-source compiler - Supports multiple languages and targets

2003 -- LLVM - Chris Lattner at UIUC - Modular compiler infrastructure - Revolutionized compiler construction with reusable components

2010s -- Rust, Go, Swift - Languages designed with modern compiler technology - Rust: ownership-based memory safety (no GC) - Go: fast compilation as a design goal - Swift: built on LLVM

2020s -- AI-Assisted Compilation - Machine learning for optimization heuristics - Neural program synthesis - LLM-based code generation (a different paradigm from traditional compilation)

4. Phases of Compilation¶

A compiler is organized as a sequence of phases, each of which transforms the program from one representation to another.

Source Code (characters)
    |
    v
+-------------------+
| Lexical Analysis  |  (Scanner / Lexer)
+-------------------+
    |  tokens
    v
+-------------------+
| Syntax Analysis   |  (Parser)
+-------------------+
    |  parse tree / AST
    v
+-------------------+
| Semantic Analysis |  (Type checker)
+-------------------+
    |  annotated AST
    v
+-------------------+
| IR Generation     |  (Intermediate Representation)
+-------------------+
    |  IR (three-address code, SSA, ...)
    v
+-------------------+
| Optimization      |  (Machine-independent)
+-------------------+
    |  optimized IR
    v
+-------------------+
| Code Generation   |  (Machine-dependent)
+-------------------+
    |  target code (assembly / machine code)
    v
+-------------------+
| Target Optimization |  (Peephole, scheduling)
+-------------------+
    |
    v
Target Code (machine instructions)

Two additional components run throughout all phases:

Symbol Table Manager: Maintains information about identifiers (names, types, scopes)
Error Handler: Detects, reports, and (where possible) recovers from errors

Phase 1: Lexical Analysis (Scanning)¶

The lexer (or scanner) reads the source program as a stream of characters and groups them into tokens -- the smallest meaningful units of the language.

Input: Stream of characters Output: Stream of tokens

Source:  position = initial + rate * 60

Tokens:  [ID:"position"] [ASSIGN:"="] [ID:"initial"] [PLUS:"+"]
         [ID:"rate"] [STAR:"*"] [INT:"60"]

Each token has: - A token type (or token class): ID, INT, ASSIGN, PLUS, etc. - An optional attribute value: the actual lexeme or its value

The lexer also: - Strips whitespace and comments - Handles preprocessing directives (in C) - Reports lexical errors (illegal characters, unterminated strings)

Phase 2: Syntax Analysis (Parsing)¶

The parser takes the token stream and builds a parse tree (or directly an AST) according to the grammar rules of the language.

Input: Stream of tokens Output: Parse tree or AST

Parse tree for: position = initial + rate * 60

        assignment_stmt
       /       |       \
     ID       '='      expr
  "position"          /  |  \
                   expr  '+'  term
                    |        / | \
                   ID    term '*' factor
                "initial"  |       |
                          ID     INT
                        "rate"   60

The parser: - Verifies that the token sequence conforms to the language grammar - Reports syntax errors ("expected ; before }") - May perform error recovery to continue parsing after errors

Phase 3: Semantic Analysis¶

Semantic analysis checks the program for semantic consistency that cannot be captured by context-free grammars.

Input: AST Output: Annotated/decorated AST

Key tasks: - Type checking: Is rate * 60 valid? (Can you multiply a float by an int?) - Type coercion: Insert implicit conversions (int 60 -> float 60.0) - Name resolution: Which declaration does each use of rate refer to? - Scope checking: Is position visible in this scope? - Definite assignment: Is initial initialized before use?

Annotated AST:

        assignment_stmt
       /       |       \
     ID       '='      expr : float
  "position"          /  |  \
  (type:float)     expr  '+'  term : float
                    |        / | \
                   ID    term '*' factor : float
                "initial"  |       |
               (float)    ID    intToFloat(INT)
                        "rate"     60 -> 60.0
                       (float)

Phase 4: Intermediate Representation (IR) Generation¶

The compiler generates a machine-independent intermediate representation of the program.

Input: Annotated AST Output: IR (e.g., three-address code)

Three-address code for: position = initial + rate * 60

    t1 = intToFloat(60)
    t2 = rate * t1
    t3 = initial + t2
    position = t3

Each instruction has at most three operands (hence "three-address code"). This flat representation is much easier to optimize than a tree.

Phase 5: Optimization¶

The optimizer transforms the IR to improve performance without changing the program's semantics.

Input: IR Output: Optimized IR

Before optimization:           After optimization:
    t1 = intToFloat(60)           t1 = rate * 60.0
    t2 = rate * t1                position = initial + t1
    t3 = initial + t2
    position = t3

Optimizations applied: - Constant folding: intToFloat(60) -> 60.0 (computed at compile time) - Dead code elimination: removed unnecessary temporaries - Copy propagation: replaced uses of t3 with its definition

Phase 6: Code Generation¶

The code generator maps the optimized IR to the target machine's instruction set.

Input: Optimized IR Output: Target code (assembly or machine code)

; x86-64 assembly for: position = initial + rate * 60.0
    movsd   xmm0, QWORD PTR [rate]       ; load rate into xmm0
    mulsd   xmm0, QWORD PTR [.LC0]       ; xmm0 = rate * 60.0
    addsd   xmm0, QWORD PTR [initial]    ; xmm0 = initial + rate * 60.0
    movsd   QWORD PTR [position], xmm0   ; store result in position
.LC0:
    .double 60.0

The code generator must: - Select appropriate instructions (instruction selection) - Assign variables to registers (register allocation) - Determine the order of instructions (instruction scheduling)

5. A Complete Example: From Source to Target¶

Let us trace the expression a = b + c * 2 through every phase.

Source Code¶

int a, b, c;
b = 3;
c = 5;
a = b + c * 2;

Phase 1: Lexical Analysis¶

Token Stream:
  [KW_INT:"int"] [ID:"a"] [COMMA:","] [ID:"b"] [COMMA:","] [ID:"c"] [SEMI:";"]
  [ID:"b"] [ASSIGN:"="] [INT:"3"] [SEMI:";"]
  [ID:"c"] [ASSIGN:"="] [INT:"5"] [SEMI:";"]
  [ID:"a"] [ASSIGN:"="] [ID:"b"] [PLUS:"+"] [ID:"c"] [STAR:"*"] [INT:"2"] [SEMI:";"]

Phase 2: Syntax Analysis (AST)¶

Program
 |- VarDecl(type=int, names=[a, b, c])
 |- Assign(target=b, value=IntLit(3))
 |- Assign(target=c, value=IntLit(5))
 |- Assign(target=a,
           value=BinOp(+,
                        left=Var(b),
                        right=BinOp(*,
                                     left=Var(c),
                                     right=IntLit(2))))

Phase 3: Semantic Analysis¶

Symbol Table:
  a : int (declared, scope=global)
  b : int (declared, scope=global)
  c : int (declared, scope=global)

Type checking:
  b + c * 2  =>  int + (int * int)  =>  int + int  =>  int  ✓
  a = (int)  =>  int = int  ✓

Phase 4: IR Generation (Three-Address Code)¶

    b = 3
    c = 5
    t1 = c * 2
    t2 = b + t1
    a = t2

Phase 5: Optimization¶

Constant propagation: b=3, c=5 are known constants

    t1 = 5 * 2     (c replaced with 5)
    t2 = 3 + t1    (b replaced with 3)
    a = t2

Constant folding:

    t1 = 10         (5 * 2 computed at compile time)
    t2 = 13         (3 + 10 computed at compile time)
    a = 13

Copy propagation + dead code elimination:

    a = 13

The entire computation was resolved at compile time.

Phase 6: Code Generation¶

; x86-64
    mov DWORD PTR [a], 13

Python Simulation of All Phases¶

"""
A simplified demonstration of compiler phases.
This processes the expression: a = b + c * 2
"""

# ============================================================
# Phase 1: Lexical Analysis
# ============================================================

import re
from dataclasses import dataclass
from typing import List, Optional, Any

@dataclass
class Token:
    type: str
    value: str
    line: int = 0
    col: int = 0

def tokenize(source: str) -> List[Token]:
    """Simple lexer for a tiny language."""
    token_spec = [
        ('INT',     r'\d+'),
        ('ID',      r'[a-zA-Z_]\w*'),
        ('ASSIGN',  r'='),
        ('PLUS',    r'\+'),
        ('STAR',    r'\*'),
        ('SEMI',    r';'),
        ('COMMA',   r','),
        ('SKIP',    r'[ \t]+'),
        ('NEWLINE', r'\n'),
        ('MISMATCH', r'.'),
    ]
    keywords = {'int'}
    tok_regex = '|'.join(f'(?P<{name}>{pattern})' for name, pattern in token_spec)
    tokens = []
    for mo in re.finditer(tok_regex, source):
        kind = mo.lastgroup
        value = mo.group()
        if kind == 'SKIP' or kind == 'NEWLINE':
            continue
        if kind == 'MISMATCH':
            raise SyntaxError(f'Unexpected character: {value!r}')
        if kind == 'ID' and value in keywords:
            kind = f'KW_{value.upper()}'
        tokens.append(Token(kind, value))
    return tokens

source = """int a, b, c;
b = 3;
c = 5;
a = b + c * 2;"""

tokens = tokenize(source)
print("=== Phase 1: Lexical Analysis ===")
for tok in tokens:
    print(f"  {tok.type:10s} {tok.value!r}")


# ============================================================
# Phase 2: Syntax Analysis (Build AST)
# ============================================================

@dataclass
class ASTNode:
    """Base class for AST nodes."""
    pass

@dataclass
class IntLiteral(ASTNode):
    value: int

@dataclass
class Identifier(ASTNode):
    name: str
    resolved_type: Optional[str] = None

@dataclass
class BinOp(ASTNode):
    op: str
    left: ASTNode
    right: ASTNode
    resolved_type: Optional[str] = None

@dataclass
class Assignment(ASTNode):
    target: str
    value: ASTNode

@dataclass
class VarDecl(ASTNode):
    type_name: str
    names: List[str]

@dataclass
class Program(ASTNode):
    statements: List[ASTNode]

class Parser:
    """Recursive descent parser for our tiny language."""

    def __init__(self, tokens: List[Token]):
        self.tokens = tokens
        self.pos = 0

    def peek(self) -> Optional[Token]:
        if self.pos < len(self.tokens):
            return self.tokens[self.pos]
        return None

    def consume(self, expected_type: str = None) -> Token:
        tok = self.tokens[self.pos]
        if expected_type and tok.type != expected_type:
            raise SyntaxError(
                f"Expected {expected_type}, got {tok.type} ({tok.value!r})"
            )
        self.pos += 1
        return tok

    def parse_program(self) -> Program:
        stmts = []
        while self.pos < len(self.tokens):
            stmts.append(self.parse_statement())
        return Program(stmts)

    def parse_statement(self):
        tok = self.peek()
        if tok.type == 'KW_INT':
            return self.parse_var_decl()
        else:
            return self.parse_assignment()

    def parse_var_decl(self) -> VarDecl:
        self.consume('KW_INT')
        names = [self.consume('ID').value]
        while self.peek() and self.peek().type == 'COMMA':
            self.consume('COMMA')
            names.append(self.consume('ID').value)
        self.consume('SEMI')
        return VarDecl('int', names)

    def parse_assignment(self) -> Assignment:
        name = self.consume('ID').value
        self.consume('ASSIGN')
        expr = self.parse_expr()
        self.consume('SEMI')
        return Assignment(name, expr)

    def parse_expr(self) -> ASTNode:
        """expr = term (('+') term)*"""
        left = self.parse_term()
        while self.peek() and self.peek().type == 'PLUS':
            self.consume('PLUS')
            right = self.parse_term()
            left = BinOp('+', left, right)
        return left

    def parse_term(self) -> ASTNode:
        """term = factor (('*') factor)*"""
        left = self.parse_factor()
        while self.peek() and self.peek().type == 'STAR':
            self.consume('STAR')
            right = self.parse_factor()
            left = BinOp('*', left, right)
        return left

    def parse_factor(self) -> ASTNode:
        """factor = INT | ID"""
        tok = self.peek()
        if tok.type == 'INT':
            self.consume()
            return IntLiteral(int(tok.value))
        elif tok.type == 'ID':
            self.consume()
            return Identifier(tok.value)
        else:
            raise SyntaxError(f"Unexpected token: {tok}")

parser = Parser(tokens)
ast = parser.parse_program()

def print_ast(node, indent=0):
    prefix = "  " * indent
    if isinstance(node, Program):
        print(f"{prefix}Program")
        for s in node.statements:
            print_ast(s, indent + 1)
    elif isinstance(node, VarDecl):
        print(f"{prefix}VarDecl(type={node.type_name}, names={node.names})")
    elif isinstance(node, Assignment):
        print(f"{prefix}Assign(target={node.target})")
        print_ast(node.value, indent + 1)
    elif isinstance(node, BinOp):
        print(f"{prefix}BinOp({node.op})")
        print_ast(node.left, indent + 1)
        print_ast(node.right, indent + 1)
    elif isinstance(node, IntLiteral):
        print(f"{prefix}IntLit({node.value})")
    elif isinstance(node, Identifier):
        print(f"{prefix}Id({node.name})")

print("\n=== Phase 2: Syntax Analysis (AST) ===")
print_ast(ast)


# ============================================================
# Phase 3: Semantic Analysis
# ============================================================

class SemanticAnalyzer:
    def __init__(self):
        self.symbol_table = {}  # name -> type

    def analyze(self, node: ASTNode):
        if isinstance(node, Program):
            for stmt in node.statements:
                self.analyze(stmt)
        elif isinstance(node, VarDecl):
            for name in node.names:
                if name in self.symbol_table:
                    raise NameError(f"Variable '{name}' already declared")
                self.symbol_table[name] = node.type_name
        elif isinstance(node, Assignment):
            if node.target not in self.symbol_table:
                raise NameError(f"Variable '{node.target}' not declared")
            expr_type = self.check_type(node.value)
            target_type = self.symbol_table[node.target]
            if expr_type != target_type:
                raise TypeError(
                    f"Cannot assign {expr_type} to {target_type} variable '{node.target}'"
                )
        return node

    def check_type(self, node: ASTNode) -> str:
        if isinstance(node, IntLiteral):
            return 'int'
        elif isinstance(node, Identifier):
            if node.name not in self.symbol_table:
                raise NameError(f"Undeclared variable '{node.name}'")
            node.resolved_type = self.symbol_table[node.name]
            return node.resolved_type
        elif isinstance(node, BinOp):
            left_type = self.check_type(node.left)
            right_type = self.check_type(node.right)
            if left_type != right_type:
                raise TypeError(
                    f"Type mismatch: {left_type} {node.op} {right_type}"
                )
            node.resolved_type = left_type
            return left_type
        raise TypeError(f"Unknown node type: {type(node)}")

analyzer = SemanticAnalyzer()
analyzer.analyze(ast)

print("\n=== Phase 3: Semantic Analysis ===")
print("Symbol table:", analyzer.symbol_table)
print("Type checking passed!")


# ============================================================
# Phase 4: IR Generation (Three-Address Code)
# ============================================================

class IRGenerator:
    def __init__(self):
        self.instructions = []
        self.temp_count = 0

    def new_temp(self) -> str:
        self.temp_count += 1
        return f"t{self.temp_count}"

    def generate(self, node: ASTNode):
        if isinstance(node, Program):
            for stmt in node.statements:
                self.generate(stmt)
        elif isinstance(node, VarDecl):
            pass  # Declarations don't generate code
        elif isinstance(node, Assignment):
            result = self.gen_expr(node.value)
            self.instructions.append(f"{node.target} = {result}")

    def gen_expr(self, node: ASTNode) -> str:
        if isinstance(node, IntLiteral):
            return str(node.value)
        elif isinstance(node, Identifier):
            return node.name
        elif isinstance(node, BinOp):
            left = self.gen_expr(node.left)
            right = self.gen_expr(node.right)
            temp = self.new_temp()
            self.instructions.append(f"{temp} = {left} {node.op} {right}")
            return temp
        raise ValueError(f"Unknown node: {type(node)}")

ir_gen = IRGenerator()
ir_gen.generate(ast)

print("\n=== Phase 4: IR Generation ===")
for instr in ir_gen.instructions:
    print(f"  {instr}")


# ============================================================
# Phase 5: Optimization (Constant Folding + Propagation)
# ============================================================

def optimize(instructions: List[str]) -> List[str]:
    """Simple constant folding and propagation optimizer."""
    constants = {}  # variable -> known constant value
    optimized = []

    for instr in instructions:
        parts = instr.split(' = ', 1)
        target = parts[0].strip()
        expr = parts[1].strip()

        # Substitute known constants
        for var, val in constants.items():
            # Replace variable with its constant value (whole word only)
            import re as _re
            expr = _re.sub(r'\b' + _re.escape(var) + r'\b', str(val), expr)

        # Try to evaluate the expression (constant folding)
        try:
            # Only evaluate if it looks like a pure arithmetic expression
            result = eval(expr)
            if isinstance(result, (int, float)):
                constants[target] = result
                optimized.append(f"{target} = {result}")
                continue
        except:
            pass

        optimized.append(f"{target} = {expr}")

    # Remove assignments to temporaries that are only used once
    # (simplified dead code elimination)
    final = []
    used_vars = set()
    for instr in reversed(optimized):
        parts = instr.split(' = ', 1)
        target = parts[0].strip()
        if target.startswith('t') and target not in used_vars:
            continue  # Dead temporary
        final.append(instr)
        # Track variable uses in the expression
        for word in re.findall(r'\b[a-zA-Z_]\w*\b', parts[1]):
            used_vars.add(word)
    final.reverse()

    return final

optimized_ir = optimize(ir_gen.instructions)

print("\n=== Phase 5: Optimization ===")
for instr in optimized_ir:
    print(f"  {instr}")


# ============================================================
# Phase 6: Code Generation (Simple Stack Machine)
# ============================================================

def generate_code(instructions: List[str]) -> List[str]:
    """Generate simple stack-machine instructions."""
    code = []
    for instr in instructions:
        parts = instr.split(' = ', 1)
        target = parts[0].strip()
        expr = parts[1].strip()

        # Try to parse as "left op right"
        match = re.match(r'(\w+)\s*([+\-*/])\s*(\w+)', expr)
        if match:
            left, op, right = match.groups()
            code.append(f"  LOAD  {left}")
            code.append(f"  LOAD  {right}")
            op_map = {'+': 'ADD', '-': 'SUB', '*': 'MUL', '/': 'DIV'}
            code.append(f"  {op_map[op]}")
            code.append(f"  STORE {target}")
        else:
            # Simple assignment: target = value
            code.append(f"  LOAD  {expr}")
            code.append(f"  STORE {target}")
    return code

target_code = generate_code(optimized_ir)

print("\n=== Phase 6: Code Generation (Stack Machine) ===")
for line in target_code:
    print(line)

Running this produces:

=== Phase 1: Lexical Analysis ===
  KW_INT     'int'
  ID         'a'
  COMMA      ','
  ID         'b'
  ...

=== Phase 2: Syntax Analysis (AST) ===
Program
  VarDecl(type=int, names=['a', 'b', 'c'])
  Assign(target=b)
    IntLit(3)
  Assign(target=c)
    IntLit(5)
  Assign(target=a)
    BinOp(+)
      Id(b)
      BinOp(*)
        Id(c)
        IntLit(2)

=== Phase 3: Semantic Analysis ===
Symbol table: {'a': 'int', 'b': 'int', 'c': 'int'}
Type checking passed!

=== Phase 4: IR Generation ===
  b = 3
  c = 5
  t1 = c * 2
  t2 = b + t1
  a = t2

=== Phase 5: Optimization ===
  a = 13

=== Phase 6: Code Generation (Stack Machine) ===
  LOAD  13
  STORE a

The optimizer reduced the entire computation to a single constant assignment.

6. Front-End vs. Back-End¶

The compiler is logically divided into two halves:

                    Front-End                    Back-End
              (language-dependent)          (machine-dependent)
         +---------------------------+  +---------------------------+
Source -->| Lexer | Parser | Semantic |->| Optimizer | Code Gen     |--> Target
         | Analysis       | Analysis |  |           | Register Alloc|
         +---------------------------+  +---------------------------+
                     |                              |
                     v                              v
               Intermediate                    Target Code
              Representation (IR)

Front-End¶

The front-end is source-language dependent and target-machine independent:

Lexical analysis
Syntax analysis
Semantic analysis
IR generation

The front-end produces an intermediate representation that captures the meaning of the program without committing to any particular target machine.

Back-End¶

The back-end is source-language independent and target-machine dependent:

Machine-independent optimization
Machine-dependent optimization
Instruction selection
Register allocation
Instruction scheduling

The $M \times N$ Problem¶

Without an IR, supporting $M$ source languages and $N$ target machines requires $M \times N$ compilers. With a shared IR, you need only $M$ front-ends and $N$ back-ends:

Without IR:  M x N compilers needed

  C    ---->  x86
  C    ---->  ARM
  C    ---->  MIPS
  Java ---->  x86
  Java ---->  ARM
  Java ---->  MIPS
  Rust ---->  x86
  Rust ---->  ARM
  Rust ---->  MIPS

  3 languages x 3 targets = 9 compilers


With IR:  M + N components needed

  C    ---->  \
  Java ---->   ----> IR ----> x86
  Rust ---->  /          ---> ARM
                         ---> MIPS

  3 front-ends + 3 back-ends = 6 components

This is exactly the design philosophy behind LLVM: many front-ends (Clang for C/C++, rustc for Rust, swiftc for Swift) share a common IR (LLVM IR) and common back-ends.

7. Single-Pass vs. Multi-Pass Compilation¶

Single-Pass Compiler¶

A single-pass compiler processes the source code exactly once, performing all phases simultaneously as it reads the input from left to right.

Advantages: - Fast compilation (only one pass over the source) - Low memory usage - Simple implementation

Disadvantages: - Limited optimization opportunities - Language must be designed for single-pass compilation (e.g., forward declarations in C/Pascal) - Cannot perform global analysis

Example: Early Pascal compilers were single-pass. This is why Pascal requires forward declarations:

{ Pascal: must declare before use }
procedure B; forward;  { forward declaration needed }

procedure A;
begin
    B;  { calls B, which hasn't been fully defined yet }
end;

procedure B;
begin
    A;
end;

Multi-Pass Compiler¶

A multi-pass compiler processes the source code multiple times, each pass performing a specific transformation.

Advantages: - Better optimization (global view of the program) - Cleaner separation of concerns - Language doesn't need forward declarations

Disadvantages: - Slower compilation - Higher memory usage (must store intermediate representations)

Example: Modern compilers like GCC and Clang use many passes:

GCC compilation passes (simplified):
  Pass 1: Preprocessing (cpp)
  Pass 2: Parsing -> GENERIC IR
  Pass 3: GENERIC -> GIMPLE (lowering)
  Pass 4: GIMPLE optimizations (SSA, constant prop, DCE, ...)
  Pass 5: GIMPLE -> RTL (Register Transfer Language)
  Pass 6: RTL optimizations (CSE, loop opt, register alloc)
  Pass 7: RTL -> Assembly

8. Bootstrapping¶

Bootstrapping is the process of writing a compiler for a language using the language itself. This creates a chicken-and-egg problem: how do you compile the compiler if the compiler doesn't exist yet?

The Bootstrap Process¶

Step 1: Write a simple compiler C0 for language L in an existing language E
        (e.g., write a C compiler in assembly)

        C0 written in E,  compiles subset of L

Step 2: Use C0 to compile a better compiler C1 written in L
        (e.g., write a C compiler in C, compile it with C0)

        C1 written in L,  compiled by C0

Step 3: Use C1 to recompile itself
        (the compiler compiles its own source code)

        C1 written in L,  compiled by C1

Step 4: Iterate to improve the compiler
        C2, C3, ... each version compiled by the previous

Historical Example: The First C Compiler¶

Ken Thompson and Dennis Ritchie bootstrapped C as follows:

Wrote a minimal C compiler in PDP-11 assembly (B language actually preceded C)
Wrote a better C compiler in C
Compiled the C compiler with the assembly version
Recompiled the C compiler with itself

Trust Problem (Thompson's Hack)¶

In his 1984 Turing Award lecture "Reflections on Trusting Trust," Ken Thompson demonstrated that a bootstrapped compiler can contain hidden backdoors:

Modify the compiler to insert a backdoor when compiling a specific program (e.g., login)
Modify the compiler to insert modification (1) when compiling itself
Remove the modifications from the source code
The compiled compiler binary still contains both modifications, even though the source is clean

This is why reproducible builds and diverse double-compilation are important for security.

9. Cross-Compilation¶

A cross-compiler runs on one platform (the host) but generates code for a different platform (the target).

$$\text{Cross-compiler}: \text{Source} \xrightarrow{\text{runs on Host}} \text{Target code for Target platform}$$

Terminology¶

We use three terms: - Build: The platform where the compiler is built - Host: The platform where the compiler runs - Target: The platform for which the compiler generates code

Build = Host = Target	Name
All same	Native compiler
Build = Host, different Target	Cross-compiler
All different	Canadian cross

Use Cases¶

Embedded systems: Develop on x86 PC, deploy on ARM microcontroller
Mobile development: Compile on macOS, run on iOS/Android
Operating systems: Compile kernel for a platform that has no compiler yet

# Example: Cross-compiling C for ARM on an x86 Linux host
arm-linux-gnueabihf-gcc -o hello hello.c

# The resulting 'hello' binary runs on ARM, not on the x86 host
file hello
# hello: ELF 32-bit LSB executable, ARM, EABI5, ...

10. T-Diagrams (Tombstone Diagrams)¶

T-diagrams are a visual notation for describing compilers, interpreters, and programs in terms of their source language, target language, and implementation language.

Notation¶

A compiler is drawn as a T-shape:

+-------------------+
|  Source  | Target  |
+----+-----+----+---+
     | Impl     |
     +----------+

Source: Language the compiler accepts
Target: Language the compiler produces
Impl: Language the compiler is written in

A program is drawn as an inverted trapezoid:

+---------+
| Program |
+----+----+
     | Lang
     +----+

An interpreter is drawn as a rounded T:

+-------------------+
|  Language          |
+----+---------+----+
     | Impl    |
     +---------+

Example: Bootstrapping with T-Diagrams¶

Step 1: Write a C compiler in assembly that targets x86.

+-------------------+
|    C     |  x86   |
+----+-----+----+---+
     |   ASM    |
     +----------+

Step 2: Write a C compiler in C that targets x86.

+-------------------+
|    C     |  x86   |
+----+-----+----+---+
     |    C      |
     +----------+

Step 3: Compile the C compiler (Step 2) using the assembly compiler (Step 1). The result is a C compiler written in x86 machine code.

+-------------------+       +-------------------+       +-------------------+
|    C     |  x86   |       |    C     |  x86   |       |    C     |  x86   |
+----+-----+----+---+  ==>  +----+-----+----+---+  ==>  +----+-----+----+---+
     |    C      |                |   ASM    |                |   x86    |
     +----+------+                +----------+                +----------+
          |
    compiled by
          |
     +----+-----+----+
     |    C     | x86 |
     +----+-----+----+
          | ASM  |
          +------+

Python Representation of T-Diagrams¶

from dataclasses import dataclass

@dataclass
class Compiler:
    """Represents a compiler as a T-diagram."""
    source: str      # Source language
    target: str      # Target language
    impl: str        # Implementation language

    def __repr__(self):
        return f"Compiler({self.source} -> {self.target}, written in {self.impl})"

@dataclass
class Program:
    """Represents a program."""
    name: str
    language: str

    def __repr__(self):
        return f"Program({self.name}, in {self.language})"

def compile_with(compiler: Compiler, artifact):
    """
    Simulate compilation:
    - If artifact is a Program in compiler.source, produce Program in compiler.target
    - If artifact is a Compiler in compiler.source, produce Compiler with impl=compiler.target
    """
    if isinstance(artifact, Program):
        if artifact.language != compiler.source:
            raise ValueError(
                f"Cannot compile {artifact.language} program "
                f"with {compiler.source} compiler"
            )
        return Program(artifact.name, compiler.target)

    elif isinstance(artifact, Compiler):
        if artifact.impl != compiler.source:
            raise ValueError(
                f"Cannot compile compiler written in {artifact.impl} "
                f"with {compiler.source} compiler"
            )
        return Compiler(artifact.source, artifact.target, compiler.target)

# Bootstrapping example
print("=== Bootstrapping a C Compiler ===\n")

# Step 1: We have a C compiler written in ASM that targets x86
c_compiler_asm = Compiler("C", "x86", "ASM")
print(f"Step 1 (existing):  {c_compiler_asm}")

# Step 2: We write a C compiler in C that targets x86
c_compiler_in_c = Compiler("C", "x86", "C")
print(f"Step 2 (source):    {c_compiler_in_c}")

# Step 3: Compile the C compiler with the ASM compiler
c_compiler_native = compile_with(c_compiler_asm, c_compiler_in_c)
print(f"Step 3 (compiled):  {c_compiler_native}")

# Step 4: The native compiler can now compile itself
c_compiler_self = compile_with(c_compiler_native, c_compiler_in_c)
print(f"Step 4 (self-comp): {c_compiler_self}")

Output:

=== Bootstrapping a C Compiler ===

Step 1 (existing):  Compiler(C -> x86, written in ASM)
Step 2 (source):    Compiler(C -> x86, written in C)
Step 3 (compiled):  Compiler(C -> x86, written in x86)
Step 4 (self-comp): Compiler(C -> x86, written in x86)

11. Compiler Construction Tools¶

Over the decades, many tools have been developed to automate parts of compiler construction:

Lexer Generators¶

Tool	Input	Output	Notes
Lex	Regular expressions	C scanner	Original Unix tool (1975)
Flex	Regular expressions	C scanner	Fast Lex (GNU replacement)
re2c	Regular expressions	C/C++ scanner	Generates direct code, no tables
ANTLR	Grammar	Java/Python/... scanner+parser	Combined lexer+parser
PLY	Python regex	Python scanner	Python Lex-Yacc

Parser Generators¶

Tool	Grammar type	Output	Notes
Yacc	LALR(1)	C parser	Original Unix tool (1975)
Bison	LALR(1)/GLR	C/C++/Java parser	GNU replacement for Yacc
ANTLR	LL(*)	Multi-language parser	Popular, user-friendly
Lark	Earley/LALR	Python parser	Modern Python tool
tree-sitter	LR(1)/GLR	C parser + bindings	Incremental parsing for editors

Compiler Frameworks¶

Framework	Type	Notes
LLVM	Compiler infrastructure	Modular, widely used
GCC	Compiler collection	Mature, many targets
Cranelift	Code generator	Used in Wasmtime, fast compile times
QBE	Compiler backend	Lightweight alternative to LLVM
MLIR	IR framework	Multi-level IR (part of LLVM project)

12. Why Study Compilers?¶

Even if you never build a production compiler, compiler techniques appear everywhere in software engineering:

Configuration file parsers: YAML, TOML, JSON, INI parsers use lexer+parser techniques
Query languages: SQL, GraphQL, Elasticsearch queries are compiled/interpreted
Template engines: Jinja2, Handlebars, JSX all involve parsing and code generation
Build systems: Make, CMake, Bazel parse their own DSLs
Regular expressions: Every regex engine implements finite automata
IDEs: Syntax highlighting, code completion, and refactoring use compiler front-ends
Static analysis: Linters and security scanners perform semantic analysis
Code generation: ORMs, protocol buffers, and API generators produce code from specifications
Optimization: Understanding compiler optimizations helps you write faster code
Formal verification: Type systems and program analysis build on compiler techniques

Summary¶

A compiler translates source code from one language to another while preserving semantics
Compilation proceeds through well-defined phases: lexical analysis, parsing, semantic analysis, IR generation, optimization, and code generation
The front-end (language-dependent) and back-end (machine-dependent) are separated by an intermediate representation
Bootstrapping allows a compiler to compile itself, but raises trust issues
Cross-compilation generates code for a platform different from the one the compiler runs on
T-diagrams provide a visual notation for reasoning about compilers, programs, and their relationships
Modern compilers like GCC and LLVM/Clang use dozens of passes over multiple intermediate representations
Compiler techniques are broadly applicable beyond compiler construction itself

Exercises¶

Exercise 1: Phase Identification¶

For the following program, identify what each compiler phase would do. Write down the output of each phase (tokens, AST structure, type checking results, three-address code, optimized code).

float area;
float radius = 5.0;
area = 3.14159 * radius * radius;

Exercise 2: Compiler vs. Interpreter¶

Consider the following scenarios. For each, explain whether a compiler, interpreter, or hybrid approach would be most appropriate and why:

A scripting language for automating system administration tasks
A language for writing high-performance game engines
A configuration language for specifying build rules
A language running in a web browser (sandboxed)

Exercise 3: T-Diagrams¶

Draw T-diagrams for the following scenario:

You have: - A Python interpreter written in C that runs on x86 - A C compiler written in C that targets x86

Show how you would: 1. Run a Python program on an x86 machine 2. Compile the C compiler using itself 3. Create a cross-compiler that runs on x86 but targets ARM

Exercise 4: Bootstrapping Sequence¶

Suppose you are creating a new language called "Nova" and you want to write the Nova compiler in Nova itself. Describe a concrete bootstrapping strategy with at least three steps. What is the minimum you need to implement in an existing language?

Exercise 5: Front-End / Back-End Separation¶

A company has compilers for 4 source languages targeting 3 architectures. How many compiler components are needed: 1. Without a shared IR? 2. With a shared IR?

If they add 2 more languages and 2 more architectures, how do the numbers change in each case?

Exercise 6: Compilation Phases in Practice¶

Use Python's dis module and ast module to explore how CPython compiles Python code:

import ast
import dis

source = "x = [i**2 for i in range(10)]"

# 1. Parse to AST
tree = ast.parse(source)
print(ast.dump(tree, indent=2))

# 2. Compile to bytecode
code = compile(source, "<string>", "exec")

# 3. Disassemble
dis.dis(code)

Study the output and identify which compiler phases are visible. What optimizations (if any) did CPython perform?

Next: Lexical Analysis | Overview