05. Temporal Difference Learning

05. Temporal Difference Learning

Difficulty: ⭐⭐⭐ (Intermediate)

Learning Objectives

  • Understand basic concepts of TD learning and the TD(0) algorithm
  • Grasp TD Target and bootstrapping concepts
  • Compare differences between TD, MC, and DP
  • Understand TD's bias-variance tradeoff
  • Learn n-step TD and TD(λ)

1. What is TD Learning?

1.1 Overview

Temporal Difference (TD) learning combines DP's bootstrapping with MC's sampling.

     DP: V(s)  E[R + γV(s')]           (model required, bootstrap)
     MC: V(s) ← Average(G)               (model-free, full return)
     TD: V(s)  V(s) + α[R + γV(s') - V(s)]  (model-free, bootstrap)

1.2 TD vs MC vs DP Comparison

Feature DP MC TD
Environment Model Required Not required Not required
Bootstrapping O X O
Sampling X O O
Update Timing Every step Episode termination Every step
Continuous Tasks O X O
Bias O X O
Variance Low High Medium

2. TD(0) Algorithm

2.1 TD Target

TD(0) updates the current state's value using the next state's estimate.

TD Target: $R_{t+1} + \gamma V(S_{t+1})$

TD Error (δ): $\delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t)$

Update Rule: $$V(S_t) \leftarrow V(S_t) + \alpha \cdot \delta_t$$ $$V(S_t) \leftarrow V(S_t) + \alpha \left[ R_{t+1} + \gamma V(S_{t+1}) - V(S_t) \right]$$

2.2 TD(0) Implementation

import numpy as np
import gymnasium as gym
from collections import defaultdict

def td0_prediction(env, policy, n_episodes=10000, alpha=0.1, gamma=0.99):
    """
    TD(0) Policy Evaluation

    Args:
        env: Gymnasium environment
        policy: Policy function policy(state) -> action
        n_episodes: Number of episodes
        alpha: Learning rate
        gamma: Discount factor

    Returns:
        V: State value function
    """
    V = defaultdict(float)

    for episode in range(n_episodes):
        state, _ = env.reset()
        done = False

        while not done:
            action = policy(state)
            next_state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated

            # TD(0) update
            if done:
                td_target = reward  # Terminal state: V(s') = 0
            else:
                td_target = reward + gamma * V[next_state]

            td_error = td_target - V[state]
            V[state] = V[state] + alpha * td_error

            state = next_state

        if (episode + 1) % 1000 == 0:
            print(f"Episode {episode + 1}/{n_episodes}")

    return dict(V)


# Usage example
env = gym.make('CliffWalking-v0')

def random_policy(state):
    return env.action_space.sample()

V = td0_prediction(env, random_policy, n_episodes=5000)
print(f"Number of learned states: {len(V)}")

2.3 TD(0) vs MC Comparison Visualization

import matplotlib.pyplot as plt
import numpy as np

def compare_td_mc(env, policy, n_episodes=500, alpha=0.1, gamma=0.99):
    """Compare learning curves of TD(0) and MC"""

    # TD(0)
    V_td = defaultdict(float)
    td_errors = []

    # MC
    V_mc = defaultdict(float)
    returns_sum = defaultdict(float)
    returns_count = defaultdict(int)
    mc_errors = []

    for episode in range(n_episodes):
        # Generate episode
        episode_data = []
        state, _ = env.reset()
        done = False

        while not done:
            action = policy(state)
            next_state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated
            episode_data.append((state, action, reward, next_state, done))
            state = next_state

        # TD(0) update (online)
        for state, action, reward, next_state, done in episode_data:
            if done:
                td_target = reward
            else:
                td_target = reward + gamma * V_td[next_state]
            V_td[state] += alpha * (td_target - V_td[state])

        # MC update (offline)
        G = 0
        for state, action, reward, _, _ in reversed(episode_data):
            G = reward + gamma * G
            returns_sum[state] += G
            returns_count[state] += 1
            V_mc[state] = returns_sum[state] / returns_count[state]

        # Record estimates for test state (for comparison)
        test_state = episode_data[0][0]
        td_errors.append(V_td[test_state])
        mc_errors.append(V_mc[test_state])

    return td_errors, mc_errors


# Visualize learning curves
# env = gym.make('CliffWalking-v0')
# td_curve, mc_curve = compare_td_mc(env, random_policy)
#
# plt.figure(figsize=(10, 5))
# plt.plot(td_curve, label='TD(0)', alpha=0.7)
# plt.plot(mc_curve, label='MC', alpha=0.7)
# plt.xlabel('Episode')
# plt.ylabel('Value Estimate')
# plt.legend()
# plt.title('TD(0) vs MC Learning Curves')
# plt.show()

3. Bootstrapping

3.1 Concept

Bootstrapping is updating an estimate using other estimates.

MC:  V(s)  V(s) + α[G - V(s)]
     Use actual return G (no bootstrapping)

TD:  V(s)  V(s) + α[R + γV(s') - V(s)]
     Use estimate V(s') (bootstrapping)

3.2 Impact of Bootstrapping

"""
Advantages of bootstrapping:
1. Can learn before episode termination
2. Applicable to continuous tasks
3. Lower variance (use only one-step reward)

Disadvantages of bootstrapping:
1. Introduces bias (propagates if V(s') is inaccurate)
2. Sensitive to initial estimates
3. Convergence guarantee more complex than MC
"""

# Bootstrapping visualization
def visualize_bootstrapping():
    """
    MC: S₀ → S₁ → S₂ → S₃ → Terminal (G = r₁ + γr₂ + γ²r₃ + γ³r₄)
                                       └── Use full return

    TD: S₀ → S₁ → S₂ → ...
        V(S₀) ← R₁ + γV(S₁)
                     └── Use estimate (bootstrap)
    """
    pass

4. Meaning of TD Error

4.1 TD Error Analysis

$$\delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t)$$

  • δ > 0: Better than expected → Increase V(s)
  • δ < 0: Worse than expected → Decrease V(s)
  • δ = 0: Matches expectations → No change

4.2 TD Error and Neuroscience

Dopamine neuron responses are similar to TD Error!

Unexpected reward → Dopamine increase (δ > 0)
Expected reward received → No dopamine change (δ ≈ 0)
Expected reward not received → Dopamine decrease (δ < 0)

→ TD learning may be similar to brain learning mechanisms

5. Advantages of TD Learning

5.1 Random Walk Example

def random_walk_comparison():
    """
    Compare TD and MC on Random Walk

    Environment: A - B - C - D - E - [Terminal]
                  ←              →
    Left terminal: Reward 0
    Right terminal: Reward 1
    """
    import numpy as np

    # States: 0=left terminal, 1-5=A-E, 6=right terminal
    n_states = 7
    true_values = np.array([0, 1/6, 2/6, 3/6, 4/6, 5/6, 1])  # True values

    def run_episode():
        """Generate episode"""
        state = 3  # Start at C
        episode = [(state, 0, state)]

        while 0 < state < 6:
            if np.random.random() < 0.5:
                state -= 1  # Left
            else:
                state += 1  # Right

            reward = 1 if state == 6 else 0
            episode.append((state, reward, state))

        return episode

    # TD(0) learning
    V_td = np.full(n_states, 0.5)
    V_td[0] = V_td[6] = 0

    alpha = 0.1
    n_episodes = 100

    for _ in range(n_episodes):
        state = 3
        while 0 < state < 6:
            if np.random.random() < 0.5:
                next_state = state - 1
            else:
                next_state = state + 1

            reward = 1 if next_state == 6 else 0
            V_td[state] += alpha * (reward + V_td[next_state] - V_td[state])
            state = next_state

    print("True Values:", true_values[1:6])
    print("TD Estimates:", V_td[1:6].round(3))
    print("TD RMSE:", np.sqrt(np.mean((V_td[1:6] - true_values[1:6])**2)))

    return V_td


V_td = random_walk_comparison()

5.2 Advantages of TD Summary

Advantage Description
Online learning Can learn before episode termination
Continuous tasks Applicable to tasks without termination
Lower variance Use only one-step reward
Incremental improvement Can improve policy in real-time

6. Batch TD vs Batch MC

6.1 Batch Learning

When learning repeatedly from the same dataset, TD and MC converge to different values.

def batch_td_mc_comparison():
    """
    Compare TD and MC in batch learning

    Example data:
    Episode 1: A → B → 0 (reward 0)
    Episode 2: B → 1 (reward 1)

    MC: V(A) = 0, V(B) = 1/2
    TD: V(A) = 3/4 * V(B) = 3/4, V(B) = 1 (A→B so V(A) ≈ V(B))
    """
    # Simple example
    episodes = [
        [('A', 'B', 0), ('B', 'terminal', 0)],  # A → B → Terminal(0)
        [('B', 'terminal', 1)]                   # B → Terminal(1)
    ]

    # Batch MC
    V_mc = {'A': 0, 'B': 0}
    returns = {'A': [], 'B': []}

    for ep in episodes:
        G = 0
        for state, next_state, reward in reversed(ep):
            G = reward + G
            if state != 'terminal':
                returns[state].append(G)

    for state in V_mc:
        if returns[state]:
            V_mc[state] = np.mean(returns[state])

    # Batch TD (iterate)
    V_td = {'A': 0, 'B': 0, 'terminal': 0}
    alpha = 0.1

    for _ in range(100):  # Batch iterations
        for ep in episodes:
            for state, next_state, reward in ep:
                if state != 'terminal':
                    V_td[state] += alpha * (reward + V_td[next_state] - V_td[state])

    print("Batch MC:", V_mc)
    print("Batch TD:", {k: round(v, 3) for k, v in V_td.items() if k != 'terminal'})


batch_td_mc_comparison()

6.2 Convergence Properties

  • MC: Minimizes mean squared error (fits observed returns)
  • TD: Maximum likelihood MDP estimate (implicitly learns transition probabilities)

7. n-step TD

7.1 Concept

While TD(0) uses 1-step returns, n-step TD uses n actual rewards.

$$G_t^{(n)} = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{n-1} R_{t+n} + \gamma^n V(S_{t+n})$$

1-step: G_t^(1) = R_{t+1} + γV(S_{t+1})                     TD(0)
2-step: G_t^(2) = R_{t+1} + γR_{t+2} + γ²V(S_{t+2})
3-step: G_t^(3) = R_{t+1} + γR_{t+2} + γ²R_{t+3} + γ³V(S_{t+3})
...
-step: G_t^() = R_{t+1} + γR_{t+2} + ...                 MC

7.2 n-step TD Implementation

def n_step_td(env, policy, n=3, n_episodes=1000, alpha=0.1, gamma=0.99):
    """
    n-step TD Policy Evaluation

    Args:
        n: Number of steps (n=1 is TD(0), n=∞ is MC)
    """
    V = defaultdict(float)

    for episode in range(n_episodes):
        states = []
        rewards = [0]  # R_0 = 0 (not used)
        T = float('inf')  # Terminal time
        t = 0

        state, _ = env.reset()
        states.append(state)

        while True:
            if t < T:
                action = policy(state)
                next_state, reward, terminated, truncated, _ = env.step(action)
                done = terminated or truncated

                states.append(next_state)
                rewards.append(reward)

                if done:
                    T = t + 1

                state = next_state

            tau = t - n + 1  # Time to update

            if tau >= 0:
                # Calculate n-step return
                G = sum(gamma ** (i - tau - 1) * rewards[i]
                        for i in range(tau + 1, min(tau + n, T) + 1))

                if tau + n < T:
                    G += gamma ** n * V[states[tau + n]]

                # Update
                V[states[tau]] += alpha * (G - V[states[tau]])

            t += 1

            if tau == T - 1:
                break

    return dict(V)

7.3 Choice of n

n value Characteristics
n=1 (TD(0)) High bias, low variance
n=∞ (MC) No bias, high variance
Middle n Balance point (optimal n depends on environment)

8. TD(λ) - Eligibility Traces

8.1 Concept

Use weighted average of all n-step returns.

$$G_t^\lambda = (1-\lambda) \sum_{n=1}^{\infty} \lambda^{n-1} G_t^{(n)}$$

  • λ=0: TD(0)
  • λ=1: MC

8.2 Eligibility Trace

Efficiently compute by tracking the "eligibility" of each state.

def td_lambda(env, policy, lambd=0.8, n_episodes=1000, alpha=0.1, gamma=0.99):
    """
    TD(λ) with Eligibility Traces

    Args:
        lambd: λ value (0 ≤ λ ≤ 1)
    """
    V = defaultdict(float)

    for episode in range(n_episodes):
        # Initialize eligibility trace
        E = defaultdict(float)

        state, _ = env.reset()
        done = False

        while not done:
            action = policy(state)
            next_state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated

            # TD error
            if done:
                delta = reward - V[state]
            else:
                delta = reward + gamma * V[next_state] - V[state]

            # Increase eligibility for current state
            E[state] += 1  # accumulating traces

            # Update all states
            for s in E:
                V[s] += alpha * delta * E[s]
                E[s] *= gamma * lambd  # Decay trace

            state = next_state

    return dict(V)

8.3 Types of Eligibility Traces

"""
1. Accumulating Traces (accumulating)
   E(s) ← E(s) + 1    (accumulate on each visit)

2. Replacing Traces (replace)
   E(s) ← 1           (reset to 1 on visit)

3. Dutch Traces
   E(s) ← (1-α)E(s) + 1
"""

def accumulating_trace(E, state):
    E[state] += 1
    return E

def replacing_trace(E, state):
    E[state] = 1
    return E

def dutch_trace(E, state, alpha):
    E[state] = (1 - alpha) * E[state] + 1
    return E

9. Example: Cliff Walking

import gymnasium as gym
import numpy as np
from collections import defaultdict
import matplotlib.pyplot as plt

def cliff_walking_td():
    """TD learning in Cliff Walking"""

    env = gym.make('CliffWalking-v0')

    # State: 0-47 (4x12 grid)
    # Actions: 0=up, 1=right, 2=down, 3=left

    Q = defaultdict(lambda: np.zeros(4))
    epsilon = 0.1
    alpha = 0.5
    gamma = 0.99
    n_episodes = 500

    episode_rewards = []

    for ep in range(n_episodes):
        state, _ = env.reset()
        done = False
        total_reward = 0

        while not done:
            # ε-greedy action selection
            if np.random.random() < epsilon:
                action = env.action_space.sample()
            else:
                action = np.argmax(Q[state])

            next_state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated
            total_reward += reward

            # Q-learning update (detailed in next lesson)
            best_next = np.max(Q[next_state]) if not done else 0
            Q[state][action] += alpha * (reward + gamma * best_next - Q[state][action])

            state = next_state

        episode_rewards.append(total_reward)

        if (ep + 1) % 100 == 0:
            avg = np.mean(episode_rewards[-100:])
            print(f"Episode {ep + 1}: avg_reward = {avg:.1f}")

    # Visualize learned policy
    print("\n=== Learned Policy (4x12 grid) ===")
    arrows = {0: '^', 1: '>', 2: 'v', 3: '<'}

    for row in range(4):
        line = ""
        for col in range(12):
            state = row * 12 + col
            if state == 36:  # Start
                line += " S "
            elif state == 47:  # Goal
                line += " G "
            elif 37 <= state <= 46:  # Cliff
                line += " C "
            else:
                action = np.argmax(Q[state])
                line += f" {arrows[action]} "
        print(line)

    env.close()
    return Q, episode_rewards


Q, rewards = cliff_walking_td()

10. Summary

Key Formulas

Method Update Rule
TD(0) $V(s) \leftarrow V(s) + \alpha[R + \gamma V(s') - V(s)]$
n-step TD $V(s) \leftarrow V(s) + \alpha[G^{(n)} - V(s)]$
TD(λ) $V(s) \leftarrow V(s) + \alpha \delta E(s)$

TD Error

$$\delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t)$$

Method Comparison

                    TD(0)        n-step        MC
                    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Bias                High         Medium        None
Variance            Low          Medium        High
Update timing       Every step   After n steps Episode end
Continuous tasks    Possible     Possible      Impossible

11. Practice Problems

  1. TD Error: What is the TD error when V(s)=5, R=1, γ=0.9, V(s')=6?

  2. n-step: Write the return formula for n=2.

  3. TD(λ): What is the weight ratio of 1-step and 2-step returns when λ=0.5?

  4. Eligibility Trace: What is the value of accumulating trace if state s is visited 2 times consecutively?


Next Steps

In the next lesson 06_Q_Learning_SARSA.md, we'll learn Q-Learning and SARSA algorithms that apply TD to control.


References

  • Sutton & Barto, "Reinforcement Learning: An Introduction", Chapters 6, 7
  • David Silver's RL Course, Lectures 4 & 5
  • Gymnasium CliffWalking
to navigate between lessons