01. Introduction to Reinforcement Learning
01. Introduction to Reinforcement Learning¶
Difficulty: ⭐ (Beginner)
Learning Objectives¶
- Understand the definition and characteristics of reinforcement learning
- Grasp the agent-environment interaction paradigm
- Learn the concepts of rewards, episodes, and continuous tasks
- Understand differences from supervised/unsupervised learning
1. What is Reinforcement Learning?¶
1.1 Definition¶
Reinforcement Learning (RL) is a branch of machine learning where an Agent learns action policies that maximize Rewards through interaction with an Environment.
┌─────────────────────────────────────────────────────┐
│ Reinforcement Learning Loop │
└─────────────────────────────────────────────────────┘
Action
┌─────────────────┐
│ ▼
┌─────────┐ ┌─────────────┐
│ │ │ │
│ Agent │ │ Environment │
│ │ │ │
└─────────┘ └─────────────┘
▲ │
│ │
└─────────────────┘
State + Reward
1.2 Key Characteristics¶
- Trial and Error Learning: Learn directly from actions and their consequences
- Delayed Reward: Consider not only immediate rewards but also future rewards
- Exploration vs Exploitation: Balance between trying new things and using existing knowledge
- Sequential Decision Making: Consider the cumulative effect of consecutive decisions
1.3 Comparison with Machine Learning Paradigms¶
| Feature | Supervised Learning | Unsupervised Learning | Reinforcement Learning |
|---|---|---|---|
| Data | Labeled | Unlabeled | Reward signal |
| Feedback | Immediate correct answer | None | Delayed reward |
| Goal | Prediction/Classification | Pattern/Structure discovery | Maximize cumulative reward |
| Example | Image classification | Clustering | Game playing |
2. Agent-Environment Interaction¶
2.1 Basic Components¶
# Basic elements of reinforcement learning
class RLComponents:
"""
1. State (s): Current situation of the environment
- Game: screen pixels, score, position, etc.
- Robot: joint angles, velocity, sensor values, etc.
2. Action (a): Actions the agent can take
- Discrete: up/down/left/right movement
- Continuous: motor torque values
3. Reward (r): Numerical feedback for actions
- Positive: encourage good actions
- Negative: discourage bad actions
4. Policy (π): Strategy for selecting actions in states
- Deterministic: π(s) = a
- Stochastic: π(a|s) = P(A=a|S=s)
5. Value Function (V, Q): Estimate of long-term value
- V(s): value of a state
- Q(s,a): value of a state-action pair
"""
pass
2.2 Interaction Process¶
import gymnasium as gym
# Create environment
env = gym.make("CartPole-v1")
# Initialize
state, info = env.reset()
total_reward = 0
done = False
while not done:
# 1. Agent selects action (random here)
action = env.action_space.sample()
# 2. Apply action to environment
next_state, reward, terminated, truncated, info = env.step(action)
# 3. Accumulate reward
total_reward += reward
# 4. Update state
state = next_state
# 5. Check termination condition
done = terminated or truncated
print(f"Total reward: {total_reward}")
env.close()
2.3 Gymnasium Environment Structure¶
import gymnasium as gym
# Check environment information
env = gym.make("CartPole-v1")
print("=== Environment Info ===")
print(f"Observation space: {env.observation_space}")
# Box([-4.8, -inf, -0.418, -inf], [4.8, inf, 0.418, inf], (4,), float32)
print(f"Action space: {env.action_space}")
# Discrete(2) - 0: left, 1: right
print(f"Reward range: {env.reward_range}")
# (-inf, inf)
# Meaning of observation space (CartPole)
# [cart position, cart velocity, pole angle, pole angular velocity]
3. Reward¶
3.1 Role of Reward¶
Reward is a signal that tells the agent what is good and bad.
# Reward design examples
class RewardExamples:
"""
Game example:
- Score gained: +10
- Enemy defeated: +100
- Goal reached: +1000
- Hit: -50
- Game over: -100
Robot example:
- Moving toward goal: +1
- Collision with obstacle: -10
- Energy consumption: -0.1
- Goal reached: +100
"""
pass
3.2 Reward Hypothesis¶
"All goals can be described as the maximization of expected cumulative reward." - Richard Sutton
$$G_t = R_{t+1} + R_{t+2} + R_{t+3} + \cdots$$
3.3 Discounted Return¶
Apply a discount factor (γ) to future rewards to convert them to present value.
$$G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}$$
def calculate_return(rewards, gamma=0.99):
"""
Calculate discounted cumulative reward
Args:
rewards: Reward sequence [r1, r2, r3, ...]
gamma: Discount factor (0 ~ 1)
Returns:
G: Discounted cumulative reward
"""
G = 0
for r in reversed(rewards):
G = r + gamma * G
return G
# Example
rewards = [1, 1, 1, 1, 10] # Large reward at the end
gamma = 0.9
G = calculate_return(rewards, gamma)
print(f"Discounted cumulative reward (γ={gamma}): {G:.2f}")
# 1 + 0.9 + 0.81 + 0.729 + 6.561 = 10.0
3.4 Meaning of Discount Factor¶
| γ value | Characteristic | Application |
|---|---|---|
| γ = 0 | Myopic (consider only immediate reward) | When short-term optimization needed |
| γ = 0.9 | Moderately consider future rewards | General cases |
| γ = 0.99 | Long-term perspective | When episodes are long |
| γ = 1 | Equal evaluation of future rewards | Only for episodic tasks |
4. Episodes and Continuous Tasks¶
4.1 Episodic Tasks¶
Tasks with clear start and end points.
# Episodic task example: one game round
def episodic_task_example():
env = gym.make("CartPole-v1")
episodes = 10
for episode in range(episodes):
state, _ = env.reset() # Episode start
episode_reward = 0
step = 0
while True:
action = env.action_space.sample()
state, reward, terminated, truncated, _ = env.step(action)
episode_reward += reward
step += 1
# Episode termination condition
if terminated or truncated:
print(f"Episode {episode + 1}: "
f"Steps = {step}, Reward = {episode_reward}")
break
env.close()
# Episodic task examples:
# - Games (start → game over or clear)
# - Maze escape (start point → exit)
# - Go/Chess (game start → win/loss/draw)
4.2 Continuing Tasks¶
Tasks that continue indefinitely without termination.
# Continuing task example: server load management
def continuing_task_example():
"""
Continuing tasks have no natural termination point
- Server load balancing
- Temperature control system
- Stock trading
- Robot walking (infinite walking)
"""
# Artificial step limit for simulation
max_steps = 10000
state = initialize_system()
for step in range(max_steps):
action = select_action(state)
next_state, reward = environment_step(action)
# Agent update
update_agent(state, action, reward, next_state)
state = next_state
# Discount factor γ < 1 required for continuing tasks
# (γ = 1 leads to infinite reward → divergence)
def initialize_system():
return None # placeholder
def select_action(state):
return None # placeholder
def environment_step(action):
return None, 0 # placeholder
def update_agent(*args):
pass # placeholder
4.3 Comparison¶
| Feature | Episodic | Continuing |
|---|---|---|
| Termination | Natural termination point | None (artificial truncation possible) |
| Return | Finite (γ=1 possible) | Possibly infinite (γ<1 required) |
| Examples | Games, mazes, dialogues | Server management, trading |
| Learning | Episode-based updates | Continuous updates |
5. Exploration vs Exploitation¶
5.1 The Dilemma¶
- Exploration: Try new actions to discover better strategies
- Exploitation: Perform the best action known so far
import numpy as np
class EpsilonGreedy:
"""
ε-greedy strategy: Most basic exploration-exploitation balance method
"""
def __init__(self, n_actions, epsilon=0.1):
self.n_actions = n_actions
self.epsilon = epsilon
self.q_values = np.zeros(n_actions) # Estimated value of each action
self.action_counts = np.zeros(n_actions)
def select_action(self):
"""
With probability ε: random action (exploration)
With probability 1-ε: best action (exploitation)
"""
if np.random.random() < self.epsilon:
# Exploration: random action
return np.random.randint(self.n_actions)
else:
# Exploitation: highest value action
return np.argmax(self.q_values)
def update(self, action, reward):
"""Update action value (incremental average)"""
self.action_counts[action] += 1
n = self.action_counts[action]
self.q_values[action] += (reward - self.q_values[action]) / n
5.2 Exploration Strategies¶
class ExplorationStrategies:
"""Various exploration strategies"""
@staticmethod
def epsilon_greedy(q_values, epsilon):
"""ε-greedy"""
if np.random.random() < epsilon:
return np.random.randint(len(q_values))
return np.argmax(q_values)
@staticmethod
def softmax(q_values, temperature=1.0):
"""
Softmax (Boltzmann) exploration
- High temperature: more exploration
- Low temperature: more exploitation
"""
exp_q = np.exp(q_values / temperature)
probs = exp_q / np.sum(exp_q)
return np.random.choice(len(q_values), p=probs)
@staticmethod
def ucb(q_values, action_counts, t, c=2.0):
"""
Upper Confidence Bound (UCB)
- Prioritize exploration of actions with high uncertainty
"""
ucb_values = q_values + c * np.sqrt(np.log(t + 1) / (action_counts + 1e-5))
return np.argmax(ucb_values)
5.3 Epsilon Decay¶
class EpsilonDecay:
"""Decrease exploration rate over time"""
def __init__(self, start=1.0, end=0.01, decay=0.995):
self.epsilon = start
self.end = end
self.decay = decay
def get_epsilon(self):
return self.epsilon
def update(self):
"""Call at the end of each episode"""
self.epsilon = max(self.end, self.epsilon * self.decay)
# Usage example
epsilon_scheduler = EpsilonDecay(start=1.0, end=0.01, decay=0.995)
for episode in range(1000):
epsilon = epsilon_scheduler.get_epsilon()
# ... run episode ...
epsilon_scheduler.update()
if episode % 100 == 0:
print(f"Episode {episode}: ε = {epsilon:.4f}")
6. Real-World Applications¶
6.1 Game AI¶
"""
AlphaGo (DeepMind, 2016)
- Defeated human world champion in Go
- Monte Carlo Tree Search + Deep Learning + RL
- Learning through self-play
OpenAI Five (2019)
- Defeated professional team in Dota 2
- Distributed PPO algorithm
- Trained with approximately 45,000 years of gameplay
AlphaStar (DeepMind, 2019)
- Achieved StarCraft II Grandmaster rank
- Multi-agent RL + League Training
"""
6.2 Robotics¶
"""
Robot Locomotion
- Learning in physics simulation then transferring to real robots
- Sim-to-Real Transfer
Robot Manipulation
- Object grasping, assembly, etc.
- Continuous action space (joint torques)
Autonomous Driving
- Lane keeping, obstacle avoidance
- Simulation-based learning due to safety importance
"""
6.3 LLM and RLHF¶
"""
RLHF (Reinforcement Learning from Human Feedback)
- Core learning method for ChatGPT, Claude, etc.
- Learning reward model from human feedback
- Fine-tuning language model with PPO
Process:
1. Pre-trained LLM
2. Humans rank response quality
3. Train reward model from ranking data
4. Fine-tune LLM with RL using reward model
"""
7. Challenges in Reinforcement Learning¶
7.1 Major Issues¶
| Problem | Description | Solution Direction |
|---|---|---|
| Sample Efficiency | Requires many experiences for learning | Model-based RL, transfer learning |
| Credit Assignment | Which actions contributed to reward | TD learning, GAE |
| Stability | Learning can be unstable | PPO, TRPO, and stabilization techniques |
| Reward Design | Defining correct rewards is difficult | Inverse RL, reward shaping |
| Safety | Preventing dangerous actions | Constrained RL, Safe RL |
7.2 Reward Hacking¶
"""
Phenomenon where agents maximize rewards in unintended ways
Examples:
- Cleaning robot: hiding trash (mistaken as cleaning completed)
- Boat racing: circling to collect power-up items only
- Tetris: pausing game to avoid game over
Lesson: Reward design must be careful!
"""
8. Summary¶
Key Concepts Summary¶
- Reinforcement Learning: Agent learns to maximize rewards through environment interaction
- MDP Components: State, action, reward, transition probability, discount factor
- Reward: Signal indicating good or bad actions
- Discount Factor (γ): Determines present value of future rewards
- Exploration-Exploitation: Balance between trying new things and known best
Formula Summary¶
| Concept | Formula |
|---|---|
| Cumulative Reward | $G_t = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}$ |
| State Value | $V^\pi(s) = \mathbb{E}_\pi[G_t \| S_t = s]$ |
| Action Value | $Q^\pi(s,a) = \mathbb{E}_\pi[G_t \| S_t = s, A_t = a]$ |
9. Exercises¶
-
Concept Check: Explain 3 main differences between supervised learning and reinforcement learning.
-
Discount Factor Calculation: Calculate the discounted cumulative reward for reward sequence [1, 2, 3, 4, 5] with γ=0.9.
-
Exploration-Exploitation: With ε=0.2 in ε-greedy strategy, how many exploration actions are expected in 100 actions?
-
Reward Design: Design an appropriate reward function for a maze escape problem.
Next Steps¶
In the next lesson 02_MDP.md, we will learn about Markov Decision Processes (MDP) and Bellman equations, which are the mathematical foundations of reinforcement learning.
References¶
- Sutton & Barto, "Reinforcement Learning: An Introduction", Chapter 1
- David Silver's RL Course, Lecture 1: Introduction to RL
- Gymnasium Documentation