Reinforcement Learning Overview
Introduction
This folder contains materials for systematically learning Reinforcement Learning (RL) from basics to advanced topics. It covers core concepts and algorithms of RL, where agents learn to maximize rewards through interaction with environments.
Target Audience
- Learners with foundational knowledge in machine learning/deep learning
- Developers interested in game AI, robotics, autonomous driving, etc.
- Those who want to understand the technical principles behind AlphaGo, ChatGPT(RLHF), etc.
Prerequisites
- Required: Python programming, basic probability/statistics
- Recommended: Completed Deep_Learning folder lessons, PyTorch basics
Learning Roadmap
βββββββββββββββββββββββββββββββββββββββ
β RL Foundations (01-04) β
βββββββββββββββββ¬ββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββ
β β β
βΌ βΌ βΌ
βββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β RL Intro β β MDP & Bellman β β Dynamic β
β (01) ββββββββββΆβ (02) ββββββββββΆβ Programming β
β β β β β (03) β
βββββββββββββββββ βββββββββββββββββββ ββββββββββ¬βββββββββ
β
ββββββββββββββββββββββββββββ
βΌ
βββββββββββββββββββββββββββββββββββββββ
β Monte Carlo Methods (04) β
βββββββββββββββββ¬ββββββββββββββββββββββ
β
βββββββββββββββββ΄ββββββββββββββββ
β β
βΌ βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Value-based Methods (05-07) β
βββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββΌββββββββββββββββββββββββββββ
β β β
βΌ βΌ βΌ
βββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β TD Learning ββββββΆβ Q-Learning & ββββββΆβ Deep Q-Network β
β (05) β β SARSA (06) β β (07) β
βββββββββββββββββ βββββββββββββββββββ ββββββββββ¬βββββββββ
β
ββββββββββββββββββββββββββββββββββββ
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Policy-based Methods (08-10) β
βββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββΌββββββββββββββββββββββββββββ
β β β
βΌ βΌ βΌ
βββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β Policy ββββββΆβ Actor-Critic ββββββΆβ PPO & TRPO β
β Gradient (08) β β A2C/A3C (09) β β (10) β
βββββββββββββββββ βββββββββββββββββββ ββββββββββ¬βββββββββ
β
ββββββββββββββββββββββββββββββββββββ
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Advanced Topics (11-12) β
βββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββ
β
βββββββββββββββββ΄ββββββββββββββββ
β β
βΌ βΌ
βββββββββββββββββββ βββββββββββββββββββ
β Multi-Agent RL β β Practical β
β (11) β β Project (12) β
βββββββββββββββββββ βββββββββββββββββββ
File List
| # |
Filename |
Topic |
Difficulty |
Key Content |
| 00 |
Overview.md |
Overview |
- |
Learning guide, roadmap, environment setup |
| 01 |
RL_Introduction.md |
RL Intro |
β |
Agent-environment, rewards, episodic/continuous tasks |
| 02 |
MDP_Basics.md |
MDP Basics |
ββ |
Markov Decision Process, Bellman equations, V/Q functions |
| 03 |
Dynamic_Programming.md |
Dynamic Programming |
ββ |
Policy iteration, value iteration, DP limitations |
| 04 |
Monte_Carlo_Methods.md |
Monte Carlo Methods |
ββ |
Sample-based learning, First-visit/Every-visit MC |
| 05 |
TD_Learning.md |
TD Learning |
βββ |
TD(0), TD Target, Bootstrapping, TD vs MC |
| 06 |
Q_Learning_SARSA.md |
Q-Learning & SARSA |
βββ |
Off-policy, On-policy, Epsilon-greedy |
| 07 |
Deep_Q_Network.md |
DQN |
βββ |
Experience Replay, Target Network, Double/Dueling DQN |
| 08 |
Policy_Gradient.md |
Policy Gradient |
ββββ |
REINFORCE, Baseline, policy gradient theorem |
| 09 |
Actor_Critic.md |
Actor-Critic |
ββββ |
A2C, A3C, Advantage function, GAE |
| 10 |
PPO_TRPO.md |
PPO & TRPO |
ββββ |
Clipping, KL Divergence, Proximal Policy Optimization |
| 11 |
Multi_Agent_RL.md |
Multi-Agent RL |
ββββ |
Cooperation/Competition, Self-Play, MARL algorithms |
| 12 |
Practical_RL_Project.md |
Practical Projects |
ββββ |
Gymnasium environments, Atari games, comprehensive projects |
| 13 |
Model_Based_RL.md |
Model-Based RL |
ββββ |
Dyna architecture, world models, MBPO, MuZero, Dreamer |
| 14 |
Soft_Actor_Critic.md |
SAC |
ββββ |
Maximum entropy RL, auto temperature, continuous control |
Difficulty Guide
| Difficulty |
Description |
Expected Study Time |
| β |
Beginner - Focus on concepts |
1-2 hours |
| ββ |
Basics - Mathematical foundations and basic algorithms |
2-3 hours |
| βββ |
Intermediate - Core algorithm implementation |
3-4 hours |
| ββββ |
Advanced - Latest algorithms and practical applications |
4-6 hours |
Environment Setup
Installing Required Packages
# Basic environment
pip install gymnasium
pip install torch torchvision
pip install numpy matplotlib
# Additional environments (Atari games, etc.)
pip install "gymnasium[atari]"
pip install "gymnasium[accept-rom-license]"
# Multi-agent RL
pip install pettingzoo
# Visualization and logging
pip install tensorboard
pip install wandb # optional
Environment Testing
import gymnasium as gym
import torch
# Gymnasium test
env = gym.make("CartPole-v1", render_mode="human")
observation, info = env.reset()
for _ in range(100):
action = env.action_space.sample()
observation, reward, terminated, truncated, info = env.step(action)
if terminated or truncated:
observation, info = env.reset()
env.close()
# PyTorch test
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
Recommended Development Environment
| Tool |
Purpose |
Installation |
| Jupyter Notebook |
Experimentation and visualization |
pip install jupyter |
| VS Code |
Code editing |
Official Website |
| TensorBoard |
Training monitoring |
pip install tensorboard |
Recommended Learning Order
Stage 1: Building Foundations (1-2 weeks)
- 01_RL_Introduction.md - Understanding basic RL concepts
- 02_MDP_Basics.md - Learning MDP and Bellman equations
- 03_Dynamic_Programming.md - Understanding policy/value iteration
- 04_Monte_Carlo_Methods.md - Introduction to sample-based learning
Stage 2: Value-based Methods (2-3 weeks)
- 05_TD_Learning.md - Core principles of TD learning
- 06_Q_Learning_SARSA.md - Table-based Q-Learning
- 07_Deep_Q_Network.md - Combining deep learning with RL
Stage 3: Policy-based Methods (2-3 weeks)
- 08_Policy_Gradient.md - Direct policy optimization
- 09_Actor_Critic.md - Combining value and policy
- 10_PPO_TRPO.md - Stable policy learning
Stage 4: Advanced Topics (3 weeks)
- 11_Multi_Agent_RL.md - Multi-agent environments
- 12_Practical_RL_Project.md - Comprehensive project execution
- 13_Model_Based_RL.md - Planning with learned models
- 14_Soft_Actor_Critic.md - Maximum entropy for continuous control
Algorithm Comparison
| Algorithm |
Type |
On/Off Policy |
Continuous Actions |
Features |
| Q-Learning |
Value-based |
Off |
X |
Simple, table-based |
| SARSA |
Value-based |
On |
X |
Safe learning |
| DQN |
Value-based |
Off |
X |
Deep learning integration |
| REINFORCE |
Policy-based |
On |
O |
Direct policy optimization |
| A2C/A3C |
Actor-Critic |
On |
O |
Distributed learning |
| PPO |
Actor-Critic |
On |
O |
Stable, versatile |
| TRPO |
Actor-Critic |
On |
O |
Theoretical guarantees |
| SAC |
Actor-Critic |
Off |
O |
Maximum entropy RL |
References
Textbooks
- Sutton & Barto: "Reinforcement Learning: An Introduction" (2nd Edition) - Free PDF
- Deep RL: "Spinning Up in Deep RL" by OpenAI - Link
Online Courses
- David Silver's RL Course (DeepMind/UCL)
- CS285: Deep Reinforcement Learning (UC Berkeley)
- Hugging Face Deep RL Course
Libraries
Key Terms
| Term |
English |
Description |
| Agent |
Agent |
Entity that learns through interaction with environment |
| Environment |
Environment |
World where the agent acts |
| State |
State |
Current situation of the environment |
| Action |
Action |
Decision made by the agent |
| Reward |
Reward |
Immediate feedback for an action |
| Policy |
Policy |
Strategy for selecting actions in states |
| Value Function |
Value Function |
Long-term value of states/actions |
| Discount Factor |
Discount Factor (Ξ³) |
Present value ratio of future rewards |
| Episode |
Episode |
Interaction from start to termination |
| Exploration/Exploitation |
Exploration/Exploitation |
Trying new vs known good actions |
- Deep_Learning/: Deep learning basics (neural networks, CNN, RNN)
- Machine_Learning/: Machine learning basics (supervised/unsupervised learning)
- Python/: Advanced Python syntax
- Statistics/: Probability and statistics
Last updated: 2026-02