κ°ννμ΅ (Reinforcement Learning) Overview
μκ°
μ΄ ν΄λλ κ°ννμ΅(Reinforcement Learning, RL)μ κΈ°μ΄λΆν° μ¬νκΉμ§ 체κ³μ μΌλ‘ νμ΅ν μ μλ μλ£λ₯Ό λ΄κ³ μμ΅λλ€. μμ΄μ νΈκ° νκ²½κ³Ό μνΈμμ©νλ©° 보μμ μ΅λννλ λ°©λ²μ νμ΅νλ RLμ ν΅μ¬ κ°λ
κ³Ό μκ³ λ¦¬μ¦μ λ€λ£Ήλλ€.
λμ λ
μ
- λ¨Έμ λ¬λ/λ₯λ¬λ κΈ°μ΄λ₯Ό μ΄ν΄νκ³ μλ νμ΅μ
- κ²μ AI, λ‘λ΄κ³΅ν, μμ¨μ£Όν λ±μ κ΄μ¬ μλ κ°λ°μ
- AlphaGo, ChatGPT(RLHF) λ±μ κΈ°μ μ리λ₯Ό μ΄ν΄νκ³ μΆμ λΆ
μ μ μ§μ
- νμ: Python νλ‘κ·Έλλ°, κΈ°μ΄ νλ₯ /ν΅κ³
- κΆμ₯: Deep_Learning ν΄λ νμ΅ μλ£, PyTorch κΈ°μ΄
νμ΅ λ‘λλ§΅
βββββββββββββββββββββββββββββββββββββββ
β κ°ννμ΅ κΈ°μ΄ (01-04) β
βββββββββββββββββ¬ββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββ
β β β
βΌ βΌ βΌ
βββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β RL κ°μ β β MDP & Bellman β β Dynamic β
β (01) ββββββββββΆβ (02) ββββββββββΆβ Programming β
β β β β β (03) β
βββββββββββββββββ βββββββββββββββββββ ββββββββββ¬βββββββββ
β
ββββββββββββββββββββββββββββ
βΌ
βββββββββββββββββββββββββββββββββββββββ
β Monte Carlo Methods (04) β
βββββββββββββββββ¬ββββββββββββββββββββββ
β
βββββββββββββββββ΄ββββββββββββββββ
β β
βΌ βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β κ°μΉ κΈ°λ° λ°©λ² (05-07) β
βββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββΌββββββββββββββββββββββββββββ
β β β
βΌ βΌ βΌ
βββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β TD Learning ββββββΆβ Q-Learning & ββββββΆβ Deep Q-Network β
β (05) β β SARSA (06) β β (07) β
βββββββββββββββββ βββββββββββββββββββ ββββββββββ¬βββββββββ
β
ββββββββββββββββββββββββββββββββββββ
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β μ μ±
κΈ°λ° λ°©λ² (08-10) β
βββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββΌββββββββββββββββββββββββββββ
β β β
βΌ βΌ βΌ
βββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β Policy ββββββΆβ Actor-Critic ββββββΆβ PPO & TRPO β
β Gradient (08) β β A2C/A3C (09) β β (10) β
βββββββββββββββββ βββββββββββββββββββ ββββββββββ¬βββββββββ
β
ββββββββββββββββββββββββββββββββββββ
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β μ¬ν κ³Όμ (11-12) β
βββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββ
β
βββββββββββββββββ΄ββββββββββββββββ
β β
βΌ βΌ
βββββββββββββββββββ βββββββββββββββββββ
β Multi-Agent RL β β μ€μ νλ‘μ νΈ β
β (11) β β (12) β
βββββββββββββββββββ βββββββββββββββββββ
νμΌ λͺ©λ‘
| λ²νΈ |
νμΌλͺ
|
μ£Όμ |
λμ΄λ |
μ£Όμ λ΄μ© |
| 00 |
Overview.md |
κ°μ |
- |
νμ΅ μλ΄, λ‘λλ§΅, νκ²½ μ€μ |
| 01 |
RL_Introduction.md |
RL κ°μ |
β |
μμ΄μ νΈ-νκ²½, 보μ, μνΌμλ/μ°μ νμ€ν¬ |
| 02 |
MDP_Basics.md |
MDP κΈ°μ΄ |
ββ |
Markov Decision Process, Bellman λ°©μ μ, V/Q ν¨μ |
| 03 |
Dynamic_Programming.md |
λμ νλ‘κ·Έλλ° |
ββ |
μ μ±
λ°λ³΅, κ°μΉ λ°λ³΅, DPμ νκ³ |
| 04 |
Monte_Carlo_Methods.md |
λͺ¬ν
μΉ΄λ₯Όλ‘ λ°©λ² |
ββ |
μν κΈ°λ° νμ΅, First-visit/Every-visit MC |
| 05 |
TD_Learning.md |
TD νμ΅ |
βββ |
TD(0), TD Target, Bootstrapping, TD vs MC |
| 06 |
Q_Learning_SARSA.md |
Q-Learning & SARSA |
βββ |
Off-policy, On-policy, Epsilon-greedy |
| 07 |
Deep_Q_Network.md |
DQN |
βββ |
Experience Replay, Target Network, Double/Dueling DQN |
| 08 |
Policy_Gradient.md |
μ μ±
κ²½μ¬ |
ββββ |
REINFORCE, Baseline, μ μ±
κ²½μ¬ μ 리 |
| 09 |
Actor_Critic.md |
Actor-Critic |
ββββ |
A2C, A3C, Advantage ν¨μ, GAE |
| 10 |
PPO_TRPO.md |
PPO & TRPO |
ββββ |
Clipping, KL Divergence, Proximal Policy Optimization |
| 11 |
Multi_Agent_RL.md |
λ€μ€ μμ΄μ νΈ RL |
ββββ |
νλ ₯/κ²½μ, Self-Play, MARL μκ³ λ¦¬μ¦ |
| 12 |
Practical_RL_Project.md |
μ€μ νλ‘μ νΈ |
ββββ |
Gymnasium νκ²½, Atari κ²μ, μ’
ν© νλ‘μ νΈ |
| 13 |
Model_Based_RL.md |
λͺ¨λΈ κΈ°λ° RL |
ββββ |
Dyna μν€ν
μ², μΈκ³ λͺ¨λΈ, MBPO, MuZero, Dreamer |
| 14 |
Soft_Actor_Critic.md |
SAC |
ββββ |
μ΅λ μνΈλ‘νΌ RL, μλ μ¨λ μ‘°μ , μ°μ μ μ΄ |
λμ΄λ κ°μ΄λ
| λμ΄λ |
μ€λͺ
|
μμ νμ΅ μκ° |
| β |
μ
λ¬Έ - κ°λ
μ΄ν΄ μ€μ¬ |
1-2μκ° |
| ββ |
κΈ°μ΄ - μνμ κΈ°μ΄μ κΈ°λ³Έ μκ³ λ¦¬μ¦ |
2-3μκ° |
| βββ |
μ€κΈ - ν΅μ¬ μκ³ λ¦¬μ¦ κ΅¬ν |
3-4μκ° |
| ββββ |
κ³ κΈ - μ΅μ μκ³ λ¦¬μ¦κ³Ό μ€μ μ μ© |
4-6μκ° |
νκ²½ μ€μ
νμ ν¨ν€μ§ μ€μΉ
# κΈ°λ³Έ νκ²½
pip install gymnasium
pip install torch torchvision
pip install numpy matplotlib
# μΆκ° νκ²½ (Atari κ²μ λ±)
pip install "gymnasium[atari]"
pip install "gymnasium[accept-rom-license]"
# λ©ν°μμ΄μ νΈ RL
pip install pettingzoo
# μκ°ν λ° λ‘κΉ
pip install tensorboard
pip install wandb # μ νμ¬ν
νκ²½ ν
μ€νΈ
import gymnasium as gym
import torch
# Gymnasium ν
μ€νΈ
env = gym.make("CartPole-v1", render_mode="human")
observation, info = env.reset()
for _ in range(100):
action = env.action_space.sample()
observation, reward, terminated, truncated, info = env.step(action)
if terminated or truncated:
observation, info = env.reset()
env.close()
# PyTorch ν
μ€νΈ
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
κΆμ₯ κ°λ° νκ²½
| λꡬ |
μ©λ |
μ€μΉ |
| Jupyter Notebook |
μ€ν λ° μκ°ν |
pip install jupyter |
| VS Code |
μ½λ νΈμ§ |
곡μ μ¬μ΄νΈ |
| TensorBoard |
νμ΅ λͺ¨λν°λ§ |
pip install tensorboard |
μΆμ² νμ΅ μμ
1λ¨κ³: κΈ°μ΄ λ€μ§κΈ° (1-2μ£Ό)
- 01_RL_Introduction.md - RLμ κΈ°λ³Έ κ°λ
μ΄ν΄
- 02_MDP_Basics.md - MDPμ Bellman λ°©μ μ νμ΅
- 03_Dynamic_Programming.md - μ μ±
/κ°μΉ λ°λ³΅ μ΄ν΄
- 04_Monte_Carlo_Methods.md - μν κΈ°λ° νμ΅ μ
λ¬Έ
2λ¨κ³: κ°μΉ κΈ°λ° λ°©λ² (2-3μ£Ό)
- 05_TD_Learning.md - TD νμ΅μ ν΅μ¬ μ리
- 06_Q_Learning_SARSA.md - ν
μ΄λΈ κΈ°λ° Q-Learning
- 07_Deep_Q_Network.md - λ₯λ¬λκ³Ό RLμ κ²°ν©
3λ¨κ³: μ μ±
κΈ°λ° λ°©λ² (2-3μ£Ό)
- 08_Policy_Gradient.md - μ§μ μ μ±
μ΅μ ν
- 09_Actor_Critic.md - κ°μΉμ μ μ±
μ κ²°ν©
- 10_PPO_TRPO.md - μμ μ μΈ μ μ±
νμ΅
4λ¨κ³: μ¬ν νμ΅ (2μ£Ό)
- 11_Multi_Agent_RL.md - λ€μ€ μμ΄μ νΈ νκ²½
- 12_Practical_RL_Project.md - μ’
ν© νλ‘μ νΈ μν
μ£Όμ μκ³ λ¦¬μ¦ λΉκ΅
| μκ³ λ¦¬μ¦ |
μ ν |
On/Off Policy |
μ°μ νλ |
νΉμ§ |
| Q-Learning |
Value-based |
Off |
X |
κ°λ¨, ν
μ΄λΈ κΈ°λ° |
| SARSA |
Value-based |
On |
X |
μμ ν νμ΅ |
| DQN |
Value-based |
Off |
X |
λ₯λ¬λ κ²°ν© |
| REINFORCE |
Policy-based |
On |
O |
μ§μ μ μ±
μ΅μ ν |
| A2C/A3C |
Actor-Critic |
On |
O |
λΆμ° νμ΅ κ°λ₯ |
| PPO |
Actor-Critic |
On |
O |
μμ μ , λ²μ©μ |
| TRPO |
Actor-Critic |
On |
O |
μ΄λ‘ μ 보μ₯ |
| SAC |
Actor-Critic |
Off |
O |
μ΅λ μνΈλ‘νΌ RL |
μ°Έκ³ μλ£
κ΅μ¬
- Sutton & Barto: "Reinforcement Learning: An Introduction" (2nd Edition) - λ¬΄λ£ PDF
- Deep RL: "Spinning Up in Deep RL" by OpenAI - λ§ν¬
μ¨λΌμΈ κ°μ
- David Silver's RL Course (DeepMind/UCL)
- CS285: Deep Reinforcement Learning (UC Berkeley)
- Hugging Face Deep RL Course
λΌμ΄λΈλ¬λ¦¬
ν΅μ¬ μ©μ΄ μ 리
| μ©μ΄ |
μλ¬Έ |
μ€λͺ
|
| μμ΄μ νΈ |
Agent |
νκ²½κ³Ό μνΈμμ©νλ©° νμ΅νλ 주체 |
| νκ²½ |
Environment |
μμ΄μ νΈκ° νλνλ μΈκ³ |
| μν |
State |
νκ²½μ νμ¬ μν© |
| νλ |
Action |
μμ΄μ νΈκ° μ·¨νλ κ²°μ |
| 보μ |
Reward |
νλμ λν μ¦κ°μ μΈ νΌλλ°± |
| μ μ±
|
Policy |
μνμμ νλμ μ ννλ μ λ΅ |
| κ°μΉ ν¨μ |
Value Function |
μν/νλμ μ₯κΈ°μ κ°μΉ |
| ν μΈμ¨ |
Discount Factor (Ξ³) |
λ―Έλ 보μμ νμ¬ κ°μΉ λΉμ¨ |
| μνΌμλ |
Episode |
μμλΆν° μ’
λ£κΉμ§μ μνΈμμ© |
| νν/νμ© |
Exploration/Exploitation |
μλ‘μ΄ μλ vs μλ €μ§ μ’μ νλ |
κ΄λ ¨ ν΄λ
- Deep_Learning/: λ₯λ¬λ κΈ°μ΄ (μ κ²½λ§, CNN, RNN)
- Machine_Learning/: λ¨Έμ λ¬λ κΈ°μ΄ (μ§λ/λΉμ§λ νμ΅)
- Python/: νμ΄μ¬ κ³ κΈ λ¬Έλ²
- Statistics/: νλ₯ λ° ν΅κ³
λ§μ§λ§ μ
λ°μ΄νΈ: 2026-02