08. ์ •์ฑ… ๊ฒฝ์‚ฌ (Policy Gradient)

08. ์ •์ฑ… ๊ฒฝ์‚ฌ (Policy Gradient)

๋‚œ์ด๋„: โญโญโญโญ (๊ณ ๊ธ‰)

ํ•™์Šต ๋ชฉํ‘œ

  • ์ •์ฑ… ๊ธฐ๋ฐ˜ ๋ฐฉ๋ฒ•์˜ ์žฅ๋‹จ์  ์ดํ•ด
  • ์ •์ฑ… ๊ฒฝ์‚ฌ ์ •๋ฆฌ (Policy Gradient Theorem) ์œ ๋„
  • REINFORCE ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ตฌํ˜„
  • Baseline์„ ํ†ตํ•œ ๋ถ„์‚ฐ ๊ฐ์†Œ ๊ธฐ๋ฒ•
  • Actor-Critic์œผ๋กœ์˜ ์—ฐ๊ฒฐ

1. ๊ฐ€์น˜ ๊ธฐ๋ฐ˜ vs ์ •์ฑ… ๊ธฐ๋ฐ˜

1.1 ๋น„๊ต

ํŠน์„ฑ ๊ฐ€์น˜ ๊ธฐ๋ฐ˜ (DQN) ์ •์ฑ… ๊ธฐ๋ฐ˜
ํ•™์Šต ๋Œ€์ƒ Q(s, a) ฯ€(a|s)
์ •์ฑ… ๋„์ถœ Q์—์„œ ๊ฐ„์ ‘ ์œ ๋„ ์ง์ ‘ ํ•™์Šต
ํ–‰๋™ ๊ณต๊ฐ„ ์ด์‚ฐ (์ฃผ๋กœ) ์ด์‚ฐ + ์—ฐ์†
ํ™•๋ฅ ์  ์ •์ฑ… ์–ด๋ ค์›€ ์ž์—ฐ์Šค๋Ÿฌ์›€
์ˆ˜๋ ด ๋ถˆ์•ˆ์ • ๊ฐ€๋Šฅ ์ง€์—ญ ์ตœ์ 

1.2 ์ •์ฑ… ๊ธฐ๋ฐ˜์˜ ์žฅ์ 

1. ์—ฐ์† ํ–‰๋™ ๊ณต๊ฐ„ ์ฒ˜๋ฆฌ ๊ฐ€๋Šฅ (๋กœ๋ด‡ ์ œ์–ด)
2. ํ™•๋ฅ ์  ์ •์ฑ… ํ•™์Šต ๊ฐ€๋Šฅ (๊ฐ€์œ„๋ฐ”์œ„๋ณด)
3. ์ •์ฑ… ๊ณต๊ฐ„์ด ๋” ๋‹จ์ˆœํ•  ์ˆ˜ ์žˆ์Œ
4. ๋” ๋‚˜์€ ์ˆ˜๋ ด ๋ณด์žฅ (์ผ๋ถ€ ๊ฒฝ์šฐ)

2. ์ •์ฑ…์˜ ํŒŒ๋ผ๋ฏธํ„ฐํ™”

2.1 ์†Œํ”„ํŠธ๋งฅ์Šค ์ •์ฑ… (์ด์‚ฐ ํ–‰๋™)

import torch
import torch.nn as nn
import torch.nn.functional as F

class DiscretePolicy(nn.Module):
    def __init__(self, state_dim, action_dim, hidden_dim=128):
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, action_dim)
        )

    def forward(self, state):
        logits = self.network(state)
        return F.softmax(logits, dim=-1)

    def get_action(self, state):
        probs = self.forward(state)
        dist = torch.distributions.Categorical(probs)
        action = dist.sample()
        log_prob = dist.log_prob(action)
        return action.item(), log_prob

2.2 ๊ฐ€์šฐ์‹œ์•ˆ ์ •์ฑ… (์—ฐ์† ํ–‰๋™)

class GaussianPolicy(nn.Module):
    def __init__(self, state_dim, action_dim, hidden_dim=128):
        super().__init__()
        self.shared = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU()
        )
        self.mean_layer = nn.Linear(hidden_dim, action_dim)
        self.log_std = nn.Parameter(torch.zeros(action_dim))

    def forward(self, state):
        features = self.shared(state)
        mean = self.mean_layer(features)
        std = self.log_std.exp()
        return mean, std

    def get_action(self, state):
        mean, std = self.forward(state)
        dist = torch.distributions.Normal(mean, std)
        action = dist.sample()
        log_prob = dist.log_prob(action).sum(-1)
        return action, log_prob

3. ์ •์ฑ… ๊ฒฝ์‚ฌ ์ •๋ฆฌ

3.1 ๋ชฉํ‘œ ํ•จ์ˆ˜

์ •์ฑ… ฯ€_ฮธ์˜ ์„ฑ๋Šฅ์„ ์ตœ๋Œ€ํ™”:

$$J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} [R(\tau)]$$

์—ฌ๊ธฐ์„œ ฯ„ = (sโ‚€, aโ‚€, rโ‚€, sโ‚, aโ‚, rโ‚, ...) ๋Š” ๊ถค์ (trajectory)

3.2 ์ •์ฑ… ๊ฒฝ์‚ฌ ์ •๋ฆฌ

$$\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot G_t \right]$$

์ง๊ด€์  ํ•ด์„: - ์ข‹์€ ๊ฒฐ๊ณผ(๋†’์€ G_t)๋ฅผ ๊ฐ€์ ธ์˜จ ํ–‰๋™์˜ ํ™•๋ฅ ์„ ๋†’์ž„ - ๋‚˜์œ ๊ฒฐ๊ณผ๋ฅผ ๊ฐ€์ ธ์˜จ ํ–‰๋™์˜ ํ™•๋ฅ ์„ ๋‚ฎ์ถค

3.3 ์œ ๋„ (Log-derivative trick)

โˆ‡_ฮธ ฯ€(a|s;ฮธ) = ฯ€(a|s;ฮธ) ยท โˆ‡_ฮธ log ฯ€(a|s;ฮธ)

๋”ฐ๋ผ์„œ:
โˆ‡_ฮธ J(ฮธ) = E[R ยท โˆ‡_ฮธ log ฯ€(a|s;ฮธ)]
         = E[โˆ‡_ฮธ log ฯ€(a|s;ฮธ) ยท R]

4. REINFORCE ์•Œ๊ณ ๋ฆฌ์ฆ˜

4.1 ๊ธฐ๋ณธ REINFORCE

๋ชฌํ…Œ์นด๋ฅผ๋กœ ์ •์ฑ… ๊ฒฝ์‚ฌ ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.

class REINFORCE:
    def __init__(self, state_dim, action_dim, lr=1e-3, gamma=0.99):
        self.policy = DiscretePolicy(state_dim, action_dim)
        self.optimizer = torch.optim.Adam(self.policy.parameters(), lr=lr)
        self.gamma = gamma

        # ์—ํ”ผ์†Œ๋“œ ์ €์žฅ
        self.log_probs = []
        self.rewards = []

    def choose_action(self, state):
        state_tensor = torch.FloatTensor(state).unsqueeze(0)
        action, log_prob = self.policy.get_action(state_tensor)
        self.log_probs.append(log_prob)
        return action

    def store_reward(self, reward):
        self.rewards.append(reward)

    def compute_returns(self):
        """ํ• ์ธ๋œ ๋ฆฌํ„ด ๊ณ„์‚ฐ"""
        returns = []
        G = 0
        for r in reversed(self.rewards):
            G = r + self.gamma * G
            returns.insert(0, G)

        returns = torch.tensor(returns)
        # ์ •๊ทœํ™” (์„ ํƒ์ ์ด์ง€๋งŒ ๊ถŒ์žฅ)
        returns = (returns - returns.mean()) / (returns.std() + 1e-8)
        return returns

    def update(self):
        returns = self.compute_returns()

        # ์ •์ฑ… ์†์‹ค ๊ณ„์‚ฐ
        policy_loss = []
        for log_prob, G in zip(self.log_probs, returns):
            policy_loss.append(-log_prob * G)  # ์Œ์ˆ˜ (๊ฒฝ์‚ฌ ์ƒ์Šน)

        loss = torch.stack(policy_loss).sum()

        # ์—…๋ฐ์ดํŠธ
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()

        # ์—ํ”ผ์†Œ๋“œ ๋ฐ์ดํ„ฐ ์ดˆ๊ธฐํ™”
        self.log_probs = []
        self.rewards = []

        return loss.item()

4.2 ํ•™์Šต ๋ฃจํ”„

import gymnasium as gym
import numpy as np

def train_reinforce(env_name='CartPole-v1', n_episodes=1000):
    env = gym.make(env_name)
    state_dim = env.observation_space.shape[0]
    action_dim = env.action_space.n

    agent = REINFORCE(state_dim, action_dim, lr=1e-3)

    scores = []

    for episode in range(n_episodes):
        state, _ = env.reset()
        total_reward = 0
        done = False

        while not done:
            action = agent.choose_action(state)
            next_state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated

            agent.store_reward(reward)
            state = next_state
            total_reward += reward

        # ์—ํ”ผ์†Œ๋“œ ์ข…๋ฃŒ ํ›„ ์—…๋ฐ์ดํŠธ
        loss = agent.update()
        scores.append(total_reward)

        if (episode + 1) % 100 == 0:
            print(f"Episode {episode + 1}, Avg Score: {np.mean(scores[-100:]):.2f}")

    return agent, scores

5. Baseline์„ ํ†ตํ•œ ๋ถ„์‚ฐ ๊ฐ์†Œ

5.1 ๋ถ„์‚ฐ ๋ฌธ์ œ

REINFORCE์˜ ๊ทธ๋ž˜๋””์–ธํŠธ๋Š” ๋†’์€ ๋ถ„์‚ฐ์„ ๊ฐ€์ง‘๋‹ˆ๋‹ค.

Var(โˆ‡_ฮธ J) โˆ E[(G - b)ยฒ]

5.2 Baseline ๋„์ž…

์ƒ์ˆ˜ b๋ฅผ ๋นผ๋„ ๊ธฐ๋Œ€๊ฐ’์€ ๋ณ€ํ•˜์ง€ ์•Š์ง€๋งŒ ๋ถ„์‚ฐ์€ ๊ฐ์†Œํ•ฉ๋‹ˆ๋‹ค.

$$\nabla_\theta J(\theta) = \mathbb{E} \left[ \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot (G_t - b) \right]$$

๊ฐ€์žฅ ์ข‹์€ baseline: b = V(s)

class REINFORCEWithBaseline:
    def __init__(self, state_dim, action_dim, lr_policy=1e-3, lr_value=1e-3, gamma=0.99):
        self.policy = DiscretePolicy(state_dim, action_dim)
        self.value = nn.Sequential(
            nn.Linear(state_dim, 128),
            nn.ReLU(),
            nn.Linear(128, 128),
            nn.ReLU(),
            nn.Linear(128, 1)
        )

        self.policy_optimizer = torch.optim.Adam(self.policy.parameters(), lr=lr_policy)
        self.value_optimizer = torch.optim.Adam(self.value.parameters(), lr=lr_value)
        self.gamma = gamma

        self.log_probs = []
        self.values = []
        self.rewards = []
        self.states = []

    def choose_action(self, state):
        state_tensor = torch.FloatTensor(state).unsqueeze(0)

        # ์ •์ฑ…์—์„œ ํ–‰๋™ ์ƒ˜ํ”Œ๋ง
        action, log_prob = self.policy.get_action(state_tensor)

        # ๊ฐ€์น˜ ์˜ˆ์ธก
        value = self.value(state_tensor)

        self.log_probs.append(log_prob)
        self.values.append(value)
        self.states.append(state_tensor)

        return action

    def update(self):
        returns = self.compute_returns()

        values = torch.cat(self.values).squeeze()
        log_probs = torch.stack(self.log_probs)

        # Advantage = Return - Baseline (Value)
        advantages = returns - values.detach()

        # ์ •์ฑ… ์†์‹ค
        policy_loss = -(log_probs * advantages).mean()

        # ๊ฐ€์น˜ ์†์‹ค
        value_loss = F.mse_loss(values, returns)

        # ์ •์ฑ… ์—…๋ฐ์ดํŠธ
        self.policy_optimizer.zero_grad()
        policy_loss.backward()
        self.policy_optimizer.step()

        # ๊ฐ€์น˜ ์—…๋ฐ์ดํŠธ
        self.value_optimizer.zero_grad()
        value_loss.backward()
        self.value_optimizer.step()

        # ์ดˆ๊ธฐํ™”
        self.log_probs = []
        self.values = []
        self.rewards = []
        self.states = []

        return policy_loss.item(), value_loss.item()

6. ์—ฐ์† ํ–‰๋™ ๊ณต๊ฐ„ ์˜ˆ์ œ

6.1 ์—ฐ์† ํ–‰๋™ REINFORCE

class ContinuousREINFORCE:
    def __init__(self, state_dim, action_dim, lr=1e-3, gamma=0.99):
        self.policy = GaussianPolicy(state_dim, action_dim)
        self.optimizer = torch.optim.Adam(self.policy.parameters(), lr=lr)
        self.gamma = gamma

        self.log_probs = []
        self.rewards = []

    def choose_action(self, state):
        state_tensor = torch.FloatTensor(state).unsqueeze(0)
        action, log_prob = self.policy.get_action(state_tensor)

        self.log_probs.append(log_prob)
        return action.detach().numpy().squeeze()

    def update(self):
        returns = self.compute_returns()

        policy_loss = []
        for log_prob, G in zip(self.log_probs, returns):
            policy_loss.append(-log_prob * G)

        loss = torch.stack(policy_loss).sum()

        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()

        self.log_probs = []
        self.rewards = []

6.2 MountainCarContinuous ์˜ˆ์ œ

def train_continuous():
    env = gym.make('MountainCarContinuous-v0')
    state_dim = env.observation_space.shape[0]
    action_dim = env.action_space.shape[0]

    agent = ContinuousREINFORCE(state_dim, action_dim, lr=1e-3)

    for episode in range(500):
        state, _ = env.reset()
        total_reward = 0

        while True:
            action = agent.choose_action(state)
            action = np.clip(action, env.action_space.low, env.action_space.high)

            next_state, reward, done, truncated, _ = env.step(action)
            agent.rewards.append(reward)

            state = next_state
            total_reward += reward

            if done or truncated:
                break

        agent.update()
        print(f"Episode {episode + 1}, Reward: {total_reward:.2f}")

7. ๊ณ ๊ธ‰ ๊ธฐ๋ฒ•

7.1 ์—”ํŠธ๋กœํ”ผ ์ •๊ทœํ™”

ํƒ์ƒ‰์„ ์žฅ๋ คํ•˜๊ธฐ ์œ„ํ•ด ์ •์ฑ…์˜ ์—”ํŠธ๋กœํ”ผ๋ฅผ ์†์‹ค์— ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.

def compute_entropy(probs):
    """์ •์ฑ…์˜ ์—”ํŠธ๋กœํ”ผ ๊ณ„์‚ฐ"""
    return -(probs * probs.log()).sum(dim=-1).mean()

# ์†์‹ค ํ•จ์ˆ˜
total_loss = policy_loss - entropy_coef * entropy

7.2 Reward Shaping

ํฌ์†Œ ๋ณด์ƒ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•œ ๋ณด์ƒ ๋ณ€ํ™˜:

def shape_reward(reward, state, next_state, done):
    """๋ณด์ƒ ํ˜•์„ฑ ์˜ˆ์‹œ"""
    # ์›๋ž˜ ๋ณด์ƒ์— ์ถ”๊ฐ€์ ์ธ ์‹œ๊ทธ๋„
    position_reward = abs(next_state[0] - state[0])  # ์›€์ง์ž„ ์žฅ๋ ค

    if done and reward > 0:
        bonus = 100  # ๋ชฉํ‘œ ๋‹ฌ์„ฑ ๋ณด๋„ˆ์Šค
    else:
        bonus = 0

    return reward + 0.1 * position_reward + bonus

8. REINFORCE์˜ ํ•œ๊ณ„

8.1 ๋ฌธ์ œ์ 

  1. ๋†’์€ ๋ถ„์‚ฐ: ์—ํ”ผ์†Œ๋“œ ์ „์ฒด๋ฅผ ์‚ฌ์šฉํ•˜๋ฏ€๋กœ ๋ถ„์‚ฐ์ด ํผ
  2. ์ƒ˜ํ”Œ ๋น„ํšจ์œจ: ์—ํ”ผ์†Œ๋“œ ์ข…๋ฃŒ๊นŒ์ง€ ๊ธฐ๋‹ค๋ ค์•ผ ํ•จ
  3. ํฌ๋ ˆ๋”ง ํ• ๋‹น: ์–ด๋–ค ํ–‰๋™์ด ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ๊ฐ€์ ธ์™”๋Š”์ง€ ํŒŒ์•… ์–ด๋ ค์›€

8.2 ํ•ด๊ฒฐ์ฑ… โ†’ Actor-Critic

  • TD ํ•™์Šต๊ณผ ์ •์ฑ… ๊ฒฝ์‚ฌ์˜ ๊ฒฐํ•ฉ
  • ๋ถ€ํŠธ์ŠคํŠธ๋ž˜ํ•‘์œผ๋กœ ๋ถ„์‚ฐ ๊ฐ์†Œ
  • ์Šคํ…๋งˆ๋‹ค ์—…๋ฐ์ดํŠธ ๊ฐ€๋Šฅ

์š”์•ฝ

์•Œ๊ณ ๋ฆฌ์ฆ˜ ์—…๋ฐ์ดํŠธ ์‹œ์  Baseline ํŠน์ง•
REINFORCE ์—ํ”ผ์†Œ๋“œ ์ข…๋ฃŒ ์—†์Œ ๋‹จ์ˆœ, ๋†’์€ ๋ถ„์‚ฐ
REINFORCE + Baseline ์—ํ”ผ์†Œ๋“œ ์ข…๋ฃŒ V(s) ๋‚ฎ์€ ๋ถ„์‚ฐ
Actor-Critic ๋งค ์Šคํ… V(s) ๋˜๋Š” Q(s,a) ํšจ์œจ์ 

ํ•ต์‹ฌ ๊ณต์‹:

โˆ‡_ฮธ J(ฮธ) = E[โˆ‡_ฮธ log ฯ€_ฮธ(a|s) ยท (G - b)]

๋‹ค์Œ ๋‹จ๊ณ„

to navigate between lessons