06. Q-Learning๊ณผ SARSA

06. Q-Learning๊ณผ SARSA

๋‚œ์ด๋„: โญโญโญ (์ค‘๊ธ‰)

ํ•™์Šต ๋ชฉํ‘œ

  • Q-Learning์˜ ์›๋ฆฌ์™€ off-policy ํŠน์„ฑ ์ดํ•ด
  • SARSA์˜ ์›๋ฆฌ์™€ on-policy ํŠน์„ฑ ์ดํ•ด
  • Q-Learning vs SARSA ์ฐจ์ด์  ๋น„๊ต
  • Epsilon-greedy ํƒ์ƒ‰ ์ „๋žต ๊ตฌํ˜„
  • ์ˆ˜๋ ด ์กฐ๊ฑด๊ณผ ์‹ค์ „ ์ ์šฉ ํŒ

1. ํ–‰๋™ ๊ฐ€์น˜ ํ•จ์ˆ˜ (Q ํ•จ์ˆ˜)

1.1 Q ํ•จ์ˆ˜์˜ ์ •์˜

์ƒํƒœ-ํ–‰๋™ ์Œ์˜ ๊ฐ€์น˜๋ฅผ ํ‰๊ฐ€ํ•˜๋Š” ํ•จ์ˆ˜์ž…๋‹ˆ๋‹ค.

$$Q(s, a) = \mathbb{E}[G_t | S_t = s, A_t = a]$$

import numpy as np

class QTable:
    def __init__(self, n_states, n_actions):
        # Q ํ…Œ์ด๋ธ” ์ดˆ๊ธฐํ™” (0 ๋˜๋Š” ์ž‘์€ ๋žœ๋ค๊ฐ’)
        self.q_table = np.zeros((n_states, n_actions))

    def get_q(self, state, action):
        return self.q_table[state, action]

    def get_best_action(self, state):
        return np.argmax(self.q_table[state])

    def update(self, state, action, value):
        self.q_table[state, action] = value

1.2 V์™€ Q์˜ ๊ด€๊ณ„

V(s) = max_a Q(s, a)           # ์ตœ์  ์ •์ฑ…์—์„œ
V(s) = ฮฃ_a ฯ€(a|s) Q(s, a)      # ์ผ๋ฐ˜ ์ •์ฑ…์—์„œ

2. Q-Learning (Off-Policy TD)

2.1 Q-Learning ์•Œ๊ณ ๋ฆฌ์ฆ˜

Off-Policy: ํ–‰๋™ ์ •์ฑ…(behavior policy)๊ณผ ํƒ€๊ฒŸ ์ •์ฑ…(target policy)์ด ๋‹ค๋ฆ„

Q(s, a) โ† Q(s, a) + ฮฑ[r + ฮณ max_a' Q(s', a') - Q(s, a)]
  • ํ–‰๋™ ์ •์ฑ…: ฮต-greedy (ํƒ์ƒ‰์šฉ)
  • ํƒ€๊ฒŸ ์ •์ฑ…: greedy (ํ•™์Šต์šฉ)
class QLearning:
    def __init__(self, n_states, n_actions, alpha=0.1, gamma=0.99, epsilon=0.1):
        self.q_table = np.zeros((n_states, n_actions))
        self.alpha = alpha      # ํ•™์Šต๋ฅ 
        self.gamma = gamma      # ํ• ์ธ์œจ
        self.epsilon = epsilon  # ํƒ์ƒ‰๋ฅ 
        self.n_actions = n_actions

    def choose_action(self, state):
        """ฮต-greedy ์ •์ฑ…์œผ๋กœ ํ–‰๋™ ์„ ํƒ"""
        if np.random.random() < self.epsilon:
            return np.random.randint(self.n_actions)  # ํƒ์ƒ‰
        return np.argmax(self.q_table[state])          # ํ™œ์šฉ

    def update(self, state, action, reward, next_state, done):
        """Q-Learning ์—…๋ฐ์ดํŠธ"""
        if done:
            target = reward
        else:
            # Off-policy: ๋‹ค์Œ ์ƒํƒœ์—์„œ ์ตœ๋Œ€ Q๊ฐ’ ์‚ฌ์šฉ
            target = reward + self.gamma * np.max(self.q_table[next_state])

        # TD ์—…๋ฐ์ดํŠธ
        td_error = target - self.q_table[state, action]
        self.q_table[state, action] += self.alpha * td_error

        return td_error

2.2 Q-Learning ํ•™์Šต ๋ฃจํ”„

def train_qlearning(env, agent, n_episodes=1000):
    rewards_history = []

    for episode in range(n_episodes):
        state = env.reset()
        total_reward = 0
        done = False

        while not done:
            # ฮต-greedy๋กœ ํ–‰๋™ ์„ ํƒ
            action = agent.choose_action(state)

            # ํ™˜๊ฒฝ์—์„œ ํ•œ ์Šคํ… ์ง„ํ–‰
            next_state, reward, done, _ = env.step(action)

            # Q ํ…Œ์ด๋ธ” ์—…๋ฐ์ดํŠธ (๋‹ค์Œ ํ–‰๋™๊ณผ ๋ฌด๊ด€)
            agent.update(state, action, reward, next_state, done)

            state = next_state
            total_reward += reward

        rewards_history.append(total_reward)

        # Epsilon decay (์„ ํƒ์ )
        agent.epsilon = max(0.01, agent.epsilon * 0.995)

    return rewards_history

3. SARSA (On-Policy TD)

3.1 SARSA ์•Œ๊ณ ๋ฆฌ์ฆ˜

On-Policy: ํ–‰๋™ ์ •์ฑ…๊ณผ ํƒ€๊ฒŸ ์ •์ฑ…์ด ๋™์ผ

์ด๋ฆ„์˜ ์œ ๋ž˜: State, Action, Reward, State, Action

Q(s, a) โ† Q(s, a) + ฮฑ[r + ฮณ Q(s', a') - Q(s, a)]

์—ฌ๊ธฐ์„œ a'๋Š” ์‹ค์ œ๋กœ ์„ ํƒ๋  ๋‹ค์Œ ํ–‰๋™์ž…๋‹ˆ๋‹ค.

class SARSA:
    def __init__(self, n_states, n_actions, alpha=0.1, gamma=0.99, epsilon=0.1):
        self.q_table = np.zeros((n_states, n_actions))
        self.alpha = alpha
        self.gamma = gamma
        self.epsilon = epsilon
        self.n_actions = n_actions

    def choose_action(self, state):
        """ฮต-greedy ์ •์ฑ…"""
        if np.random.random() < self.epsilon:
            return np.random.randint(self.n_actions)
        return np.argmax(self.q_table[state])

    def update(self, state, action, reward, next_state, next_action, done):
        """SARSA ์—…๋ฐ์ดํŠธ"""
        if done:
            target = reward
        else:
            # On-policy: ์‹ค์ œ ๋‹ค์Œ ํ–‰๋™์˜ Q๊ฐ’ ์‚ฌ์šฉ
            target = reward + self.gamma * self.q_table[next_state, next_action]

        td_error = target - self.q_table[state, action]
        self.q_table[state, action] += self.alpha * td_error

        return td_error

3.2 SARSA ํ•™์Šต ๋ฃจํ”„

def train_sarsa(env, agent, n_episodes=1000):
    rewards_history = []

    for episode in range(n_episodes):
        state = env.reset()
        action = agent.choose_action(state)  # ์ดˆ๊ธฐ ํ–‰๋™ ์„ ํƒ
        total_reward = 0
        done = False

        while not done:
            # ํ™˜๊ฒฝ์—์„œ ํ•œ ์Šคํ… ์ง„ํ–‰
            next_state, reward, done, _ = env.step(action)

            # ๋‹ค์Œ ํ–‰๋™ ์„ ํƒ (์—…๋ฐ์ดํŠธ ์ „์—)
            next_action = agent.choose_action(next_state)

            # SARSA ์—…๋ฐ์ดํŠธ (๋‹ค์Œ ํ–‰๋™ ํ•„์š”)
            agent.update(state, action, reward, next_state, next_action, done)

            state = next_state
            action = next_action
            total_reward += reward

        rewards_history.append(total_reward)

    return rewards_history

4. Q-Learning vs SARSA ๋น„๊ต

4.1 ํ•ต์‹ฌ ์ฐจ์ด์ 

ํŠน์„ฑ Q-Learning SARSA
์ •์ฑ… ์œ ํ˜• Off-policy On-policy
ํƒ€๊ฒŸ ๊ณ„์‚ฐ max Q(s', a') Q(s', a')
ํ•™์Šต ๋Œ€์ƒ ์ตœ์  ์ •์ฑ… ํ˜„์žฌ ์ •์ฑ…
ํƒ์ƒ‰ ์˜ํ–ฅ ํ•™์Šต์— ์˜ํ–ฅ ์—†์Œ ํ•™์Šต์— ์ง์ ‘ ์˜ํ–ฅ
์•ˆ์ „์„ฑ ๋” ๊ณต๊ฒฉ์  ๋” ๋ณด์ˆ˜์ 

4.2 Cliff Walking ์˜ˆ์ œ

[S][.][.][.][.][.][.][.][.][.][.][G]
[C][C][C][C][C][C][C][C][C][C][C][C]

S: ์‹œ์ž‘, G: ๋ชฉํ‘œ, C: ์ ˆ๋ฒฝ (ํฐ ์Œ์ˆ˜ ๋ณด์ƒ)
def cliff_walking_comparison():
    """
    Q-Learning: ์ ˆ๋ฒฝ ๊ฐ€์žฅ์ž๋ฆฌ์˜ ์ตœ๋‹จ ๊ฒฝ๋กœ ์„ ํ˜ธ (์œ„ํ—˜ํ•˜์ง€๋งŒ ๋น ๋ฆ„)
    SARSA: ์ ˆ๋ฒฝ์—์„œ ๋–จ์–ด์ง„ ์•ˆ์ „ํ•œ ๊ฒฝ๋กœ ์„ ํ˜ธ (๋А๋ฆฌ์ง€๋งŒ ์•ˆ์ „)
    """
    # Q-Learning์€ ์ตœ์  ๊ฒฝ๋กœ๋ฅผ ํ•™์Šตํ•˜์ง€๋งŒ
    # ฮต-greedy ํƒ์ƒ‰ ์ค‘ ์ ˆ๋ฒฝ์œผ๋กœ ๋–จ์–ด์งˆ ์ˆ˜ ์žˆ์Œ

    # SARSA๋Š” ํƒ์ƒ‰์„ ๊ณ ๋ คํ•˜์—ฌ
    # ์ ˆ๋ฒฝ์—์„œ ๋–จ์–ด์ง„ ๊ฒฝ๋กœ๋ฅผ ํ•™์Šต
    pass

4.3 ์‹œ๊ฐํ™” ๋น„๊ต

import matplotlib.pyplot as plt

def compare_algorithms(env, n_episodes=500, n_runs=10):
    q_rewards = np.zeros((n_runs, n_episodes))
    sarsa_rewards = np.zeros((n_runs, n_episodes))

    for run in range(n_runs):
        q_agent = QLearning(env.n_states, env.n_actions)
        sarsa_agent = SARSA(env.n_states, env.n_actions)

        q_rewards[run] = train_qlearning(env, q_agent, n_episodes)
        sarsa_rewards[run] = train_sarsa(env, sarsa_agent, n_episodes)

    # ํ‰๊ท  ๋ฐ ํ‘œ์ค€ํŽธ์ฐจ
    plt.figure(figsize=(10, 6))

    q_mean = q_rewards.mean(axis=0)
    sarsa_mean = sarsa_rewards.mean(axis=0)

    plt.plot(q_mean, label='Q-Learning', alpha=0.8)
    plt.plot(sarsa_mean, label='SARSA', alpha=0.8)

    plt.xlabel('Episode')
    plt.ylabel('Total Reward')
    plt.title('Q-Learning vs SARSA')
    plt.legend()
    plt.show()

5. ํƒ์ƒ‰ ์ „๋žต (Exploration Strategies)

5.1 Epsilon-Greedy

def epsilon_greedy(q_values, epsilon):
    if np.random.random() < epsilon:
        return np.random.randint(len(q_values))
    return np.argmax(q_values)

# Epsilon decay ์Šค์ผ€์ค„
def get_epsilon(episode, min_eps=0.01, max_eps=1.0, decay=0.995):
    return max(min_eps, max_eps * (decay ** episode))

5.2 Softmax (Boltzmann) ํƒ์ƒ‰

def softmax_action(q_values, temperature=1.0):
    """์˜จ๋„์— ๋”ฐ๋ฅธ ํ™•๋ฅ ์  ํ–‰๋™ ์„ ํƒ"""
    exp_q = np.exp(q_values / temperature)
    probs = exp_q / np.sum(exp_q)
    return np.random.choice(len(q_values), p=probs)

5.3 UCB (Upper Confidence Bound)

class UCBAgent:
    def __init__(self, n_states, n_actions, c=2.0):
        self.q_table = np.zeros((n_states, n_actions))
        self.n_visits = np.zeros((n_states, n_actions))
        self.total_visits = np.zeros(n_states)
        self.c = c

    def choose_action(self, state):
        self.total_visits[state] += 1

        # ๋ฐฉ๋ฌธํ•˜์ง€ ์•Š์€ ํ–‰๋™์ด ์žˆ์œผ๋ฉด ์„ ํƒ
        if 0 in self.n_visits[state]:
            return np.argmin(self.n_visits[state])

        # UCB ๊ฐ’ ๊ณ„์‚ฐ
        ucb_values = self.q_table[state] + self.c * np.sqrt(
            np.log(self.total_visits[state]) / self.n_visits[state]
        )
        return np.argmax(ucb_values)

6. Expected SARSA

6.1 ๊ฐœ๋…

SARSA์™€ Q-Learning์˜ ์ค‘๊ฐ„ ํ˜•ํƒœ๋กœ, ๋‹ค์Œ ์ƒํƒœ์—์„œ ๊ธฐ๋Œ€๊ฐ’์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

Q(s, a) โ† Q(s, a) + ฮฑ[r + ฮณ ฮฃ_a' ฯ€(a'|s') Q(s', a') - Q(s, a)]
class ExpectedSARSA:
    def __init__(self, n_states, n_actions, alpha=0.1, gamma=0.99, epsilon=0.1):
        self.q_table = np.zeros((n_states, n_actions))
        self.alpha = alpha
        self.gamma = gamma
        self.epsilon = epsilon
        self.n_actions = n_actions

    def get_policy_probs(self, state):
        """ฮต-greedy ์ •์ฑ…์˜ ํ™•๋ฅ  ๋ถ„ํฌ"""
        probs = np.ones(self.n_actions) * self.epsilon / self.n_actions
        best_action = np.argmax(self.q_table[state])
        probs[best_action] += 1 - self.epsilon
        return probs

    def update(self, state, action, reward, next_state, done):
        if done:
            target = reward
        else:
            # ๊ธฐ๋Œ€๊ฐ’ ๊ณ„์‚ฐ
            probs = self.get_policy_probs(next_state)
            expected_q = np.sum(probs * self.q_table[next_state])
            target = reward + self.gamma * expected_q

        td_error = target - self.q_table[state, action]
        self.q_table[state, action] += self.alpha * td_error

7. ์ˆ˜๋ ด์„ฑ๊ณผ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ

7.1 ์ˆ˜๋ ด ์กฐ๊ฑด

Q-Learning์ด ์ตœ์  Q*๋กœ ์ˆ˜๋ ดํ•˜๊ธฐ ์œ„ํ•œ ์กฐ๊ฑด:

  1. ๋ชจ๋“  ์ƒํƒœ-ํ–‰๋™ ์Œ์„ ๋ฌดํ•œํžˆ ๋ฐฉ๋ฌธ
  2. ํ•™์Šต๋ฅ  ์กฐ๊ฑด: ฮฃ ฮฑ = โˆž, ฮฃ ฮฑยฒ < โˆž (์˜ˆ: ฮฑ = 1/n)
  3. ๋ณด์ƒ์ด ์œ ๊ณ„

7.2 ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹

# ์ผ๋ฐ˜์ ์ธ ์‹œ์ž‘์ 
config = {
    'alpha': 0.1,        # ํ•™์Šต๋ฅ : 0.01 ~ 0.5
    'gamma': 0.99,       # ํ• ์ธ์œจ: 0.9 ~ 0.999
    'epsilon': 1.0,      # ์ดˆ๊ธฐ ํƒ์ƒ‰๋ฅ 
    'epsilon_min': 0.01, # ์ตœ์†Œ ํƒ์ƒ‰๋ฅ 
    'epsilon_decay': 0.995  # ๊ฐ์†Œ์œจ
}

# ํ•™์Šต๋ฅ  ์Šค์ผ€์ค„๋ง
def learning_rate_schedule(episode, initial_lr=0.5, decay=0.001):
    return initial_lr / (1 + decay * episode)

8. ์‹ค์Šต: FrozenLake

import gymnasium as gym

def train_frozen_lake():
    env = gym.make('FrozenLake-v1', is_slippery=True)

    agent = QLearning(
        n_states=env.observation_space.n,
        n_actions=env.action_space.n,
        alpha=0.1,
        gamma=0.99,
        epsilon=1.0
    )

    n_episodes = 10000
    rewards = []

    for episode in range(n_episodes):
        state, _ = env.reset()
        total_reward = 0
        done = False

        while not done:
            action = agent.choose_action(state)
            next_state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated

            agent.update(state, action, reward, next_state, done)
            state = next_state
            total_reward += reward

        rewards.append(total_reward)
        agent.epsilon = max(0.01, agent.epsilon * 0.9995)

        if (episode + 1) % 1000 == 0:
            avg_reward = np.mean(rewards[-100:])
            print(f"Episode {episode + 1}, Avg Reward: {avg_reward:.3f}")

    return agent, rewards

if __name__ == "__main__":
    agent, rewards = train_frozen_lake()

์š”์•ฝ

์•Œ๊ณ ๋ฆฌ์ฆ˜ ํƒ€๊ฒŸ ์ •์ฑ… ํŠน์ง•
Q-Learning max Q(s',a') Off-policy ์ตœ์  ์ •์ฑ… ํ•™์Šต
SARSA Q(s',a') On-policy ํ˜„์žฌ ์ •์ฑ… ํ‰๊ฐ€
Expected SARSA E[Q(s',a')] Off-policy ๋‚ฎ์€ ๋ถ„์‚ฐ

ํ•ต์‹ฌ ํฌ์ธํŠธ: - Q-Learning์€ ์ตœ์  ์ •์ฑ…์„ ์ง์ ‘ ํ•™์Šต - SARSA๋Š” ํƒ์ƒ‰์„ ๊ณ ๋ คํ•œ ์•ˆ์ „ํ•œ ํ•™์Šต - ์ ์ ˆํ•œ ํƒ์ƒ‰-ํ™œ์šฉ ๊ท ํ˜•์ด ์ค‘์š”


๋‹ค์Œ ๋‹จ๊ณ„

to navigate between lessons