Post | RLG

简介

强化学习方法既有具有环境模型的算法（基于模型），比如动态规划和启发式搜索；也有没有环境模型的算法（无模型），比如蒙特卡洛和时序差分。
基于模型的方法将规划作为其主要组成部分，而无模型的方法则主要依赖于学习。

8.1 模型和规划

模型：指的是一个智能体可以用来预测环境对其动作的反应的工具。给定一个状态和一个动作，模型就会产生_后继状态和下一个收益的预测_ 。

分布模型：描述了各种状态转移以及奖励反馈的概率，也就是提供了具体的分布$p(s’,r | s, a ) $。
采样模型：基于概率产生一种随机结果。

分布模型比采样模型更强大，因为它们可以用来生成采样模型。然而在许多场景下，获得采样模型会比获得分布模型容易得多。
规划：指将模型作为输入，进而与之交互，对策略进行优化改进的计算过程。

规划（Planning）方法和后面要提到的学习（Learning）方法核心思想都一样：通过反向更新来估计价值函数。区别仅在于，前者通过 model 生成模拟经验数据，而后者直接与环境交互，生成真实经验数据。

8.2 Dyna-Q：集成在一起的规划、动作和学习

智能体与环境交互时获得的新信息可能会改变模型，我们需要以某种方式对规划进行调整【模型学习和决策确定】，使其适应当前或后继预期的状态或决策。
Dyna-Q集成了在线规划的智能体所需要的主要功能，将规划和学习集成到了一起。它有以下三个概念：

Experience. 实际经验可以用于改进模型（模型学习）；也能用于直接改进价值函数和策略（直接强化学习）
Model. 模型，对环境的模拟
Value/policy. 价值函数和策略

Dyna-Q集成了规划和学习：

规划采用了随机单步表格型 Q-learning
学习采用了单步表格型 Q-learning

下图是Dyna-Q算法，a-c是与环境交互，d是直接强化学习，e是模型学习，
f是规划（均匀随机采样，Q-learning）

8.2.1 例8.1 Dyna迷宫

为什么带规划的智能体比无规划的智能体更快找到解决方案？如果没有规划 (n=0) ，则每一幕只会给策略学习增加一次学习机会，即智能体仅仅在一个特定时刻(该幕最后一个时刻)进行了学习。而有规划的时候，虽然在第一幕中仍然只有一步的学习，但是到第二幕时，可以学出一个宽泛的策略，于是不用等到这一幕结束就可以做多次回溯更新计算。

8.2.2 代码实现

class QLearningAgent(BaseAgent):
    def __init__(self, run_episode_fn=run_qlearning_episode, **kwargs):
        super().__init__(run_episode_fn=run_qlearning_episode, **kwargs)

    def update(self, state, action, reward, next_state):
        """ Q learning update to the policy -- eq 6.8 """

        q_t0 = self.get_q_value(state, action)
        q_t1 = self.get_value(next_state)

        # q learning update per eq 6.8 -- greedy policy after the current step
        new_value = q_t0 + self.alpha * (reward + self.discount * q_t1 - q_t0)

        # perform update
        self.q_values[(state, action)] = new_value

        self.num_updates += 1

        return new_value

class DynaQAgent(QLearningAgent):
    """ Tabular Dyna-Q algorithm per Section 8.2 """
    def __init__(self, n_planning_steps, **kwargs):
        super().__init__(**kwargs)
        self.n_planning_steps = n_planning_steps

    def reset(self):
        super().reset()
        self.model = {}

    def sample_model(self):
        # sample state
        past_states = [k[0] for k in self.model.keys()]
        sampled_state = past_states[np.random.choice(len(past_states))]
        # sample action, previously taken from the sampled state
        past_actions = [k[1] for k in self.model.keys() if k[0] == sampled_state]
        sampled_action = past_actions[np.random.choice(len(past_actions))]
        # model assumes deterministic environment so no need to sample from the (R,S') pair under model(S,A)
        reward, next_state = self.model[(sampled_state, sampled_action)][1]
        return sampled_state, sampled_action, reward, next_state

    def update(self, state, action, reward, next_state):
        """ Execute the Q-learning off-policy algorithm in Section 6.5 with Dyna-Q model update/planning in Section 8.2 """

        # perform q-learning update (Section 8.2 - Tabular Dyna-Q algorithm line (d))
        super().update(state, action, reward, next_state)  # note this is stepping the num_updates counter

        # update model (Sec 8.2 - Dyna-Q line (e))
        # model assumes deterministic environment
        self.model[(state, action)] = self.num_updates, (reward, next_state)

        # perform planning (Sec 8.2 - Dyna-Q line(f))
        # Loop repeat n times for the n_planning_steps
        for i in range(self.n_planning_steps):
            # sample randomly previously observed state (S) and sample randomly action previously taken at S
            super().update(*self.sample_model())  # update q_values with the planning sample

        self.mdp.step()      # keep track of mdp number of update steps to change the mdp dynamically per example 8.2 blocking maze

8.3 模型错误的时候

环境有时会发生改变，导致模型不正确，当模型不正确时，规划就可能算出次优的策略。

8.3.1 例8.3 捷径迷宫

这其实又是一个探索与开发之间的矛盾问题。
Dyna-Q+算法会对每一个”状态-动作”二元组进行跟踪，记录它自上一次在与环境进行真实交互时出现以来，已经过了多少时刻。时间越长，我们就越有理由推测这个二元组相关的环境动态特性会产生变化，也就是它的模型是不正确的。对于长期未出现过的动作，会设置”额外收益”。
如果模型对单步转移的收益是 r ，而这个转移在τ时刻内没有在真实环境尝试，那么在规划更新时就会采用如下收益：
$R = r + k \sqrt{τ}$

8.3.2 代码实现

class DynaQPlusAgent(DynaQAgent):
    """ Dyna-Q+ algorithm per Section 8.3 and footnote 1 """
    def __init__(self, k, **kwargs):
        super().__init__(**kwargs)
        self.k = k      # scale multiplier tying reward and timesteps: reward + k*sqrt(time delta)

    def sample_model(self):
        # Sec 8.3 + footnote 1:agent changed in the following ways:
        # 1. actions that have never been tried before from a state are allowed to be considered
        # 2. initial model for such actions is that they lead back to the same state with a reward of zero
        # 3. 'bonus reward' for long-untried actions -- planning updates done with new_reward = reward + k * sqrt(time delta)

        # sample a state
        past_states = [k[0] for k in self.model.keys()]
        sampled_state = past_states[np.random.choice(len(past_states))]

        # sample action from all possible action
        # 1. actions that have never been tried before from a state are allowed to be considered
        possible_actions = self.mdp.get_possible_actions(sampled_state)
        past_actions = [k[1] for k in self.model.keys() if k[0] == sampled_state]
        sampled_action = possible_actions[np.random.choice(len(possible_actions))]
        if sampled_action not in past_actions:
            # 2. initial model for such actions is that they lead back to the same state with a reward of zero
            reward = 0
            next_state = sampled_state
            # since this state-action has never been tried, add it to the model
            self.model[(sampled_state, sampled_action)] = self.num_updates, (reward, next_state)
        else:
            # model assumes deterministic environment so no need to sample from the (R,S') pair under model(S,A)
            t_last_update, (reward, next_state) = self.model[(sampled_state, sampled_action)]
            # 3. 'bonus reward' for long-untried actions -- planning updates done with new_reward = reward + k * sqrt(time delta)
            reward += self.k * np.sqrt(self.num_updates - t_last_update)
        return sampled_state, sampled_action, reward, next_state

增加探索意味着 Dyna-Q+ 比 Dyna-Q 更快地找到最优策略。 Dyna-Q 可能会找到一条有效但不是最优的轨迹，然后必须等待很长时间才能采取足够的探索性行动来找到最优策略。

Dyna-Q+ 将采取次优行动以进行探索（当 τ 变大时）。 Dyna-Q 不会这样做，因此具有更好的渐近性能。

8.4 优先遍历

无论是Dyna-Q还是Dyna-Q+，规划过程中的模拟转移都是从先前经历的“状态-动作”二元组中随机均匀采样得到的，但均匀采样方法在达到其中一个有用的更新之前将进行很多没用的更新，比如在Dyna迷宫问题中的第二幕开始的时候，有用的转移只有最后一步，因为之前绝大多数转移的价值状态都是从0到0。

IDEA：如果能够在规划过程中，更有针对性地优先考虑那些 value 发生过较明显改变的状态，便能提升规划学习的效率。如果这些动作的价值被更新，则它们的前导状态的价值也会依次改变

4.1 代码实现

class PrioritizedSweepingAgent(QLearningAgent):
    def __init__(self, n_planning_steps, theta, **kwargs):
        super().__init__(**kwargs)
        self.n_planning_steps = n_planning_steps
        self.theta = theta  # the minimum magnitude of q_value update to be performed

    def reset(self):
        super().reset()
        self.model = {}
        self.pq = PriorityQueue()
        self.predecessors = defaultdict(set)

    def _update_predecessors(self, state, action, next_state):
        # add predecessors as a set of (state, action) tuples
        self.predecessors[next_state].add((state, action))

    def update(self, state, action, reward, next_state):
        """ Execute the Q-learning off-policy algorithm in Section 6.5 with
        Prioritized Sweeping model update/planning in Section 8.4 """

        # update model (Sec 8.4 - line (d))
        # model assumes deterministic environment
        self.model[(state, action)] = (reward, next_state)
        # keep track of predecessors for the pq loop below
        self._update_predecessors(state, action, next_state)

        # compute q value proposed update and update priority queue (Sec 8.4 - line (e-f))
        proposed_update = reward + self.discount * self.get_value(next_state) - self.get_q_value(state, action)
        if abs(proposed_update) > self.theta:
            self.pq.push((state,action), -abs(proposed_update))

        # loop over n_planning steps while pq is not empty (Sec 8.4 - line(g)
        for i in range(self.n_planning_steps):
            if self.pq.is_empty():
                break

            # pop best update from queue and transition from model
            state, action = self.pq.pop()
            reward, next_state = self.model[(state, action)]

            # update q values for this state-action pair
            super().update(state, action, reward, next_state)

            # loop for all S', A' predicted to lead to the above state
            for s, a in self.predecessors[state]:
                # get predicted reward from the predecessor leading to `state`
                r, _ = self.model[(s, a)]
                # calculate the proposed update to (s,a)
                proposed_update = r + self.discount * self.get_value(state) - self.get_q_value(s, a)
                # add to priority queue if greater than min threshold
                if abs(proposed_update) > self.theta:
                    self.pq.push((s,a), -abs(proposed_update))

8.5 期望更新与采样更新

前面的 Dyna-Q 算法中，更新方法是采样更新 ，也就是选取某一个具体的样本来更新价值函数，与其对应地，就是期望更新，对下一步所有可能的动作都遍历一次，按概率分布取得期望后更新价值函数。

eg：对于最优策略下动作价值的更新，期望更新和采样更新的更新公式如下：

期望更新

$Q(s, a) = \sum_{s’, r}^{}\hat{p}(s’, r|s,a)[r+\gamma \mathop{max}\limits_{a’}Q(s’,a’)]$

采样更新

$Q(s, a) = Q(s, a) + \alpha [R+\gamma \mathop{max}\limits_{a’}Q(s’,a’) - Q(s, a)]$
下面对比两种更新方法。

如果没有模型，那么无法使用期望更新，只能和环境交互，做采样更新。
如果有模型，期望更新的估计准确性更高，避免了采样误差，而采样更新的运算量则会很小，速度快。对于一个给定的二元组s,a，b是分支因子，则这个二元组的期望更新需要的计算量大约是采样更新的b 倍。

8.6 轨迹采样

这一节主要解决『选取哪些状态、动作用于更新价值函数』这一问题。
基于动态规划的经典方法是通常需要遍历整个状态空间或状态动作二元组空间，每一次遍历都需要对每个状态或状态动作的价值进行更新，但是大规模任务根本没有时间完成一次遍历，并且绝大多数状态都是无关紧要的。所以需要将计算时间集中到需要的地方
为了避免这类问题，考虑以某种『合适的』概率分布在状态空间中采样，避免『盲目性』。
根据当前策略的概率分布进行采样操作（轨迹采样）值得考虑，因为至少比均匀采样好，比如我们下棋时考虑的是真实游戏中可能出现的位置，而不是棋子的随机位置。

8.7 决策时规划

规划有两种运行方式：

从环境模型生成模拟经验，并以此为基础逐步改进策略或价值函数。规划不仅聚焦于当前状态，还要预先在后台处理其他的多个状态，也称后台规划
每遇到一个新状态，才开始规划，直接选出当前状态下的最优策略，这里的规划聚焦于特定状态，也就是决策时规划

决策时规划的优点在于，他重点关注了即时状态，而没有太注重综合性的信息。由于在一些问题中，一些状态出现一次后要很长时间才能再出现一次，这时决策时规划就能较好地处理这种情况。
决策时规划适合那些不太需要快速反应的场景，比如一些棋类游戏，在每一步都允许较多思考时间。

8.7.1 启发式搜索

遇到一个状态，建立一个树结构（包含了后面各种可能的延续），搜索树深度k越大，每一次更新时的信息量也更大，显然更深的启发式搜索能够生成更好的结果，搜索树中的回溯更新与在本书中讨论的最大值的期望更新完全相同

8.7.2 预演算法

预演（rollout）算法是一种基于MC控制的决策时规划算法，在仿真轨迹层面上进行更新学习。rollout算法利用MC采样得到很多从当前状态开始的仿真轨迹（当前状态下有多个动作，因此对于每个动作都会得到一些仿真的轨迹），然后分别用这些轨迹回报的均值来估计每个动作的值。当这个估计值足够准确了，规划算法会执行具有最高估计值的动作。

MC 中，采样的目的是估计一个完整的，最优动作价值函数。MC是用于学习，所以我们要估计出所有状态动作的值函数，然后才能找到一个最优策略。
rollout 中，目标不是估计完整的最优动作价值函数q*，用于规划的，而且是决策时规划。对于这种规划，我们只在当前状态下搜索所有可能的plan，然后选择最有利的，因此就只侧重于当前状态周围的动作价值函数的值。

RLG

基于表格型方法的规划与学习

简介