Policy Iteration

1 Introduction

Most Reinforcement learning problems are formally defined as a sequential decision making problem and hence they are Mathematically modelled by a Markov Decision Process (MDP) which is a tuple \( \mathcal{M} = (\mathcal{S}, \mathcal{A}, P, R, \gamma) \) such as:

At state \( s_t \)

2 Value functions

In the previous article, we mentioned how RL is based on the idea that an agent has to choose actions to maximize the expected sum of discounted rewards. If we're following the actions selected by a policy \( \pi \), then we call this the value function:

\[ V_\pi(s) = \mathbb{E}_\pi \left[ \sum_{t=0}^{\infty} \gamma^t r_t \;\middle|\; s_0 = s \right] \]

Which is the expected return if we start in state \( s \) and continue selecting actions following \( \pi \). In the same breath we define the action-value, given the starting state \( s \) and the chosen action \( a \) from the policy \( \pi \):

\[ Q_\pi(s,a) = \mathbb{E}_\pi \left[ \sum_{t=0}^{\infty} \gamma^t r_t \;\middle|\; s_0 = s,\, a_0 = a \right] \]

3 Bellman Equations

Both the value and action-value satisfy the following equations called Bellman equations:

\[ V_\pi(s) = \mathbb{E}_\pi \big[ R_{t+1} + \gamma V_\pi(S_{t+1}) \mid S_t = s \big]. \]

\[ Q_\pi(s,a) = \mathbb{E}_\pi \big[ R_{t+1} + \gamma Q_\pi(S_{t+1}, A_{t+1}) \mid S_t = s,\, A_t = a \big]. \]

Then since we want to maximise them to find the optimal policy, if given one \( \pi^* \), it will be such as \( V_{\pi^*}(s) \geq V_{\pi}(s), \forall s\in \mathcal{S} \), \( \mathcal{S} \) being the state space.

The value function for optimal policies is unique and denoted by \( V^* \), similarly we can define the optimal action-value function denoted by \( Q^* \). Given the Bellman equations for normal value functions, the optimal ones satisfy the following termed Bellman optimality equations:

\[ V^*(s) = \max_a \Bigl[ R(s,a) + \gamma \, \mathbb{E}_{\pi} \bigl[ V^*(S_{t+1}) \bigr] \Bigr] \]

\[ Q^*(s,a) = R(s,a) + \gamma \, \mathbb{E}_{\pi} \Bigl[ \max_{a'} Q^*(S_{t+1},A_{t+1}) \Bigr] \]

4 Policy Iteration

In this section we assume that we already know the world model (i.e., we know the state transition probability distribution \( P \)) and that the states and actions set are discrete and \( \gamma < 1 \). Therefore we can compute the value function for a certain policy (policy evaluation) and then we improve it using the computed value function (policy improvement).

First we randomly choose an initial value function \( V_0 \) and then for an initial state \( s \) we update it according to:

\[ V_{k+1}(s) = \mathbb{E}\bigl[R_{t+1} + \gamma \cdot V_{k}(S_{t+1}) \mid S_t = s \bigr] \]

Thanks to Brouwer's fixed-point theorem, we know that if \( k \to +\infty \) then \( V_{k} \to V_{\pi} \).

Once we have computed the value function for our policy, for the next action we can choose a different action than the one dictated by our policy \( \pi \) that maximizes \( V_{\pi} \) at each turn; we call this a deterministic policy \( \pi' \) that acts greedily with respect to \( V_{\pi} \).

And so, we keep alternating between policy improvement and evaluation until we reach a policy that acts greedily with respect to its own value function—that's the optimal policy.