Introducing Policy Gradient
Today, we’ll learn a policy-based reinforcement learning technique called Policy Gradients.In policy-based methods, instead of learning a value function that tells us what is the expected sum of rewards given a state and an action, we learn directly the policy function that maps state to action (select actions without using a value function).
It means that we directly try to optimize our policy function π without worrying about a value function. We’ll directly parameterize π (select an action without a value function).
Sure, we can use a value function to optimize the policy parameters. But the value function will not be used to select an action.
A policy can be either deterministic or stochastic.
A deterministic policy is policy that maps state to actions. You give it a state and the function returns an action to take.
Deterministic policies are used in deterministic environments. These are environments where the actions taken determine the outcome. There is no uncertainty. For instance, when you play chess and you move your pawn from A2 to A3, you’re sure that your pawn will move to A3.
On the other hand, a stochastic policy outputs a probability distribution over actions.
There are three main advantages in using Policy Gradients.
For one, policy-based methods have better convergence properties.
The problem with value-based methods is that they can have a big oscillation while training. This is because the choice of action may change dramatically for an arbitrarily small change in the estimated action values.
On the other hand, with policy gradient, we just follow the gradient to find the best parameters. We see a smooth update of our policy at each step.
Because we follow the gradient to find the best parameters, we’re guaranteed to converge on a local maximum (worst case) or global maximum (best case).
The second advantage is that policy gradients are more effective in high dimensional action spaces, or when using continuous actions.
The problem with Deep Q-learning is that their predictions assign a score (maximum expected future reward) for each possible action, at each time step, given the current state.
But what if we have an infinite possibility of actions?
For instance, with a self driving car, at each state you can have a (near) infinite choice of actions (turning the wheel at 15°, 17.2°, 19,4°, honk…). We’ll need to output a Q-value for each possible action!
On the other hand, in policy-based methods, you just adjust the parameters directly: thanks to that you’ll start to understand what the maximum will be, rather than computing (estimating) the maximum directly at every step.
A third advantage is that policy gradient can learn a stochastic policy, while value functions can’t. This has two consequences.
One of these is that we don’t need to implement an exploration/exploitation trade off. A stochastic policy allows our agent to explore the state space without always taking the same action. This is because it outputs a probability distribution over actions. As a consequence, it handles the exploration/exploitation trade off without hard coding it.
We also get rid of the problem of perceptual aliasing. Perceptual aliasing is when we have two states that seem to be (or actually are) the same, but need different actions.
Let’s take an example. We have a intelligent vacuum cleaner, and its goal is to suck the dust and avoid killing the hamsters.
Our vacuum cleaner can only perceive where the walls are.
The problem: the two red cases are aliased states, because the agent perceives an upper and lower wall for each two.
Under a deterministic policy, the policy will be either moving right when in red state or moving left. Either case will cause our agent to get stuck and never suck the dust.
Under a value-based RL algorithm, we learn a quasi-deterministic policy (“epsilon greedy strategy”). As a consequence, our agent can spend a lot of time before finding the dust.
On the other hand, an optimal stochastic policy will randomly move left or right in grey states. As a consequence it will not be stuck and will reach the goal state with high probability.
Disadvantages
Naturally, Policy gradients have one big disadvantage. A lot of the time, they converge on a local maximum rather than on the global optimum.
Instead of Deep Q-Learning, which always tries to reach the maximum, policy gradients converge slower, step by step. They can take longer to train.
However, we’ll see there are solutions to this problem.
We have our policy π that has a parameter θ. This π outputs a probability distribution of actions.
Awesome! But how do we know if our policy is good?
Remember that policy can be seen as an optimization problem. We must find the best parameters (θ) to maximize a score function, J(θ).
There are two steps:
The main idea here is that J(θ) will tell us how good our π is. Policy gradient ascent will help us to find the best policy parameters to maximize the sample of good actions.
To measure how good our policy is, we use a function called the objective function (or Policy Score Function) that calculates the expected reward of policy.
Three methods work equally well for optimizing policies. The choice depends only on the environment and the objectives you have.
First, in an episodic environment, we can use the start value. Calculate the mean of the return from the first time step (G1). This is the cumulative discounted reward for the entire episode.
The idea is simple. If I always start in some state s1, what’s the total reward I’ll get from that start state until the end?
We want to find the policy that maximizes G1, because it will be the optimal policy.
Third, we can use the average reward per time step. The idea here is that we want to get the most reward per time step.
We have a Policy score function that tells us how good our policy is. Now, we want to find a parameter θ that maximizes this score function. Maximizing the score function means finding the optimal policy.
To maximize the score function J(θ), we need to do gradient ascent on policy parameters.
Gradient ascent is the inverse of gradient descent. Remember that gradient always points to the steepest change.
In gradient descent, we take the direction of the steepest decrease in the function. In gradient ascent we take the direction of the steepest increase of the function.
Why gradient ascent and not gradient descent? Because we use gradient descent when we have an error function that we want to minimize.
But, the score function is not an error function! It’s a score function, and because we want to maximize the score, we need gradient ascent.
The idea is to find the gradient to the current policy π that updates the parameters in the direction of the greatest increase, and iterate.
Okay, now let’s implement that mathematically. This part is a bit hard, but it’s fundamental to understand how we arrive at our gradient formula.
Our score function can be defined as:
Which is the total summation of expected reward given policy.
Now, because we want to do gradient ascent, we need to differentiate our score function J(θ).
Our score function J(θ) can be also defined as:
We wrote the function in this way to show the problem we face here.
We know that policy parameters change how actions are chosen, and as a consequence, what rewards we get and which states we will see and how often.
So, it can be challenging to find the changes of policy in a way that ensures improvement. This is because the performance depends on action selections and the distribution of states in which those selections are made.
Both of these are affected by policy parameters. The effect of policy parameters on the actions is simple to find, but how do we find the effect of policy on the state distribution? The function of the environment is unknown.
As a consequence, we face a problem: how do we estimate the ∇ (gradient) with respect to policy θ, when the gradient depends on the unknown effect of policy changes on the state distribution?
The solution will be to use the Policy Gradient Theorem. This provides an analytic expression for the gradient ∇ of J(θ) (performance) with respect to policy θ that does not involve the differentiation of the state distribution.
So let’s calculate:
Remember, we’re in a situation of stochastic policy. This means that our policy outputs a probability distribution π(τ ; θ). It outputs the probability of taking these series of steps (s0, a0, r0…), given our current parameters θ.
But, differentiating a probability function is hard, unless we can transform it into a logarithm. This makes it much simpler to differentiate.
Here we’ll use the likelihood ratio trick that replaces the resulting fraction into log probability.
Now let’s convert the summation back to an expectation:
This Policy gradient is telling us how we should shift the policy distribution through changing parameters θ if we want to achieve an higher score.
R(tau) is like a scalar value score:
This policy gradient causes the parameters to move most in the direction that favors actions that has the highest return.
Policy Gradient Assignment
1) Implement a Java-based Policy Gradient () function with specific interface and parameters;
2) For iMADE implementation, implement a multiagent based Policy Gradient application.
In order to ensure a well defined multiagent system (on top of JADE platform), the application must at least consists of two (or more than 2 iMADE agents). E,g. Data-miner agent to get the data from the iMADE server, Policy Gradient agent to perform the Policy Gradient operations.