Actor-Critic Methods with Continous Action Spaces — Having Too Many Things to Do

Kae
4 min readFeb 13, 2021

Actor-Critic Based Methods are now becoming the standard for policy-optimization reinforcement learning and even value-based reinforcement learning. Often in beginner tutorials and guides, actor-critic methods are explained and implemented for discrete action spaces. Reading complicated codebases like stable-baselines, spinning-up leaves you with no knowledge of how they actually port these algorithms over to continuous action spaces, and even if they do, it's most likely you have no idea how or why that works. So that’s what we’re covering in this blog post.

Note: This post is best suited for the beginner in reinforcement learning. You can find excellent resources online for getting started. Its also best if you have a good understanding of actor-critic-based methods.

Another short note: this post doesn't cover DDPG or TD3. They don't use distributions, and they have their own exploration systems. This post covers a2c, a3c, vanilla actor-critic, PPO, etc.

Gimme the solution

In reality, the solution to porting these over to continuous action spaces is quite easy, to be honest. The Categorical distribution in actor-critic architectures

# pytorch syntaxmu = self.model(state)
dist = Categorical(mu)
action = dist.sample()

can simply be replaced with a Normal or Gaussian distribution

# pytorch syntaxmu = self.model(state)
std = torch.tensor([0.1]).expand_as(mu) # why this?
dist = Normal(mu, std)
action = dist.sample()

This seems pretty easy to do right?

Well, not exactly.

The std variable you see in the code plays a vital part. If you know how Gaussian or Normal distributions work, you can skip this part. But I still suggest that you stick around and read, for a refresher.

A Gaussian or a Normal distribution works with a mean and a standard deviation. The mean becomes the center of the curve, and the standard deviation is the max deviation. The value is then (usually) sampled from this distribution anywhere between mean-std and mean+std; the closer the value to the mean, the more likely it will be picked. Sometimes, the actions can go beyond that range, though it is wildly unlikely.

Credits: Scribbr

This means that we have a solid exploration strategy coming in with our Normal Distribution.

So what's the problem?

Standard Deviation

The choosing of the standard deviation or std for the distribution is key. In my experience, if your problem is not complex, you can keep the std constant forever. But if your problem is quite complex, has sparse rewards, or needs a lot of exploration, then the best idea might be to keep the std as an output of the neural network. Let's see why.

The std controls the exploration of the agent. If the std is 0, then the agent has no exploration at all, and if the std is always 1.0, then there is too much exploration. If you are familiar with epsilon-greedy strategies, then annealing the std to some small amount like 0.001 gradually would seem very attractive. And you are not wrong! That makes sense. But the most stable and the most powerful solution seems to be keeping the std as an output of the network. This is because giving the exploration in the hands of the network itself turns out to be much easier. A well-trained agent on a video game will learn to start exploring more when it encounters state space it has not seen before. One downside of this is that the learning is reduced drastically. This is because instead of predicting action A given state S and std T it now has to predict action A and std T in the best combination of the two given state S.

This solution also turns out to be magically more stable. An environment that requires a lot of exploration will never converge with a fixed std because even when the agent is quite good, it will continue to explore forever.

So what do I do?

It depends on the environment. If it's complicated or requires a lot of exploration, let the std be an output of the network. If the environment is relatively simple and you are just trying to test out a new algorithm, keep the std fixed. A solid middle ground would be annealing std down to 0.001 gradually.

Conclusion

Hopefully, this left you with a good understanding and enough to obtain a working implementation of actor-critic based architectures for continuous action spaces. For getting started with actor-critic algorithms, check this article out. Thanks for reading!

--

--