Sticks And Carrots — An introduction to Reinforcement Learning for the Average Machine Learning Joe

Kae
5 min readJan 14, 2021

What on earth is this Reinforcement Learning?

Reinforcement learning, in psychological terms, means learning particular actions and behaviors by positively or negatively reinforcing the action or behavior. For example, if a child throws a tantrum in a shop, his parent rebukes. If the child behaves, however, the parent rewards him by giving him a cookie. This is reinforcement learning in action. The field of Reinforcement Learning or RL in machine learning tries to imitate this sort of learning into machines, as a way of teaching them more efficiently than other present algorithms. As of its level, Reinforcement Learning is a pretty big subfield of Deep Learning, which in itself is a HUGE subfield of Machine Learning.

Note: Moving forward with this post, it is considered that you have a working knowledge of basic machine learning and some deep learning algorithms. If you want to learn more about those, check this post out for some introductions.

In a basic Reinforcement Learning setting, there is an agent, which interacts with an environment. Given a state (Sₜ) from the environment( a game screen, the location of the foods in Snake, etc. ), the agent processes the state and returns an action (Aₜ) to be performed in the environment. The environment then returns a reward R(ₜ), based on which the agent learns and updates its policy. It might be hard to wrap your head around, but this is the base of all Reinforcement Learning. This is called a Markov Decision Process or MDP(in the context of RL).

Alongside the reward, the environment often returns the next state, and also the done flag, which tells the agent whether or not the action taken ended the episode.

Types of RL

There are two basic types of RL:

  1. Model-Based Learning
  2. Model-Free Learning
Credits: Spinning Up in OpenAI

Model-Based Reinforcement Learning uses the model’s predictions to provide rewards and the next state. This model is rarely learned.

Model-Free Learning, however, relies on samples purely from the environment. A sample is a collection of state, action, reward, next state, and done. Model-Free Learning is currently the bigger field of research, and the main focus of this post.

Model-Free Learning, in turn, has two major segments:

  1. Policy-Optimization Reinforcement learning
  2. Value-Based Reinforcement learning

A Policy is the main decision making part of an agent. In Deep Reinforcement Learning algorithms, a neural network is often used as a major part of this policy. On-Policy methods use the current samples collected from playing the environment using the policy. Off-policy methods use samples from a memory, which is filled with samples from the environment.

Policy Optimization Reinforcement Learning

On-Policy Methods such as Policy Gradients, Actor-Critic, etc. aim to evaluate or train the policy that is making the decision. Some popular examples of On-Policy Methods are Policy Gradient (PG), Advantage Actor-Critic (A2C, A3C), and Proximal Policy Optimization (PPO).

Value-Based Reinforcement Learning

Off-Policy Methods such as Q Learning, Deuling Q Learning, etc. are methods in which the main policy is not updated directly. For example, in Deep Q Learning, the policy which chooses actions is the epsilon greedy one, where a percentage of the time, a random action is taken, where other times a neural network is used to predict an “optimal” action (at least according to the network). However, the training loop in this algorithm does not update the policy, but in turn, updates the network directly. Thus, it is Off-Policy.

Deep Reinforcement Learning or DRL is Reinforcement Learning which utilizes Deep Learning.

Deep Q Learning

Deep Q Learning is an off-policy value-based reinforcement learning algorithm that has grown in popularity over the past few years. The original paper came out in 2013, released by Google’s Deep Mind. The centric idea of this algorithm is actually quite simple.

For the action selection, there is a neural network, which when given a state, predicts the expected reward for taking each action. Then, the action with the greatest expected reward is selected. For the training, the TD function:

td = rewards + gamma*next

Is used as the target for the Q network. ( Where gamma is a hyperparameter usually 0 < gamma < 1 ). The next is often the prediction of the next state as per the policy q network or a secondary target network. I won’t go into much detail here, but we will be going to go into more detail in the next post.

Policy Gradients (REINFORCE)

Policy Gradient (REINFORCE) is another popular reinforcement learning algorithm that has been a foundation for both On Policy and Off Policy Reinforcement Learning methods. In Reinforce, action selection is relatively simple. A Neural Network’s outputs are fed into a probability distribution, from which the action is sampled.

For the training loop, this algorithm updates after every episode, instead of every step like most other algorithms. There is a buffer for storing the rewards collected and actions taken throughout the episode. The rewards are discounted at each step; the further the reward from the current timestep, the more discounted it would be. Then the network is updated so that the probability of the actions which produce better discounted-rewards are increased. Again, more details in the upcoming posts.

Credits: Spinning Up in OpenAI

There are some overlaps between the two types of methods. Algorithms such as DDPG, TD3, and SAC have components like value functions (or Q functions) and others like actor-critic architecture. These algorithms don't define as part of any one method.

Model-Based Reinforcement Learning

As I mentioned before, model-based algorithms use their predictions to compute rewards and next_states. Some popular algorithms include AlphaGo, AlphaZero, World Modelling, and Muzero. These algorithms often require lots of computing power, and are completely environment-specific. That said, they are much more stable than model-free algorithms.

Well, what next?

I hope reading this post left you with an overview of Reinforcement Learning in general. In the following posts, I am going to go into detail about specific algorithms and how to implement them using python and pytorch. Look here if you are interested in learning and implementing some of these algorithms as well. Until next time!

--

--