In this article, I will try to explain the idea behind Reinforcement Learning and why it is much different from classic machine learning. I will also present and explain some key math formulas which are very important to understand before going deeper in Reinforcement Learning.
What is the difference between RL and other Machine Learning field?
As we know, Machine Learning can be split into three major domains: Supervised Learning, Unsupervised Learning, and Reinforcement Learning. In both supervised and unsupervised learning, we try to learn from available data contained in a dataset in order to build the best model which generalizes to similar data that are not used during the training process. In contrast, in Reinforcement Learning, there is no available data from where we can learn. Instead, we have an environment with whom we can interact. After each interaction with the environment, a feedback is received. From this process of trial-error, we will be able to understand how the environment actually works.
How does Reinforcement Learning work?
Well, as mentioned in the previous section, the goal is to build an agent that learns how the environment works through a trial-error process. At each time (step), the agent will perform an action in the environment. If the action is good, the agent receives positive feedback (reward). Else, it receives a negative reward. The role of the agent is to maximize the total rewards he gets from the environment. The environment can be represented as a set of states. From each state, we can reach some other states. For those how have studied Markov Decision Process, it is exactly what we are doing here.
To resume, at time t, the agent is at state S_t and takes an action a_t. The environment processes that action and lets him know how good is his action (reward) and tells him at which state the agent will be at the next time t+1. The agent keeps repeating this process until the environment tells him that he is done. The following figure explains well the whole process.
How does the agent learn?
As we saw previously, the goal of the agent is to maximize the total rewards he gets from the environment. Thus, when he is at state s, he has to choose the best action that gives him the highest reward. But, by doing that, it means that the agent does not take into account the impact of choosing a certain action on the future rewards he will receive since he is only interested in the reward he will get for one step. To come over this problem, the agent will choose the action that maximizes its rewards in the long term. We then introduce the discounted expected return function that the agent has to maximize.
Here, the gamma is the discounted factor that belongs to (0,1]. It is important because it gives more significance for the immediate next rewards than long term ones. It makes all its sense.
To maximize the discounted expected return function, the agent has to learn which action to choose when he is at a state s. We can express this as finding the best probability distribution over actions when he is at a certain state. This distribution is called Policy and commonly denoted as 𝜋: 𝜋(a,s)=Pr(a|S_t=s)
We can stop here and use several algorithms as Policy Gradient or Proximal Policy Optimization to solve the problem. But we can go deeper and view the problem differently.
Let’s introduce two other functions: State-Value function and Action-Value function. The state-Value function estimates how good is to be at a certain state s under a policy 𝜋.
The Action-Value function estimates how good is for the agent to take a given action from a given state when following a certain policy 𝜋.
Since the Action-Value is explained, we immediately introduce the Bellman Optimality Equation which is very used in Reinforcement Learning algorithms. Let’s see how it looks.
It tells us that under the optimal policy, for any given state-action pair (s,a), the Action-Value function is equal to the reward of taking the action a from the state s plus the maximum discounted expected return that we can get from any state-action pair. If we find the policy that verifies this equation, we can say that we won.
Now the agent will try to solve the Bellman Optimality Equation. Many algorithms are implemented to solve this equation as the Q-Learning that I will explain in another post.
I hope you understood the logic and the main key equations and formulas behind Reinforcement Learning.