Deep Reinforcement Learning Explained: From Theory to Application

Deep Reinforcement Learning (Deep RL) represents a fascinating frontier in Artificial Intelligence, merging the capabilities of Reinforcement Learning (RL) with the power of Deep Learning. This combination allows agents to learn optimal behaviors through trial and error in complex environments, making decisions based on unstructured input data. This article explores the fundamentals of deep reinforcement learning, its historical context, key concepts, and diverse applications.

Introduction: Learning Through Interaction

Imagine teaching a robot to perform a complex task, such as landing a drone, without explicitly programming every move. This is the essence of Reinforcement Learning. Unlike other machine learning approaches, RL involves an agent learning from the environment by interacting with it, receiving rewards or penalties as feedback for its actions.

Inspiration from Behavioral Psychology

The core idea of RL can be traced back to the experiments of Pavlov with dogs and Skinner with rats. The principle is simple: reward the subject when it performs a desired action (positive reinforcement) and penalize it for undesirable actions (negative reinforcement).

In computational terms, the goal is to create a system that learns to perform tasks in a way that maximizes rewards and minimizes penalties.

Key Concepts in Reinforcement Learning

To understand how RL systems work, it's essential to define several key concepts:

Agent (or Actor): The entity that interacts with the environment. This could be a robot, a software program, or any decision-making system.
Environment (or the world): The surroundings in which the agent operates. This could be a physical space, a simulation, or a virtual game world. The agent's entire existence is confined within this environment.
State: The agent's perception or knowledge of its current situation within the environment. It's what the agent "sees" or "knows" about its surroundings.
Action: A step the agent takes based on its current state. For instance, an agent might decide to move, interact with an object, or perform a specific operation.
Reward: Feedback the agent receives after taking an action. This can be positive (reward) or negative (penalty), indicating whether the action was beneficial or detrimental to achieving the agent's goal.

The Reinforcement Learning Framework: A Cycle of Interaction

The RL process can be visualized as a continuous loop:

The agent receives a state S0 from the environment.
Based on the state S0, the agent takes an action A0.
The environment transitions to a new state S1.
The environment provides a reward R1 to the agent.

This loop continues, generating a sequence of state, action, reward, and next state. The agent's ultimate goal is to maximize its cumulative reward, known as the expected return.

The Reward Hypothesis: Maximizing Cumulative Reward

RL is based on the reward hypothesis, which posits that all goals can be described as maximizing the expected return (expected cumulative reward). Achieving the best behavior in RL requires maximizing this expected cumulative reward.

Deep Learning Integration

Deep Reinforcement Learning leverages deep neural networks to solve reinforcement learning problems. Instead of using traditional algorithms, Deep RL employs neural networks to approximate value functions or policies, enabling agents to handle complex, high-dimensional state spaces.

The "Deep" in Reinforcement Learning

The integration of deep neural networks is what gives rise to "deep" reinforcement learning. Deep RL uses deep learning to ascertain the model of the environment, policies, rewards, penalties, and other components of reinforcement learning. For instance, in Q-Learning, a classic RL algorithm, a traditional algorithm creates a Q table to determine the best action for each state. In Deep Q-Learning, a neural network approximates the Q-value.

Observations and Action Spaces

Observations/States Space

Observations/States are the information our agent gets from the environment. In the case of a video game, it can be a frame (a screenshot), in the case of the trading agent, it can be the value of a certain stock etc.

There is a differentiation to make between observation and state:

State s: is a complete description of the state of the world (there is no hidden information). In a fully observed environment.In a chess game, we receive a state from the environment since we have access to the whole check board information.With a chess game, we are in a fully observed environment, since we have access to the whole check board information.
Observation o: is a partial description of the state. In a partially observed environment.In Super Mario Bros, we only see a part of the level close to the player, so we receive an observation.In Super Mario Bros, we are in a partially observed environment, we receive an observation since we only see a part of the level.

In reality, we use the term state in this course but we will make the distinction in implementations.

Action Space

The Action space is the set of all possible actions in an environment.The actions can come from a discrete or continuous space:

Discrete space: the number of possible actions is finite.Again, in Super Mario Bros, we have only 4 directions and jump possibleIn Super Mario Bros, we have a finite set of actions since we have only 4 directions and jump.
Continuous space: the number of possible actions is infinite.A Self Driving Car agent has an infinite number of possible actions since he can turn left 20°, 21°, 22°, honk, turn right 20°, 20,1°…

Taking this information into consideration is crucial because it will have importance when we will choose in the future the RL algorithm.

Rewards and the Discounting

The reward is fundamental in RL because it’s the only feedback for the agent. Thanks to it, our agent knows if the action taken was good or not.

The cumulative reward at each time step t can be written as:

The cumulative reward is equal to the sum of all rewards of the sequence.

Which is equivalent to:

However, in reality, we can’t just add them like that. The rewards that come sooner (at the beginning of the game) are more probable to happen, since they are more predictable than the long term future reward.

Let say your agent is this small mouse that can move one tile each time step, and your opponent is the cat (that can move too). Your goal is to eat the maximum amount of cheese before being eaten by the cat.

As we can see in the diagram, it’s more probable to eat the cheese near us than the cheese close to the cat (the closer we are to the cat, the more dangerous it is).

As a consequence, the reward near the cat, even if it is bigger (more cheese), will be more discounted since we’re not really sure we’ll be able to eat it.

To discount the rewards, we proceed like this:

We define a discount rate called gamma. It must be between 0 and 1.The larger the gamma, the smaller the discount. This means our agent cares more about the long term reward.

On the other hand, the smaller the gamma, the bigger the discount. This means our agent cares more about the short term reward (the nearest cheese).

Then, each reward will be discounted by gamma to the exponent of the time step.

As the time step increases, the cat gets closer to us, so the future reward is less and less probable to happen.

Our discounted cumulative expected rewards is:

Type of tasks

A task is an instance of a Reinforcement Learning problem. We can have two types of tasks: episodic and continuous.

Episodic task

In this case, we have a starting point and an ending point (a terminal state). This creates an episode: a list of States, Actions, Rewards, and New States.

For instance think about Super Mario Bros, an episode begin at the launch of a new Mario Level and ending when you’re killed or you’re reach the end of the level.

Continuous tasks

These are tasks that continue forever (no terminal state). In this case, the agent has to learn how to choose the best actions and simultaneously interacts with the environment.

For instance, an agent that do automated stock trading. For this task, there is no starting point and terminal state. The agent keeps running until we decide to stop him.

Exploration/ Exploitation tradeoff

Finally, before looking at the different methods to solve Reinforcement Learning problems, we must cover one more very important topic: the exploration/exploitation trade-off.

Exploration is exploring the environment by trying random actions in order to find more information about the environment.
Exploitation is exploiting known information to maximize the reward.

Remember, the goal of our RL agent is to maximize the expected cumulative reward. However, we can fall into a common trap.

Let’s take an example:

In this game, our mouse can have an infinite amount of small cheese (+1 each). But at the top of the maze, there is a gigantic sum of cheese (+1000).

However, if we only focus on exploitation, our agent will never reach the gigantic sum of cheese. Instead, it will only exploit the nearest source of rewards, even if this source is small (exploitation).

But if our agent does a little bit of exploration, it can discover the big reward (the pile of big cheese).

This is what we call the exploration/exploitation trade off. We need to balance how much we explore the environment and how much we exploit what we know about the environment.

Therefore, we must define a rule that helps to handle this trade-off. We’ll see in future chapters different ways to handle it.

If it’s still confusing think of a real problem: the choice of a restaurant:

Exploitation: You go everyday to the same one that you know is good and take the risk to miss another better restaurant.
Exploration: Try restaurants you never went before, with the risk of having a bad experience but the probable opportunity of an amazing experience.

Approaches for Solving RL Problems

Now that we learned the RL framework, how do we solve the RL problem?In other terms, how to build a RL agent that can select the actions that maximize its expected cumulative reward?

The Policy π: the agent’s brain

The Policy π is the brain of our Agent, it’s the function that tell us what action to take given the state we are. So it defines the agent behavior at a given time.

Think of policy as the brain of our agent, the function that will tells us the action to take given a state

This Policy is the function we want to learn, our goal is to find the optimal policy π, the policy that maximizes expected return when the agent acts according to it. We find this π through training.

There are two approaches to train our agent to find this optimal policy π*:

Directly, by teaching the agent to learn which action to take, given the state is in: Policy-Based Methods.
Indirectly, teach the agent to learn which state is more valuable and then take the action that leads to the more valuable states: Value-Based Methods.

Policy-Based Methods

In Policy-Based Methods, we learn a policy function directly.This function will map from each state to the best corresponding action at that state. Or a probability distribution over the set of possible actions at that state.

As we can see here, the policy (deterministic) directly indicates the action to take for each step.

We have two types of policy:

Deterministic: a policy at a given state will always return the same action.action = policy(state)
Stochastic: output a probability distribution over actions.policy(actions | state) = probability distribution over the set of actions given the current state

Given an initial state, our stochastic policy will output a probability distributions over the possible actions at that state.

Value based methods

In Value based methods, instead of training a policy function, we train a value function that maps a state to the expected value of being at that state.

tags: #deep #learning #reinforcement #learning #explained