Comprehensive Overview of Reinforcement Learning Algorithms

Machine learning is typically divided into three categories: supervised learning, unsupervised learning, and reinforcement learning (RL). Supervised learning uses labeled data for classification and regression tasks, while unsupervised learning uncovers patterns in unlabeled data, like clustering and principal component analysis (PCA). Reinforcement learning differs significantly, as it relies on evaluative feedback rather than explicit labels. This feedback doesn't directly indicate the correctness of a decision, presenting unique challenges such as credit assignment and the exploration-exploitation trade-off. Deep RL is an integration of deep learning and RL.

Core Concepts of Reinforcement Learning

An RL agent learns a policy through trial and error by interacting with an environment over time to maximize its long-term cumulative reward. At each time step, the agent observes the environment, selects an action based on its policy (a mapping from observations to actions), and executes it. The environment then provides a scalar reward and transitions to a new state, guided by its dynamics.

In episodic environments, this process continues until the agent reaches a terminal state, after which it restarts. Continuing environments, conversely, lack a terminal state. A discount factor is used to weigh the influence of future rewards. The model encompasses the transition probability and the reward function.

The RL framework is highly versatile, accommodating discrete or continuous state and action spaces. The environment can take various forms, including multi-armed bandits, Markov Decision Processes (MDPs), Partially Observable MDPs (POMDPs), and games. Furthermore, RL problems can be deterministic, stochastic, dynamic, or adversarial.

Value Functions: Guiding the Agent

Value functions play a critical role in RL by quantifying the "goodness" of states or state-action pairs. These functions predict the expected cumulative, discounted future reward, often referred to as the return. The action-value function, commonly known as the Q-function, is particularly important.

Read also: Inference and Learning Algorithms

An optimal value function represents the best achievable value under any policy, with the corresponding policy being the optimal policy. Optimal value functions encode global optimal information, simplifying the process of finding an optimal policy. The agent's primary objective is to maximize the expected long-term return or to identify an optimal policy.

Methods for Solving Reinforcement Learning Problems

Several approaches exist for tackling RL problems, each with its strengths and weaknesses.

Dynamic Programming

When the system model is available, dynamic programming methods can be employed. These methods involve policy evaluation to calculate the value/action value function for a given policy, and value iteration or policy iteration to find an optimal policy. Policy iteration consists of policy evaluation and policy improvement.

Monte Carlo Methods

Monte Carlo (MC) methods learn from complete episodes of experience, not requiring knowledge of transition or reward models, and use sample means for estimation. Monte Carlo methods are applicable only to episodic tasks.

Model-Free vs. Model-Based Methods

In model-free methods, the agent learns directly from experience through trial and error, without knowing the state transition model. Conversely, model-based methods utilize a model, either given or learned from experience, to guide learning.

Read also: Algorithm Efficiency Course

Prediction and Control

The prediction problem, also known as policy evaluation, involves computing the state or action value function for a given policy. The control problem focuses on finding the optimal policy. Planning involves constructing a value function or a policy using a model.

Online vs. Offline Learning

In online mode, algorithms are trained on data received sequentially. In offline mode, also known as batch mode, algorithms are trained on a collection of data.

Temporal Difference Learning

Temporal difference (TD) learning learns state value functions directly from experience, bootstrapping from its own estimation in a model-free, online, and fully incremental manner. Bootstrapping involves updating estimates of state or action values based on subsequent estimates. TD learning is an on-policy method, using samples from the same target policy.

Q-learning, a temporal difference control method, learns an optimal action value function to find an optimal policy. Q-learning is an off-policy method, learning from experience trajectories generated by some behavior policy, which may differ from the target policy. The concepts of on-policy and off-policy can be understood as "same-policy" and "different-policy," respectively.

Function Approximation

In tabular cases, value functions and policies are stored in tabular forms. However, function approximation is used for generalization when the state and/or action spaces are large or continuous. Function approximation aims to generalize from examples of a function to construct an approximate of the entire function. Linear function approximation is a popular choice, partially due to its desirable theoretical properties. A function is approximated by a linear combination of basis functions, which usually need to be designed manually. The coefficients, or weights, in the linear combination, need to be found by learning algorithms.

Read also: Learn Data Structures & Algorithms

Non-linear function approximation, particularly with Deep Neural Networks (DNNs), can be used to represent states, observations, and actions, and to approximate value functions, policies, and models (state transition and reward functions). In this case, the weights in the DNNs need to be learned. Integrating deep learning with RL leads to deep RL methods, which have gained significant popularity and achieved remarkable results.

Policy-Based vs. Value-Based Methods

TD learning and Q-learning are value-based methods. Policy-based methods, in contrast, optimize the policy directly, such as policy gradient methods. Actor-critic algorithms update both the value function and the policy.

Popular Deep Reinforcement Learning Algorithms

Several deep RL algorithms have emerged as prominent solutions for complex problems.

Deep Q-Network (DQN)

DQN integrates Q-learning with DNNs and uses experience replay and a target network to stabilize learning. Experience replay involves storing sequences of state, action, reward, and next state in a replay buffer and sampling them randomly for training. A target network maintains separate network parameters for training and updates them periodically, rather than for every iteration.

Asynchronous Advantage Actor-Critic (A3C)

Mnih et al. (2016) introduced Asynchronous Advantage Actor-Critic (A3C), where parallel actors employ different exploration policies to stabilize training, without using experience replay.

Deterministic Policy Gradient (DPG) and Deep DPG (DDPG)

Deterministic policy gradient can help estimate policy gradients more efficiently. Silver et al. (2014) presented Deterministic Policy Gradient (DPG), which Lillicrap et al. (2016) extended to Deep DPG (DDPG).

Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO)

Trust region methods stabilize policy optimization by constraining gradient updates. Schulman et al. (2015) introduced Trust Region Policy Optimization (TRPO), and Schulman et al. (2017) presented Proximal Policy Optimization (PPO).

Soft Actor-Critic (SAC)

Haarnoja et al. (2018) presented Soft Actor-Critic (SAC), an off-policy algorithm that aims to simultaneously succeed at the task and act as randomly as possible.

Twin Delayed Deep Deterministic Policy Gradient (TD3)

Fujimoto et al. (2018) developed the Twin Delayed Deep Deterministic policy gradient algorithm (TD3) to minimize the effects of overestimation on both the actor and the critic. Fujimoto and Gu (2021) introduced a variant of TD3 for offline RL.

Exploration vs. Exploitation Trade-off

A fundamental dilemma in RL is the exploration vs. exploitation trade-off. To maximize rewards, the agent needs to exploit the currently best action greedily, yet it must also explore the environment to discover potentially better actions, especially when the policy is not yet optimal or the system is non-stationary. A simple exploration approach is Îµ-greedy, where the agent selects a greedy action with probability 1 â Îµ and a random action otherwise.

Reinforcement Learning in Practice: The Shortest Path Problem

Consider the shortest path problem as an example. The single-source shortest path problem aims to find the shortest path between a pair of nodes, minimizing the total distance or the sum of edge weights along the path. This problem can be formulated as an RL problem.

In this formulation, the state represents the current node. At each node, taking an action involves following a link to a neighboring node. The transition model dictates that the agent moves to a neighbor after selecting a link. The reward is the negative of the link cost. Setting the discount factor to 1 ensures that immediate and future rewards are treated equally, which is appropriate for episodic problems. The goal is to maximize the negative of the total cost, effectively minimizing the total distance.

An optimal policy involves choosing the best neighbor to traverse to achieve the shortest path. For each state/node, the optimal value represents the negative of the shortest distance from that node to the destination.

Why Not Dijkstra's Algorithm?

Dijkstra's algorithm is an efficient solution for the shortest path problem when global information about the graph (nodes, edges, and weights) is available. RL, however, can operate without such global information by exploring the graph according to a policy, following a model-free approach. While Dijkstra's algorithm is efficient for the shortest path problem given complete graph information, RL offers a more general approach applicable to problems with stochastic weights or new instances through meta-learning.

Foundations and Influences of Reinforcement Learning

Reinforcement Learning (RL) is rooted in optimal control (in particular, dynamic programming and Markov decision process), learning by trail and error, and temporal difference learning, the latter two being related to animal learning. RL has foundation in theoretical computer science/machine learning, classical AI, optimization, statistics, optimal control, operations research, neuroscience, and psychology. (At the same time, RL also influences these disciplines.) RL has both art and science components. Rather than making it pure science, we need to accept that part of it is art.

Richard Sutton discusses briefly about David Marrâs three levels at which any information process- ing machine must be understood: computational theory, representation and algorithm, and hardware implementation. AI has surprisingly little computational theory. The big ideas are mostly at the middle level of representation and algorithm. RL is the first computational theory of intelligence. RL is explicitly about the goal, the whats and whys of intelligence.

RL methods deal with sampled, evaluative and sequential feedback, three weak forms of feedback simultaneously, compared with exhaustive, supervised and one-shot feedback (Littman, 2015).

Traditionally optimal control (OC) and operations research (OR) require a perfect model for the problem formulation. RL can work in a model-free mode, and can handle very complex problems via simulation, which may be intractable for model-based OC or OR approaches. AlphaGo, together with many games, are huge problems. RL with a perfect model can return a perfect solution, in the limit. RL with a decent simulator or with decently sufficient amount of data can provide a decent solution. Admittedly, there are adaptive and/or approximate approaches in OC and OR working with imperfect models, and/or for large-scale problems. We see that RL, optimal control, and operations research are merging and advancing together, which is beneficial and fruitful for all these scientific communities.

The Agent-Environment Interaction in Detail

A basic reinforcement learning agent interacts with its environment in discrete time steps. At each time step t, the agent receives the current state and reward. It then chooses an action from the set of available actions, which is subsequently sent to the environment. The environment moves to a new state and the reward associated with the transition is determined.

Formulating the problem as a Markov decision process assumes the agent directly observes the current environmental state; in this case, the problem is said to have full observability. If the agent only has access to a subset of states, or if the observed states are corrupted by noise, the agent is said to have partial observability, and formally the problem must be formulated as a partially observable Markov decision process. In both cases, the set of actions available to the agent can be restricted.

When the agent's performance is compared to that of an agent that acts optimally, the difference in performance yields the notion of regret. Thus, reinforcement learning is particularly well-suited to problems that include a long-term versus short-term reward trade-off.

Two elements make reinforcement learning powerful: the use of samples to optimize performance, and the use of function approximation to deal with large environments.

Exploration Mechanisms

Reinforcement learning requires clever exploration mechanisms; randomly selecting actions, without reference to an estimated probability distribution, shows poor performance. The case of (small) finite Markov decision processes is relatively well understood. One such method is -greedy, where is a parameter controlling the amount of exploration vs. exploitation. With probability , exploitation is chosen, and the agent chooses the action that it believes has the best long-term effect (ties between actions are broken uniformly at random). Alternatively, with probability , exploration is chosen, and the action is chosen uniformly at random.

Value Iteration and Policy Iteration

Assuming full knowledge of the Markov decision process, the two basic approaches to compute the optimal action-value function are value iteration and policy iteration. Both algorithms compute a sequence of functions () that converge to . Computing these functions involves computing expectations over the whole state-space, which is impractical for all but the smallest (finite) Markov decision processes.

Monte Carlo Methods in Detail

Monte Carlo methods are used to solve reinforcement learning problems by averaging sample returns. Unlike methods that require full knowledge of the environment's dynamics, Monte Carlo methods rely solely on actual or simulated experienceâsequences of states, actions, and rewards obtained from interaction with an environment. This makes them applicable in situations where the complete dynamics are unknown. Learning from actual experience does not require prior knowledge of the environment and can still lead to optimal behavior.

Monte Carlo methods apply to episodic tasks, where experience is divided into episodes that eventually terminate. Policy and value function updates occur only after the completion of an episode, making these methods incremental on an episode-by-episode basis, though not on a step-by-step (online) basis. These methods function similarly to the bandit algorithms, in which returns are averaged for each state-action pair. The key difference is that actions taken in one state affect the returns of subsequent states within the same episode, making the problem non-stationary. To address this non-stationarity, Monte Carlo methods use the framework of general policy iteration (GPI).

Temporal Difference Methods in Detail

The computation in TD methods can be incremental (when after each transition the memory is changed and the transition is thrown away), or batch (when the transitions are batched and the estimates are computed once based on the batch). Batch methods, such as the least-squares temporal difference method, may use the information in the samples better, while incremental methods are the only choice when batch methods are infeasible due to their high computational or memory complexity. Some methods try to combine the two approaches.

Another problem specific to TD comes from their reliance on the recursive Bellman equation. Most TD methods have a so-called parameter that can continuously interpolate between Monte Carlo methods that do not rely on the Bellman equations and the basic TD methods that rely entirely on the Bellman equations.

Function Approximation Methods in Detail

In order to address the fifth issue, function approximation methods are used. Linear function approximation starts with a mapping that assigns a finite-dimensional vector to each state-action pair. The algorithms then adjust the weights, instead of adjusting the values associated with the individual state-action pairs. The problem with using action-values is that they may need highly precise estimates of the competing action values that can be hard to obtain when the returns are noisy, though this problem is mitigated to some extent by temporal difference methods.

Policy Search Methods

An alternative method is to search directly in (some subset of) the policy space, in which case the problem becomes a case of stochastic optimization. Gradient-based methods (policy gradient methods) start with a mapping from a finite-dimensional (parameter) space to the space of policies: given the parameter vector , let denote the policy associated to . Defining the performance function by under mild conditions this function will be differentiable as a function of the parameter vector . If the gradient of was known, one could use gradient ascent. Since an analytic expression for the gradient is not available, only a noisy estimate is available.

A large class of methods avoids relying on gradient information. These include simulated annealing, cross-entropy search or methods of evolutionary computation. Policy search methods may converge slowly given noisy data. For example, this happens in episodic problems when the trajectories are long and the variance of the returns is large. Value-function based methods that rely on temporal differences might help in this case.

Multiagent Reinforcement Learning

Finally, all of the above methods can be combined with algorithms that first learn a model of the Markov decision process, the probability of each next state given an action taken from an existing state. Both the asymptotic and finite-sample behaviors of most algorithms are well understood. multiagent/distributed reinforcement learning is a topic of interest.

Applications of Reinforcement Learning

RL has a wide variety of applications across many different fields.

Neuroscience

TD learning modeling dopamine-based learning in the brain.

Robotics

Deep RL has enabled the reinforcement learning community to move from playing video games to solving complex real-life problems in autonomous systems such as self-driving cars, delivery drones, and automated robotics.

Game Playing

A couple of exciting news in Artificial Intelligence (AI) has just happened in recent years. AlphaGo defeated the best professional human player in the game of Go. Very soon the extended algorithm AlphaGo Zero beat AlphaGo by 100-0 without supervised learning on human knowledge. Top professional game players lost to the bot developed by OpenAI on DOTA2 1v1 competition.

tags: #algorithms #for #reinforcement #learning #overview