Reinforcement Learning: An Overview

Reinforcement learning (RL) is a dynamic field within artificial intelligence, focused on training models, often referred to as "agents," to make optimal decisions in an environment to achieve a specific goal. This article provides a comprehensive overview of reinforcement learning, covering its fundamentals, applications, and key challenges.

Introduction to Reinforcement Learning

At its core, reinforcement learning is about optimizing outcomes through interaction with the world. It involves learning techniques that allow an agent to navigate an environment and take actions to maximize cumulative rewards. In simpler terms, RL enables AI to "learn by doing," similar to how humans learn through trial and error.

Fundamentals of Reinforcement Learning

Core Concepts

Reinforcement learning is a framework to learn any task. In theory, RL can solve any problem that is phrased as a Markov Decision Process. The concept of the agent should be taken very broadly here, an agent can be a robot, a chatbot, a virtual character, etc. At every timestep t, the agent needs to choose an action a. After this action it might receive a reward r and we get a new observation of its state s. The new state can be determined both by the action of the agent and also by the environment the agent is operating in.The RL problem is trying to maximize the cumulative reward the agent gets over time.

Consider an example where an agent is a monkey, and the task is to pick up as many bananas as possible. At every timestep, the monkey needs to decide to take an action. The actions could be to step towards the tree, grab something, climb. The reward at every timestep can be defined as the number of bananas the monkey got at that timestep. After every action, the monkey will also be in a new state. Maybe we define the state of the monkey as its position in the world. So when the monkey takes a step, the state at the next timestep would be the coordinates of the monkey at the next timestep. We are now searching for the optimal behavior, the best sequence of actions the monkey can take, to maximize the cumulative number of bananas it will get.

Model-Based vs. Model-Free Methods

Deep Reinforcement Learning

Deep Reinforcement Learning (DRL) is combining RL with neural networks.

Applications of Reinforcement Learning

Reinforcement learning has found success in various domains, including:

Games

One of the early successes of Deep RL (which is combining RL with neural networks), was the ability to learn how to play Atari games, straight from pixels. Later on, researchers took it upon themselves to not only evaluate RL on the (relatively) simple Atari games, but rather evaluate it on the hardest competitive games out there. The assumption here is, that if RL can solve these complex games, it can also generalize to challenging real-world settings. As an example, this is Deepmind’s AlphaStar taking on a pro-gamer in the game StarCraft 2.

Robotics

Solving tasks in simulations and video games is one thing, but what about real life? Another popular field where RL is often applied (or at least holds great promise), is robotics. Robotics are significantly harder than simulations for various reasons. Think for example about the time it takes to repeatedly make a robot try out a certain action. Or think about the safety requirements involved for robotics. In the example below, you can see how the ANYmal robot from the Robotics System Lab in Zürich learned to recover from a fall.

Real-World Applications

RL can be applied to many other domains than the ones I just mentioned. Advertising, finance, healthcare, just to name a few.

Key Challenges in Reinforcement Learning

Sample Inefficiency

It is generally known that RL is very sample-inefficient. We regard a “sample” as an interaction with the environment. RL needs a lot of samples/interactions to be able to solve a task. In this sense, RL is very inefficient compared to humans, for example, it doesn’t take a human dozens of hours to learn how to play an Atari game.This sample-efficiency can in part be explained by the fact that humans can leverage a lot of their previous knowledge (priors) when they encounter a new task. A human can for example reuse some of the knowledge and skills of previous games and/or concepts they already acquired from other experiences throughout their life. An RL-agent in contrast, starts the learning process without any assumptions. Another thing to mention is that leveraging knowledge from previous tasks is also an active research topic.

Exploration-Exploitation Trade-off

While the previous problem sounds more like an engineering effort (it’s not), the exploration-exploitation trade-off seems more fundamental. Whenever we train an RL-agent, the agent will need some time to explore, it needs to take some actions that it hasn’t taken before, in order to discover how to solve the problem. On the other hand, we can’t let the agent always take random actions, because these random actions might lead to nothing. Sometimes we want the agent to leverage what it has already learned to try and optimize further. This is the exploration-exploitation trade-off, we want an automated way to strike a good balance between letting the agent explore and taking actions for which it already knows what they will lead to.For a lot problems, it is quite possible that the agent gets stuck in a local optimum.The exploration-exploitation trade-off sounds very much tractable at first, but it turns out to be one of the hardest problems for RL to solve.

Sparse-Reward Problem

Another rather fundamental problem, is the so called Sparse-reward problem. As the name implies, this problem occurs when our RL-agent receives so little rewards, that it actually gets no feedback on how it should improve. Imagine for example this mountain car. The agent needs to move the car left and right, such that it gets enough momentum to reach the top. Initially though, the agent doesn’t know that it needs to move the car back and forth to reach the top. If we only give our agent a reward (a positive feedback signal) when we have reached the flag, it might not ever get a positive feedback signal, simply because it might never reach the flag by taking random actions (exploration). Something commonly done to counteract this problem, is by “reward shaping”. We will modify the reward such that the agent gets more feedback signals to learn from. In case of the mountain car, we could for example also give the agent a positive reward based on the speed or altitude it achieves. However, reward shaping is not a scalable solution. Luckily other solutions are being sought after.

Reinforcement Learning Course Overview

Course Structure

The course is composed of:

A theory part: where you learn a concept in theory.
A hands-on: where you’ll learn to use famous Deep RL libraries to train your agents in unique environments. These hands-on will be Google Colab notebooks with companion tutorial videos if you prefer learning with video format!
Challenges: you’ll get to put your agent to compete against other agents in different challenges. There will also be a leaderboard for you to compare the agents’ performance.

Syllabus

The class teaches the fundamentals of reinforcement learning, starting with model-based methods such as value iteration and policy iteration. It also covered model-free methods such as Q-learning (i.e., the classic, tabular approach) and deep Q-learning (personally, I applied a Double DQN in the second project). In addition, the class provided significant practice with reading, understanding, and replicating papers. We learnt that this is a very difficult task, with most papers not clearly documenting their parameters, libraries, and code. Replicating the experiments was made harder given advances in technology-certain algorithms that performed poorly (e.g., failed to converge) during time of publishing might converge with current technology).

Learning Outcomes

At the end of this course, you’ll get a solid foundation from the basics to the SOTA (state-of-the-art) of methods.

Certification Process

The certification process is completely free:

To get a certificate of completion: you need to complete 80% of the assignments.
To get a certificate of honors: you need to complete 100% of the assignments.

Tools Needed

You need only 3 things:

A computer with an internet connection.
Google Colab (free version): most of our hands-on will use Google Colab, the free version is enough.
A Hugging Face Account: to push and load models. If you don’t have an account yet, you can create one here (it’s free).

Recommended Pace

Each chapter in this course is designed to be completed in 1 week, with approximately 3-4 hours of work per week. However, you can take as much time as necessary to complete the course. If you want to dive into a topic more in-depth, we’ll provide additional resources to help you achieve that.

tags: #reinforcement #learning #class #overview