Mastering the One-Step Reinforcement Learning Bandit Problem

The one-step reinforcement learning bandit problem is a foundational concept in reinforcement learning (RL), offering a simplified environment to explore the critical balance between exploration and exploitation. This article provides a comprehensive tutorial on understanding and implementing solutions for this problem, drawing from various sources and insights to offer a complete picture.

Introduction: The Exploration vs. Exploitation Dilemma

The exploration vs. exploitation dilemma is a fundamental challenge in decision-making, where one must choose between exploiting known options for immediate rewards and exploring unknown options for potentially greater long-term gains. This dilemma permeates many aspects of life.

Imagine you tried two pizzerias. And the second one is pretty good🍕. So you keep going there. Makes sense, right? But what if there’s a cheaper, better one just around the corner? And you’ll never find it because you stopped exploring? That’s not just a human dilemma. It’s the foundation of a simple yet powerful concept in machine learning: How to learn by trial and error using Multi-Armed Bandit (MABs).

In a scenario where all information about the environment is known, finding the optimal strategy is straightforward. However, the challenge arises from incomplete information, necessitating a balance between gathering enough information to make informed decisions and managing the associated risks. Exploitation involves leveraging the best known option, while exploration entails taking risks to discover potentially better, unknown options. The best long-term strategy may involve short-term sacrifices.

The Multi-Armed Bandit Problem

The multi-armed bandit problem is a classic example that illustrates the exploration vs. exploitation dilemma. Imagine a gambler facing multiple slot machines (bandits) in a casino, each with an unknown probability of yielding a reward. The goal is to maximize the total reward by strategically choosing which machines to play and how many times to play each.

Multi-armed bandits offer a lightweight framework for decision-making under uncertainty. They are often used in real-world applications like ad placement, recommendation systems or A/B testing, especially when decisions need to be made sequentially.

Multi-Armed Bandits: Imagine a slot machine with multiple levers. Each lever gives a different reward, but you don’t know the probabilities. You can pull the levers as often as you like.

The goal? Win as much as possible. So the challenge is to figure out which lever pays off best in the long run.

In this setting, we consider an infinite number of trials. Restricting the number of trials introduces a new type of exploration problem. For instance, if the number of trials is smaller than the number of slot machines, we cannot even try every machine to estimate the reward probability (!) and hence we have to behave smartly w.r.t. a limited set of knowledge and resources (i.e.

Formalizing the Problem

Let ( \mathcal{A} ) be a set of actions, each referring to the interaction with one slot machine. The value of action ( a ) is the expected reward, ( Q(a) = \mathbb{E} [r \vert a] = \theta ). ( \mathcal{R} ) is a reward function. In the case of a Bernoulli bandit, we observe a reward ( r ) in a stochastic fashion.

Read also: Understanding the Hesi in Football

Action-Value Methods

To maximize the total reward in the k-armed bandit problem, we need a way to estimate the values of different actions based on the rewards we’ve received. One natural method is to average the rewards received from selecting an action. These estimated action values can then be used to guide future decisions.

Sample Average Estimation

The simplest approach to estimate the value of an action is to take the average of the rewards received each time that action is selected. For an action a, the value estimate at time step t is denoted as Qₜ(a), and can be calculated using the sample average

Sample average estimation

Where 𝟙 is the indicator function, which returns 1 if the action a was selected at time i, and 0 otherwise. This formula therefore counts how many times action a has been chosen up to time t, and computes the mean reward over those selections.

If the denominator (the number of times action a has been selected) is zero, meaning the action has not yet been chosen, the value estimate Qₜ(a) can be initialized to a default value, such as zero.

As we select actions and accumulate rewards, by the law of large numbers, the estimate Qₜ(a) will converge to the true value q⁎(a). This method is called the sample-average method because each action’s estimate is the average of the rewards received when that action was taken.

Computationally Efficient Estimate Updates

While the sample-average method gives a reliable estimate of action values, it requires storing all the past rewards and recomputing the average at every time step. This can become computationally expensive, especially when the number of actions and time steps increases. To overcome this, we can use an incremental update method that allows us to compute the action value in a computational and memory-efficient manner.

Incremental Update

Instead of recalculating the entire average every time a new reward is received, we can update the estimate incrementally. Let Rₙ be the reward received after the nᵗʰ action selection. At time step n+1, the action value estimate Qₙ₊₁ can be updated based on the previous estimate Qₙ and the new reward Rₙ using the formula

Incremental value estimate update rule

This update rule can be interpreted as follows:

Old Estimate: Qₙ is the previous estimate of the action’s value.
Error: Rₙ‎ - ‎Qₙ represents the difference between the most recent reward and the current estimate - this is the error in the estimate.
Step Size: ¹⁄ₙ is the step size, which becomes smaller as more rewards are collected.

This rule efficiently incorporates the latest reward into the running average, with the influence of older rewards decreasing as the number of steps increases.

Basic Action Selection Methods

The effectiveness of an RL agent in the k-armed bandit problem depends not only on how accurately it estimates the value of each action but also on how it chooses which action to take. This choice, known as the action selection strategy, directly influences the agent’s ability to balance exploration and exploitation - a fundamental challenge in reinforcement learning.

Greedy Action Selection

Once we have obtained estimates for the values of all actions, a straightforward way to choose the next action is to select the one with the highest estimated value - this is known as greedy action selection. At any time step t, the greedy action is defined as

Greedy action selection

This approach exploits the current knowledge of the action values to maximize the immediate reward. However, it comes with a significant limitation: by always selecting the greedy action, the agent never explores other actions, which may yield even higher rewards in the long run. Thus, greedy action selection is prone to getting stuck in local optima.

Epsilon-Greedy Action Selection

The epsilon-greedy action selection method addresses the limitations introduced with purely greedy action selection, and balances exploitation with exploration by occasionally selecting a random action with probability epsilon, or ϵ.

Formally, at each time step t, with probability (1 - ϵ), the agent selects the greedy action. With probability ϵ, the agent selects an action randomly from the set of all possible actions. This is expressed as

Epsilon-greedy action selection

The benefit of ϵ-greedy action selection is that, as the number of time steps grows, every action will be sampled an infinite number of times, allowing the estimate Qₜ(a) to converge to the true value q⁎(a) for all actions. This ensures that the agent has sufficient exploration while still mostly exploiting what it has learned.

The value of ϵ controls the balance between exploration and exploitation:

A high ϵ leads to more exploration.
A low ϵ leads to more exploitation.

Choosing an appropriate ϵ is crucial for achieving a good balance between learning and immediate rewards.

Performance of Greedy vs Epsilon-Greedy Agents

To assess the effectiveness of the greedy and ϵ-greedy action-value methods, we compare them numerically using 1000 randomly generated k-armed bandit problems with k=10. The true action values q⁎(a) for each action were sampled from a normal distribution with mean 0 and variance 1. At each time step, the actual reward for a selected action was drawn from a normal distribution with mean q⁎(a) and unit variance.

Similarly to Sutton and Barto, for each learning method, we measure its performance over 1000 time steps, averaged across 1000 independent runs of randomly generated bandit problems. This setup allows us to observe how well each method learns to maximize rewards over time and how frequently it selects the optimal action.

Average Reward vs Steps Taken

The following plot shows the average reward received over time for ϵ-greedy agents configured with ϵ-values of 0.01, 0.05 and 0.1 compared to a purely greedy agent:

Epsilon-greedy agents tend to obtain a better average reward, outperforming purely greedy agents.

As we can see, the ϵ-greedy methods outperform the purely greedy method in terms of the average reward over time. The purely greedy agent (ϵ = 0) tends to quickly converge to a suboptimal action due to the lack of exploration. In contrast, agents with small but positive ϵ values (e.g., ϵ = 0.01 or ϵ = 0.05) explore enough to find the optimal action more reliably, leading to higher cumulative rewards over time.

Agents with higher ϵ values (ϵ = 0.1) initially sacrifice immediate rewards for exploration, but over time, they outperform agents with smaller or zero ϵ, as they are more likely to discover the optimal actions.

Optimal Action Selection Percentage

The next plot shows the percentage of time the optimal action was selected, averaged over the 1000 runs:

While epsilon-greedy agents with smaller ϵ-values improve slowly, they eventually outperform other agents.

Again, we observe that the ϵ-greedy agents perform better in selecting the optimal action over time. The greedy agent performs the worst, as it typically gets stuck selecting a suboptimal action early in the learning process. Among the ϵ-greedy agents, higher ϵ-values (e.g., ϵ = 0.1) initially result in less optimal action selection, but this improves over time as the exploration phase helps the agent discover the best actions. Although agents with smaller ϵ-values have a slower rate of improvement, eventually they will outperform those with higher ϵ-values, as they will select the optimal action more frequently due to the lower probability of exploration.

Interpretation of Results

From these results, it is clear that ϵ-greedy agents consistently outperform a purely greedy agent in the k-armed bandit problem. This is because greedy agents exploit their current knowledge without exploring alternative actions, and often converge on suboptimal actions. In contrast, ϵ-greedy agents are able to balance exploration and exploitation, which allows them to converge on the optimal action more reliably.

However, the best ϵ-value depends on the specifics of the problem. For example:

In noisier environments with higher variance, it might take more exploration to find the optimal action, meaning agents with higher ϵ-values should fare better.
If the rewards are deterministic (with no variance), a greedy method could perform best since the optimal action would be found quickly and reliably.
In cases where rewards are non-stationary (where the reward probabilities change over time) exploration becomes even more critical. Without continued exploration, the agent may get stuck exploiting outdated knowledge, missing out on better actions that could yield better rewards.

Tracking Non-Stationary Problems

In some cases, it may be beneficial to fix the step-size parameter, rather than adjusting it dynamically as ¹⁄ₙ. A constant step-size, denoted by α can be used to allow recent rewards to have a stronger influence on the estimate than older ones. In this case, the update rule becomes

Action value estimate with constant step-size

where 0 < α ≤ 1 is a constant. This form of update rule is particularly useful in non-stationary environments, where the underlying reward probabilities may change over time. By keeping the step size constant, the estimate remains responsive to recent changes.

Optimistic Initial Values

All the methods discussed so far rely heavily on initial estimates of the action values, denoted as Q₁(a). These initial estimates introduce a bias that can impact the agent’s early decisions and influence which actions are selected more frequently. For the sample-average method, this bias disappears once all actions have been selected at least once. However, for methods using a constant step size, the bias remains permanent, but gradually diminishes over time.

One technique to encourage exploration early in the learning process is to set optimistic initial values - estimates that are higher than the agent’s expectations. Setting an optimistic initial value makes the agent more likely to try out different actions, since every action initially appears to offer a high reward. This drives the agent to explore all actions multiple times until the value estimates converge to the true action values.

For an action a, if we initialize the action value estimate with a high value, such as Q₁(a) = 5, this signals to the agent that every action starts with an expected reward of 5. As the agent tries different actions and receives lower rewards, the estimates will decrease and eventually converge to the true values q⁎(a). However, during the early stages, the optimistic initial values prevent the agent from prematurely exploiting a suboptimal action.

Optimistic Initialisation vs ϵ-Greedy Agent

The following plot shows the average reward received over time for an optimistic initialisation agent with Q₁(a) = 5, compared to an ϵ-greedy agent configured with ϵ = 0.1:

The optimistic initialisation method leads to higher rewards after initial exploration

When both agents are configured with a constant step-size of α=0.1, it can be seen that the optimistic initialisation method outperforms the ϵ-greedy agent in terms of the average reward received over 1000 timesteps, although it learnt what the optimal action was slower than the ϵ-greedy agent due to more exploration in early stages.

tags: #one-step #reinforcement #learning #bandit #tutorial