Human-Level Control Through Deep Reinforcement Learning: An Explanation

Reinforcement learning (RL) enables agents to learn how to make decisions in an environment to maximize a cumulative reward. Unlike supervised learning, which relies on labeled training data, RL agents learn through trial and error, receiving feedback in the form of rewards or penalties based on their actions. Deep reinforcement learning (DRL) combines RL with deep learning, allowing agents to learn directly from high-dimensional sensory inputs, such as images or video, and has achieved remarkable success in various domains, including game playing.

The Essence of Reinforcement Learning

In reinforcement learning, an agent interacts with an environment over time. At each time step, the agent observes the current state of the environment, selects an action, and receives a reward. The goal of the agent is to learn a policy, which is a mapping from states to actions, that maximizes the expected cumulative reward over the long term.

Exploration vs. Exploitation

A fundamental challenge in reinforcement learning is balancing exploration and exploitation. Exploration involves trying out new actions to discover potentially better strategies, while exploitation involves using the current best strategy to maximize immediate reward.

Consider a robot mouse in a maze seeking cheese (🧀 +1000 points) or water (💧+10 points) while avoiding electric shocks (⚡ -100 points). The mouse might find a cluster of water sources near the entrance and spend its time exploiting this discovery, racking up small rewards but missing the cheese further in the maze.

One simple strategy for exploration is the epsilon-greedy strategy. The mouse takes the best-known action most of the time (e.g., 80%), but occasionally (e.g., 20% of the time, epsilon) explores a new, randomly selected direction. Initially, exploration should be high, with a higher epsilon value. Over time, as the mouse learns more about the maze, epsilon can be reduced, settling into exploiting what it knows.

Read also: Exploring UMich Human Resources

The reward is not always immediate. The robot mouse might have to navigate a long stretch of the maze before reaching the cheese.

Markov Decision Processes (MDPs)

The mouse's journey through the maze can be formalized as a Markov Decision Process (MDP), which specifies transition probabilities from state to state. MDPs include:

A finite set of states: These are the possible positions of the mouse within the maze.
A set of actions available in each state: This is {forward, back} in a corridor and {forward, back, left, right} at a crossroads.
Transitions between states: For example, going left at a crossroads leads to a new position. These can be probabilities linking to multiple states (e.g., an attack in Pokémon can miss, inflict some damage, or knock out the opponent).
Rewards associated with each transition: In the robot-mouse example, most rewards are 0, but reaching water or cheese is positive, and an electric shock is negative.
A discount factor γ between 0 and 1: This quantifies the difference in importance between immediate and future rewards. If γ is 0.9, a reward of 5 after 3 steps has a present value of 0.9³*5.
Memorylessness: Once the current state is known, the history of the mouse's travels can be erased because the current Markov state contains all useful information from the history. "The future is independent of the past given the present."

The objective is to maximize the sum of rewards in the long term:

Σ [γ^t * r(x, a)] for all time steps t

Where:

γ is the discount factor.
r(x,a) is a reward function. For state x and action a, it gives the reward associated with taking that action a at state x.

Core Techniques in Reinforcement Learning

Q-Learning: Learning the Action-Value Function

Q-learning is a technique that evaluates which action to take based on an action-value function that determines the value of being in a certain state and taking a certain action at that state.

The function Q takes a state and an action as input and returns the expected reward of that action (and all subsequent actions) at that state. Initially, Q assigns the same (arbitrary) fixed value. As the agent explores the environment, Q provides a better approximation of the value of an action a at a state s. The function Q is updated as the agent goes.

The update rule for Q-learning is:

Q(st, at) ← Q(st, at) + α [rt + γ * maxa Q(st+1, a) - Q(st, at)]

Where:

α is the learning rate, determining how aggressively the value is updated.
rt is the reward received by taking action at at state st.
γ is the discount factor.
maxa Q(st+1, a) is the maximum achievable reward Q for all available actions at the next state st+1.

Having a value estimate for each state-action pair, the agent selects an action according to an action-selection strategy (e.g., epsilon-greedy exploration).

In the robot mouse example, Q-learning figures out the value of each position in the maze and the value of the actions {forward, backward, left, right} at each position. Then, the action-selection strategy chooses what the mouse does at each time step.

Policy Learning: Mapping State to Action

In Q-learning, a value function estimates the value of each state-action pair. Policy learning is a direct alternative where a policy function, π, maps each state to the best corresponding action. This can be thought of as a behavioral policy: "when I observe state s, the best thing to do is take action a". For example, an autonomous vehicle's policy might include: "if I see a yellow light and I am more than 100 feet from the intersection, I should brake. Otherwise, keep moving forward."

So, the agent learns a function that will maximize expected reward. Deep neural networks are suitable for learning complex functions.

Andrej Karpathy’s Pong from Pixels uses deep reinforcement learning to learn a policy for the Atari game Pong. The input is raw pixels from the game (state), and the output is a probability of moving the paddle up or down (action).

In a policy gradient network, the agent learns the optimal policy by adjusting its weights through gradient descent based on reward signals from the environment.

Deep Reinforcement Learning: Combining RL with Deep Learning

Deep reinforcement learning (DRL) leverages the power of deep neural networks to handle complex, high-dimensional state spaces. Deep neural networks can learn intricate patterns and representations from raw sensory data, enabling RL agents to make informed decisions in complex environments.

Deep Q-Networks (DQNs)

In 2015, DeepMind used deep Q-networks (DQN), which approximate Q-functions using deep neural networks, to surpass human benchmarks across many Atari games.

The DQN agent, receiving only the pixels and the game score as inputs, surpassed the performance of all previous algorithms and achieved a level comparable to that of a professional human games tester across a set of 49 games, using the same algorithm, network architecture, and hyperparameters. This bridges the divide between high-dimensional sensory inputs and actions, resulting in the first artificial agent capable of excelling at a diverse array of challenging tasks. (Mnih et al., 2015)

DQNs incorporate techniques to improve performance and stability:

Experience replay: This learns by randomizing over a longer sequence of previous observations and corresponding reward to avoid overfitting to recent experiences. This is inspired by biological brains: rats traversing mazes, for example, “replay” patterns of neural activity during sleep to optimize future behavior in the maze.
Recurrent neural networks (RNNs) augmenting DQNs: When an agent can only see its immediate surroundings (e.g., robot-mouse only seeing a certain segment of the maze vs. a birds-eye view of the whole maze), the agent needs to remember the bigger picture so it remembers where things are. This is similar to how humans babies develop object permanence to know things exist even if they leave the baby’s visual field. RNNs are “recurrent”, i.e. they allow information to persist on a longer-term basis.

Asynchronous Advantage Actor-Critic (A3C)

In 2016, DeepMind revealed Asynchronous Advantage Actor-Critic (A3C) that surpassed state-of-the-art performance on Atari games after training for half as long (Mnih et al., 2016). A3C is an actor-critic algorithm that combines the best of both approaches: it uses an actor (a policy network that decides how to act) AND a critic (a Q-network that decides how valuable things are). A3C is now OpenAI’s Universe Starter Agent.

Advancements in Deep Reinforcement Learning and Modeling Human Behavior

Recent research explores the connection between advancements in DRL and the modeling of human behavior. Deep Q-networks (DQNs) enable the investigation of how humans transform high-dimensional, time-continuous visual stimuli into appropriate motor responses.

Researchers recorded motor responses in human participants while playing three distinct arcade games. Stimulus features generated by a DQN were used as predictors for human data by fitting the DQN’s response probabilities to human motor responses using a linear model. The hypothesis was that advancements in RL models would lead to better prediction of human motor responses. Features from two recently developed DQN models (Ape-X and SEED) and a third baseline DQN were used to compare prediction accuracy.

Ape-X and SEED involved additional structures, such as dueling and double Q-learning, and a long short-term memory, which considerably improved their performances when playing arcade games. The experimental tasks were time-continuous, so the effect of temporal resolution on prediction accuracy was also analyzed by smoothing the model and human data to varying degrees.

The study found that all three models predict human behavior significantly above chance level. SEED, the most complex model, outperformed the others in prediction accuracy of human behavior across all three games.

Key Architectural Elements in Advanced DQNs

Dueling Architecture: The estimation of Q-values is split into two components: one stream for the evaluation of different actions and another for evaluating a certain state.
Long Short-Term Memory (LSTM): SEED introduces an LSTM, a recurrent neural network (RNN) that incorporates past experiences into decision-making.
Double Q-Learning: Ape-X and SEED employ double Q-learning by utilizing separate networks for action evaluation and action selection.
Experience Prioritization: They collect more gameplay data, evaluate the data according to its importance, and store it in a shared buffer.
Reward Clipping: Unlike other DQNs, SEED does not suffer from the limitation of reward clipping, restricting the received reward to a certain range.

Experimental Setup and Results

The baseline DQN and Ape-X were individually trained to play arcade games, and a pre-trained SEED was used. The baseline DQN, compared to the other two DQNs, did not show stable learning curves. After sufficient training, Ape-X and SEED achieved performance significantly higher than human-level.

Behavioral data recorded from subjects playing the arcade games Breakout, Space Invaders, and Enduro were analyzed. The trained DQN models were used as a nonlinear, feature-generating mapping by processing the video screens seen by subjects through the DQNs. The features generated by DQNs were used to predict human responses using a GLM, which fitted the Q-values to human data.

A comparison between the games revealed that all three models achieved their highest prediction accuracy in the game Enduro, compared to Breakout and Space Invaders. A post hoc paired t-test revealed a significantly higher correlation of SEED compared to the baseline DQN and Ape-X in Breakout, Space Invaders, and Enduro. These findings demonstrate that SEED, the most advanced of the three DQNs, provided the most effective feature-generating mapping for the GLM.

The Significance of Human-Level Control

The ability of DRL agents to achieve human-level control has significant implications for various real-world applications. DRL can be used to train robots to perform complex tasks in manufacturing, logistics, and healthcare. It can also be used to develop intelligent assistants that can help people with daily tasks, such as scheduling appointments, managing finances, and providing personalized recommendations.

Challenges and Future Directions

Despite the remarkable progress in DRL, there are still several challenges that need to be addressed. One challenge is the sample efficiency of DRL algorithms. DRL agents typically require a large amount of training data to learn effectively, which can be a bottleneck in many real-world applications. Another challenge is the exploration-exploitation trade-off. DRL agents need to explore the environment to discover new strategies, but they also need to exploit their current knowledge to maximize rewards. Balancing exploration and exploitation is a difficult problem, and there is no one-size-fits-all solution.

Future research in DRL will likely focus on developing more sample-efficient algorithms, more effective exploration strategies, and more robust methods for handling noisy and uncertain environments. Researchers are also exploring the use of hierarchical reinforcement learning, which involves breaking down complex tasks into simpler subtasks, and meta-learning, which involves learning how to learn.

tags: #human #level #control #through #deep #reinforcement