A Comprehensive Survey of Constraint Formulations in Safe Reinforcement Learning

Introduction

Reinforcement learning (RL) has demonstrated remarkable capabilities across various domains, including games, robotics, and autonomous systems. However, deploying RL in real-world, safety-critical applications necessitates ensuring that learning agents avoid catastrophic failures or unsafe behaviors. Safe Reinforcement Learning (SafeRL) has emerged as a subfield of reinforcement learning that explicitly addresses safety constraints during the learning and deployment phases of agents. This article provides a structured overview of SafeRL formulations, primarily focusing on Constrained Markov Decision Processes (CMDPs) and their extensions to Multi-Agent Safe RL (SafeMARL).

This survey aims to provide researchers familiar with fundamental RL concepts with a deeper understanding of how to incorporate safety into RL. It assumes knowledge of basic RL principles, such as Markov decision processes and policy optimization.

Motivation

The need for SafeRL arises from the potential risks associated with deploying standard RL algorithms in environments where safety is paramount. Examples of such applications include:

Autonomous Driving: Ensuring vehicles avoid collisions and adhere to traffic laws.
Healthcare: Optimizing treatment plans while minimizing risks to patients.
Robotics: Controlling robots to perform tasks without causing damage or injury.

In these scenarios, even brief periods of unsafe behavior can have severe consequences. Therefore, SafeRL seeks to develop algorithms that can learn optimal policies while simultaneously satisfying predefined safety constraints.

SafeRL: A Constrained Criterion Approach

A prevalent approach in SafeRL is based on a constrained criterion, which seeks to maximize the expected cumulative reward subject to specific safety constraints. A common framework for SafeRL is the Constrained Markov Decision Process (CMDP). In a CMDP, an agent seeks to maximize expected return subject to one or more constraints (e.g., bounds on certain costs or probabilities of unsafe events). This framework allows formalizing safety requirements as mathematical constraints and provides tools from constrained optimization and control theory to enforce them. SafeRL algorithms often leverage CMDP theory to find policies that respect constraints (at least approximately) while learning efficiently.

Constrained Markov Decision Processes (CMDPs)

The Constrained Markov Decision Process (CMDP) provides a mathematical framework for formalizing SafeRL problems. A CMDP extends the standard Markov Decision Process (MDP) by incorporating cost functions and constraints.

Formal Definition

A CMDP can be defined as a tuple (𝒮, 𝒜, P, r, c, γ, d), where:

𝒮 is the state space.
𝒜 is the action space.
P(s' | s, a) is the transition probability function, representing the probability of transitioning to state s' after taking action a in state s.
r(s, a) is the reward function, representing the immediate reward received after taking action a in state s.
c(s, a) is the cost function, encoding the aspects of the task we want to constrain. Each cost function usually corresponds to a particular notion of “unsafe” behavior or resource usage that should be limited. For example, c(1)⁢(s,a)superscript𝑐1𝑠𝑎c^{(1)}(s,a)italicc startPOSTSUPERSCRIPT ( 1 ) endPOSTSUBSCRIPT ( italics , italica ) might be an indicator of entering an unsafe state or a measure of damage/risk at state s𝑠sitalics. In general, c:𝒮×𝒜→[0,1]:𝑐→𝒮𝒜01c:\mathcal{S}\times\mathcal{A}\rightarrow[0,1]italicc : caligraphicS × caligraphicA → [ 0 , 1 ] is a safety cost function. Without loss of generality, we assume that the safety cost function c𝑐citalicc is bounded between 00 and 1111 for all (s,a)𝑠𝑎(s,a)( italics , italica ) as with the reward function r𝑟ritalic_r.
γ∈[0,1]𝛾01\gamma\in[0,1]italic_γ ∈ [ 0 , 1 ] is a discount factor that weighs immediate vs. future rewards and costs.
d is a vector of cost constraints, where disubscript𝑑𝑖d{i}italicd startPOSTSUBSCRIPT italici endPOSTSUBSCRIPT is a specified threshold for the i𝑖iitalici-th cost (safety limit).

Objective

The goal in a CMDP is to find a policy π𝜋\piitalic_π that maximizes the expected cumulative discounted reward while satisfying the cost constraints. Formally, the problem can be expressed as:

maxπ𝜋Vπ⁢(ρ)romanmax startPOSTSUBSCRIPT italicπ endPOSTSUBSCRIPT italicV startPOSTSUPERSCRIPT italicπ endPOSTSUBSCRIPT ( italic_ρ )

subject to:

Jc(i)⁢(π)≤disubscript𝐽superscript𝑐𝑖𝜋subscript𝑑𝑖J{c^{(i)}}(\pi)\leq d{i}italicJ startPOSTSUBSCRIPT italicc startPOSTSUPERSCRIPT ( italici ) endPOSTSUBSCRIPT endPOSTSUBSCRIPT ( italicπ ) ≤ italicd startPOSTSUBSCRIPT italici endPOSTSUBSCRIPT for all i

where:

Vπ⁢(ρ)superscript𝑉𝜋𝜌V^{\pi}(\rho)italicV startPOSTSUPERSCRIPT italicπ endPOSTSUBSCRIPT ( italicρ ) is the expected cumulative discounted reward starting from initial state distribution ρ𝜌\rhoitalicρ and following policy π𝜋\piitalic_π.
Jc(i)⁢(π)subscript𝐽superscript𝑐𝑖𝜋J{c^{(i)}}(\pi)italicJ startPOSTSUBSCRIPT italicc startPOSTSUPERSCRIPT ( italici ) endPOSTSUBSCRIPT endPOSTSUBSCRIPT ( italicπ ) is the expected cumulative discounted cost for the i𝑖iitalici-th cost function under policy π𝜋\piitalic_π.
A (stationary) policy π𝜋\piitalic_π is a mapping from states to a distribution over actions.

The set of policies that satisfy all cost constraints is called the feasible policy set. It is assumed that this set is non-empty (the constraints are attainable).

Problem (1) is the standard formulation of SafeRL as a CMDP optimization problem Altman (1999). It is a constrained Markov decision problem which, in principle, can be solved via dynamic programming or linear programming if the model is known and state-action spaces are small.

Solving CMDPs

Eitan Altman’s foundational work Altman (1999) established that for finite CMDPs, there exists an optimal policy that is stationary (time-independent) and, if multiple constraints are present, possibly stochastic (randomized). A common theoretical approach to solve CMDPs is to form the Lagrangian of (1).

Read also: Optimizing Online Learning

Lagrangian Formulation

The Lagrangian of the CMDP problem is defined as:

L⁢(π,λ)=J⁢(π)−∑iλi⁢(Jc(i)⁢(π)−di)caligraphicL ( italicπ , italicλ ) = italicJ ( italicπ ) - ∑ startPOSTSUBSCRIPT italici endPOSTSUBSCRIPT italicλ startPOSTSUBSCRIPT italici endPOSTSUBSCRIPT ( italicJ startPOSTSUBSCRIPT italicc startPOSTSUPERSCRIPT ( italici ) endPOSTSUBSCRIPT endPOSTSUBSCRIPT ( italicπ ) - italicd startPOSTSUBSCRIPT italici endPOSTSUBSCRIPT )

where:

λisubscript𝜆𝑖\lambda{i}italicλ startPOSTSUBSCRIPT italici end_POSTSUBSCRIPT are Lagrange multipliers associated with each cost constraint.

The Lagrangian represents a trade-off between maximizing the reward and satisfying the constraints. The Lagrange multipliers λisubscript𝜆𝑖\lambda{i}italicλ startPOSTSUBSCRIPT italici end_POSTSUBSCRIPT determine the “price” of violating each constraint.

The dual function is defined as:

g⁢(λ)=maxπL⁢(π,λ)𝑔𝜆romanmax startPOSTSUBSCRIPT italicπ endPOSTSUBSCRIPT caligraphicL ( italicπ , italicλ )italicg ( italicλ ) = romanmax startPOSTSUBSCRIPT italicπ endPOSTSUBSCRIPT italicL ( italicπ , italicλ )

which is the optimal policy for the MDP with reward rλsubscript𝑟𝜆r{\lambda}italicr startPOSTSUBSCRIPT italicλ endPOSTSUBSCRIPT. In other words, π(λ)superscript𝜋𝜆\pi^{(\lambda)}italicπ startPOSTSUPERSCRIPT ( italicλ ) endPOSTSUBSCRIPT is the unconstrained optimal policy if we treat −λisubscript𝜆𝑖-\lambda{i}- italicλ startPOSTSUBSCRIPT italici endPOSTSUBSCRIPT as a weight (penalty) for cost c(i)superscript𝑐𝑖c^{(i)}italicc startPOSTSUPERSCRIPT ( italici ) endPOSTSUBSCRIPT.

Under certain conditions (convexity or linearity of the CMDP problem), strong duality holds and solving the dual yields the primal optimum Altman (1999); Achiam et al. (2017). The optimal policy π∗superscript𝜋\pi^{}italicπ startPOSTSUPERSCRIPT ∗ endPOSTSUPERSCRIPT for the CMDP is then π∗⁢(λ∗)superscript𝜋superscript𝜆\pi^{}(\lambda^{*})italicπ startPOSTSUPERSCRIPT ∗ endPOSTSUBSCRIPT ( italicλ startPOSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) (or a mixture of policies if needed when the optimum is not unique).

Lagrange Multiplier Methods

Lagrange multiplier methods are commonly used in safe RL, where one maintains estimates of λisubscript𝜆𝑖\lambda{i}italicλ startPOSTSUBSCRIPT italici endPOSTSUBSCRIPT during learning and adjusts them based on constraint violations. It gives insight into how costs trade off with reward: λisubscript𝜆𝑖\lambda{i}italicλ startPOSTSUBSCRIPT italici endPOSTSUBSCRIPT can be interpreted as the “price” of violating constraint i𝑖iitalici. The gradient of g⁢(λ)𝑔𝜆g(\lambda)italicg ( italicλ ) can be derived as ∇λig⁢(λ)=di−Jc(i)⁢(π∗⁢(λ))subscript∇subscript𝜆𝑖𝑔𝜆subscript𝑑𝑖subscript𝐽superscript𝑐𝑖superscript𝜋𝜆\nabla{\lambda{i}}g(\lambda)=d{i}-J{c^{(i)}}(\pi^{*}(\lambda))∇ startPOSTSUBSCRIPT italicλ startPOSTSUBSCRIPT italici endPOSTSUBSCRIPT endPOSTSUBSCRIPT italicg ( italicλ ) = italicd startPOSTSUBSCRIPT italici endPOSTSUBSCRIPT - italicJ startPOSTSUBSCRIPT italicc startPOSTSUPERSCRIPT ( italici ) endPOSTSUBSCRIPT endPOSTSUBSCRIPT ( italicπ startPOSTSUPERSCRIPT ∗ endPOSTSUPERSCRIPT ( italicλ ) ).

Linear Programming Formulation

For finite-state CMDPs, an alternative formulation is via occupancy measures and linear programming. One can define variables x⁢(s,a)𝑥𝑠𝑎x(s,a)italicx ( italics , italica ) representing the discounted visitation frequency of state-action pair (s,a)𝑠𝑎(s,a)( italics , italic_a ) under a stationary policy.

The objective function can be expressed as:

∑s,ar⁢(s,a)⁢x(s,a)subscript∑subscript𝑠𝑎𝑟𝑠𝑎𝑥𝑠𝑎\sum{s,a}r(s,a)x(s,a)∑ startPOSTSUBSCRIPT italics , italica endPOSTSUBSCRIPT italicr ( italics , italica ) italicx ( italics , italic_a )

This expression is the key to formulating the CMDP as a linear program since the objective becomes linear in x⁢(s,a)𝑥𝑠𝑎x(s,a)italicx ( italics , italic_a ).

The constraints can be written as:

∑s,ac(i)⁢(s,a)⁢x(s,a)≤disubscript∑subscript𝑠𝑎subscript𝑐𝑖𝑠𝑎𝑥𝑠𝑎subscript𝑑𝑖\sum{s,a}c^{(i)}(s,a)x(s,a)\leq d{i}∑ startPOSTSUBSCRIPT italics , italica endPOSTSUBSCRIPT italicc startPOSTSUPERSCRIPT ( italici ) endPOSTSUPERSCRIPT ( italics , italica ) italicx ( italics , italica ) ≤ italicd startPOSTSUBSCRIPT italici end_POSTSUBSCRIPT

which are also linear in x⁢(s,a)𝑥𝑠𝑎x(s,a)italicx ( italics , italic_a ).

The flow constraints ensure that the visitation frequencies are consistent with the transition dynamics:

∑a⁢x(s′,a)=ρ⁢(s′)+γ∑s,a⁢P(s′|s,a)⁢x(s,a)subscript∑𝑥𝑠𝑎subscript𝜌𝑠∑subscript𝑠𝑎𝑃𝑠𝑠𝑎𝑥𝑠𝑎\sum{a}x(s',a)=\rho(s')+\gamma\sum{s,a}P(s'|s,a)x(s,a)∑ startPOSTSUBSCRIPT italica endPOSTSUBSCRIPT italicx ( italics superscript , italica ) = italicρ ( italics superscript ) + italicγ ∑ startPOSTSUBSCRIPT italics , italica endPOSTSUBSCRIPT italicP ( italics superscript | italics , italica ) italicx ( italics , italica )

where ρ⁢(s)𝜌𝑠\rho(s)italicρ ( italics ) is the starting state distribution.

This linear program can be solved efficiently for moderate state-action sizes and yields an optimal (potentially stochastic) policy for the CMDP Altman (1999).

Types of Safety Constraints

SafeRL formulations can be categorized based on how safety constraints are defined and enforced. This section reviews representative constraint formulations.

Expected Cumulative Costs as Constraints

The formulation above uses expected cumulative costs as constraints.

Instantaneous Constraints

Instead of long-term expected cost, one could require c⁢(st,at)≤d𝑐subscript𝑠𝑡subscript𝑎𝑡𝑑c(s{t},a{t})\leq ditalicc ( italics startPOSTSUBSCRIPT italict endPOSTSUBSCRIPT , italica startPOSTSUBSCRIPT italict endPOSTSUBSCRIPT ) ≤ italicd at every time step t𝑡titalic_t (almost surely). This is a stricter requirement (no violations at all). Such hard constraints are challenging for learning and often enforced via external mechanisms (like safety filters).

Instantaneous constraints refer to safety requirements that must hold at every time step during the agent’s execution, rather than only in expectation over a trajectory.

Examples of Instantaneous Constraints

Robotics - Torque or Force Limits: Robotic manipulators and mobile robots have strict actuator limits. A common constraint is ‖τt‖≤τmaxnormsubscript𝜏𝑡subscript𝜏||\tau{t}||\leq\tau{\max}| | italicτ startPOSTSUBSCRIPT italict endPOSTSUBSCRIPT | | ≤ italicτ startPOSTSUBSCRIPT romanmax endPOSTSUBSCRIPT, where τtsubscript𝜏𝑡\tau{t}italicτ startPOSTSUBSCRIPT italict endPOSTSUBSCRIPT is the torque vector applied at time t𝑡titalict. Exceeding these limits even once can cause irreversible hardware damage.
Autonomous Driving - Collision Avoidance: Autonomous vehicles must avoid collisions at all times. This is often modeled as a minimum distance constraint, distance⁢(st)≥dsafedistancesubscript𝑠𝑡subscript𝑑safe\text{distance}(s{t})\geq d{\text{safe}}distance ( italics startPOSTSUBSCRIPT italict endPOSTSUBSCRIPT ) ≥ italicd startPOSTSUBSCRIPT safe endPOSTSUBSCRIPT, where dsafesubscript𝑑safed{\text{safe}}italicd startPOSTSUBSCRIPT safe end_POSTSUBSCRIPT is a safety margin.
Aerial Vehicles (Drones) - Altitude Constraints: Drones often operate within restricted altitude corridors, leading to constraints of the form zmin≤zt≤zmaxsubscript𝑧subscript𝑧𝑡subscript𝑧z{\min}\leq z{t}\leq z{\max}italicz startPOSTSUBSCRIPT romanmin endPOSTSUBSCRIPT ≤ italicz startPOSTSUBSCRIPT italict endPOSTSUBSCRIPT ≤ italicz startPOST…

Other SafeRL Formulations

Different Safe RL formulations based on the constrained criterion and associated representative work are described in Table 1.

SafeRL Algorithms

SafeRL research in the last decade has focused on developing new algorithms. Practically, many well-known algorithms, such as constrained policy optimization (CPO, Achiam et al. (2017)) or reward-constrained policy optimization (RCPO, Tessler et al.

Multi-Agent Safe Reinforcement Learning (SafeMARL)

An emerging frontier is Multi-Agent Safe Reinforcement Learning (SafeMARL), which considers multiple agents learning and interacting under safety constraints. SafeMARL is crucial for applications like coordinated robotics, drone swarms, and autonomous driving with multiple vehicles, where safety conditions involve interactions among agents. SafeMARL introduces additional challenges such as coordinating safety in a team, handling the coupling of constraints across agents, and new solution concepts (like safe equilibria Ganzfried (2022) in competitive settings). While single-agent SafeRL is relatively well-studied, SafeMARL remains a young research area with many open problems ElSayed-Aly et al. (2021); Gu et al. (2023).

Open Research Problems in SafeRL and SafeMARL

Five open research problems that we believe are important for advancing SafeRL and SafeMARL:

Safe Exploration in Multi-Agent Settings
Scalable SafeMARL Algorithms
Safe Transfer Learning for RL
Formal Verification of SafeRL Policies
Safe Human-Robot Interaction

Connections to Other Areas

Other related areas include robust RL (handling model uncertainty or adversarial disturbances) and reward hacking / alignment (ensuring the specified reward leads to intended safe behavior). While robust RL (e.g., solving worst-case MDPs) and SafeRL share some techniques (like min-max optimization), they address different problem formulations (uncertainty vs. explicit constraints). Similarly, reward specification and alignment problems are complementary to SafeRL: one can combine learned reward shaping with SafeRL constraints to yield agents that both seek correct objectives and stay safe Amodei et al. (2016); Achiam et al. (2017). We touch upon these connections where relevant.

tags: #survey #of #constraint #formulations #safe #reinforcement