Define Observation and Reward Signals in Custom Environments

When designing a reinforcement learning environment, the choice of observation and reward signals is crucial because they directly influence the agent's learning efficiency and its ability to generalize and perform well in the task. Poorly designed observation and reward signals can lead to suboptimal learning, convergence to unintended behaviors, or failure to learn altogether.

While the observation signals might be mostly determined by the underlying physical systems and the available sensors, a good observation signal should provide the agent with sufficient and relevant information about the environment's state, enabling it to make informed decisions.

You typically have more flexibility to design an appropriate reward signal. A good reward signal should accurately reflect the agent's performance in achieving the desired outcome, encouraging behaviors that lead to success while discouraging those that do not. It should be neither too sparse nor too dense, providing timely feedback to guide the agent effectively.

An important consideration is that a reinforcement learning environment is normally assumed to be strictly causal from the current action to the current observation. That is, it is assumed that the current observation does not depend on the current action (while the next state generally does). In other words, there must be no direct feedthrough between the current action and the current observation.

Strict Causality and Time-Discretization

Reinforcement Learning Toolbox™ environments must be strictly causal. This means that the current observation must not depend on the current action, or, in other words, there must not be a direct feedthrough between action and observation. For a step-by-step explanation of how the environment interacts with an agent, see Reinforcement Learning Environments.

When you implement a custom MATLAB^® environment using custom step and reset functions or a class template, the training or simulation function interprets the first and second outputs of your step function as the next observation O(t+1) and its associated reward R(t+1). While they are typically functions of the current action A(t), they cannot be a function of the next action A(t+1), which is not yet available to your step function. Therefore, strict causality is enforced by the software.

When you implement a custom Simulink^® environment, you must make sure that there is no direct feedthrough between action and observation, to avoid potentially unsolvable algebraic loops. For more information, see Create Custom Simulink Environments.

If your MATLAB environment relies on continuous-time equations to represent the dynamics of the underlying system, you must use an integration technique such as the Euler or the trapezoidal method to calculate the next state x(t+1), the next observation O(t+1), and its associated reward R(t+1) as functions of the current state x(t). Here, you typically assume that the action stays constant in the interval between t and t+1.

Simulink environments can also directly rely on continuous-time equations to represent the system dynamics. In this case, during simulation, the Simulink solver integrates the continuous-time equations. The RL Agent block always executes at discrete intervals according to the sample time specified in the SampleTime property of the agent object.

For an example on different ways to implement the same environment in Simulink, see Create and Simulate Same Environment in Both MATLAB and Simulink.

Observation

Choosing the right observations for the agent is crucial for effective training and performance in reinforcement learning. The observations should provide sufficient information about the current environment state (independently on past environment states) for the agent to make informed decisions.

For example, for control system applications, while the observation depends on your application and on the measurements that are available for the underlying physical system, the integrals (and sometimes derivatives) of error signals are often useful observations. Also, for reference-tracking applications, having a time-varying reference signal as an observation is important.

If the environment state is low-dimensional, and all the states are available for measurement, it is best practice to include all the available environment states in the observation vector. This ensures that the all necessary information about the environment can be captured by the agent. Failure to do so can lead to situations in which different environment states result in the same observation. For such states, the agent policy (assuming it is a static function of the observation) can only return the same action. Such a policy is typically unsuccessful, because it is normally the case that a successful policy needs to react to different environment states by returning different actions.

For example, an image observation of a swinging pendulum has position information but does not have enough information, by itself, to determine the pendulum velocity. In this case, a static policy that cannot sense the velocity would not be able to stabilize the pendulum. But if the velocity can be measured or estimated, adding it as an additional entry in the observation vector will provide a static policy with enough information to stabilize the pendulum.

When not all states are available as observation signals (for example because it would be unrealistic to measure them), a possible workaround is to use an estimator (as a part of the environment) that estimates the values of the unmeasured states, and makes such estimates available to the agent as observations. Alternatively, you can use recurrent networks such as an LSTM in your policy. Doing so results in a policy that has states, and that might therefore be able to use its state as an internal representation of the environment state. Such a policy can consequently return different actions (based on different values of its internal state) even when there is not enough information to reconstruct the correct environment state from the current observation.

When the state or action space is large, it becomes challenging for the agent to explore and learn effectively. The curse of dimensionality can make it difficult to find an optimal policy within a reasonable time frame. Therefore, if the observation space is high dimensional (for example containing images) it may be necessary to preprocess the observations. When images are used as observations, techniques like image resizing, cropping, or converting to grayscale can help reduce dimensionality and focus on relevant features.

Depending on the complexity of the task, you may need to perform feature engineering to extract relevant information from the observations. This can involve transforming or combining the raw observations to create more informative features. In general, it is best practice to leverage any domain knowledge or insights you have about the task to guide the selection of observations. Understanding the underlying environment dynamics and relevant factors can help you identify the most informative observations for the agent.

The selection or observation (and action) signals has an important effect on the agent training and its convergence. For more information, see the last section of Train Reinforcement Learning Agents.

Reward

To guide the learning process, reinforcement learning uses a scalar reward signal generated from the environment.

This signal measures the immediate performance of the agent with respect to the task goals. Specifically, the reward R(t+1) measures the immediate effectiveness of the transition from the observation O(t) to the next observation O(t+1) caused by the agent action A(t). Note that the reward R(t+1) is synchronized with the next observation O(t+1), and cannot depend on A(t+1). In other words, the reward is strictly causal with respect to the action.

During training, an agent updates its policy based on the rewards received for different state-action combinations. For an introduction to different types of agents and how they use the reward signal during training, see Reinforcement Learning Agents.

In general, you provide a positive reward to encourage certain agent actions and a negative reward (penalty) to discourage other actions. A well-designed reward signal guides the agent to maximize the expectation of the (possibly discounted) cumulative long-term reward. What constitutes a well-designed reward depends on your application and the agent goals.

For example, when an agent must perform a task for as long as possible, a common strategy is to provide a small positive reward for each time step that the agent successfully performs the task and a large penalty when the agent fails. This approach encourages longer training episodes while heavily discouraging actions that lead to episodes in which the agent fails. For an example that uses this approach, see Train DQN Agent to Balance Discrete Cart-Pole System.

If your reward function incorporates multiple signals, such as position, velocity, and control effort, you must consider the relative sizes of the signals and scale their contributions to the reward signal accordingly.

You can specify either continuous or discrete reward signals. In either case, you must provide a reward signal that provides rich information when the action and observation signals change.

For control system applications in which cost functions and constraints are already available, you can also use built-in functions to generate rewards from such specifications.

Continuous Reward Functions

A continuous reward function varies continuously with changes in the environment observations and actions. In general, continuous reward signals improve convergence during training and can lead to simpler network structures.

An example of a continuous reward is the quadratic regulator (QR) cost function, where the cumulative long-term reward can be expressed as:

$J_{i} = - (s_{τ}^{T} Q_{τ} s_{τ} + \sum_{j = i}^{τ} s_{j}^{T} Q_{j} s_{j} + a_{j}^{T} R_{j} a_{j} + 2 s_{j}^{T} N_{j} a_{j})$

Here, Q_τ, Q, R, and N are the weight matrices. Q_τ is the terminal weight matrix, applied only at the end of the episode. Also, s is the observation vector, a is the action vector, and τ is the terminal iteration of the episode. The (instantaneous) reward for this cost function is

$r_{i} = s_{i}^{T} Q_{i} s_{i} + a_{i}^{T} R_{i} a_{i} + 2 s_{i}^{T} N_{i} a_{i}$

This QR reward structure encourages an agent to drive s to zero with minimal action effort. A QR-based reward structure is a good reward to choose for regulation or stationary point problems, such as pendulum swing-up or regulating the position of the double integrator. For training examples that use a QR reward, see Train DQN Agent to Swing Up and Balance Pendulum and Compare DDPG Agent to LQR Controller.

Smooth continuous rewards, such as the QR regulator, are good for fine-tuning parameters and can provide policies similar to optimal controllers (LQR/MPC).

Discrete Reward Functions

A discrete reward function varies discontinuously with changes in the environment observations or actions. These types of reward signals can make convergence slower and can require more complex network structures. Discrete rewards are usually implemented as events that occur in the environment—for example, when an agent receives a positive reward if it exceeds some target value or a penalty when it violates some performance constraint.

While discrete rewards can slow down convergence, they can also guide the agent toward better reward regions in the state space of the environment. For example, a region-based reward, such as a fixed reward when the agent is near a target location, can emulate final-state constraints. Also, a region-based penalty can encourage an agent to avoid certain areas of the state space.

Mixed Reward Functions

In many cases, providing a mixed reward signal that has a combination of continuous and discrete reward components is beneficial. The discrete reward signal can be used to drive the system away from bad states, and the continuous reward signal can improve convergence by providing a smooth reward near target states. For example, in Train DDPG Agent to Control Sliding Robot, the reward function has three components: r₁, r₂, and r₃.

$\begin{array}{l} r_{1} = 10 ((x_{t}^{2} + y_{t}^{2} + θ_{t}^{2}) < 0.5) \\ r_{2} = - 100 (| x_{t} | \geq 20 | | | y_{t} | \geq 20) \\ r_{3} = - (0.2 {(R_{t - 1} + L_{t - 1})}^{2} + 0.3 {(R_{t - 1} - L_{t - 1})}^{2} + 0.03 x_{t}^{2} + 0.03 y_{t}^{2} + 0.02 θ_{t}^{2}) \\ r = r_{1} + r_{2} + r_{3} \end{array}$

Here:

r₁ is a region-based continuous reward that applies only near the target location of the robot.
r₂ is a discrete signal that provides a large penalty when the robot moves far from the target location.
r₃ is a continuous QR penalty that applies for all robot states.

Reward Generation from Control Specifications

For applications where a working control system already exists, specifications such as cost functions or constraints might already be available. In these cases, you can use generateRewardFunction to automatically generate a reward function, coded in MATLAB, that can be used as a starting point for reward design. This function allows you to generate reward functions from:

Cost and constraint specifications defined in an mpc (Model Predictive Control Toolbox) or nlmpc (Model Predictive Control Toolbox) controller object. This feature requires Model Predictive Control Toolbox™ software.
Performance constraints defined in Simulink Design Optimization™ model verification blocks.

In both cases, when constraints are violated, a negative reward is calculated using penalty functions such as exteriorPenalty (default), hyperbolicPenalty or barrierPenalty functions.

Starting from the generated reward function, you can tune the cost and penalty weights, use a different penalty function, and then use the resulting reward function within an environment to train an agent.