Define Reward Signals

To guide the learning process, reinforcement learning uses a scalar reward signal generated from the environment. This signal measures the performance of the agent with respect to the task goals. In other words, for a given observation (state), the reward measures how good it is to take a particular action. During training, an agent updates its policy based on the rewards received for different state-action combinations. For more information on the different types of agents and how they use the reward signal during training, see Reinforcement Learning Agents.

In general, you provide a positive reward to encourage certain agent actions and a negative reward (penalty) to discourage other actions. A well-designed reward signal guides the agent to maximize the expectation of the long-term reward. What constitutes a well-designed reward depends on your application and the agent goals.

For example, when an agent must perform a task for as long as possible, a common strategy is to provide a smaller positive reward for each time step that the agent successfully performs the task and a large penalty when the agent fails. This approach encourages longer training episodes while heavily discouraging episodes that fail. For an example that uses this approach, see Train DQN Agent to Balance Cart-Pole System.

If your reward function incorporates multiple signals, such as position, velocity, and control effort, it is important to consider the relative sizes of the signals and scale their contributions to the reward signal accordingly.

You can specify either continuous or discrete reward signals. In either case, it is important for the reward signal to provide rich information when the action and observation signals change.

Continuous Rewards

A continuous reward function varies continuously with changes in the environment observations and actions. In general, continuous reward signals improve convergence during training and can lead to simpler network structures.

An example of a continuous reward is the Quadratic Regulator (QR) cost function, where the long-term reward can be expressed as:

Ji=(sτTQτsτ+j=iτsjTQjsj+ajTRjaj+2sjTNjaj)

Where Qτ, Q, R, and N are the weight matrices ( Qτ is the terminal weight matrix, applied only at the end of the episode if applicable). Also, s is the observation vector, a is the action vector, and τ is the terminal iteration of the episode. The instantaneous reward for this is:

ri=siTQisi+aiTRiai+2siTNiai

This QR reward structure encourages driving s to zero with minimal action effort. A QR-based reward structure is a good reward to choose for regulation or stationary point problems, such as pendulum swing up or regulating the position of the double integrator. For training examples that use a QR reward, see Train DQN Agent to Swing Up and Balance Pendulum and Train DDPG Agent to Control Double Integrator System.

Smooth continuous rewards, such as the QR regulator, are good for fine tuning parameters and can provide policies similar to optimal controllers (LQR/MPC).

Discrete Rewards

A discrete reward function varies discontinuously with changes in the environment observations or actions. These types of reward signals can make convergence slower and can require more complex network structures. Discrete rewards are usually implemented as events that occur in the environment. For example, an agent could receive a positive reward if it exceeds some target value or a penalty if it violates some performance constraint.

While discrete rewards can slow down convergence, they can also guide the agent toward better reward regions in the state space of the environment. For example, region-based reward, such as a fixed reward when the agent is near a target location can emulate final-state constraints. Also, a region-based penalty can encourage an agent to avoid certain areas of the state space.

Mixed Rewards

In many cases, it is beneficial to provide a mixed reward signal that has a combination of continuous and discrete reward components. The discrete reward signal can be used to drive the system away from bad states, and the continuous reward signal can improve convergence by providing a smooth reward near target states. For example, in Train DDPG Agent to Control Flying Robot, the reward function has three components: r1, r2, and r3.

r1=10((xt2+yt2+θt2)<0.5)r2=100(|xt|20|||yt|20)r3=(0.2(Rt1+Lt1)2+0.3(Rt1Lt1)2+0.03xt2+0.03yt2+0.02θt2)r=r1+r2+r3

Here:

  • r1 is a region-based continuous reward that applies only near the target location of the robot.

  • r2 is a discrete signal that provides a large penalty when the robot moves far from the target location.

  • r3 is a continuous QR penalty that applies for all robot states.

Related Topics