Main Content

Create Policies and Value Functions

A reinforcement learning policy is a mapping from the current environment observation to a probability distribution of the actions to be taken. A value function is a mapping from the current environment observation to the expected value (the cumulative long-term reward) of the current policy.

Reinforcement learning agents use parametrized policies and value functions, which are implemented by function approximators called actors and critics, respectively. During training, an agent updates the parameters of its actor and critic to maximize the expected cumulative long-term reward.

Before creating a non-default agent, you must create the actor and critic using approximation models such as deep neural networks, linear basis functions, or lookup tables. The type of function approximator and model you can use depends on the type of agent that you want to create.

You can also create policy objects from agents, actors, or critics. You can train these objects using custom loops and deploy them in applications.

For more information on agents, see Reinforcement Learning Agents.

Actors and Critics

Reinforcement Learning Toolbox™ software supports the following types of actors and critics:

  • V(S|θV) — Critics that estimate the expected cumulative long-term reward based on a given observation S. You can create these critics using rlValueFunction.

  • Q(S,A|θQ) — Critics that estimate the expected cumulative long-term reward for a given discrete action A and a given observation S. You can create these critics using rlQValueFunction.

  • Qi(S,Ai|θQ) — Multi-output critics that estimate the expected cumulative long-term reward for all possible discrete actions Ai given the observation S. You can create these critics using rlVectorQValueFunction.

  • π(S|θπ) — Actors with a continuous action space that select an action deterministically based on a given observation S. You can create these actors using rlContinuousDeterministicActor.

  • π(S|θπ) — Actors that select an action stochastically (the action is sampled from a probability distribution) based on a given observation S. You can create these actors using either rlDiscreteCategoricalActor (for discrete action spaces) or rlContinuousGaussianActor (for continuous action spaces).

Each approximator uses a set of parameters (θV, θQ, θπ), which are computed during the learning process.

Four different diagrams showing inputs and outputs of three critics (value functions, scalar q-value functions, vector q-value functions), and one actor.

For systems with a limited number of discrete observations and discrete actions, you can store value functions in a lookup table. For systems that have many discrete observations and actions and for observation and action spaces that are continuous, storing the observations and actions is impractical. For such systems, you can represent your actors and critics using deep neural networks or custom (linear in the parameters) basis functions.

The following table summarizes the way in which you can use the six approximator objects available with Reinforcement Learning Toolbox software, depending on the action and observation spaces of your environment, and on the approximation model and agent that you want to use.

How Function Approximators (Actors or Critics) are Used in Agents

Approximator (Actor or Critic)Supported ModelObservation SpaceAction SpaceSupported Agents

Value function critic V(S), which you create using

rlValueFunction

TableDiscreteNot applicablePG, AC, PPO
Deep neural network or custom basis functionDiscrete or continuousNot applicablePG, AC, PPO
Deep neural networkDiscrete or continuousNot applicableTRPO

Q-value function critic, Q(S,A), which you create using

rlQValueFunction

TableDiscreteDiscreteQ, DQN, SARSA
Deep neural network or custom basis functionDiscrete or continuousDiscreteQ, DQN, SARSA
ContinuousDDPG, TD3, SAC

Multi-output Q-value function critic with a discrete action space Q(S,A), which you create using

rlVectorQValueFunction

Deep neural network or custom basis functionDiscrete or continuousDiscreteQ, DQN, SARSA

Deterministic policy actor with a continuous action space π(S), which you create using

rlContinuousDeterministicActor

Deep neural network or custom basis functionDiscrete or continuousContinuousDDPG, TD3

Stochastic policy actor with a discrete action space π(S), which you create using

rlDiscreteCategoricalActor

Deep neural network or custom basis functionDiscrete or continuousDiscretePG, AC, PPO
Deep neural networkDiscrete or continuousDiscreteTRPO

Stochastic policy actor with a continuous action space π(S), which you create using

rlContinuousGaussianActor

Deep neural networkDiscrete or continuousContinuousPG, AC, PPO, SAC, TRPO

You can configure the actor and critic optimization options using the rlOptimizerOptions object within an agent option object.

Specifically, you can create an agent options object and set its CriticOptimizerOptions and ActorOptimizerOptions properties to appropriate rlOptimizerOptions objects. Then you pass the agent options object to the function that creates the agent.

Alternatively, you can create the agent and then use dot notation to access the optimization options for the agent actor and critic, for example: agent.AgentOptions.ActorOptimizerOptions.LearnRate = 0.1;.

For more information on agents, see Reinforcement Learning Agents.

Policy Objects

You can extract a policy object from an agent or create it from an actor or critic. You can then use getAction to generate deterministic or stochastic actions from the policy, given an input observation. Differently from function approximator objects like actors and critics, policy objects do not have functions that you can use to easily calculate gradients with respect to parameters. Therefore, policy objects are more tailored toward application deployment, rather than training. The following table describes the available policy objects.

Policy Objects

Policy Object and getAction BehaviorDistribution and ExplorationAction SpaceApproximator Objects Used for CreationAgents Needed for Extraction

rlMaxQPolicy

Generates actions that maximize a discrete action-space Q-value function

Deterministic (no exploration)DiscreterlQValueFunction or rlVectorQValueFunction Q, DQN, SARSA

rlEpsilonGreedyPolicy

Generates either actions that maximize a discrete action-space Q-value function with probability 1-Epsilon or random actions otherwise

Default: Stochastic (random actions help exploration)DiscreterlQValueFunction or rlVectorQValueFunction Q, DQN, SARSA

rlDeterministicActorPolicy

Generates continuous deterministic actions

Deterministic (no exploration)ContinuousrlContinuousDeterministicActorDDPG, TD3

rlAdditiveNoisePolicy

Generates continuous deterministic actions with added noise according to an internal noise model

Default: Stochastic (noise helps exploration)ContinuousrlContinuousDeterministicActorDDPG, TD3

rlStochasticActorPolicy

Generates stochastic actions according to a probability distribution

Default: Stochastic (random actions help exploration)

DiscreterlDiscreteCategoricalActorPG, AC, PPO, TRPO
ContinuousrlContinuousGaussianActorPG, AC, PPO, TRPO, SAC

Each one of the stochastic policy objects has an option to enable deterministic behavior, thereby disabling exploration. Except for rlEpsilonGreedyPolicy and rlAdditiveNoisePolicy, you can use generatePolicyBlock and generatePolicyFunction to generate a Simulink® block or a function that evaluates the policy, returning an action, for a given observation input. You can then use the generated function or block to generate code for application deployment. For more information, see Deploy Trained Reinforcement Learning Policies.

Table Models

Value function approximators (critics) based on lookup tables models are appropriate for environments with a limited number of discrete observations and actions. You can create two types of lookup tables:

  • Value tables, which store rewards for corresponding observations

  • Q-tables, which store rewards for corresponding observation-action pairs

To create a table based critic, first create a value table or Q-table using the rlTable function. Then use the table object as input argument for either rlValueFunction or rlQValueFunction to create the approximator object.

Neural Network Models

You can create actor and critic function approximators using deep neural networks models. Doing so uses Deep Learning Toolbox™ software features.

Network Input and Output Dimensions

The dimensions of the network input and output layers for your actor and critic must match the dimension of the corresponding environment observation and action channels, respectively. To obtain the action and observation specifications from the environment env, use the getActionInfo and getObservationInfo functions, respectively.

actInfo = getActionInfo(env);
obsInfo = getObservationInfo(env);

Access the Dimensions property of each channel. For example, get the size of the first environment and action channel:

actSize = actInfo(1).Dimensions;
obsSize = obsInfo(1).Dimensions;

In general actSize and obsSize are row vectors whose elements are the lengths of the corresponding dimensions. For example, if the first observation channel is a 256-by-256 RGB image, actSize is the vector [256 256 3]. To calculate the total number of dimension of the channel, use prod.For example, assuming the environment has only one observation channel:

obsDimensions = prod(obsInfo.Dimensions);

For rlVectorQValueFunction critics and rlDiscreteCategoricalActor actors, you need to obtain the number of possible elements of the action set. You can do so by accessing the Elements property of the action channel. For example, assuming the environment has only one action channel:

actNumElements = numel(actInfo.Elements);

Networks for value function critics (such as the ones used in AC, PG, PPO or TRPO agents) must take only observations as inputs and must have a single scalar output. For these networks, the dimensions of the input layers must match the dimensions of the environment observation channels. For more information, see rlValueFunction.

Networks for single-output Q-value function critics (such as the ones used in Q, DQN, SARSA, DDPG, TD3, and SAC agents) must take both observations and actions as inputs, and must have a single scalar output. For these networks, the dimensions of the input layers must match the dimensions of the environment channels for both observations and actions. For more information, see rlQValueFunction.

Networks for multi-output Q-value function critics (such as those used in Q, DQN, and SARSA agents) take only observations as inputs and must have a single output layer with output size equal to the number of possible discrete actions. For these networks the dimensions of the input layers must match the dimensions of the environment observations channels. For more information, see rlVectorQValueFunction.

For actor networks, the dimensions of the input layers must match the dimensions of the environment observation channels and the dimension of the output layer must be as follows.

  • Networks used in actors with a discrete action space (such as the ones in PG, AC, and PPO agents) must have a single output layer with an output size equal to the number of possible discrete actions. For more information, see rlDiscreteCategoricalActor.

  • Networks used in deterministic actors with a continuous action space (such as the ones in DDPG and TD3 agents) must have a single output layer with an output size matching the dimension of the action space defined in the environment action specification. For more information, see rlContinuousDeterministicActor.

  • Networks used in stochastic actors with a continuous action space (such as the ones in PG, AC, PPO, and SAC agents) must have a two output layers each with as many elements as the dimension of the action space, as defined in the environment specification. One output layer must produce the mean values (which must be scaled to the output range of the action), and the other must produce the standard deviations of the actions (which must be non-negative). For more information, see rlContinuousGaussianActor.

Deep Neural Networks

Deep neural networks consist of a series of interconnected layers. You can specify a deep neural network as one of the following:

Note

Among the different network objects, dlnetwork is preferred, since it has built-in validation checks and supports automatic differentiation. If you pass another network object as an input argument, it is internally converted to a dlnetwork object. However, best practice is to convert other network objects to dlnetwork explicitly before using it to create a critic or an actor for a reinforcement learning agent. You can do so using dlnet=dlnetwork(net), where net is any neural network object from the Deep Learning Toolbox. The resulting dlnet is the dlnetwork object that you use for your critic or actor. This practice allows a greater level of insight and control for cases in which the conversion is not straightforward and might require additional specifications.

Typically, you build your neural network by stacking together a number of layers in an array of Layer objects, possibly adding these arrays to a layerGraph object, and then converting the final result to a dlnetwork object.

For agents that need multiple input or output layers, you create an array of Layer objects for each input path (observations or actions) and for each output path (estimated rewards or actions). You then add these arrays to a layerGraph object and connect them paths together using the connectLayers function.

You can also create your deep neural network using the Deep Network Designer app. For an example, see Create Agent Using Deep Network Designer and Train Using Image Observations.

The following table lists some common deep learning layers used in reinforcement learning applications. For a full list of available layers, see List of Deep Learning Layers.

LayerDescription
featureInputLayerInputs feature data and applies normalization
imageInputLayerInputs vectors and 2-D images and applies normalization.
sigmoidLayerApplies a sigmoid function to the input such that the output is bounded in the interval (0,1).
tanhLayerApplies a hyperbolic tangent activation layer to the input.
reluLayerSets any input values that are less than zero to zero.
fullyConnectedLayerMultiplies the input vector by a weight matrix, and add a bias vector.
softmaxLayerApplies a softmax function layer to the input, normalizing it to a probability distribution.
convolution2dLayerApplies sliding convolutional filters to the input.
additionLayerAdds the outputs of multiple layers together.
concatenationLayerConcatenates inputs along a specified dimension.
sequenceInputLayerProvides inputs sequence data to a network.
lstmLayerApplies a Long Short-Term Memory layer to the input. Supported for DQN and PPO agents.

The bilstmLayer and batchNormalizationLayer layers are not supported for reinforcement learning.

The Reinforcement Learning Toolbox software provides the following layers, which contain no tunable parameters (that is, parameters that change during training).

LayerDescription
scalingLayerApplies a linear scale and bias to an input array. This layer is useful for scaling and shifting the outputs of nonlinear layers, such as tanhLayer and sigmoidLayer.
quadraticLayerCreates a vector of quadratic monomials constructed from the elements of the input array. This layer is useful when you need an output that is some quadratic function of its inputs, such as for an LQR controller.
softplusLayerImplements the softplus activation Y = log(1 + eX), which ensures that the output is always positive. This function is a smoothed version of the rectified linear unit (ReLU).

You can also create your own custom layers. For more information, see Define Custom Deep Learning Layers.

When you create a deep neural network, it is good practice to specify names for the first layer of each input path and the final layer of the output path.

The following code creates and connects the following input and output paths:

  • An observation input path, observationPath, with the first layer named 'observation'.

  • An action input path, actionPath, with the first layer named 'action'.

  • An estimated value function output path, commonPath, which takes the outputs of observationPath and actionPath as inputs. The final layer of this path is named 'output'.

observationPath = [
    featureInputLayer(4,'Normalization','none','Name','myobs')
    fullyConnectedLayer(24,'Name','CriticObsFC1')
    reluLayer('Name','CriticRelu1')
    fullyConnectedLayer(24,'Name','CriticObsFC2')];

actionPath = [
    featureInputLayer(1,'Normalization','none','Name','myact')
    fullyConnectedLayer(24,'Name','CriticActFC1')];

commonPath = [
    additionLayer(2,'Name','add')
    reluLayer('Name','CriticCommonRelu')
    fullyConnectedLayer(1,'Name','output')];

criticNetwork = layerGraph(observationPath);
criticNetwork = addLayers(criticNetwork,actionPath);
criticNetwork = addLayers(criticNetwork,commonPath);    
criticNetwork = connectLayers(criticNetwork,'CriticObsFC2','add/in1');
criticNetwork = connectLayers(criticNetwork,'CriticActFC1','add/in2');

criticNetwork = dlnetwork(criticNetwork);

For all observation and action input paths, you must specify a featureInputLayer as the first layer in the path, with a number of input neurons equal to the number of dimensions of the corresponding environment channel.

You can view the structure of your deep neural network using the plot function.

plot(layerGraph(criticNetwork))

Structure of the critic network, with observation and action as inputs and a scalar output value.

Since the output of a network in an rlDiscreteCategoricalActor actors must represent the probability of executing each possible action, the software automatically adds a softmaxLayer as a final output layer if you do not specify it explicitly. When computing the action, the actor then randomly samples the distribution to return an action.

Determining the number, type, and size of layers for your deep neural network can be difficult and is application dependent. However, the most critical component in deciding the characteristics of the function approximator is whether it is able to approximate the optimal policy or discounted value function for your application, that is, whether it has layers that can correctly learn the features of your observation, action, and reward signals.

Consider the following tips when constructing your network.

  • For continuous action spaces, bound actions with a tanhLayer followed by a ScalingLayer to scale the action to desired values, if necessary.

  • Deep dense networks with reluLayer layers can be fairly good at approximating many different functions. Therefore, they are often a good first choice.

  • Start with the smallest possible network that you think can approximate the optimal policy or value function.

  • When you approximate strong nonlinearities or systems with algebraic constraints, adding more layers is often better than increasing the number of outputs per layer. In general, the ability of the approximator to represent more complex (compositional) functions grows only polynomially in the size of the layers, but grows exponentially with the number of layers. In other words, more layers allow approximating more complex and nonlinear compositional functions, although this generally requires more data and longer training times. Given a total number of neurons and comparable approximation tasks, networks with fewer layers can require exponentially more units to successfully approximate the same class of functions, and might fail to learn and generalize correctly.

  • For on-policy agents (the ones that learn only from experience collected while following the current policy), such as AC and PG agents, parallel training works better if your networks are large (for example, a network with two hidden layers with 32 nodes each, which has a few hundred parameters). On-policy parallel updates assume each worker updates a different part of the network, such as when they explore different areas of the observation space. If the network is small, the worker updates can correlate with each other and make training unstable.

Create and Configure Actors and Critics from a Neural Network

To create a critic from your deep neural network, use an rlValueFunction, rlQValueFunction or (whenever possible) an rlVectorQValueFunction object. To create a deterministic actor for a continuous action space from your deep neural network, use an rlContinuousDeterministicActor object. To create a stochastic actor from your deep neural network use either an rlDiscreteCategoricalActor or an rlContinuousGaussianActor object. To configure the learning rate and optimization used by the actor or critic, use an optimizer object within an agent option object.

For example, create a Q-value function object for the critic network criticNetwork. Then create a the critic optimizer object criticOpts specifying a learning rate of 0.02 and a gradient threshold of 1.

critic = rlQValueFunction(criticNetwork,obsInfo,actInfo,...
        'Observation',{'observation'},'Action',{'action'});

criticOpts = rlOptimizerOptions('LearnRate',0.02,...
                                'GradientThreshold',1);

Then create an agent option object, and set the CriticOptimizerOptions property of the agent option object to criticOpts. When finally you create the agent, pass the agent option object as a last input argument to the agent constructor function.

When you create your deep neural network and configure your actor or critic, consider using the following approach as a starting point.

  1. Start with the smallest possible network and a high learning rate (0.01). Train this initial network to see if the agent converges quickly to a poor policy or acts in a random manner. If either of these issues occur, rescale the network by adding more layers or more outputs on each layer. Your goal is to find a network structure that is just big enough, does not learn too fast, and shows signs of learning (an improving trajectory of the reward graph) after an initial training period.

  2. Once you settle on a good network architecture, a low initial learning rate can allow you to see if the agent is on the right track, and help you check that your network architecture is satisfactory for the problem. A low learning rate makes tuning parameters easier, especially for difficult problems.

Also, consider the following tips when configuring your deep neural network agent.

  • Be patient with DDPG and DQN agents, since they might not learn anything for some time during the early episodes, and they typically show a dip in cumulative reward early in the training process. Eventually, they can show signs of learning after the first few thousand episodes.

  • For DDPG and DQN agents, promoting exploration of the agent is critical.

  • For agents with both actor and critic networks, set the initial learning rates of both actor and critic to the same value. However, for some problems, setting the critic learning rate to a higher value than that of the actor can improve learning results.

Recurrent Neural Networks

When creating actors or critics for use with any agent except Q and SARSA, you can use recurrent neural networks (RNN). These networks are deep neural networks with a sequenceInputLayer input layer and at least one layer that has hidden state information, such as an lstmLayer. They can be especially useful when the environment has states that cannot be included in the observation vector.

For agents that have both actor and critic, you must either use an RNN for both of them, or not use an RNN for any of them. You cannot use an RNN only for the critic or only for the actor.

When using PG agents, the learning trajectory length for the RNN is the whole episode. For an AC agent, the NumStepsToLookAhead property of its options object is treated as the training trajectory length. For a PPO agent, the trajectory length is the MiniBatchSize property of its options object.

For DQN, DDPG, SAC and TD3 agents, you must specify the length of the trajectory training as an integer greater than one in the SequenceLength property of their options object.

Note that code generation is not supported for continuous action space PG, AC, PPO and TRPO agents, and SAC agents using a recurrent neural network (RNN), or for any agent having multiple input paths and containing an RNN in any of the paths.

For more information and examples on policies and value functions, see rlValueFunction, rlQValueFunction, rlVectorQValueFunction, rlContinuousDeterministicActor, rlDiscreteCategoricalActor, and rlContinuousGaussianActor.

Custom Basis Function Models

Custom (linear in the parameters) basis function approximation models have the form f = W'B, where W is a weight array and B is the column vector output of a custom basis function that you must create. The learnable parameters of a linear basis function are the elements of W.

For value function critics, (such as the ones used in AC, PG or PPO agents), f is a scalar value, so W must be a column vector with the same length as B, and B must be a function of the observation. For more information and examples, see rlValueFunction.

For single-output Q-value function critics, (such as the ones used in Q, DQN, SARSA, DDPG, TD3, and SAC agents), f is a scalar value, so W must be a column vector with the same length as B, and B must be a function of both the observation and action. For more information and examples, see rlQValueFunction.

For multi-output Q-value function critics with discrete action spaces, (such as those used in Q, DQN, and SARSA agents), f is a vector with as many elements as the number of possible actions. Therefore W must be a matrix with as many columns as the number of possible actions and as many rows as the length of B. B must be only a function of the observation. For more information and examples, see rlVectorQValueFunction.

  • For deterministic actors with a continuous action space (such as the ones in DDPG, and TD3 agents), the dimensions of f must match the dimensions of the agent action specification, which is either a scalar or a column vector. For more information and examples, see rlContinuousDeterministicActor.

  • For stochastic actors with a discrete action space (such as the ones in PG, AC, and PPO agents), f must be column vector with length equal to the number of possible discrete actions. The output of the actor is softmax(f), which represents the probability of selecting each possible action. For more information and examples, see rlDiscreteCategoricalActor.

  • For stochastic actors with continuous action spaces cannot rely on custom basis functions (they can only use neural network approximators, due to the need to enforce positivity for the standard deviations). For more information and examples, see rlContinuousGaussianActor.

For any actor, W must have as many columns as the number of elements in f, and as many rows as the number of elements in B. B must be only a function of the observation.

For an example that trains a custom agent that uses a linear basis function, see Train Custom LQR Agent.

Create an Agent

Once you create your actor and critic, you can create a reinforcement learning agent that uses them. For example, create a PG agent using a given actor and critic (baseline) network.

agentOpts = rlPGAgentOptions('UseBaseline',true);
agent = rlPGAgent(actor,baseline,agentOpts);

For more information on the different types of reinforcement learning agents, see Reinforcement Learning Agents.

You can obtain the actor and critic from an existing agent using getActor and getCritic, respectively.

You can also set the actor and critic of an existing agent using setActor and setCritic, respectively. The input and output layers of the actor and critic must match the observation and action specifications of the original agent.

Related Topics