Create Policy and Value Function Representations

A reinforcement learning policy is a mapping that selects an action to take based on observations from the environment. During training, the agent tunes the parameters of its policy representation to maximize the long-term reward.

Depending on the type of reinforcement learning agent you are using, you define actor and critic function approximators, which the agent uses to represent and train its policy. The actor represents the policy that selects the best action to take. The critic represents the value function that estimates the long-term reward for the current policy. Depending on your application and selected agent, you can define policy and value functions using deep neural networks, linear basis functions, or look-up tables.

For more information on agents, see Reinforcement Learning Agents.

Function Approximation

Depending on the type of agent you are using, Reinforcement Learning Toolbox™ software supports the following types of function approximators:

  • V(S|θV) — Critics that estimate the expected long-term reward based on observation S

  • Q(S,A|θQ) — Critics that estimate the expected long-term reward based on observation S and action A

  • μ(S|θμ) — Actors that select an action based on observation S

Each function approximator has a corresponding set of parameters (θV, θQ, or θμ), which are computed during the learning process.

For systems with a limited number of discrete observations and discrete actions, you can store value functions in a look-up table. For systems that have many discrete observations and actions or with observation and action spaces that are continuous, storing the observations and actions becomes impractical. For such systems, you can represent your actors and critics using deep neural networks or linear basis functions.

Table Representations

You can create two types of table representations:

  • Value tables, which store rewards for corresponding observations.

  • Q-tables, which store rewards for corresponding observation-action pairs.

To create a table representation, first create a value table or Q table using the rlTable function. Then, create a representation for the table using the rlRepresentation function. To configure the learning rate and optimization used by the representation, use an rlRepresentationOptions object.

Deep Neural Network Representations

You can create actor and critic function approximators using deep neural network representations. Doing so uses Deep Learning Toolbox™ software features.

Network Input and Output Dimensions

The dimensions of your actor and critic networks must match the corresponding action and observation specifications from the training environment object. To obtain the action and observation dimensions for environment env, use the getActionInfo and getObservationInfo functions, respectively. Then access the Dimensions property of the specification objects

actInfo = getActionInfo(env);
actDimensions = actionInfo.Dimensions;

obsInfo = getObservationInfo(env);
obsDimensions = observatonInfo.Dimensions;

For critic networks that take only observations as inputs, such as those used in AC or PG agents, the dimensions of the input layers must match the dimensions of the environment observation specifications. The dimensions of the critic output layer must be a scalar value function.

For critic networks that take only observations and actions as inputs, such as those used in DQN or DDPG agents, the dimensions of the input layers must match the dimensions of the corresponding environment observation and action specifications. In this case, the dimensions of the critic output layer must also be a scalar value function.

For actor networks the dimensions of the input layers must match the dimensions of the environment observation specifications. If the actor has a:

  • Discrete action space, then its output size must equal the number of discrete actions.

  • Continuous action space, then its output size must be a scalar or vector value, as defined in the observation specification.

Create Deep Neural Network

Deep neural networks consist of a series of interconnected layers. The following table lists some common deep learning layers used in reinforcement learning applications. For a full list of available layers, see List of Deep Learning Layers (Deep Learning Toolbox).

imageInputLayerInput vectors and 2-D images, and normalize the data.
tanhLayerApply a hyperbolic tangent activation layer to the layer inputs.
reluLayerSet any input values that are less than zero to zero.
fullyConnectedLayerMultiply the input vector by a weight matrix, and add a bias vector.
convolution2dLayerApply sliding convolutional filters to the input.
additionLayerAdd the outputs of multiple layers together.
concatenationLayerConcatenate inputs along a specified dimension.

The lstmLayer, bilstmLayer, and batchNormalizationLayer layers are not supported for reinforcement learning.

You can also create your own custom layers. For more information, see Define Custom Deep Learning Layers (Deep Learning Toolbox). Reinforcement Learning Toolbox software provides the following custom layers.

scalingLayerLinearly scale and bias an input array. This layer is useful for scaling and shifting the outputs of nonlinear layers, such as tanhLayer and sigmoid.
quadraticLayerCreate vector of quadratic monomials constructed from the elements of the input array. This layer is useful when you need an output that is some quadratic function of its inputs, such as for an LQR controller.

The scalingLayer and quadraticLayer custom layers do not contain tunable parameters; that is, they do not change during training.

For reinforcement learning applications, you construct your deep neural network by connecting a series of layers for each input path (observations or actions) and for each output path (estimated rewards or actions). You then connect these paths together using the connectLayers function.

When you create a deep neural network, you must specify names for the first layer of each input path and the final layer of the output path.

The following code creates and connects the following input and output paths:

  • An observation input path, observationPath, with the first layer named 'observation'.

  • An action input path, actionPath, with the first layer named 'action'.

  • An estimated value function output path, commonPath, which takes the outputs of observationPath and actionPath as inputs. The final layer of this path is named 'output'.

observationPath = [
    imageInputLayer([4 1 1],'Normalization','none','Name','observation')
    fullyConnectedLayer(24, 'Name','CriticObsFC1')
actionPath = [
    imageInputLayer([1 1 1],'Normalization','none','Name','action')
commonPath = [
criticNetwork = layerGraph(observationPath);
criticNetwork = addLayers(criticNetwork,actionPath);
criticNetwork = addLayers(criticNetwork,commonPath);    
criticNetwork = connectLayers(criticNetwork,'CriticObsFC2','add/in1');
criticNetwork = connectLayers(criticNetwork,'CriticObsFC1','add/in2');

For all observation and action input paths, you must specify an imageInputLayer layer as the first layer in the path.

You can view the structure of your deep neural network using the plot function.


For a PG and AC agents, the final output layers of your deep neural network actor representation are a fullyConnectedLayer and a softmaxLayer layer. When you specify the layers for your network, you must specify the fullyConnectedLayer and you can optionally specify the softmaxLayer layer. If you omit the softmaxLayer, the software automatically adds one for you.

Determining the number, type, and size of layers for your deep neural network representation can be difficult and is application-dependent. However, the most critical component for any function approximator is whether the function is able to approximate the optimal policy or discounted value function for your application; that is, does it have layers that can correctly learn the features of your observation, action, and reward signals.

Consider the following tips when constructing your network.

  • For continuous action spaces, bound actions with a tanhLayer followed by ScalingLayer if necessary.

  • Deep dense networks with reluLayers can be fairly good at approximating many different functions. Therefore, they are often a good first choice.

  • When approximating strong nonlinearities or systems with algebraic constraints, it is often better to add more layers rather than increasing number of outputs per layer. Adding more layers promotes exponential exploration, while adding layer outputs promotes polynomial exploration.

  • For on-policy agents, such as AC and PG agents, parallel training works better if your networks are large (for example, a network with two hidden layers with 32 nodes each, which has a few hundred parameters). On-policy parallel updates assume each worker updates a different part of the network, such as when they explore different areas of the observation space. If the network is small, the worker updates might correlate with each other and make training unstable.

Create and Configure Representation

To create an actor or critic representation object for your deep neural network, use the rlRepresentation function. To configure the learning rate and optimization used by the representation, use an rlRepresentationOptions object.

For example, create a representation object for the critic network criticNetwork, specifying a learning rate of 0.0001. When you create the representation, pass the environment action and observation specifications to the rlRepresentation function, and specify the names of the network layers to which the actions and observations are connected.

opt = rlRepresentationOptions('LearnRate',0.0001);
critic = rlRepresentation(criticNetwork,'Observation'{'observation'},obsInfo...

When creating your deep neural network and configuring your representation object, consider using one of the following approaches as a starting point:

  1. Start with smallest possible network and a high learning rate (0.01). Train this initial network to see if the agent converges quickly to a poor policy or acts in a random manner. If either of these issues occur, rescale the network by adding more layers or more outputs on each layer. Your goal is to find a network structure that is just big enough, does not learn too fast, and shows signs of learning (an improving trajectory of the reward graph) after an initial training period.

  2. Initially configure the agent to learn slowly by setting a low learning rate. By learning slowly, you can check to see if the agent is on the right track, which can help verify whether your network architecture is satisfactory for the problem. For difficult problems, tuning parameters is much easier once you settle on a good network architecture.

Also, consider the following tips when configuring your deep neural network representation:

  • Be patient with DDPG and DQN agents, since they may not learn anything for some time during the early episodes, and they typically show a dip in cumulative reward early in the training process. Eventually, they can show signs of learning after the first few thousand episodes.

  • For DDPG and DQN agents, promoting exploration of the agent is critical.

  • For agents with both actor and critic networks, set the initial learning rates of both representations to the same value. For some problems, setting the critic learning rate to a higher value than that of the actor can improve learning results.

Linear Basis Function Representations

Linear basis function representations have the form f = W'B, where W is a weight array, and B is the column vector output of a custom basis function. The learnable parameters of a linear basis function representation are the elements of W.

For critic representations, f is a scalar value and W is a column vector with the same length as B.

For actor representations, with a:

  • Continuous action space, the dimensions of f match the dimensions of the agent action specification, which is either a scalar or a column vector.

  • Discrete action space, f is a column vector with length equal to the number of discrete actions. In this case,

For actor representations, the number of columns in W equals the number of elements in f.

To create a linear basis function representation, first create a custom basis function that returns a column vector. The signature of this basis function depends on what type of function approximator you are creating. When you create:

  • A critic representation with observation inputs only or an actor representation, your basis function must have the following signature.

    B = myBasisFunction(obs1,obs2,...,obsN)
  • A critic representation with observation and action inputs, your basis function must have the following signature.

    B = myBasisFunction(obs1,obs2,...,obsN,act)

Here obs1 to obsN are observations in the same order and with the same data type and dimensions as the observation specifications of the agent, and act has the same data type and dimensions as the agent action specification.

Each element of B can be any function of the observation and action signals, depending on the requirements of your application.

For more information on creating such a representation, see rlRepresentation.

For an example that trains a custom agent that uses a linear basis function representation, see Train Custom LQR Agent.

Specify Agent Representations

Once you create your actor and critic representations, you can create a reinforcement learning agent that uses these representations. For example, create a PG agent using a given actor and critic network.

agentOpts = rlPGAgentOptions('UseBaseline',true);
agent = rlPGAgent(actor,baseline,agentOpts);

For more information on the different types of reinforcement learning agents, see Reinforcement Learning Agents.

You can obtain the actor and critic representations from an existing agent using getActor and getCritic, respectively.

You can also set the actor and critic of an existing agent using setActor and setCritic, respectively. When you specify a representation using these functions, the input and output layers of the specified representation must match the observation and action specifications of the original agent.

See Also


Related Topics