A reinforcement learning policy is a mapping that selects an action to take based on observations from the environment. During training, the agent tunes the parameters of its policy representation to maximize the long-term reward.

Depending on the type of reinforcement learning agent you are using, you define actor and critic function approximators, which the agent uses to represent and train its policy. The actor represents the policy that selects the best action to take. The critic represents the value function that estimates the long-term reward for the current policy. Depending on your application and selected agent, you can define policy and value functions using deep neural networks, linear basis functions, or look-up tables.

For more information on agents, see Reinforcement Learning Agents.

Depending on the type of agent you are using, Reinforcement Learning Toolbox™ software supports the following types of function approximators:

*V*(*S*|*θ*) — Critics that estimate the expected long-term reward based on observation_{V}*S**Q*(*S*,*A*|*θ*) — Critics that estimate the expected long-term reward based on observation_{Q}*S*and action*A**μ*(*S*|*θ*) — Actors that select an action based on observation_{μ}*S*

Each function approximator has a corresponding set of parameters
(*θ _{V}*,

For systems with a limited number of discrete observations and discrete actions, you can store value functions in a look-up table. For systems that have many discrete observations and actions or with observation and action spaces that are continuous, storing the observations and actions becomes impractical. For such systems, you can represent your actors and critics using deep neural networks or linear basis functions.

You can create two types of table representations:

Value tables, which store rewards for corresponding observations.

Q-tables, which store rewards for corresponding observation-action pairs.

To create a table representation, first create a value table or Q table using the
`rlTable`

function.
Then, create a representation for the table using the `rlRepresentation`

function. To configure the learning rate and optimization used by the representation, use an
`rlRepresentationOptions`

object.

You can create actor and critic function approximators using deep neural network representations. Doing so uses Deep Learning Toolbox™ software features.

The dimensions of your actor and critic networks must match the corresponding action
and observation specifications from the training environment object. To obtain the action
and observation dimensions for environment `env`

, use the
`getActionInfo`

and `getObservationInfo`

functions, respectively. Then access the `Dimensions`

property of the
specification objects

actInfo = getActionInfo(env); actDimensions = actionInfo.Dimensions; obsInfo = getObservationInfo(env); obsDimensions = observatonInfo.Dimensions;

For critic networks that take only observations as inputs, such as those used in AC or PG agents, the dimensions of the input layers must match the dimensions of the environment observation specifications. The dimensions of the critic output layer must be a scalar value function.

For critic networks that take only observations and actions as inputs, such as those used in DQN or DDPG agents, the dimensions of the input layers must match the dimensions of the corresponding environment observation and action specifications. In this case, the dimensions of the critic output layer must also be a scalar value function.

For actor networks the dimensions of the input layers must match the dimensions of the environment observation specifications. If the actor has a:

Discrete action space, then its output size must equal the number of discrete actions.

Continuous action space, then its output size must be a scalar or vector value, as defined in the observation specification.

Deep neural networks consist of a series of interconnected layers. The following table lists some common deep learning layers used in reinforcement learning applications. For a full list of available layers, see List of Deep Learning Layers (Deep Learning Toolbox).

Layer | Description |
---|---|

`imageInputLayer` | Input vectors and 2-D images, and normalize the data. |

`tanhLayer` | Apply a hyperbolic tangent activation layer to the layer inputs. |

`reluLayer` | Set any input values that are less than zero to zero. |

`fullyConnectedLayer` | Multiply the input vector by a weight matrix, and add a bias vector. |

`convolution2dLayer` | Apply sliding convolutional filters to the input. |

`additionLayer` | Add the outputs of multiple layers together. |

`concatenationLayer` | Concatenate inputs along a specified dimension. |

The `lstmLayer`

,
`bilstmLayer`

,
and `batchNormalizationLayer`

layers are not supported for reinforcement
learning.

You can also create your own custom layers. For more information, see Define Custom Deep Learning Layers (Deep Learning Toolbox). Reinforcement Learning Toolbox software provides the following custom layers.

Layer | Description |
---|---|

`scalingLayer` | Linearly scale and bias an input array. This layer is useful for scaling and
shifting the outputs of nonlinear layers, such as `tanhLayer` and sigmoid. |

`quadraticLayer` | Create vector of quadratic monomials constructed from the elements of the input array. This layer is useful when you need an output that is some quadratic function of its inputs, such as for an LQR controller. |

The `scalingLayer`

and `quadraticLayer`

custom
layers do not contain tunable parameters; that is, they do not change during
training.

For reinforcement learning applications, you construct your deep neural network by
connecting a series of layers for each input path (observations or actions) and for each
output path (estimated rewards or actions). You then connect these paths together using
the `connectLayers`

function.

When you create a deep neural network, you must specify names for the first layer of each input path and the final layer of the output path.

The following code creates and connects the following input and output paths:

An observation input path,

`observationPath`

, with the first layer named`'observation'`

.An action input path,

`actionPath`

, with the first layer named`'action'`

.An estimated value function output path,

`commonPath`

, which takes the outputs of`observationPath`

and`actionPath`

as inputs. The final layer of this path is named`'output'`

.

observationPath = [ imageInputLayer([4 1 1],'Normalization','none','Name','observation') fullyConnectedLayer(24, 'Name','CriticObsFC1') reluLayer('Name','CriticRelu1') fullyConnectedLayer(24,'Name','CriticObsFC2')]; actionPath = [ imageInputLayer([1 1 1],'Normalization','none','Name','action') fullyConnectedLayer(24,'Name','CriticActionFC1')]; commonPath = [ additionLayer(2,'Name','add') reluLayer('Name','CriticCommonRelu') fullyConnectedLayer(1,'Name','output')]; criticNetwork = layerGraph(observationPath); criticNetwork = addLayers(criticNetwork,actionPath); criticNetwork = addLayers(criticNetwork,commonPath); criticNetwork = connectLayers(criticNetwork,'CriticObsFC2','add/in1'); criticNetwork = connectLayers(criticNetwork,'CriticObsFC1','add/in2');

For all observation and action input paths, you must specify an
`imageInputLayer`

layer as the first layer in the path.

You can view the structure of your deep neural network using the
`plot`

function.

plot(criticNetwork)

For a PG and AC agents, the final output layers of your deep neural network actor
representation are a `fullyConnectedLayer`

and a
`softmaxLayer`

layer. When you specify the layers for your network,
you must specify the `fullyConnectedLayer`

and you can optionally
specify the `softmaxLayer`

layer. If you omit the
`softmaxLayer`

, the software automatically adds one for you.

Determining the number, type, and size of layers for your deep neural network representation can be difficult and is application-dependent. However, the most critical component for any function approximator is whether the function is able to approximate the optimal policy or discounted value function for your application; that is, does it have layers that can correctly learn the features of your observation, action, and reward signals.

Consider the following tips when constructing your network.

For continuous action spaces, bound actions with a

`tanhLayer`

followed by`ScalingLayer`

if necessary.Deep dense networks with

`reluLayers`

can be fairly good at approximating many different functions. Therefore, they are often a good first choice.When approximating strong nonlinearities or systems with algebraic constraints, it is often better to add more layers rather than increasing number of outputs per layer. Adding more layers promotes exponential exploration, while adding layer outputs promotes polynomial exploration.

For on-policy agents, such as AC and PG agents, parallel training works better if your networks are large (for example, a network with two hidden layers with 32 nodes each, which has a few hundred parameters). On-policy parallel updates assume each worker updates a different part of the network, such as when they explore different areas of the observation space. If the network is small, the worker updates might correlate with each other and make training unstable.

To create an actor or critic representation object for your deep neural network, use
the `rlRepresentation`

function. To configure the learning rate and optimization used by the representation, use
an `rlRepresentationOptions`

object.

For example, create a representation object for the critic network
`criticNetwork`

, specifying a learning rate of
`0.0001`

. When you create the representation, pass the environment
action and observation specifications to the `rlRepresentation`

function,
and specify the names of the network layers to which the actions and observations are
connected.

opt = rlRepresentationOptions('LearnRate',0.0001); critic = rlRepresentation(criticNetwork,'Observation'{'observation'},obsInfo... 'Action',{'action'},actInfo,opt);

When creating your deep neural network and configuring your representation object, consider using one of the following approaches as a starting point:

Start with smallest possible network and a high learning rate (

`0.01`

). Train this initial network to see if the agent converges quickly to a poor policy or acts in a random manner. If either of these issues occur, rescale the network by adding more layers or more outputs on each layer. Your goal is to find a network structure that is just big enough, does not learn too fast, and shows signs of learning (an improving trajectory of the reward graph) after an initial training period.Initially configure the agent to learn slowly by setting a low learning rate. By learning slowly, you can check to see if the agent is on the right track, which can help verify whether your network architecture is satisfactory for the problem. For difficult problems, tuning parameters is much easier once you settle on a good network architecture.

Also, consider the following tips when configuring your deep neural network representation:

Be patient with DDPG and DQN agents, since they may not learn anything for some time during the early episodes, and they typically show a dip in cumulative reward early in the training process. Eventually, they can show signs of learning after the first few thousand episodes.

For DDPG and DQN agents, promoting exploration of the agent is critical.

For agents with both actor and critic networks, set the initial learning rates of both representations to the same value. For some problems, setting the critic learning rate to a higher value than that of the actor can improve learning results.

Linear basis function representations have the form `f = W'B`

, where
`W`

is a weight array, and `B`

is the column vector
output of a custom basis function. The learnable parameters of a linear basis function
representation are the elements of `W`

.

For critic representations, `f`

is a scalar value and
`W`

is a column vector with the same length as
`B`

.

For actor representations, with a:

Continuous action space, the dimensions of

`f`

match the dimensions of the agent action specification, which is either a scalar or a column vector.Discrete action space,

`f`

is a column vector with length equal to the number of discrete actions. In this case,

For actor representations, the number of columns in `W`

equals the
number of elements in `f`

.

To create a linear basis function representation, first create a custom basis function that returns a column vector. The signature of this basis function depends on what type of function approximator you are creating. When you create:

A critic representation with observation inputs only or an actor representation, your basis function must have the following signature.

B = myBasisFunction(obs1,obs2,...,obsN)

A critic representation with observation and action inputs, your basis function must have the following signature.

B = myBasisFunction(obs1,obs2,...,obsN,act)

Here `obs1`

to `obsN`

are observations in the same
order and with the same data type and dimensions as the observation specifications of the
agent, and `act`

has the same data type and dimensions as the agent action
specification.

Each element of `B`

can be any function of the observation and action
signals, depending on the requirements of your application.

For more information on creating such a representation, see `rlRepresentation`

.

For an example that trains a custom agent that uses a linear basis function representation, see Train Custom LQR Agent.

Once you create your actor and critic representations, you can create a reinforcement learning agent that uses these representations. For example, create a PG agent using a given actor and critic network.

```
agentOpts = rlPGAgentOptions('UseBaseline',true);
agent = rlPGAgent(actor,baseline,agentOpts);
```

For more information on the different types of reinforcement learning agents, see Reinforcement Learning Agents.

You can obtain the actor and critic representations from an existing agent using
`getActor`

and
`getCritic`

,
respectively.

You can also set the actor and critic of an existing agent using `setActor`

and
`setCritic`

,
respectively. When you specify a representation using these functions, the input and output
layers of the specified representation must match the observation and action specifications
of the original agent.

`rlRepresentation`

| `rlRepresentationOptions`