Main Content

rlValueFunction

Value function approximator object for reinforcement learning agents

Description

This object implements a value function approximator object that you can use as a critic for a reinforcement learning agent. A value function maps an environment state to a scalar value. The output represents the predicted discounted cumulative long-term reward when the agent starts from the given state and takes the best possible action. After you create an rlValueFunction critic, use it to create an agent such as an rlACAgent, rlPGAgent, or rlPPOAgent agent. For an example of this workflow, see Create Actor and Critic Representations. For more information on creating value functions, see Create Policies and Value Functions.

Creation

Description

example

critic = rlValueFunction(net,observationInfo) creates the value-function object critic from the deep neural network net and sets the ObservationInfo property of critic to the observationInfo input argument. The network input layers are automatically associated with the environment observation channels according to the dimension specifications in observationInfo.

example

critic = rlValueFunction(net,ObservationInputNames=netObsNames) specifies the network input layer names to be associated with the environment observation channels. The function assigns, in sequential order, each environment observation channel specified in observationInfo to the layer specified by the corresponding name in the string array netObsNames. Therefore, the network input layers, ordered as the names in netObsNames, must have the same data type and dimensions as the observation channels, as ordered in observationInfo.

example

critic = rlValueFunction(tab,observationInfo) creates the value function object critic with a discrete observation space, from the table tab, which is an rlTable object containing a column array with as many elements as the number of possible observations. The function sets the ObservationInfo property of critic to the observationInfo input argument, which in this case must be a scalar rlFiniteSetSpec object.

example

critic = rlValueFunction({basisFcn,W0},observationInfo) creates the value function object critic using a custom basis function as underlying approximator. The first input argument is a two-element cell array whose first element is the handle basisFcn to a custom basis function and whose second element is the initial weight vector W0. The function sets the ObservationInfo property of critic to the observationInfo input argument.

critic = rlValueFunction(___,UseDevice=useDevice) specifies the device used to perform computations for the critic object, and sets the UseDevice property of critic to the useDevice input argument. You can use this syntax with any of the previous input-argument combinations.

Input Arguments

expand all

Deep neural network used as the underlying approximator within the critic, specified as one of the following:

Note

Among the different network representation options, dlnetwork is preferred, since it has built-in validation checks and supports automatic differentiation. If you pass another network object as an input argument, it is internally converted to a dlnetwork object. However, best practice is to convert other representations to dlnetwork explicitly before using it to create a critic or an actor for a reinforcement learning agent. You can do so using dlnet=dlnetwork(net), where net is any Deep Learning Toolbox™ neural network object. The resulting dlnet is the dlnetwork object that you use for your critic or actor. This practice allows a greater level of insight and control for cases in which the conversion is not straightforward and might require additional specifications.

The network must have the environment observation channels as inputs and a single scalar as output.

rlValueFunction objects support recurrent deep neural networks.

The learnable parameters of the critic are the weights of the deep neural network. For a list of deep neural network layers, see List of Deep Learning Layers. For more information on creating deep neural networks for reinforcement learning, see Create Policies and Value Functions.

Network input layers names corresponding to the environment observation channels, specified as a string array or a cell array of character vectors. When you use this argument after 'ObservationInputNames', the function assigns, in sequential order, each environment observation channel specified in observationInfo to each network input layer specified by the corresponding name in the string array netObsNames. Therefore, the network input layers, ordered as the names in netObsNames, must have the same data type and dimensions as the observation specifications, as ordered in observationInfo.

Note

Of the information specified in observationInfo, the function uses only the data type and dimension of each channel, but not its (optional) name or description.

Example: {"NetInput1_airspeed","NetInput2_altitude"}

Value table, specified as an rlTable object containing a column vector with length equal to the number of possible observations from the environment. Each element is the predicted discounted cumulative long-term reward when the agent starts from the given observation and takes the best possible action. The elements of this vector are the learnable parameters of the representation.

Custom basis function, specified as a function handle to a user-defined function. The user defined function can either be an anonymous function or a function on the MATLAB path. The output of the critic is the scalar c = W'*B, where W is a weight vector containing the learnable parameters and B is the column vector returned by the custom basis function.

Your basis function must have the following signature.

B = myBasisFunction(obs1,obs2,...,obsN)

Here, obs1 to obsN are inputs in the same order and with the same data type and dimensions as the environment observation channels defined in observationInfo.

For an example on how to use a basis function to create a value function critic with a mixed continuous and discrete observation space, see Create Mixed Observation Space Value Function Critic from Custom Basis Function.

Example: @(obs1,obs2,obs3) [obs3(1)*obs1(1)^2; abs(obs2(5)+obs1(2))]

Initial value of the basis function weights W, specified as a column vector having the same length as the vector returned by the basis function.

Properties

expand all

Observation specifications, specified as an rlFiniteSetSpec or rlNumericSpec object or an array containing a mix of such objects. Each element in the array defines the properties of an environment observation channel, such as its dimensions, data type, and name. Note that only the data type and dimension of a channel are used by the software to create actors or critics, but not its (optional) name and description.

rlValueFucntion sets the ObservationInfo property of critic to the input argument observationInfo.

You can extract ObservationInfo from an existing environment or agent using getObservationInfo. You can also construct the specifications manually.

Computation device used to perform operations such as gradient computation, parameter update and prediction during training and simulation, specified as either "cpu" or "gpu".

The "gpu" option requires both Parallel Computing Toolbox™ software and a CUDA® enabled NVIDIA® GPU. For more information on supported GPUs see GPU Computing Requirements (Parallel Computing Toolbox).

You can use gpuDevice (Parallel Computing Toolbox) to query or select a local GPU device to be used with MATLAB®.

Note

Training or simulating an agent on a GPU involves device-specific numerical round-off errors. These errors can produce different results compared to performing the same operations a CPU.

To speed up training by using parallel processing over multiple cores, you do not need to use this argument. Instead, when training your agent, use an rlTrainingOptions object in which the UseParallel option is set to true. For more information about training using multicore processors and GPUs for training, see Train Agents Using Parallel Computing and GPUs.

Example: "gpu"

Object Functions

rlACAgentActor-critic reinforcement learning agent
rlPGAgentPolicy gradient reinforcement learning agent
rlPPOAgentProximal policy optimization reinforcement learning agent
getValueObtain estimated value from a critic given environment observations and actions
evaluateEvaluate function approximator object given observation (or observation-action) input data
gradientEvaluate gradient of function approximator object given observation and action input data
accelerateOption to accelerate computation of gradient for approximator object based on neural network
getLearnableParametersObtain learnable parameter values from agent, function approximator, or policy object
setLearnableParametersSet learnable parameter values of agent, function approximator, or policy object
setModelSet function approximation model for actor or critic
getModelGet function approximator model from actor or critic

Examples

collapse all

Create an observation specification object (or alternatively use getObservationInfo to extract the specification object from an environment). For this example, define the observation space as a continuous four-dimensional space, so that a single observation is a column vector containing four doubles.

obsInfo = rlNumericSpec([4 1]);

Create a deep neural network to approximate the value function within the critic, as a column vector of layer objects. The network input layer must accept a four-element vector (the observation vector defined by obsInfo), and the output must be a scalar (the value, representing the expected cumulative long-term reward when the agent starts from the given observation).

You can also obtain the number of observations from the obsInfo specification (regardless of whether the observation space is a column vector, row vector, or matrix, prod(obsInfo.Dimension) is its total number of dimensions, in this case equal to 4).

net = [ featureInputLayer(prod(obsInfo.Dimension));
        fullyConnectedLayer(10);
        reluLayer;
        fullyConnectedLayer(1,Name="value")];

Convert the network to a dlnetwork object.

dlnet = dlnetwork(net);

You can plot the network using plot and display its main characteristics, like the number of weights, using summary.

plot(dlnet)

Figure contains an axes object. The axes object contains an object of type graphplot.

summary(dlnet)
   Initialized: true

   Number of learnables: 61

   Inputs:
      1   'input'   4 features

Create the critic using the network and the observation specification object. When you use this syntax the network input layer is automatically associated with the environment observation according to the dimension specifications in obsInfo.

critic = rlValueFunction(dlnet,obsInfo)
critic = 
  rlValueFunction with properties:

    ObservationInfo: [1x1 rl.util.rlNumericSpec]
          UseDevice: "cpu"

To check your critic, use getValue to return the value of a random observation, using the current network weights.

v = getValue(critic,{rand(obsInfo.Dimension)})
v = single
    0.5196

You can now use the critic (along with an actor) to create an agent relying on a value function critic (such as rlACAgent or rlPGAgent).

Create an actor and a critic that you can use to define a reinforcement learning agent such as an Actor Critic (AC) agent. For this example, create actor and critic for an agent that can be trained against the cart-pole environment described in Train AC Agent to Balance Cart-Pole System.

First, create the environment. Then, extract the observation and action specifications from the environment. You need these specifications to define the agent and critic.

env = rlPredefinedEnv("CartPole-Discrete");
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);

A state-value-function critic, such as those used for AC or PG agents, has the current observation as input and the state value, a scalar, as output. For this example, to approximate the value function within the critic, create a deep neural network with one output (the value) and four inputs (the environment observation signals x, xdot, theta, and thetadot).

Create the network as a column vector of layer objects. You can obtain the number of observations from the obsInfo specification (regardless of whether the observation space is a column vector, row vector, or matrix, prod(obsInfo.Dimension) is its total number of dimensions). Name the network input layer criticNetInput.

criticNetwork = [
    featureInputLayer(prod(obsInfo.Dimension),...
        Name="criticNetInput");
    fullyConnectedLayer(10);
    reluLayer;        
    fullyConnectedLayer(1,Name="CriticFC")];

Convert the network to a dlnetwork object.

criticNetwork = dlnetwork(criticNetwork);

To display the network main characteristics, use summary.

summary(criticNetwork)
   Initialized: true

   Number of learnables: 61

   Inputs:
      1   'criticNetInput'   4 features

Create the critic using the specified neural network. Also, specify the action and observation information for the critic. Set the observation name to observation, which is the name of the criticNetwork input layer.

critic = rlValueFunction(criticNetwork,obsInfo,...
    ObservationInputNames={'criticNetInput'})
critic = 
  rlValueFunction with properties:

    ObservationInfo: [1x1 rl.util.rlNumericSpec]
          UseDevice: "cpu"

Check your critic using getValue to return the value of a random observation, given the current network weights.

v = getValue(critic,{rand(obsInfo.Dimension)})
v = single
    0.5196

Specify the critic optimization options using rlOptimizerOptions. These options control the learning of the critic network parameters. For this example, set the learning rate to 0.01 and the gradient threshold to 1.

criticOpts = rlOptimizerOptions( ...
    LearnRate=1e-2,...
    GradientThreshold=1);

An AC agent decides which action to take given observations using a policy which is represented by an actor. For an actor, the inputs are the environment observations, and the output depends on whether the action space is discrete or continuous. The actor in this example has two possible discrete actions, –10 or 10. To create the actor, use a deep neural network that can output these two values given the same observation input as the critic.

Create the network using a row vector of two layer objects. You can obtain the number of actions from the actInfo specification. Name the network output actorNetOutput.

actorNetwork = [
    featureInputLayer( ...
        prod(obsInfo.Dimension),...
        Name="actorNetInput")
    fullyConnectedLayer( ...
        numel(actInfo.Elements), ...
        Name="actorNetOutput")];

Convert the network to a dlnetwork object.

actorNetwork = dlnetwork(actorNetwork);

To display the network main characteristics, use summary.

summary(actorNetwork)
   Initialized: true

   Number of learnables: 10

   Inputs:
      1   'actorNetInput'   4 features

Create the actor using rlDiscreteCategoricalActor together with the observation and action specifications, and the name of the network input layer to be associated with the environment observation channel.

actor = rlDiscreteCategoricalActor(actorNetwork,obsInfo,actInfo,...
        ObservationInputNames={'actorNetInput'})
actor = 
  rlDiscreteCategoricalActor with properties:

       Distribution: [1x1 rl.distribution.rlDiscreteGaussianDistribution]
    ObservationInfo: [1x1 rl.util.rlNumericSpec]
         ActionInfo: [1x1 rl.util.rlFiniteSetSpec]
          UseDevice: "cpu"

To check your actor, use getAction to return a random action from a given observation, using the current network weights.

a = getAction(actor,{rand(obsInfo.Dimension)})
a = 1x1 cell array
    {[-10]}

Specify the actor optimization options using rlOptimizerOptions. These options control the learning of the critic network parameters. For this example, set the learning rate to 0.05 and the gradient threshold to 1.

actorOpts = rlOptimizerOptions( ...
    LearnRate=5e-2,...
    GradientThreshold=1);

Create an AC agent using the actor and critic. Use the optimizer options objects previously created for both actor and critic.

agentOpts = rlACAgentOptions(...
    NumStepsToLookAhead=32,...
    DiscountFactor=0.99,...
    CriticOptimizerOptions=criticOpts,...
    ActorOptimizerOptions=actorOpts);
agent = rlACAgent(actor,critic,agentOpts)
agent = 
  rlACAgent with properties:

            AgentOptions: [1x1 rl.option.rlACAgentOptions]
    UseExplorationPolicy: 1
         ObservationInfo: [1x1 rl.util.rlNumericSpec]
              ActionInfo: [1x1 rl.util.rlFiniteSetSpec]
              SampleTime: 1

To check your agent, use getAction to return a random action from a given observation, using the current actor and critic network weights.

act = getAction(agent,{rand(obsInfo.Dimension)})
act = 1x1 cell array
    {[-10]}

For additional examples showing how to create actors and critics for different agent types, see:

Create a finite set observation specification object (or alternatively use getObservationInfo to extract the specification object from an environment with a discrete observation space). For this example, define the observation space as a finite set consisting of four possible values: 1, 3, 4 and 7.

obsInfo = rlFiniteSetSpec([1 3 5 7]);

Create a table to approximate the value function within the critic.

vTable = rlTable(obsInfo);

The table is a column vector in which each entry stores the predicted cumulative long-term reward for each possible observation as defined by obsInfo. You can access the table using the Table property of the vTable object. The initial value of each element is zero.

vTable.Table
ans = 4×1

     0
     0
     0
     0

You can also initialize the table to any value, in this case, an array containing all the integers from 1 to 4.

vTable.Table = reshape(1:4,4,1)
vTable = 
  rlTable with properties:

    Table: [4x1 double]

Create the critic using the table and the observation specification object.

critic = rlValueFunction(vTable,obsInfo)
critic = 
  rlValueFunction with properties:

    ObservationInfo: [1x1 rl.util.rlFiniteSetSpec]
          UseDevice: "cpu"

To check your critic, use the getValue function to return the value of a given observation, using the current table entries.

v = getValue(critic,{7})
v = 4

You can now use the critic (along with an actor) to create an agent relying on a value function critic (such as rlACAgent or rlPGAgent).

Create an observation specification object (or alternatively use getObservationInfo to extract the specification object from an environment). For this example, define the observation space as a continuous four-dimensional space, so that a single observation is a column vector containing four doubles.

obsInfo = rlNumericSpec([4 1]);

Create a custom basis function to approximate the value function within the critic. The custom basis function must return a column vector. Each vector element must be a function of the observations defined by obsInfo.

myBasisFcn = @(myobs) [myobs(2)^2; myobs(3)+exp(myobs(1)); abs(myobs(4))]
myBasisFcn = function_handle with value:
    @(myobs)[myobs(2)^2;myobs(3)+exp(myobs(1));abs(myobs(4))]

The output of the critic is the scalar W'*myBasisFcn(myobs), where W is a weight column vector which must have the same size as the custom basis function output. This output is the expected cumulative long term reward when the agent starts from the given observation and takes the best possible action. The elements of W are the learnable parameters.

Define an initial parameter vector.

W0 = [3;5;2];

Create the critic. The first argument is a two-element cell containing both the handle to the custom function and the initial weight vector. The second argument is the observation specification object.

critic = rlValueFunction({myBasisFcn,W0},obsInfo)
critic = 
  rlValueFunction with properties:

    ObservationInfo: [1x1 rl.util.rlNumericSpec]
          UseDevice: "cpu"

To check your critic, use the getValue function to return the value of a given observation, using the current parameter vector.

v = getValue(critic,{[2 4 6 8]'})
v = 130.9453

You can now use the critic (along with an actor) to create an agent relying on a value function critic (such as rlACAgent or rlPGAgent).

Create an environment and obtain observation and action information.

env = rlPredefinedEnv("CartPole-Discrete");
obsInfo = getObservationInfo(env);

To approximate the value function within the critic, use create a recurrent deep neural network as a row vector of layer objects. Use a sequenceInputLayer as the input layer (obsInfo.Dimension(1) is the dimension of the observation space) and include at least one lstmLayer.

myNet = [
    sequenceInputLayer(obsInfo.Dimension(1))
    fullyConnectedLayer(8, Name="fc")
    reluLayer(Name="relu")
    lstmLayer(8,OutputMode="sequence")
    fullyConnectedLayer(1,Name="output")];

Convert the network to a dlnetwork object.

dlCriticNet = dlnetwork(myNet);

Display a summary of network characteristics.

summary(dlCriticNet)
   Initialized: true

   Number of learnables: 593

   Inputs:
      1   'sequenceinput'   Sequence input with 4 dimensions

Create a value function representation object for the critic.

critic = rlValueFunction(dlCriticNet,obsInfo)
critic = 
  rlValueFunction with properties:

    ObservationInfo: [1x1 rl.util.rlNumericSpec]
          UseDevice: "cpu"

To check your critic, use the getValue function to return the value of a random observation, using the current network weights.

v = getValue(critic,{rand(obsInfo.Dimension)})
v = single
    0.0017

You can now use the critic (along with an actor) to create an agent relying on a value function critic (such as rlACAgent or rlPGAgent).

Create a finite-set observation specification object (or alternatively use getObservationInfo to extract the specification object from an environment). For this example, define the observation space as two channels where the first one is a single observation labeled 7, 5, 3, or 1, and the second one is a vector over a continuous three-dimensional space.

obsInfo = [rlFiniteSetSpec([7 5 3 1]) rlNumericSpec([3 1])];

Create a custom basis function to approximate the value function within the critic. The custom basis function must return a column vector. Each vector element must be a function of the observations (in this case a single number) defined by obsInfo.

myBasisFcn = @(obsA,obsB) [ obsA(1) + norm(obsB);
                            obsA(1) - norm(obsB);
                            obsA(1)^2 + obsB(3);
                            obsA(1)^2 - obsB(3)];

The output of the critic is the scalar W'*myBasisFcn(obsA,obsB), where W is a weight column vector which must have the same size of the custom basis function output. This output is the expected cumulative long term reward when the agent starts from the given observation and takes the best possible action. The elements of W are the learnable parameters.

Define an initial parameter vector.

W0 = ones(4,1);

Create the critic. The first argument is a two-element cell containing both the handle to the custom function and the initial weight vector. The second argument is the observation specification object.

critic = rlValueFunction({myBasisFcn,W0},obsInfo)
critic = 
  rlValueFunction with properties:

    ObservationInfo: [2x1 rl.util.RLDataSpec]
          UseDevice: "cpu"

To check your critic, use the getValue function to return the value of a given observation, using the current parameter vector.

v = getValue(critic,{5,[0.1 0.1 0.1]'})
v = 60

Note that the critic does not enforce the set constraint for the discrete set element.

v = getValue(critic,{-3,[0.1 0.1 0.1]'})
v = 12

You can now use the critic (along with an with an actor) to create an agent relying on a discrete value function critic (such as rlACAgent or rlPGAgent).

Version History

Introduced in R2022a