Contenido principal

getValue

Obtain estimated value from a critic given environment observations and actions

Description

Value Function Critic

value = getValue(ValueFcn,obs) evaluates the value function critic ValueFcn and returns the value corresponding to the observation obs. In this case, ValueFcn is an rlValueFunction approximator object.

example

Q-Value Function Critics

value = getValue(VQValueFcn,obs) evaluates the discrete-action-space Q-value function critic VQValueFcn and returns the vector value, in which each element represents the estimated value given the state corresponding to the observation obs and the action corresponding to the element number of value. In this case, VQValueFcn is an rlVectorQValueFunction approximator object.

example

value = getValue(QValueFcn,obs,act) evaluates the Q-value function critic QValueFcn and returns the scalar value, representing the value given the observation obs and action act. In this case, QValueFcn is an rlQValueFunction approximator object.

example

Return Recurrent Neural Network State

[value,state] = getValue(___) also returns the updated state of the critic object when it contains a recurrent neural network.

Use Forward

___ = getValue(___,UseForward=useForward) allows you to explicitly call a forward pass when computing gradients.

Examples

collapse all

Create an observation specification object (or alternatively use the getObservationInfo function to extract the specification object from an environment). For this example, define the observation space as a continuous four-dimensional space, so that a single observation is a column vector containing four doubles.

obsInfo = rlNumericSpec([4 1]);

To approximate the value function within the critic, create a neural network. Define a single path from the network input (the observation) to its output (the value), as an array of layer objects.

net = [ featureInputLayer(4) ...
        fullyConnectedLayer(1)];

Convert the network to a dlnetwork object and display the number of weights.

net = dlnetwork(net);
summary(net);
   Initialized: true

   Number of learnables: 5

   Inputs:
      1   'input'   4 features

Create a critic using the network and the observation specification object. When you use this syntax the network input layer is automatically associated with the environment observation according to the dimension specifications in obsInfo.

critic = rlValueFunction(net,obsInfo);

Obtain a value function estimate for a random single observation. Use an observation array with the same dimensions as the observation specification.

val = getValue(critic,{rand(4,1)})
val = single

0.7904

You can also obtain value function estimates for a batch of observations. For example, obtain value functions for a batch of 20 observations.

batchVal = getValue(critic,{rand(4,1,20)});
size(batchVal)
ans = 1×2

     1    20

valBatch contains one value function estimate for each observation in the batch.

Create observation and action specification objects (or alternatively use the getObservationInfo and getActionInfo functions to extract the specification objects from an environment). For this example, define the observation space as a continuous four-dimensional space, so that a single observation is a column vector containing four doubles, and the action space as a finite set consisting of three possible values (named 7, 5, and 3 in this case).

obsInfo = rlNumericSpec([4 1]);
actInfo = rlFiniteSetSpec([7 5 3]);

Create a vector Q-value function approximator to use as a critic. A vector Q-value function takes only the observation as input and returns as output a single vector with as many elements as the number of possible actions. The value of each output element represents the expected discounted cumulative long-term reward for taking the action from the state corresponding to the current observation, and following the policy afterward.

To model the parameterized vector Q-value function within the critic. The network must have one input layer that accepts a four-element vector, as defined by obsInfo. The output must be a single output layer having as many elements as the number of possible discrete actions (three in this case, as defined by actInfo).

Define a single path from the network input to its output as array of layer objects.

net = [
    featureInputLayer(4) 
    fullyConnectedLayer(3)
    ];

Convert the network to a dlnetwork object and display the number of weights.

net = dlnetwork(net);
summary(net)
   Initialized: true

   Number of learnables: 15

   Inputs:
      1   'input'   4 features

Create the critic using the network, as well as the names of the observation and action specification objects. The network input layers are automatically associated with the observation channels, according to the dimension specifications in obsInfo.

critic = rlVectorQValueFunction(net,obsInfo,actInfo);

Use getValue to return the values of a random observation, using the current network weights.

v = getValue(critic,{rand(obsInfo.Dimension)})
v = 3×1 single column vector

    0.7232
    0.8177
   -0.2212

v contains three value function estimates, one for each possible discrete action.

You can also obtain value function estimates for a batch of observations. For example, obtain value function estimates for a batch of 10 observations.

batchV = getValue(critic,{rand([obsInfo.Dimension 10])});
size(batchV)
ans = 1×2

     3    10

batchV contains three value function estimates for each observation in the batch.

Create observation and action specification objects (or alternatively use the getObservationInfo function and getObservationInfo to extract the specification object from an environment). For this example, define the observation space as having two continuous channels, the first one carrying an 8 by 3 matrix, and the second one a continuous four-dimensional vector. The action specification is a continuous column vector containing 2 doubles.

obsInfo = [rlNumericSpec([8 3]), rlNumericSpec([4 1])];
actInfo = rlNumericSpec([2 1]);

Create a custom basis function and its initial weight matrix. Note that both channels carry 2-D matrices but the respective myBasisFcn input has also the batch and sequence dimensions.

myBasisFcn = @(obsA,obsB,act) [...
    ones(30,1,size(obsA,3),like=obsA);
    reshape(obsA,24,1,[]); 
    reshape(obsB,4,1,[]); 
    reshape(act,2,1,[]);
    reshape(obsA,24,1,[]).^2; 
    reshape(obsB,4,1,[]).^2; 
    reshape(act,2,1,[]).^2;
    sin(reshape(obsA,24,1,[])); 
    sin(reshape(obsB,4,1,[])); 
    sin(reshape(act,2,1,[]));
    cos(reshape(obsA,24,1,[])); 
    cos(reshape(obsB,4,1,[])); 
    cos(reshape(act,2,1,[]))];
W0 = rand(150,1);

The output of the critic is the scalar W'*myBasisFcn(obs,act), representing the Q-value function to be approximated.

Create the critic.

critic = rlQValueFunction({myBasisFcn,W0}, ...
    obsInfo,actInfo);

Use getValue to return the value of a random observation-action pair, using the current parameter matrix.

v = getValue(critic,{rand(8,3),(1:4)'},{rand(2,1)})
v = 
72.7248

Create a random observation set of batch size 64 for each channel. The third dimension is the batch size, while the fourth is the sequence length for any recurrent neural network used by the critic (in this case not used).

batchobs_ch1 = rand(8,3,64,1);
batchobs_ch2 = rand(4,1,64,1);

Create a random action set of batch size 64.

batchact = rand(2,1,64,1);

Obtain the state-action value function estimate for the batch of observations and actions.

bv = getValue(critic,{batchobs_ch1,batchobs_ch2},{batchact});
size(bv)
ans = 1×2

     1    64

bv(23)
ans = 
44.8497

Input Arguments

collapse all

Value function critic, specified as an rlValueFunction approximator object.

Example: vf = rlValueFunction(dlnetwork([featureInputLayer(2) fullyConnectedLayer(10) reluLayer fullyConnectedLayer(1)]),rlNumericSpec([2 1])) creates an rlValueFunction object and assigns it to the variable vf.

Vector Q-value function critic, specified as an rlVectorQValueFunction approximator object.

Example: vqvf = rlVectorQValueFunction(dlnetwork([featureInputLayer(4) fullyConnectedLayer(10) reluLayer fullyConnectedLayer(2)]),rlNumericSpec([4 1]),rlFiniteSetSpec([-1 1])) creates an rlVectorQValueFunction object and assigns it to the variable vqvf.

Q-value function critic, specified as an rlQValueFunction object.

Example: qvf = rlQValueFunction(rlTable(rlFiniteSetSpec([-1 0 1]),rlFiniteSetSpec([-1 1])),rlFiniteSetSpec([-1 0 1]),rlFiniteSetSpec([1 -1])) creates an rlQValueFunction object and assigns it to the variable qvf.

Environment observations, specified as a cell array with as many elements as there are observation input channels. Each element of obs contains an array of observations for a single observation input channel.

The dimensions of each element in obs are MO-by-LB-by-LS, where:

  • MO corresponds to the dimensions of the associated observation input channel.

  • LB is the batch size. To specify a single observation, set LB = 1. To specify a batch of observations, specify LB > 1. If your approximator object has multiple observation input channels, then LB must be the same for all elements of obs.

  • LS specifies the sequence length for a recurrent neural network. If your approximator object does not use a recurrent neural network, then LS = 1. If the approximator has multiple observation input channels, then LS must be the same for all elements of obs.

LB and LS must be the same for all the approximator input channels (both observation and, if needed, action).

For more information on input and output formats for recurrent neural networks, see the Algorithms section of lstmLayer.

Example: {rand(8,3,64,1),rand(4,1,64,1)}

Environment action, specified as a cell array with as many elements as there are action input channels. Note that you only need two action channels (the first discrete, the second continuous,) to represent hybrid action spaces. Non-hybrid action spaces must have only one channel.

Each element of act contains an array of action for a single action input channel.

The dimensions of each element in act are MA-by-LB-by-LS, where:

  • MA corresponds to the dimensions of the associated action input channel.

  • LB is the batch size. To specify a single action, set LB = 1. To specify a batch of actions, specify LB > 1. If your approximator object has multiple action input channels, then LB must be the same for all elements of act.

  • LS specifies the sequence length for a recurrent neural network. If your approximator object does not use a recurrent neural network, then LS = 1. If the approximator has multiple action input channels, then LS must be the same for all elements of act.

LB and LS must be the same for all the approximator input channels (observations and actions).

For more information on input and output formats for recurrent neural networks, see the Algorithms section of lstmLayer.

Example: {rand(2,1,64,1)}

Option to use forward pass, specified as a logical value. When you specify UseForward=true the function calculates its outputs using forward instead of predict. This allows layers such as batch normalization and dropout to appropriately change their behavior for training.

Example: true

Output Arguments

collapse all

Estimated value function, returned as array with dimensions N-by-LB-by-LS, where:

  • N is the number of outputs of the critic network.

    • For a state value critic (ValueFcn), N = 1.

    • For a single-output state-action value function critic (QValueFcn), N = 1.

    • For a multi-output state-action value function critic (VQValueFcn), N is the number of discrete actions.

  • LB is the batch size.

  • LS is the sequence length for a recurrent neural network.

Updated state of the critic, returned as a cell array. If the critic does not use a recurrent neural network, then state is an empty cell array.

You can set the state of the critic to state using dot notation. For example:

ValueFcn.State=state;

Tips

Version History

Introduced in R2020a