In TrainMBPOAgentToBalanceCartPoleSystemExample/ cartPoleRewardFunction ，（nextObs）is what？

Question

Lin el 25 de Oct. de 2024

0
Enlazar

Enlace directo a esta pregunta

https://es.mathworks.com/matlabcentral/answers/2162165-in-trainmbpoagenttobalancecartpolesystemexample-cartpolerewardfunction-nextobs-is-what

Comentada: Lin el 14 de Nov. de 2024

Respuesta aceptada: Ayush Aniket

function reward = cartPoleRewardFunction(obs,action,nextObs)

% Compute reward value based on the next observation.

if iscell(nextObs)

nextObs = nextObs{1};

end

% Distance at which to fail the episode

xThreshold = 2.4;

% Reward each time step the cart-pole is balanced

rewardForNotFalling = 1;

% Penalty when the cart-pole fails to balance

penaltyForFalling = -50;

x = nextObs(1,:);

distReward = 1 - abs(x)/xThreshold;

isDone = cartPoleIsDoneFunction(obs,action,nextObs);

reward = zeros(size(isDone));

reward(logical(isDone)) = penaltyForFalling;

reward(~logical(isDone)) = ...

0.5 * rewardForNotFalling + 0.5 * distReward(~logical(isDone));

end

I really want to know where nextObs is passing this function in from? Why can't I find this variable in the main function.

If my environment is built from Simulink, how do I get the nextObs variable？

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Iniciar sesión para comentar.

Iniciar sesión para responder a esta pregunta.

Answer 1

Ayush Aniket el 28 de Oct. de 2024

0
Enlazar

Enlace directo a esta respuesta

https://es.mathworks.com/matlabcentral/answers/2162165-in-trainmbpoagenttobalancecartpolesystemexample-cartpolerewardfunction-nextobs-is-what#answer_1537810

Hi Lin,

The nextObs variable returns the next state after transition from the current state by the Reinforcement Learning(RL) Agent. While training, using the train function, the step function is implicitly called, which takes the environment model for the RL agent and the action as input, and returns three outputs: nextObs,reward and isdone.These inputs are then used in the reward function to calculate the reward for the action taken.

The Train MBPO Agent to Balance Continuous Cart-Pole System example uses a rlNeuralNetworkEnvironment object to create the environment. In this function, you can provide a custom reward function by using the function handle. Refer to the following documentation link for this input paramater:

https://www.mathworks.com/help/reinforcement-learning/ref/rl.env.rlneuralnetworkenvironment.html#mw_cafd380f-20d7-4fbb-8134-fbb5843c2c37

Once a custom reward function handle is provided, it is implicitly fed the input arguments (obs,action,nextObs) during training.

However, you can evaluate these function by using the step function (and get the nextObs variable) as shown in the following documentation section:

https://www.mathworks.com/help/reinforcement-learning/ug/train-mbpo-agent-to-balance-cart-pole-system-example.html#TrainMBPOAgentToBalanceCartPoleSystemExample-11

3 comentarios
Mostrar 1 comentario más antiguoOcultar 1 comentario más antiguo

Ayush Aniket el 12 de Nov. de 2024

Can you share the custom reward function you are using?

Lin el 14 de Nov. de 2024

Abrir en MATLAB Online

I used a Simulink environment,state is a 2×1 vector,action is a 1×1 vector.

Main function call

useGroundTruthReward = true;
if useGroundTruthReward
    rewardFcn = @RewardFunction;
else
    % This neural network uses action and next observation as inputs.
    rewardnet = createRewardNetworkActionNextObs(numObservations,numActions);
    rewardFcn = rlContinuousDeterministicRewardFunction(rewardnet,...
        obsInfo,...
        actInfo, ...
        ActionInputNames="action",...
        NextObservationInputNames="nextState");
end

RewardFunction

function reward = cartPoleRewardFunction(obs,action,nextObs)
% Compute reward value based on the next observation.
    if iscell(nextObs)
        nextObs = nextObs{1};
    end
    % Distance at which to fail the episode
    xThreshold = 2400;
    % Reward each time step the cart-pole is balanced
    rewardForNotFalling = 0;
    % Penalty when the cart-pole fails to balance
    penaltyForFalling = -50;
    x = nextObs(1,:);
    distReward = -log2(10000*abs(x)+1);
    isDone = cartPoleIsDoneFunction(obs,action,nextObs);
    reward = zeros(size(isDone));
    reward(logical(isDone)) = penaltyForFalling;
    reward(~logical(isDone)) = ...
        0.5 * rewardForNotFalling + 1* distReward(~logical(isDone));
end
    
    % reward = 1/(abs(x)+0.000001);

Iniciar sesión para comentar.

In TrainMBPOAgentToBalanceCartPoleSystemExample/ cartPoleRewardFunction ，（nextObs）is what？

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Respuesta aceptada

3 comentarios
Mostrar 1 comentario más antiguoOcultar 1 comentario más antiguo

Más respuestas (0)

Ver también

Categorías

Etiquetas

Productos

Versión

Community Treasure Hunt

In TrainMBPOA​gentToBala​nceCartPol​eSystemExa​mple/ cartPoleRewardFunction ，（nextObs）is what？

0 comentarios Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Respuesta aceptada

3 comentarios Mostrar 1 comentario más antiguoOcultar 1 comentario más antiguo

Más respuestas (0)

Ver también

Categorías

Etiquetas

Productos

Versión

Community Treasure Hunt

In TrainMBPOAgentToBalanceCartPoleSystemExample/ cartPoleRewardFunction ，（nextObs）is what？

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

3 comentarios
Mostrar 1 comentario más antiguoOcultar 1 comentario más antiguo