How to normalize the rewards in RL

Question

Danial Kazemikia el 4 de Ag. de 2024

0
Enlazar

Enlace directo a esta pregunta

https://es.mathworks.com/matlabcentral/answers/2142801-how-to-normalize-the-rewards-in-rl

Comentada: Camilla Ancona el 31 de En. de 2025

I recently learned normalizing the rewards is a key step in RL since rewards can vary over a large range of magnitudes, and the function approximators being used in RL are usually not invariant to the scale of the input. And It usually results in faster learning. I also learned that to normalize all discounted rewards across all episodes, we compute the mean and standard deviation of all the discounted rewards, and we subtract the mean from each discounted reward, and divide by the standard deviation.

How can I implement this in MATLAB? is it internally implemented in matlab? if not how can I transfer the necessary variables between episodes?

2 comentarios
Mostrar NingunoOcultar Ninguno

Umar el 4 de Ag. de 2024

Editada: Walter Roberson el 4 de Ag. de 2024

Hi @Danial Kazemikia ,

Calculate the mean and standard deviation of all discounted rewards across episodes. You can use MATLAB functions like mean() and std() to compute these statistics. For more information on these functions, please refer to,

https://www.mathworks.com/help/matlab/ref/mean.html#

https://www.mathworks.com/help/matlab/ref/std.html?s_tid=doc_ta

Then, subtract the mean from each discounted reward and divide by the standard deviation to normalize the rewards. Afterwards, transfer necessary variables between episodes, you can store them in MATLAB's workspace or data structures like arrays or cell arrays. For example, you can store mean and standard deviation values for each episode in arrays.

Please let me know if you have any further questions.

Danial Kazemikia el 4 de Ag. de 2024

Thank you @Umar, would you explain more about transfering the necessary variables between episodes? I have defined the action and observation space and the agent in a .m file, which also calls a reset function and step function also defined in .m files, and not in simulink. How can I move this variables when Matlab is still running the function train(agent,env) ?

Iniciar sesión para comentar.

Iniciar sesión para responder a esta pregunta.

Answer 1

Kaustab Pal el 6 de Ag. de 2024

1
Enlazar

Enlace directo a esta respuesta

https://es.mathworks.com/matlabcentral/answers/2142801-how-to-normalize-the-rewards-in-rl#answer_1495236

Editada: Kaustab Pal el 6 de Ag. de 2024

Abrir en MATLAB Online

Hi @Danial Kazemikia,

Reward normalization is a crucial step in reinforcement learning (RL) as it stabilizes the training process by ensuring rewards are on a consistent scale, and it improves convergence by providing a more uniform gradient signal, among other benefits.

You can implement this in MATLAB using a custom training loop. An example of writing a custom training loop can be found in the documentation here:https://www.mathworks.com/help/releases/R2024a/reinforcement-learning/ug/train-reinforcement-learning-policy-using-custom-training.html

The below code is a slight modification of the custom training loop to show you how to normalize the rewards:

for episodeCt = 1:numEpisodes
    episodeOffset = ...
        mod(episodeCt-1,trajectoriesForLearning)*maxStepsPerEpisode;
    
    % 1. Reset the environment at the start of the episode
    obs = reset(env);    
    episodeReward = zeros(maxStepsPerEpisode,1);
    
    % 3. Generate experiences 
    %    for the maximum number of steps per episode 
    %    or until a terminal condition is reached.
    for stepCt = 1:maxStepsPerEpisode
        
        % Compute an action using the policy 
        % based on the current observation.
        action = getAction(policy,{obs});
        
        % Apply the action to the environment 
        % and obtain the resulting observation and reward.
        [nextObs,reward,isdone] = step(env,action{1});
        
        % Store the action, observation, 
        % and reward experiences in their buffers.
        j = episodeOffset + stepCt;
        observationBuffer(:,:,j) = obs;
        actionBuffer(:,:,j) = action{1};
        %%%%%%%%%% REWARD NORMALIZATION %%%%%%%%%%
        episodeReward(stepCt) = reward;
	    tstep = linspace(1,maxStepsPerEpisode,maxStepsPerEpisode);
	    discount = reshape(.99.^tstep ,[maxStepsPerEpisode,1]);
	    discounted_reward = discount.*episodeReward;
	    normalized_reward = (discounted_reward - mean(discounted_reward))/(std(discounted_reward)+1e-10);
        %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
        rewardBuffer(:,j) = reward;
        maskBuffer(:,j) = 1;
        
        obs = nextObs;
        
        % Stop if a terminal condition is reached.
        if isdone
            break;
        end
        
    end
    for i=1:length(nor_reward)
        rewardBuffer(:,end+1) = nor_reward(i)
    end
    %%%%%% REMAINING PART OF THE CUSTOM TRAINING LOOP GOES HERE %%%%%
    
end

Hope this helps.

With regards,

Kaustab

1 comentario
Mostrar -1 comentarios más antiguosOcultar -1 comentarios más antiguos

Camilla Ancona el 31 de En. de 2025

Thanks for your answer, I see only the code for custom training loop for an actor policy from the link you posted here. Do you know how to adapt the code, and in particular this bit for a DQN alogorithm? % Learn the set of aggregated trajectories.

if mod(episodeCt,trajectoriesForLearning) == 0

% Get the indices of each action taken in the action buffer.

actionIndicationMatrix = dlarray(single(actionBuffer(:,:) == actionSet));

% 5. Compute the gradient of the loss with respect to the actor

% learnable parameters.

actorGradient = dlfeval(actorGradFcn,...

actor,{observationBuffer},actionIndicationMatrix,returnBuffer,maskBuffer);

% 6. Update the actor using the computed gradients.

[actor,actorOptimizer] = update( ...

actorOptimizer, ...

actor, ...

actorGradient);

% Update the policy from the actor

policy = rlStochasticActorPolicy(actor);

% flush the mask and reward buffer

maskBuffer(:) = 0;

rewardBuffer(:) = 0;

end

Iniciar sesión para comentar.

How to normalize the rewards in RL

2 comentarios
Mostrar NingunoOcultar Ninguno

Respuestas (1)

1 comentario
Mostrar -1 comentarios más antiguosOcultar -1 comentarios más antiguos

Ver también

Categorías

Etiquetas

Productos

Versión

Community Treasure Hunt

How to normalize the rewards in RL

2 comentarios Mostrar NingunoOcultar Ninguno

Respuestas (1)

1 comentario Mostrar -1 comentarios más antiguosOcultar -1 comentarios más antiguos

Ver también

Categorías

Etiquetas

Productos

Versión

Community Treasure Hunt

2 comentarios
Mostrar NingunoOcultar Ninguno

1 comentario
Mostrar -1 comentarios más antiguosOcultar -1 comentarios más antiguos