How to bound DQN critic estimate or RL training progress y-axis

Question

CAC el 30 de Jul. de 2019

0
Enlazar

Enlace directo a esta pregunta

https://es.mathworks.com/matlabcentral/answers/474168-how-to-bound-dqn-critic-estimate-or-rl-training-progress-y-axis

Comentada: dani ansari el 25 de Ag. de 2023

Respuesta aceptada: Emmanouil Tzorakoleftherakis

I'm training a DQN agent from the new Reinforcement Learning toolbox. During training, the critic network generates long-term reward estimates (Q0) throughout each episode - these are displayed in green on the training progress plot. In blue and red are the episode and running average reward, respectively. As you can see, the actual rewards average around -1000, but the first few estimates were orders of magnitude greater, and so they permanently skew the y-axis. Therefore we cannot discern the progress of actual rewards in training.

It seems I either need to bound the critic's estimate, or set limits on the Reinforcement Learning Episode Manager's y-axis. I haven't found a way to do either.

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Iniciar sesión para comentar.

Iniciar sesión para responder a esta pregunta.

Answer 1

Emmanouil Tzorakoleftherakis el 1 de Ag. de 2019

0
Enlazar

Enlace directo a esta respuesta

https://es.mathworks.com/matlabcentral/answers/474168-how-to-bound-dqn-critic-estimate-or-rl-training-progress-y-axis#answer_385782

Hello,

I believe the best approach here is to figure out why the critic estimate takes large values. Even if you scale the plot window, if the critic estimates are off, it will have an impact on training. Also, bounding the estimate values would not be ideal either, because you are losing information (one action may be better than another, but this won't be reflected in the estimate). A few things to try:

1) Make sure that the gradient threshold option ins the representation options of the network if finite, e.g. set it to '1'. This will prevent the weights from changing too much during training.

2) Try reducing the number of layers/nodes

3) Try providing initial (small) values for the network weights (especially the last FC layer)

4) Maybe adding a scaling layer towards the end would be helpful as well

6 comentarios
Mostrar 4 comentarios más antiguosOcultar 4 comentarios más antiguos

Sourabh el 24 de Mayo de 2023

Screenshot 2023-05-24 222308.png

sir during the training i get some rewards as high as 10^16 (see the screenshot attached) can you plz help me with what am i doing wrong.

this is the code i am using

Tf = 10;

Ts = 0.1;

mdl = 'rl_exam2'

obsInfo = rlNumericSpec([3 1]);

obsInfo.Name = 'observations';

obsInfo.Description = 'integrated error, error, Response';

numObservations = obsInfo.Dimension(1)

actInfo = rlNumericSpec([1 1],'LowerLimit',0,'UpperLimit',1);

actInfo.Name = 'Control Input';

numActions = actInfo.Dimension(1);

%% To Create Environment

env = rlSimulinkEnv(mdl,[mdl '/RL Agent'],obsInfo,actInfo);

%%

rng(0)

%%

%% To Create Critic Network

statePath = [

imageInputLayer([numObservations 1 1],'Normalization','none','Name','State')

fullyConnectedLayer(50,'Name','CriticStateFC1')

reluLayer('Name','CriticRelu1')

fullyConnectedLayer(40,'Name','CriticStateFC2')];

actionPath = [

imageInputLayer([numActions 1 1],'Normalization','none','Name','Action')

fullyConnectedLayer(40,'Name','CriticActionFC1')];

commonPath = [

additionLayer(2,'Name','add')

reluLayer('Name','CriticCommonRelu')

fullyConnectedLayer(1,'Name','CriticOutput')];

criticNetwork = layerGraph();

criticNetwork = addLayers(criticNetwork,statePath);

criticNetwork = addLayers(criticNetwork,actionPath);

criticNetwork = addLayers(criticNetwork,commonPath);

criticNetwork = connectLayers(criticNetwork,'CriticStateFC2','add/in1');

criticNetwork = connectLayers(criticNetwork,'CriticActionFC1','add/in2');

criticOpts = rlRepresentationOptions('LearnRate',1e-03,'GradientThreshold',1);

critic = rlQValueRepresentation(criticNetwork,obsInfo,actInfo,'Observation',{'State'},'Action',{'Action'},criticOpts);

actorNetwork = [

imageInputLayer([numObservations 1 1],'Normalization','none','Name','State')

fullyConnectedLayer(40,'Name','actorFC1')

reluLayer('Name','ActorRelu1')

fullyConnectedLayer(numActions,'Name','actorFC2')

tanhLayer('Name','actorTanh')

scalingLayer('Name','Action','Scale',0.5,'Bias',0.5)

];

actorOptions = rlRepresentationOptions('LearnRate',1e-04,'GradientThreshold',1);

actor = rlDeterministicActorRepresentation(actorNetwork,obsInfo,actInfo,'Observation',{'State'},'Action',{'Action'},actorOptions);

%% To Create Agent

agentOpts = rlDDPGAgentOptions(...

'SampleTime',0.1,...

'TargetSmoothFactor',1e-3,...

'DiscountFactor',1,...

'ExperienceBufferLength',1e6,...

'MiniBatchSize',64,...

'ExperienceBufferLength',1e6);

agentOpts.NoiseOptions.Variance = 0.08;

agentOpts.NoiseOptions.VarianceDecayRate = 1e-5;

agent = rlDDPGAgent(actor,critic,agentOpts)

%% Training Options

maxepisodes = 3000;

maxsteps = ceil(Tf/Ts);

trainingOpts = rlTrainingOptions(...

'MaxEpisodes',maxepisodes,...

'MaxStepsPerEpisode',maxsteps,...

'ScoreAveragingWindowLength',20, ...

'Verbose',false,...

'Plots','training-progress',...

'StopTrainingCriteria','EpisodeCount',...

'StopTrainingValue',1500);

%% TO TRAIN

doTraining = true;

if doTraining

trainingStats = train(agent,env,trainingOpts);

% save('agent_new.mat','agent_ready') %%% to save agent ###

else

% Load pretrained agent for the example.

load('agent_old.mat','agent')

end

sir can i please get any mail id or you can mail me at sourabhy711@gmai.com. I would be very gratefull.

dani ansari el 25 de Ag. de 2023

@Sourabh

your amount of reward is just becuse what you design for your reward function. this amount of high negetive reward shows how bad in some episode(probably in exploremode) your agent is doing. you should stop that episode with a reasonable is done condition and a reasonable negetive reward for terminating that episode.

Iniciar sesión para comentar.

How to bound DQN critic estimate or RL training progress y-axis

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Respuesta aceptada

6 comentarios
Mostrar 4 comentarios más antiguosOcultar 4 comentarios más antiguos

Más respuestas (0)

Ver también

Categorías

Etiquetas

Community Treasure Hunt

How to bound DQN critic estimate or RL training progress y-axis

0 comentarios Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Respuesta aceptada

6 comentarios Mostrar 4 comentarios más antiguosOcultar 4 comentarios más antiguos

Más respuestas (0)

Ver también

Categorías

Etiquetas

Community Treasure Hunt

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

6 comentarios
Mostrar 4 comentarios más antiguosOcultar 4 comentarios más antiguos