DDPG has two different policies

Question

Jorge De la Rosa Padrón el 22 de Mayo de 2023

0
Enlazar

Enlace directo a esta pregunta

https://es.mathworks.com/matlabcentral/answers/1969994-ddpg-has-two-different-policies

Respondida: Elena el 13 de Nov. de 2025

Respuesta aceptada: Jorge De la Rosa Padrón

Hello,

I'm training a DDPG agent for autonomous driving. The thing is when I save the agent, the policy is way different than expected. I know that when training, an exploration policy is used, and when the agent is saved, it is a greedy policy. However, these two policies are so different. The following graph shows what I mean. In red, the greedy policy, and in blue, the exploration policy. As I see, taking the documentation into consideration, the greedy policy should be the exploration policy without the added noise. Then, why is greedy policy above the exploration policy? In fact, when I run a simulation, If I move down the greedy policy a little (-150) I get better results. I would like to know why this is happening, how the greedy policy is made, why these two are so different and how to solve it.

Here it is my main code:

clear all;clc
rng(6);
epochs = 80; %30
mdl =  'MODELO';
stoptrainingcriteria = "AverageReward";
stoptrainingvalue = 2000000;
load_system(mdl);
numObs = 1;
obsInfo = rlNumericSpec([numObs 1]);
obsInfo.Name = 'observations'; 
ActionInfo = rlNumericSpec([1 1],...
    LowerLimit=[1]',...
    UpperLimit=[1000]');
ActionInfo.Name = 'alfa';
blk = [mdl,'/RL Agent'];
env = rlSimulinkEnv(mdl,blk,obsInfo,ActionInfo);
env.ResetFcn = @(in) resetfunction(in, mdl);
initOpts = rlAgentInitializationOptions('NumHiddenUnit',32); %32
agent = rlDDPGAgent(obsInfo, ActionInfo, initOpts);
agent.SampleTime = 1;% -1
agent.AgentOptions.NoiseOptions.MeanAttractionConstant = 1/30;% 1/30
agent.AgentOptions.NoiseOptions.StandardDeviation = 41; % 41
agent.AgentOptions.NoiseOptions.StandardDeviationDecayRate = 0.00001;% 0
agent.AgentOptions.NumStepsToLookAhead = 32; % 32
agent.AgentOptions.CriticOptimizerOptions.LearnRate = 1e-03;
agent.AgentOptions.CriticOptimizerOptions.GradientThreshold = 1;
agent.AgentOptions.ActorOptimizerOptions.LearnRate = 1e-04;
agent.AgentOptions.ActorOptimizerOptions.GradientThreshold = 1;
opt = rlTrainingOptions(...
    'MaxEpisodes', epochs,... 
    'MaxStepsPerEpisode', 1000,... % 1000 
    'StopTrainingCriteria', stoptrainingcriteria,...
    'StopTrainingValue', stoptrainingvalue,...
    'Verbose', true,...
    'Plots', "training-progress");
trainResults = train(agent,env,opt);
generatePolicyFunction(agent);

Here it is the function I use to create the graph:

policy1 = getGreedyPolicy(agent);
policy2 = getExplorationPolicy(agent);
x_values = 0:0.1:120;
actions1 = zeros(length(x_values), 1);
actions2 = zeros(length(x_values), 1);
for i = 1:length(x_values)
    actions1(i) = cell2mat(policy1.getAction(x_values(i)));
    actions2(i) = cell2mat(policy2.getAction(x_values(i)));
end
hold on
plot(x_values, actions2);
plot(x_values, actions1, 'LineWidth', 2);
hold off

Thanks in advance!

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Iniciar sesión para comentar.

Iniciar sesión para responder a esta pregunta.

Answer 1

Jorge De la Rosa Padrón el 15 de Jun. de 2023

0
Enlazar

Enlace directo a esta respuesta

https://es.mathworks.com/matlabcentral/answers/1969994-ddpg-has-two-different-policies#answer_1256224

Editada: Jorge De la Rosa Padrón el 15 de Jun. de 2023

UPDATE:

So I recently find this article that summarizes all my problems. It seems that DDPG struggles to solve simple 1D problems due to it's own design. I recommend to have a look at the article. I'm facing a sparse reward problem in a deterministic environment. It seems that there is no actual deep explanation about how to solve this issue, but I've seen people use either PPO or A2C in order to solve this problem.

I hope it helps anyone facing with this problem too.

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Iniciar sesión para comentar.

Answer 2

Emmanouil Tzorakoleftherakis el 23 de Mayo de 2023

0
Enlazar

Enlace directo a esta respuesta

https://es.mathworks.com/matlabcentral/answers/1969994-ddpg-has-two-different-policies#answer_1242794

Abrir en MATLAB Online

The comparison plot is not set up correctly. The noisy policy also has a noise state which needs to be propagated after each call. This explains while there is an offset between greedy and exploration policy.

The right way to get the actions and propagate the noise state would be

[action,policy] = getAction(policy,observation)

4 comentarios
Mostrar 2 comentarios más antiguosOcultar 2 comentarios más antiguos

Jorge De la Rosa Padrón el 24 de Mayo de 2023

Editada: Jorge De la Rosa Padrón el 24 de Mayo de 2023

Sure,

The new plot after change what you said result on this, after 700 iterations:

As you can see, both greedy an exploratory policy are the same. This is the training session plot:

What I said in my previous post is that, if I take my greedy policy and substract, for example -70 to each action, I get so much better results than the original. If I plot what I mean, results on this:

It's happening in every training I execute. I don't know why my agent is learning worst actions that It could, because all it takes is to subtract 70 to each action to gett better results. I've try to change the noise parameters multiple times with no success. My Neural Network should not be a problem, since It takes just 1 input and returns 1 output. I'm running out of ideas to make my agent get better results and don't fall in these suboptimal solutions.

Emmanouil Tzorakoleftherakis el 24 de Mayo de 2023

I see. That's a separate question with no definitive answer. If you only have a 1-1 map, consider using an even smaller network than the 32 hidden nodes that you have. Maybe some of the bias you observed will be eliminated?

Iniciar sesión para comentar.

Answer 3

awcii el 20 de Jul. de 2023

0
Enlazar

Enlace directo a esta respuesta

https://es.mathworks.com/matlabcentral/answers/1969994-ddpg-has-two-different-policies#answer_1276168

did you solve the problem ?

can you try it by changing the actir and critic learn rates.

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Iniciar sesión para comentar.

Answer 4

Elena el 13 de Nov. de 2025

0
Enlazar

Enlace directo a esta respuesta

https://es.mathworks.com/matlabcentral/answers/1969994-ddpg-has-two-different-policies#answer_1571932

clear all;clc

rng(6);

epochs = 80; %30

mdl = 'MODELO';

stoptrainingcriteria = "AverageReward";

stoptrainingvalue = 2000000;

load_system(mdl);

numObs = 1;

obsInfo = rlNumericSpec([numObs 1]);

obsInfo.Name = 'observations';

ActionInfo = rlNumericSpec([1 1],...

LowerLimit=[1]',...

UpperLimit=[1000]');

ActionInfo.Name = 'alfa';

blk = [mdl,'/RL Agent'];

env = rlSimulinkEnv(mdl,blk,obsInfo,ActionInfo);

env.ResetFcn = @(in) resetfunction(in, mdl);

initOpts = rlAgentInitializationOptions('NumHiddenUnit',32); %32

agent = rlDDPGAgent(obsInfo, ActionInfo, initOpts);

agent.SampleTime = 1;% -1

agent.AgentOptions.NoiseOptions.MeanAttractionConstant = 1/30;% 1/30

agent.AgentOptions.NoiseOptions.StandardDeviation = 41; % 41

agent.AgentOptions.NoiseOptions.StandardDeviationDecayRate = 0.00001;% 0

agent.AgentOptions.NumStepsToLookAhead = 32; % 32

agent.AgentOptions.CriticOptimizerOptions.LearnRate = 1e-03;

agent.AgentOptions.CriticOptimizerOptions.GradientThreshold = 1;

agent.AgentOptions.ActorOptimizerOptions.LearnRate = 1e-04;

agent.AgentOptions.ActorOptimizerOptions.GradientThreshold = 1;

opt = rlTrainingOptions(...

'MaxEpisodes', epochs,...

'MaxStepsPerEpisode', 1000,... % 1000

'StopTrainingCriteria', stoptrainingcriteria,...

'StopTrainingValue', stoptrainingvalue,...

'Verbose', true,...

'Plots', "training-progress");

trainResults = train(agent,env,opt);

generatePolicyFunction(agent);

Here it is the function I use to create the graph:

policy1 = getGreedyPolicy(agent);

policy2 = getExplorationPolicy(agent);

x_values = 0:0.1:120;

actions1 = zeros(length(x_values), 1);

actions2 = zeros(length(x_values), 1);

for i = 1:length(x_values)

actions1(i) = cell2mat(policy1.getAction(x_values(i)));

actions2(i) = cell2mat(policy2.getAction(x_values(i)));

end

hold on

plot(x_values, actions2);

plot(x_values, actions1, 'LineWidth', 2);

hold off

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Iniciar sesión para comentar.

Answer 5

Elena el 13 de Nov. de 2025

0
Enlazar

Enlace directo a esta respuesta

https://es.mathworks.com/matlabcentral/answers/1969994-ddpg-has-two-different-policies#answer_1571933

Abrir en MATLAB Online

clear all;clc

rng(6);

epochs = 80; %30

mdl = 'MODELO';

stoptrainingcriteria = "AverageReward";

stoptrainingvalue = 2000000;

load_system(mdl);

numObs = 1;

obsInfo = rlNumericSpec([numObs 1]);

obsInfo.Name = 'observations';

ActionInfo = rlNumericSpec([1 1],...

LowerLimit=[1]',...

UpperLimit=[1000]');

ActionInfo.Name = 'alfa';

blk = [mdl,'/RL Agent'];

env = rlSimulinkEnv(mdl,blk,obsInfo,ActionInfo);

env.ResetFcn = @(in) resetfunction(in, mdl);

initOpts = rlAgentInitializationOptions('NumHiddenUnit',32); %32

agent = rlDDPGAgent(obsInfo, ActionInfo, initOpts);

agent.SampleTime = 1;% -1

agent.AgentOptions.NoiseOptions.MeanAttractionConstant = 1/30;% 1/30

agent.AgentOptions.NoiseOptions.StandardDeviation = 41; % 41

agent.AgentOptions.NoiseOptions.StandardDeviationDecayRate = 0.00001;% 0

agent.AgentOptions.NumStepsToLookAhead = 32; % 32

agent.AgentOptions.CriticOptimizerOptions.LearnRate = 1e-03;

agent.AgentOptions.CriticOptimizerOptions.GradientThreshold = 1;

agent.AgentOptions.ActorOptimizerOptions.LearnRate = 1e-04;

agent.AgentOptions.ActorOptimizerOptions.GradientThreshold = 1;

opt = rlTrainingOptions(...

'MaxEpisodes', epochs,...

'MaxStepsPerEpisode', 1000,... % 1000

'StopTrainingCriteria', stoptrainingcriteria,...

'StopTrainingValue', stoptrainingvalue,...

'Verbose', true,...

'Plots', "training-progress");

trainResults = train(agent,env,opt);

generatePolicyFunction(agent);

Here it is the function I use to create the graph:

policy1 = getGreedyPolicy(agent);

policy2 = getExplorationPolicy(agent);

x_values = 0:0.1:120;

actions1 = zeros(length(x_values), 1);

actions2 = zeros(length(x_values), 1);

for i = 1:length(x_values)

actions1(i) = cell2mat(policy1.getAction(x_values(i)));

actions2(i) = cell2mat(policy2.getAction(x_values(i)));

end

hold on

plot(x_values, actions2);

plot(x_values, actions1, 'LineWidth', 2);

hold off

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Iniciar sesión para comentar.

DDPG has two different policies

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Respuesta aceptada

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Más respuestas (4)

4 comentarios
Mostrar 2 comentarios más antiguosOcultar 2 comentarios más antiguos

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Ver también

Categorías

Etiquetas

Productos

Versión

Community Treasure Hunt

DDPG has two different policies

0 comentarios Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Respuesta aceptada

0 comentarios Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Más respuestas (4)

4 comentarios Mostrar 2 comentarios más antiguosOcultar 2 comentarios más antiguos

0 comentarios Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

0 comentarios Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

0 comentarios Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Ver también

Categorías

Etiquetas

Productos

Versión

Community Treasure Hunt

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

4 comentarios
Mostrar 2 comentarios más antiguosOcultar 2 comentarios más antiguos

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos