Train PPO Agent to Swing Up and Balance Pendulum

Question

shoki kobayashi el 2 de En. de 2021

0
Enlazar

Enlace directo a esta pregunta

https://es.mathworks.com/matlabcentral/answers/707088-train-ppo-agent-to-swing-up-and-balance-pendulum

Comentada: shoki kobayashi el 8 de En. de 2021

Hello

I'm trying to learn the Swing Up and Balance Pendulum example in PPOAgent. However, PPOAgent is not able to train it properly. How can I improve it?

I think the structure of the network is not working.

I am attaching the code below

clear all
rng(0)
%Envitronment
mdl = 'rlSimplePendulumModel';
open_system(mdl)
env = rlPredefinedEnv('SimplePendulumModel-Continuous');
set_param('rlSimplePendulumModel/create observations','ThetaObservationHandling','sincos');
obsInfo = getObservationInfo(env);
numObs = obsInfo.Dimension(1);
actInfo = getActionInfo(env);
numAct = actInfo.Dimension(1);
env.ResetFcn = @(in)setVariable(in,'theta0',pi,'Workspace',mdl);
Ts = 0.05;
Tf = 20;
%PPOAgent
criticLayerSizes = [200 100];
actorLayerSizes = [200 100];
criticNetwork = [imageInputLayer([numObs 1 1],'Normalization','none','Name','observations')
    fullyConnectedLayer(criticLayerSizes(1),'Name','CriticFC1')
    reluLayer('Name','CriticRelu1')
    fullyConnectedLayer(criticLayerSizes(2),'Name','CriticFC2')
    reluLayer('Name','CriticRelu2')
    fullyConnectedLayer(1,'Name','CriticOutput')
                          ];
                      
criticOpts = rlRepresentationOptions('LearnRate',1e-3);
critic = rlValueRepresentation(criticNetwork,env.getObservationInfo, ...
                          'Observation',{'observations'},criticOpts);
%ActorNetwork        
inPath = [ imageInputLayer([numObs 1 1], 'Normalization','none','Name','observations') 
           fullyConnectedLayer(numAct,'Name','infc') ]; % 2 by 1 output
% path layers for mean value (2 by 1 input and 2 by 1 output)
% using scalingLayer to scale the range
meanPath = [ tanhLayer('Name','tanh'); % output range: (-1,1)
             scalingLayer('Name','scale','Scale',actInfo.UpperLimit) ]; % output range: (-10,10)
% path layers for standard deviations (2 by 1 input and output)
% using softplus layer to make it non negative
sdevPath =  softplusLayer('Name', 'splus');
outLayer = concatenationLayer(3,2,'Name','mean&sdev');
% add layers to network object
net = layerGraph(inPath);
net = addLayers(net,meanPath);
net = addLayers(net,sdevPath);
net = addLayers(net,outLayer);
% connect layers: the mean value path output MUST be connected to the FIRST input of the concatenationLayer
net = connectLayers(net,'infc','tanh/in');              % connect output of inPath to meanPath input
net = connectLayers(net,'infc','splus/in');             % connect output of inPath to sdevPath input
net = connectLayers(net,'scale','mean&sdev/in1');       % connect output of meanPath to gaussPars input #1
net = connectLayers(net,'splus','mean&sdev/in2');       % connect output of sdevPath to gaussPars input #2
actorOptions = rlRepresentationOptions('LearnRate',1e-3);
Actor = rlStochasticActorRepresentation(net,obsInfo,actInfo,... 
                         'Observation',{'observations'}, actorOptions);
                     
opt = rlPPOAgentOptions('ExperienceHorizon',512,...
                        'ClipFactor',0.2,...
                        'EntropyLossWeight',0.02,...
                        'MiniBatchSize',64,...
                        'NumEpoch',3,...
                        'AdvantageEstimateMethod','gae',...
                        'GAEFactor',0.95,...
                        'SampleTime',0.05,...
                        'DiscountFactor',0.9995);
 agent = rlPPOAgent(actor,critic,opt); 
%Train
maxepisodes = 5000;
maxsteps = ceil(Tf/Ts);
trainOpts = rlTrainingOptions(...
    'MaxEpisodes',maxepisodes,...
    'MaxStepsPerEpisode',maxsteps,...
    'ScoreAveragingWindowLength',5,...
    'Verbose',false,...
    'Plots','training-progress',...
    'StopTrainingCriteria','AverageReward',...
    'StopTrainingValue',-740,...
    'SaveAgentCriteria','EpisodeReward',...
    'SaveAgentValue',-740);
 trainingStats = train(agent,env,trainOpts);
 
 simOptions = rlSimulationOptions('MaxSteps',500);
experience = sim(env,agent,simOptions);

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Iniciar sesión para comentar.

Iniciar sesión para responder a esta pregunta.

Answer 1

Emmanouil Tzorakoleftherakis el 4 de En. de 2021

0
Enlazar

Enlace directo a esta respuesta

https://es.mathworks.com/matlabcentral/answers/707088-train-ppo-agent-to-swing-up-and-balance-pendulum#answer_590678

I think the fully connected layer in the actor may not have enough nodes (actorLayerSizes is not used anywhere).

Regardless, you can let Reinforcement Learning Toolbox create an initial architecture for the actor and the critic by following the example here. You can provide some initialization options to tailor the default architecture, or you can modify it afterwards.

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Iniciar sesión para comentar.

Answer 2

shoki kobayashi el 6 de En. de 2021

0
Enlazar

Enlace directo a esta respuesta

https://es.mathworks.com/matlabcentral/answers/707088-train-ppo-agent-to-swing-up-and-balance-pendulum#answer_591988

Abrir en MATLAB Online

Emmanouil Tzorakoleftherakis, thanks for the reply.

I have increased the number of fully connected layer nodes in the Actor as you pointed out. However, the learning does not work. I would like to know how I can improve it.

I am attaching the code below

clear all
rng(0)
%Envitronment
mdl = 'rlSimplePendulumModel';
open_system(mdl)
env = rlPredefinedEnv('SimplePendulumModel-Continuous');
set_param('rlSimplePendulumModel/create observations','ThetaObservationHandling','sincos');
obsInfo = getObservationInfo(env);
numObs = obsInfo.Dimension(1);
actInfo = getActionInfo(env);
numAct = actInfo.Dimension(1);
env.ResetFcn = @(in)setVariable(in,'theta0',pi,'Workspace',mdl);
Ts = 0.05;
Tf = 20;
%PPOAgent
%network reference（https://github.com/gouxiangchen/ac-ppo/blob/master/PPO.py）
criticNetwork = [imageInputLayer([numObs 1 1],'Normalization','none','Name','observations')
    fullyConnectedLayer(64,'Name','CriticFC1')
    reluLayer('Name','CriticRelu1')
    fullyConnectedLayer(256,'Name','CriticFC2')
    reluLayer('Name','CriticRelu2')
    fullyConnectedLayer(1,'Name','CriticOutput')
                          ];
criticOpts = rlRepresentationOptions('LearnRate',1e-3);
critic = rlValueRepresentation(criticNetwork,env.getObservationInfo, ...
                          'Observation',{'observations'},criticOpts);
ActorNetwork        
inPath = [ imageInputLayer([numObs 1 1], 'Normalization','none','Name','observations') 
           fullyConnectedLayer(128,'Name','fc_1') 
           reluLayer('Name','CriticRelu1')
           fullyConnectedLayer(numAct,'Name','fc')
           reluLayer('Name','CriticRelu2')]; % 2 by 1 output
% path layers for mean value (2 by 1 input and 2 by 1 output)
% using scalingLayer to scale the range
meanPath = [ fullyConnectedLayer(numAct,'Name','fc_mean')
             tanhLayer('Name','tanh'); % output range: (-1,1)
             scalingLayer('Name','scale','Scale',max(actInfo.UpperLimit)) ]; % output range: (-10,10)
% path layers for standard deviations (2 by 1 input and output)
% using softplus layer to make it non negative
sdevPath =  [ fullyConnectedLayer(numAct,'Name','fc_std')
              softplusLayer('Name', 'splus')];
outLayer = concatenationLayer(3,2,'Name','mean&sdev');
% add layers to network object
net = layerGraph(inPath);
net = addLayers(net,meanPath);
net = addLayers(net,sdevPath);
net = addLayers(net,outLayer);
% connect layers: the mean value path output MUST be connected to the FIRST input of the concatenationLayer
% connect layers: the mean value path output MUST be connected to the FIRST input of the concatenationLayer
net = connectLayers(net,'CriticRelu2','fc_mean/in');              % connect output of inPath to meanPath input
net = connectLayers(net,'CriticRelu2','fc_std/in');             % connect output of inPath to sdevPath input
net = connectLayers(net,'scale','mean&sdev/in1');       % connect output of meanPath to gaussPars input #1
net = connectLayers(net,'splus','mean&sdev/in2');       % connect output of sdevPath to gaussPars input #2
actorOptions = rlRepresentationOptions('LearnRate',1e-3);
Actor = rlStochasticActorRepresentation(net,obsInfo,actInfo,... 
                         'Observation',{'observations'}, actorOptions);
                     
PPO
opt = rlPPOAgentOptions('ExperienceHorizon',512,...
                        'ClipFactor',0.2,...
                        'EntropyLossWeight',0.02,...
                        'MiniBatchSize',64,...
                        'NumEpoch',3,...
                        'AdvantageEstimateMethod','gae',...
                        'GAEFactor',0.95,...
                        'SampleTime',0.05,...
                        'DiscountFactor',0.9995);
 agent = rlPPOAgent(Actor,critic,opt); 
maxepisodes = 10000;
maxsteps = ceil(Tf/Ts);
trainOpts = rlTrainingOptions(...
    'MaxEpisodes',maxepisodes,...
    'MaxStepsPerEpisode',maxsteps,...
    'ScoreAveragingWindowLength',5,...
    'Verbose',false,...
    'Plots','training-progress',...
    'StopTrainingCriteria','AverageReward',...
    'StopTrainingValue',-740,...
    'SaveAgentCriteria','EpisodeReward',...
    'SaveAgentValue',-740);
trainingStats = train(agent,env,trainOpts);
plot(env)

2 comentarios
Mostrar NingunoOcultar Ninguno

Emmanouil Tzorakoleftherakis el 6 de En. de 2021

Seeing the plot from the Episode manager would help as well. But since the actor/critic structure seems ok, assuming the reward is similar to the respective examples, it now comes down to tweaking PPO hyperparameters or stopping criteria. For example, how did you come up with the -740 average reward? Was it from a shipping example? For hyperparameters there is really no recipe here - it's effectively trial & error. I have seen the clip factor play a big role - you may want to try 0.1 or 0.3 to see if that changes things. Depending on how the episode manager plot looks, the agent may need to explore more so increasing the entropy loss weight would be another thing to try.

shoki kobayashi el 8 de En. de 2021

Emmanouil Tzorakoleftherakis, thanks for your reply.

Average Reward -740 was referred to Example (https://jp.mathworks.com/help/reinforcement-learning/ug/train-ddpg-agent-to-swing-up-and-balance- pendulum.html?searchHighlight=Swing%20up&s_tid=srchtitle).

The result of setting ClipFactor to 0.3 is attached below. As you can see from this figure, the learning is not stable and the Pendulum does not stand upright. 0.1 ClipFactor gives the same result. I also tried to increase Entoropy loss, but it failed. How can I improve this?

Iniciar sesión para comentar.

Train PPO Agent to Swing Up and Balance Pendulum

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Respuestas (2)

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

2 comentarios
Mostrar NingunoOcultar Ninguno

Ver también

Categorías

Etiquetas

Productos

Versión

Community Treasure Hunt

Train PPO Agent to Swing Up and Balance Pendulum

0 comentarios Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Respuestas (2)

0 comentarios Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

2 comentarios Mostrar NingunoOcultar Ninguno

Ver también

Categorías

Etiquetas

Productos

Versión

Community Treasure Hunt

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

2 comentarios
Mostrar NingunoOcultar Ninguno