PPO | RL | A single policy controlled many agents during training

Question

Muhammad Fairuz Abdul Jalal el 4 de Dic. de 2023

0
Enlazar

Enlace directo a esta pregunta

https://es.mathworks.com/matlabcentral/answers/2056099-ppo-rl-a-single-policy-controlled-many-agents-during-training

Comentada: Muhammad Fairuz Abdul Jalal el 4 de En. de 2024

Hi ,

I am currently working on PPO agent using RL and Parallel toolboxes. I read about this share policy to controlled 20 agents (as quoted below).

"During training, a single policy controlled 20 agents that interact with the enviroment. Though the 20 agents shared a single policy and same measured dataset, actions of each agent varied during a training session because of entropy regularization simulation samples and converging speed."

I wonder, how do set this condition while using RL toolbox.

Thank you in in advance.

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Iniciar sesión para comentar.

Iniciar sesión para responder a esta pregunta.

Answer 1

Shivansh el 27 de Dic. de 2023

0
Enlazar

Enlace directo a esta respuesta

https://es.mathworks.com/matlabcentral/answers/2056099-ppo-rl-a-single-policy-controlled-many-agents-during-training#answer_1378547

Abrir en MATLAB Online

Hi Muhammad,

I understand that you want to implement a model where 20 agents share a single policy. This involves training one agent with multiple parallel environments where each environment represents an instance where the agent can interact and learn.

This approach can help with improved efficiency and stabilized training in policy gradient methods like Proximal Policy Optimization (PPO).

You can set up a model with the required conditions using the RL toolbox by following the below steps:

Create or define an environment as per your problem statement. If you're using a custom environment, make sure it is compatible with the RL Toolbox. You can read more about RL environments here https://www.mathworks.com/help/reinforcement-learning/environments.html.
Define the PPO agent with the desired policy representation.
Use the 'parpool' function to create a parallel pool with the desired number of workers (in your case, 20). You can read more about the “parpool” function here https://www.mathworks.com/help/parallel-computing/parpool.html.
Use the 'rlTrainingOptions' function to set up your training options. Make sure to set the 'UseParallel' option to 'true' and specify the 'ParallelizationOptions' to use 'async' updates. You can read more about “rlTrainingOptions” here https://www.mathworks.com/help/reinforcement-learning/ref/rl.option.rltrainingoptions.html.
Call the 'train' function with your agent, environments, and training options. You can read more about “train” here https://www.mathworks.com/help/reinforcement-learning/ref/rl.agent.rlqagent.train.html.

You can refer to the below example code for the implementation:

% Assuming you have already created your custom environment 'myEnv'
env = myEnv();
  
% Create the PPO agent 'agent' with your desired policy and critic representations  
agent = rlPPOAgent(observationInfo,actorNetwork,criticNetwork,options);
% Setup parallel training option  
numEnvironments = 20; % Number of parallel environments
  
trainOpts = rlTrainingOptions(...
    'MaxEpisodes',1000,...
    'MaxStepsPerEpisode',500,...
    'ScoreAveragingWindowLength',20,...
    'Verbose',false,...
    'Plots','training-progress',...
    'StopTrainingCriteria','AverageReward',...
    'StopTrainingValue',500,...
    'UseParallel',true,...
    'ParallelizationOptions',struct('Mode','async',...
                                    'DataToSendFromWorkers','Gradients',...
                                    'StepsUntilDataIsSent',32,...
                                    'WorkerRandomSeeds',randi([1,1e6],numEnvironments,1)));
% Create a parallel pool
parpool(numEnvironments);
  
% Train the agent
trainingStats = train(agent,env,trainOpts);

You can refer to the following Reinforcement Learning toolbox documentation for more information https://www.mathworks.com/help/reinforcement-learning/index.html.

Hope it helps!

3 comentarios
Mostrar 1 comentario más antiguoOcultar 1 comentario más antiguo

Shivansh el 3 de En. de 2024

Hi Mohammad!

GPU and Parallel Environments: The parpool function does not automatically distribute environments across all available GPUs. Distribution of computations to GPUs has to be managed manually within your environment setup code. You can refer to the following link for more information https://blogs.mathworks.com/loren/2013/06/24/running-monte-carlo-simulations-on-multiple-gpus/.
DataToSendFromWorker and StepsUntilDataIsSent: They are no longer available for Matlab. You can refer to the "rltrainingoptions" for more information https://www.mathworks.com/help/reinforcement-learning/ref/rl.option.rltrainingoptions.html.
WorkerRandomSeeds: Using the default -1 for WorkerRandomSeeds assigns seeds based on the worker ID, ensuring each worker has a different seed for diversity in exploration. This can be beneficial for training stability and exploration.

Muhammad Fairuz Abdul Jalal el 4 de En. de 2024

Hi @Shivansh,

Thanks a lot for your explanation. It's been a big help to me.

Iniciar sesión para comentar.

PPO | RL | A single policy controlled many agents during training

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Respuesta aceptada

3 comentarios
Mostrar 1 comentario más antiguoOcultar 1 comentario más antiguo

Más respuestas (0)

Ver también

Categorías

Etiquetas

Community Treasure Hunt

PPO | RL | A single policy controlled many agents during training

0 comentarios Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Respuesta aceptada

3 comentarios Mostrar 1 comentario más antiguoOcultar 1 comentario más antiguo

Más respuestas (0)

Ver también

Categorías

Etiquetas

Community Treasure Hunt

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

3 comentarios
Mostrar 1 comentario más antiguoOcultar 1 comentario más antiguo