Understanding Entropy Loss for PPO Agents Exploration
92 visualizaciones (últimos 30 días)
Mostrar comentarios más antiguos
Hello,
I have been experimenting with a PPO agent training on a continous action space. I am a little confused with how the exploration works when using entopy loss. I have mostly used epsilon greedy exploration in the past which seems easier to understand in terms of how the agent explores (taking random actions with probability epsilon, and epsilon decay is easy to calculate knowing the decay rate). This means I know exactly the number of training iterations where the agent should start relying on the trained policy instead of exploring. Im not able to understand how the entropy term controls exploration in the same sense.
0 comentarios
Respuestas (1)
Emmanouil Tzorakoleftherakis
el 11 de Oct. de 2023
Hi,
In PPO, the goal of training is to strike a balance between the entropy term and fine tuning the probabilities for all available action. This happens throughout training, as, unlike epsilon greedy approach, exploration in PPO does not diminish over time. This page and references therein should be helpful.
Also, don't forget that PPO is stochastic so there is always some exploration happening when sampling the action distribution. If after training you want to just use the action mean (i.e. not sample to get the policy output), you can set this option to 0.
Hope this helps
4 comentarios
Mohammed Mohiuddin
el 15 de Abr. de 2024
Thank you for your suggestion. I tried this approach and it seemed to work but like you said it is not a very efficient approach.
Ver también
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!