Reinforcement Learning -- Rocket Lander
Mostrar comentarios más antiguos
Th "Rocket Lander" example does not converge with the stated hyperparameters. Someone was helpful enough to give me the following values:
learning rate = 1e-4
clip factor = 0.1
mini-batch size = 128
Although these values work better, the algorithm still does not converge. After about 14,000 episodes there are many successful landings, but they are interspersed with violent crash landings. Anybody at MathWorks or otherwise have any suggestions? Thank you.
Averill M. LAW
7 comentarios
Emmanouil Tzorakoleftherakis
el 19 de Mayo de 2020
Hi Averill,
We are looking into it - give me a couple of days and I will get back to you.
Averill Law
el 19 de Mayo de 2020
Averill Law
el 21 de Mayo de 2020
Emmanouil Tzorakoleftherakis
el 22 de Mayo de 2020
Hi Averill,
The model I sent has multiple changes including the reward and other hyperparameters. We ran it multiple times and got convergence every time. Can you please send a screenshot of the episode manager from the example I sent?
Averill Law
el 26 de Mayo de 2020
Averill Law
el 26 de Mayo de 2020
Emmanouil Tzorakoleftherakis
el 27 de Mayo de 2020
Hi Averill,
I am not sure why you do not get convergence, but comparing the screenshot you sent with the one that is in the live script I sent, you can clearly see that the Episode rewards are on a different scale (~7000 vs ~300-400). I would suggest to start fresh by deleting temp files and download and run the example I sent from below. You shouldn't need to change the clip factor or any other hyperparameter in the example below.
The reason we made changes to the example is that some of the latest underlying optimizations changed the numerical behavior of training (which is why the example was not converging), so we made these changes to get a more robust result. The reward is typically the most important thing you need to get right to be able to get the desired behavior. This is most of the time the first thing that needs to be retuned if you don't get the right behavior.
In terms of epsilon, I think you may be confusing epsilon greedy exploration, which is used e.g. in DQN and Q learning with the clip rate epsilon in PPO (please correct me if I am wrong). The former does indeed change with time in the current implementation but the latter is fixed. They share the same letter so it may be confusing but the two hyperparameters serve very different purposes. PPO does not use the "exploration epsilon" because it handles exploration through the stochastic nature of the actor as well as through an additional entropy term in the objective. PPO uses 'clip factor epsilon' to control the amount of change that happens to the neural network weights.
Hope that helps.
Respuesta aceptada
Más respuestas (2)
Averill Law
el 22 de Mayo de 2020
0 votos
Averill Law
el 1 de Jun. de 2020
0 votos
7 comentarios
Emmanouil Tzorakoleftherakis
el 1 de Jun. de 2020
Hi Averill,
Make sure you run the example in R2020a. 'rlValueRepresentation' was not available in previous releases. The epsilon value continues to decay across time steps in the same episode as well as across episodes. The state is initialized at the beginning of each episode. You can use the reset function to randomize initial conditions if you want to.
Averill Law
el 2 de Jun. de 2020
Emmanouil Tzorakoleftherakis
el 3 de Jun. de 2020
Hi Averill,
Let me schedule a call with you to discuss this in more detail.
Emmanouil
Averill Law
el 3 de Jun. de 2020
Emmanouil Tzorakoleftherakis
el 3 de Jun. de 2020
Glad to help. I sent you an invite in the email found on your webpage for tomorrow.
Thank you,
Emmanouil
Averill Law
el 3 de Jun. de 2020
Emmanouil Tzorakoleftherakis
el 3 de Jun. de 2020
Of course. Talk to you tomorrow,
Emmanouil
Categorías
Más información sobre Deep Learning Toolbox en Centro de ayuda y File Exchange.
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!