Reinforcement Learning -- Rocket Lander

Th "Rocket Lander" example does not converge with the stated hyperparameters. Someone was helpful enough to give me the following values:
learning rate = 1e-4
clip factor = 0.1
mini-batch size = 128
Although these values work better, the algorithm still does not converge. After about 14,000 episodes there are many successful landings, but they are interspersed with violent crash landings. Anybody at MathWorks or otherwise have any suggestions? Thank you.
Averill M. LAW

7 comentarios

Hi Averill,
We are looking into it - give me a couple of days and I will get back to you.
Averill Law
Averill Law el 19 de Mayo de 2020
I got it to converge for clip factor = 0.15 at about 17,900 episodes, but there are still a lot of crashes (see enclosed). Note that learning rate = 1e-4 and mini-batch size = 128.
Any word on the "Stochastic Waterfall Grid World" example? Thank you.
Averill M. Law
Averill Law
Averill Law el 21 de Mayo de 2020
Hi Emmanouil,
Thank you for your assistance. However, the Rocket Lander model you sent me does not unfortunately converge in Version 2020a. However, you do get great convergence with the following parameters (see enclosed results):
Clip rate = 0.125
Learning rate = 1e-4
Mini-batch size = 128
Also, in 2020a the convergence criterion is "average score of 10,000 for 100 episodes." The criterion was different in the model you sent m, which was from 2018.
There is a "Stochastic Waterfall Grid World" example in the RL Toolbox, but it does not converge with the stated hyperparameters. Thank you again.
Averill M. LAW
Hi Averill,
The model I sent has multiple changes including the reward and other hyperparameters. We ran it multiple times and got convergence every time. Can you please send a screenshot of the episode manager from the example I sent?
Averill Law
Averill Law el 26 de Mayo de 2020
Hi Emmanouil,
I could not get your model to work, which is probably the result of me doing something wrong. I have, instead, included results for the Rocket Lander model in Version 2020a. With a Clip Rate of 0.125, it converged in 13,065 episodes, which was the best case. It did not converge for Clip Rate = 0.02, and I have enclosed the corresponding plot. Why are you changing the reward structure and hyperparameters for this model?
You can see that convergence is very sensitive to the value of Clip Rate. Do you have any plans to add a "formal" mechanism for performing hyperparameter tuning in the next release? Thank you.
Averill M. Law
Averill Law
Averill Law el 26 de Mayo de 2020
Hi Emmanouil,
Presumably, the value of epsilon is decayed during an episode according to the formula in the documentation. Is it decayed after each state transition? What happens to epsilon at the beginning of a new episode? Thank you very much for your assistance.
Best regards,
Averill M. Law
Hi Averill,
I am not sure why you do not get convergence, but comparing the screenshot you sent with the one that is in the live script I sent, you can clearly see that the Episode rewards are on a different scale (~7000 vs ~300-400). I would suggest to start fresh by deleting temp files and download and run the example I sent from below. You shouldn't need to change the clip factor or any other hyperparameter in the example below.
The reason we made changes to the example is that some of the latest underlying optimizations changed the numerical behavior of training (which is why the example was not converging), so we made these changes to get a more robust result. The reward is typically the most important thing you need to get right to be able to get the desired behavior. This is most of the time the first thing that needs to be retuned if you don't get the right behavior.
In terms of epsilon, I think you may be confusing epsilon greedy exploration, which is used e.g. in DQN and Q learning with the clip rate epsilon in PPO (please correct me if I am wrong). The former does indeed change with time in the current implementation but the latter is fixed. They share the same letter so it may be confusing but the two hyperparameters serve very different purposes. PPO does not use the "exploration epsilon" because it handles exploration through the stochastic nature of the actor as well as through an additional entropy term in the objective. PPO uses 'clip factor epsilon' to control the amount of change that happens to the neural network weights.
Hope that helps.

Iniciar sesión para comentar.

 Respuesta aceptada

Emmanouil Tzorakoleftherakis
Emmanouil Tzorakoleftherakis el 20 de Mayo de 2020

0 votos

Hi Averill,
Here is a version that converges in ~18-20k episodes - thank you for pointing out that this example was not converging properly. This version will also be included in a few weeks in the next R2020a update. We changed some of the hyperparameters as well as the reward signal.
For the stochastic grid world I don't think we have a published example (if I recall correctly). If you used the basic grid world example as reference, you will likely need to make some changes over there as well.
Hope that helps

Más respuestas (2)

Averill Law
Averill Law el 22 de Mayo de 2020

0 votos

Hi Emmanouil,
You are absolutely right about the "Waterfall Grid World" examples. There is a basic discussion but not complete programs. I got the Deterministic version to converge, but not the Stochastic version.
I look forward to hearing from you further on the "Rocket Lander" example, as per my comments earlier today. Using clip rate = 0.125 I got convergence in 13,065 episodes. Thank you.
Averill M. Law
Averill Law
Averill Law el 1 de Jun. de 2020

0 votos

Hi Emmanouil,
Your new version of the "Rocket Lander" example does not work on my computer. At line 25 I get the error message:
Unrecognized function or variable 'RLValueRepresentation'
I do not know how to delete Temp files.
I was, in fact, interested in the epsilon for an epsilon-greedy policy when using Q-learning. As episode 1 ends and episode 2 begins, does the value of epsilon continue to be decayed? Is it correct that the state S is NOT reinitialized at this time?
Thank you very much for your assistance.
Averill M. Law

7 comentarios

Hi Averill,
Make sure you run the example in R2020a. 'rlValueRepresentation' was not available in previous releases. The epsilon value continues to decay across time steps in the same episode as well as across episodes. The state is initialized at the beginning of each episode. You can use the reset function to randomize initial conditions if you want to.
Averill Law
Averill Law el 2 de Jun. de 2020
Hi Emmanouil,
I got version 2 of the Rocket Lander example to run in Release 2020a. (I had to delete all older releases.) However, my run stopped after 17,038 episodes (see enclosed plot), whereas in your write-up it converged after 18,471 episodes. Why? Does MATLAB's random-number generator give the same results on all computers? With all due respect, I'm not sure whether either result is completely appropriate based on the plots. (Maybe, I'm missing something.) Upon "convergence" at 430, there are apparently lot of crashes still occurring. See my enclosed plot using version 1 of the Rocket Lander example, where it appears to consistently give successful landings for a Clip Factor of 0.125.
I then ran version 2 with StopTrainingValue = 500 and it never converged or reached an average value of 430 (unless it missed it). Please see the enclosed corresponding plot. Thank you.
Averill M. Law
Hi Averill,
Let me schedule a call with you to discuss this in more detail.
Emmanouil
Averill Law
Averill Law el 3 de Jun. de 2020
Hi Emmanouil,
I hope that I'm not annoying or bothering you with my comments. I'm very interested in RL from an intellectual point of view. By the way using "430" as a stopping rule seems very strange. I also looked at the Lunar Lander model in OpenAI. As a point of reference, I'm the author of the book "Simulation Modeling and Analysis" (5th edition, McGraw-Hill, 2015) that has been cited 20,600 times. This is three times more than any other book on discrete-event simulation.
When do you want to have a phone conversation? I'm in Arizona which is three hours earlier than the east coast. Today or tomorrow would be good. Thank you.
Averill M. Law
520-795-6265
Glad to help. I sent you an invite in the email found on your webpage for tomorrow.
Thank you,
Emmanouil
Averill Law
Averill Law el 3 de Jun. de 2020
Hi Emmanouil,
If I understand correctly, you will call me at 520-795-6265 on Thursday at 11AM Arizona time.
I'm totally amazed that you want your whole team to discuss the convergence of the Rocket Lander example with me. Thank you.
Averill M. Law
520-795-6265
Of course. Talk to you tomorrow,
Emmanouil

Iniciar sesión para comentar.

Categorías

Más información sobre Deep Learning Toolbox en Centro de ayuda y File Exchange.

Productos

Versión

R2020a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by