What exactly is Episode Q0? What information is it giving?

202 visualizaciones (últimos 30 días)
Reading documentation I find that "For agents with a critic, Episode Q0 is the estimate of the discounted long-term reward at the start of each episode, given the initial observation of the environment. As training progresses, if the critic is well designed. Episode Q0 approaches the true discounted long-term reward"
But I cannot grasp exactly what is Q0 because, except in a few examples (like this one) where it "converges" to some value rather quickly, I have seen Q0 value do different things and I cannot understad or interpret them (like the two examples shown here). I also don't understand what "true discounted reward" means exaclty. Is it for each episode, average or something cumulative?
In this answer it is suggested that Q0 should track the average episode reward, but I don't see that in the examples.
For example, in the cartpole example, if one continues the training for more episodes (changing the stop training criteria to avoid stopping for average reward), Q0 value reaches very high values that have nothing to do with the average reward or the episodes. I simulated 1000 episodes for the cartpole example and Q0 values even mess up the scale because they go way too high. The agent seams too learn properly and it even manages to get out of some local minimums sucessfully but still, I cannot grasp what information Q0 yields
I have not found Q0 defined in Reinforcement Learning bibliography either. Could you please clarify a bit or give some bibliogtaphy where I can read further on this specific parameter?

Respuesta aceptada

Emmanouil Tzorakoleftherakis
Emmanouil Tzorakoleftherakis el 22 de Jun. de 2021
Q0 is calculated by performing inference on the critic at the beginning of each episode. Effectively, it is a metric that tells you how well the critic has been trained. If you had the perfect critic that could accurately predict the expected long term reward based on the current observation at the beginning of the episode, this value should overlap with the actual total reward collected during that same episode.
In general, it is not required for this to happen for actor-critic mathods. The actor may converge first and at that point it would be totally fine to stop training.
Hope that helps
  7 comentarios
Arman Ali
Arman Ali el 19 de Ag. de 2022
Same question, anyone who can show Episode Q0 calculation?
轩
el 30 de Dic. de 2023
I think Q0 is still useful in Actor-Critic methods. Can I take it this way, in the AC algorithm if Q0 does not converge with average reward, that means critic's converging rate is slower than actor, we should set a bigger learning rate or something else.

Iniciar sesión para comentar.

Más respuestas (0)

Categorías

Más información sobre Training and Simulation en Help Center y File Exchange.

Productos


Versión

R2021a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by