What exactly is Episode Q0? What information is it giving?

Question

Cecilia S. el 11 de Jun. de 2021

1
Enlazar

Enlace directo a esta pregunta

https://es.mathworks.com/matlabcentral/answers/854195-what-exactly-is-episode-q0-what-information-is-it-giving

Comentada: Jayandi el 26 de Abr. de 2024

Respuesta aceptada: Emmanouil Tzorakoleftherakis

Reading documentation I find that "For agents with a critic, Episode Q0 is the estimate of the discounted long-term reward at the start of each episode, given the initial observation of the environment. As training progresses, if the critic is well designed. Episode Q0 approaches the true discounted long-term reward"

But I cannot grasp exactly what is Q0 because, except in a few examples (like this one) where it "converges" to some value rather quickly, I have seen Q0 value do different things and I cannot understad or interpret them (like the two examples shown here). I also don't understand what "true discounted reward" means exaclty. Is it for each episode, average or something cumulative?

In this answer it is suggested that Q0 should track the average episode reward, but I don't see that in the examples.

For example, in the cartpole example, if one continues the training for more episodes (changing the stop training criteria to avoid stopping for average reward), Q0 value reaches very high values that have nothing to do with the average reward or the episodes. I simulated 1000 episodes for the cartpole example and Q0 values even mess up the scale because they go way too high. The agent seams too learn properly and it even manages to get out of some local minimums sucessfully but still, I cannot grasp what information Q0 yields

I have not found Q0 defined in Reinforcement Learning bibliography either. Could you please clarify a bit or give some bibliogtaphy where I can read further on this specific parameter?

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Iniciar sesión para comentar.

Iniciar sesión para responder a esta pregunta.

Answer 1

Emmanouil Tzorakoleftherakis el 22 de Jun. de 2021

2
Enlazar

Enlace directo a esta respuesta

https://es.mathworks.com/matlabcentral/answers/854195-what-exactly-is-episode-q0-what-information-is-it-giving#answer_730710

Q0 is calculated by performing inference on the critic at the beginning of each episode. Effectively, it is a metric that tells you how well the critic has been trained. If you had the perfect critic that could accurately predict the expected long term reward based on the current observation at the beginning of the episode, this value should overlap with the actual total reward collected during that same episode.

In general, it is not required for this to happen for actor-critic mathods. The actor may converge first and at that point it would be totally fine to stop training.

Hope that helps

8 comentarios
Mostrar 6 comentarios más antiguosOcultar 6 comentarios más antiguos

轩 el 30 de Dic. de 2023

I think Q0 is still useful in Actor-Critic methods. Can I take it this way, in the AC algorithm if Q0 does not converge with average reward, that means critic's converging rate is slower than actor, we should set a bigger learning rate or something else.

Jayandi el 26 de Abr. de 2024

Thank you Emmanouil. I think your definition is correct.

Converging of Q0 and reward average is the best a training could achieve but my case experience it is not necessary. So this is my conclusions:

Q0 should be almost stable, meaning it is around a number, for insant -7.xxxx. Perhaps at beginning it is not but after several epoch it arrives on a stable number. Q0 which is going upward and downward, not stable in a number, perhaps required several adjusment in hyperparameters, networks, or else.
A stable Q0 and a stable average rewards, for my case, is enough to say the training is converged. It is proven while testing of the agent afterward.
Of course, it is always tempting to make changes as far as possible to make the avg rewards and Q0 converge, but I don't have luck yet. If anyone could share the recipe, I would glad to hear.
In short, Q0 is helping to check our training conditions.

Iniciar sesión para comentar.

What exactly is Episode Q0? What information is it giving?

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Respuesta aceptada

8 comentarios
Mostrar 6 comentarios más antiguosOcultar 6 comentarios más antiguos

Más respuestas (0)

Ver también

Categorías

Etiquetas

Productos

Versión

Community Treasure Hunt

What exactly is Episode Q0? What information is it giving?

0 comentarios Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Respuesta aceptada

8 comentarios Mostrar 6 comentarios más antiguosOcultar 6 comentarios más antiguos

Más respuestas (0)

Ver también

Categorías

Etiquetas

Productos

Versión

Community Treasure Hunt

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

8 comentarios
Mostrar 6 comentarios más antiguosOcultar 6 comentarios más antiguos