Reinforcement Learning toolbox simple Q learning

Question

YANGZHE LIU el 9 de Mzo. de 2022

0
Enlazar

Enlace directo a esta pregunta

https://es.mathworks.com/matlabcentral/answers/1667204-reinforcement-learning-toolbox-simple-q-learning

Respondida: arushi el 21 de Dic. de 2023

Hello everyone, i am a newbie in reinforcement learning and I am trying to use the Matlab RL toolbox to solve some simple problem.

However, I meet some problem during the process. I follow the document to establish my own Environment (step function and reset function). And I apply a Q learning on it. However, the result I get from RL toolbox is quite different with that I wrote the algorithm by myself. I am trying to use 1 episode - 350 steps simple Q learning with epsilon greedy.

I wonder if anyone could know what caused this problem. Here I post my driver script, there are three section: 1,import environment 2,Q learning from RL toolbox 3,Q learning wrote by myself.

I do not upload my step function and reset function here.

Thank you so much for the help.

clear all

close all

clc

%%

%%section 1: Import Env and Create action/state

IsDone=0;

Observinfo = rlFiniteSetSpec([1 2 3 4]);

Observinfo.Name = 'Swimmer States';

Observinfo.Description = 'state representation';

ActionInfo = rlFiniteSetSpec([1 2]);

ActionInfo.Name = 'Link Action';

env = rlFunctionEnv(Observinfo,ActionInfo,'myStepFunction','myResetFunction');

%%

%%section 2:Q learning using RL toolbox

qTable = rlTable(getObservationInfo(env),getActionInfo(env));

qRepresentation = rlQValueRepresentation(qTable,getObservationInfo(env),getActionInfo(env));

qRepresentation.Options.LearnRate = 1;

agentOpts = rlQAgentOptions;

agentOpts.EpsilonGreedyExploration.Epsilon = .05;

agentOpts.EpsilonGreedyExploration.EpsilonMin = .05;

agentOpts.DiscountFactor=0.7;

qAgent = rlQAgent(qRepresentation,agentOpts);

trainOpts = rlTrainingOptions;

trainOpts.MaxStepsPerEpisode = 400;

trainOpts.MaxEpisodes= 30;

trainOpts.StopTrainingCriteria = "GlobalStepCount" ;

trainOpts.StopTrainingValue = 350 ;

trainOpts.ScoreAveragingWindowLength = 28;

trainingStats = train(qAgent,env,trainOpts);

critic = getCritic(qAgent);

qtable = getLearnableParameters(critic);

%%

%%section 3:Q learning by myself

InitialObs = reset(env)

Q=zeros(4,2);

alpha=1;

gamma=0.7;

epsilon=0.05;

iter=350;

for i=1:iter;

randcheck=rand;

s=env.LoggedSignals.State

if randcheck>epsilon

a=find(Q(s,:)==max(Q(s,:)));

elseif randcheck<=epsilon

a=randi(2);

end

[NextObs,Reward,IsDone,LoggedSignals] = step(env,a);

s2=LoggedSignals.State

r=Reward;

Q(s,a)=Q(s,a)+alpha*(r+gamma*max(Q(s2,:))-Q(s,a));

end

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Iniciar sesión para comentar.

Iniciar sesión para responder a esta pregunta.

Answer 1

arushi el 21 de Dic. de 2023

0
Enlazar

Enlace directo a esta respuesta

https://es.mathworks.com/matlabcentral/answers/1667204-reinforcement-learning-toolbox-simple-q-learning#answer_1375342

Hi Yangzhe,

I understand that you are getting different results in both the cases with the same approach. Given the code you've provided, here are a few potential considerations that might lead to differences in the results between the MATLAB RL toolbox implementation and your custom Q-learning implementation:

Epsilon-Greedy Action Selection:

In your custom code, if multiple actions have the same maximum Q-value, the find function will return all indices where this condition is true. If there is more than one, this could lead to unintended behavior because you do not explicitly handle choosing one action from the resulting array.
To fix this, you can use randi(length(a)) to randomly select an action from those with the maximum Q-value:
Ensure that the MATLAB RL toolbox also picks one action randomly when there is a tie for the maximum Q-value.

Initial Observation:

The custom code does not seem to use InitialObs after resetting the environment. Make sure that you're starting from the same initial state in both implementations.

Learning Rate (Alpha):

You have set the learning rate to 1 in both implementations, which is fine as long as this is intentional. A learning rate of 1 means that the Q-values are updated to fully reflect the most recent information, without considering the old value.

Stopping Criteria:

In the MATLAB RL toolbox code, you've set MaxStepsPerEpisode to 400 and MaxEpisodes to 30 but then stop after 350 global steps. This could potentially result in stopping mid-episode.
In your custom code, you run for a flat 350 iterations, which may not correspond to the same number of episodes or steps per episode as in the MATLAB RL toolbox code.
Ensure that the number of episodes and steps per episode is consistent between both implementations.

Reward Processing:

Verify that the reward is being processed and applied in the same way in both implementations.

Hope these suggestions help.

Thank you

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Iniciar sesión para comentar.

Reinforcement Learning toolbox simple Q learning

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Respuestas (1)

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Ver también

Categorías

Etiquetas

Productos

Versión

Community Treasure Hunt

Reinforcement Learning toolbox simple Q learning

0 comentarios Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Respuestas (1)

0 comentarios Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Ver también

Categorías

Etiquetas

Productos

Versión

Community Treasure Hunt

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos