Hello everyone, i am a newbie in reinforcement learning and I am trying to use the Matlab RL toolbox to solve some simple problem.
However, I meet some problem during the process. I follow the document to establish my own Environment (step function and reset function). And I apply a Q learning on it. However, the result I get from RL toolbox is quite different with that I wrote the algorithm by myself. I am trying to use 1 episode - 350 steps simple Q learning with epsilon greedy.
I wonder if anyone could know what caused this problem. Here I post my driver script, there are three section: 1,import environment 2,Q learning from RL toolbox 3,Q learning wrote by myself.
I do not upload my step function and reset function here.
Thank you so much for the help.
clear all
close all
clc
%%
%%section 1: Import Env and Create action/state
IsDone=0;
Observinfo = rlFiniteSetSpec([1 2 3 4]);
Observinfo.Name = 'Swimmer States';
Observinfo.Description = 'state representation';
ActionInfo = rlFiniteSetSpec([1 2]);
ActionInfo.Name = 'Link Action';
env = rlFunctionEnv(Observinfo,ActionInfo,'myStepFunction','myResetFunction');
%%
%%section 2:Q learning using RL toolbox
qTable = rlTable(getObservationInfo(env),getActionInfo(env));
qRepresentation = rlQValueRepresentation(qTable,getObservationInfo(env),getActionInfo(env));
qRepresentation.Options.LearnRate = 1;
agentOpts = rlQAgentOptions;
agentOpts.EpsilonGreedyExploration.Epsilon = .05;
agentOpts.EpsilonGreedyExploration.EpsilonMin = .05;
agentOpts.DiscountFactor=0.7;
qAgent = rlQAgent(qRepresentation,agentOpts);
trainOpts = rlTrainingOptions;
trainOpts.MaxStepsPerEpisode = 400;
trainOpts.MaxEpisodes= 30;
trainOpts.StopTrainingCriteria = "GlobalStepCount" ;
trainOpts.StopTrainingValue = 350 ;
trainOpts.ScoreAveragingWindowLength = 28;
trainingStats = train(qAgent,env,trainOpts);
critic = getCritic(qAgent);
qtable = getLearnableParameters(critic);
%%
%%section 3:Q learning by myself
InitialObs = reset(env)
Q=zeros(4,2);
alpha=1;
gamma=0.7;
epsilon=0.05;
iter=350;
for i=1:iter;
randcheck=rand;
s=env.LoggedSignals.State
if randcheck>epsilon
a=find(Q(s,:)==max(Q(s,:)));
elseif randcheck<=epsilon
a=randi(2);
end
[NextObs,Reward,IsDone,LoggedSignals] = step(env,a);
s2=LoggedSignals.State
r=Reward;
Q(s,a)=Q(s,a)+alpha*(r+gamma*max(Q(s2,:))-Q(s,a));
end