optimal path problem using q-learning algorithm

Question

Aristi Christoforou el 24 de Jun. de 2022

0
Enlazar

Enlace directo a esta pregunta

https://es.mathworks.com/matlabcentral/answers/1747120-optimal-path-problem-using-q-learning-algorithm

Respondida: Pratik el 14 de Nov. de 2023

hello i want to find the optimal path for my target to the nearest corner of an environment with 3 obstacles using q-learning. For some reason , the code works only for the left up corner and the right up corner of the environment and it goes into an infinite loop in some causes when the nearest corner is the left down corner. In general, in the left and right down corners, the agent for some reason prefers to go a step up and thats why it takes many actions that arent the optimal. I dont know what i can do to fix that problem:( i would really appreciate if anyone could help me find my mistake in order to find the optimal path.

n = 12;
dim=n*n;
maze = ones(n,n);
%%
%{
variables:
r=row of agent
c=column of agent
n=number of rows and columns
rt=row of thieve
ct=column of thieve
ruc=thieve's distance from the right up-corner (1,n)
rdc=thieve's distance from the right down-corner (n,n)
luc=thieve's distance from the  left up-corner (1,1)
ldc=thieve's distance from the  left down-corner (n,1)
rpt=row of path position of the thieve in state path(i)
goalt=position of goal (perno ena apo ta 4 akra tu maze ke to metatrepo se
mia sintetagmeni thesis pinaka using the tool 1
rewardt=reward of thieve 
distgt=goal of thieve(distance number : poso makria ine i pio kontini
apostasi apo tin korifi)
cornerofgoal= corner of the best distance of the thieve (will give the
position of the optimal corner : ruc=1 rdc=2 luc=3 ldc=4
q=q value
gamma=discount factor
alpha=factor learning rate
maxItr=max num of iterations
csa=current state of agent
nsa= next state of agent
cst=current state of thieve
nst= next state of thieve
n_actions=possible agent's actions for the chosen state 
distta=distance between thieve and agent 
fores,ypol= dieresi gia na vro to row ke to column tis kathe thesis tu
thieve sto path tu 
%}
%% creating maze
for i=1:3
    x=randi([1,n]); %row
    y=randi([1,n]); %column
    while (x==1 && y==1) || (x==1 && y==n) || (x==n && y==1) || (x==n && y==n)
    x=randi([1,n]);
    y=randi([1,n]);
    end
    maze(x,y)=-50;    
end
%disp(maze)
n=length(maze);
figure
imagesc(maze)
colormap(spring)
colorbar
set(gca,'xtick',0:1:n)
set(gca,'ytick',0:1:n)
grid on 
%position randomly the agent 
r=randi([1,n]);
c=randi([1,n]);
 
    while maze(r,c)<-30
        r=randi([1,n]);
        c=randi([1,n]);
       
    end
    
  text(c,r,'\bullet','Color','cyan','FontSize',28,'HorizontalAlignment','center')
%position randomly the thieve 
rt=randi([1,n]);
ct=randi([1,n]);
 
    while maze(rt,ct)<-30
        rt=randi([1,n]); 
        ct=randi([1,n]);
       
    
    end
 disp(rt)
 disp(ct)
    
text(ct,rt,'\bullet','Color','red','FontSize',28,'HorizontalAlignment','center')
%% finding the distances of the position of the thieve from each corner
ruc=norm([rt ct]-[1 n])
rdc=norm([rt ct]-[n n])
luc=norm([rt ct]-[1 1])
ldc=norm([rt ct]-[n 1])
%% Setting goal for thieve
goalt=0;
[distgt,cornerofgoal]=min([ruc rdc luc ldc])
if cornerofgoal==1
    goalt=n
elseif cornerofgoal==2
    goalt=n*n
elseif cornerofgoal==3
    goalt=1
elseif cornerofgoal==4
    goalt=1+n*(n-1)
end
%%
%find the path of the thieve (best path)
%{
for every thieve's action i will find the reward 
creating reward matrix for the maze(for the thieve):
Up     (i-n)
Down   (i+n)
Left   (i-1)
Right  (i+1)
Reward  is -Inf (~No reward) for any other actions. Thus any other action other then above action will not occur.
%}
if distgt==0
    x=['the thieve has won,he was already in the corner ',num2str(cornerofgoal),' of the maze'];
    disp(x)
else
    
    rewardt=zeros(n*n);
for i=1:dim %giati hriazete?
    rewardt(i,:)=reshape(maze',1,n*n); %ns about n*n
end
for i=1:dim
    for j=1:dim
        if j~=i-n  && j~=i+n  && j~=i-1 && j~=i+1  %if the action isn't one of the above then the reward is -inf
            rewardt(i,j)=-Inf;
        end    
    end
end
for i=1:n:dim %tha prepi na figo to j=nw,sw?
    for j=1:i+n
        if j==i-1 
            rewardt(i,j)=-Inf;
            rewardt(j,i)=-Inf;
        end
    end
end
%% Q-Learning algorithm
q = randn(size(rewardt));
gamma = 0.95;
alpha = 0.2;
maxItr = 50;
for i=1:maxItr
    
    % Starting from start position    
    cs=ct+n*(rt-1); %convert position of the thieve into n*n reward matrix
    
    % Repeat until Goal state is reached
    while(1)
        
    % possible actions for the chosen state
    n_actions = find(rewardt(cs,:)>=0);
    % choose an action at random and set it as the next state
     ns = n_actions(randi([1 length(n_actions)]));
       
            % find all the possible actions for the selected state
            n_actions = find(rewardt(ns,:)>=0);
            
            % find the maximum q-value i.e, next state with best action
            max_q = 0;
            for j=1:length(n_actions)
                max_q = max(max_q,q(ns,n_actions(j)));
            end
            
            % Update q-values as per Bellman's equation
            q(cs,ns)=rewardt(cs,ns)+gamma*max_q;
            
            % Check whether the episode has completed i.e Goal has been reached
            if(cs == goalt)
                break;
            end
            
            % Set current state as next state
            cs=ns;
    end
end
%% Solving the maze
% * Starting from the first postion
start =ct+n*(rt-1);
move = 0;
path = start;
%% 
% * Iterating until Goal-State is reached
while(move~=goalt)
    [~,move]=max(q(start,:));
    
    % Deleting chances of getting stuck in small loops  (upto order of 4)  
    if ismember(move,path)
        [~,x]=sort(q(start,:),'descend');
        move=x(2); 
        if ismember(move,path)
            [~,x]=sort(q(start,:),'descend');
            move=x(3);
            %{ 
if ismember(move,path)
                [~,x]=sort(q(start,:),'descend');
                move=x(4);
                if ismember(move,path)
                    [~,x]=sort(q(start,:),'descend');
                    move=x(5);
                end 
            end
            %}
            
        end
    end
    
    % Appending next action/move to the path
    path=[path,move];
    start=move;
end
%% Solution of maze 
% i.e, Optimal Path between START to GOAL
%%
fprintf('Final path: %s',num2str(path))
fprintf('Total steps: %d',length(path))
% reproducing path to matrix path
pmat=zeros(n,n);
[q, r]=quorem(sym(path),sym(n));
q=double(q);r=double(r);
q(r~=0)=q(r~=0)+1;r(r==0)=n;
for i=1:length(q)
    pmat(q(i),r(i))=50;
end  
%% Final Plot of the maze
%%
figure
imagesc(pmat)
colormap(white)
for i=1:n
    for j=1:n
        if maze(i,j)==min(maze)
            text(j,i,'X','HorizontalAlignment','center')
        end
        if pmat(i,j)==50
            text(j,i,'\bullet','Color','red','FontSize',28)
        end
    end
end
hold on
imagesc(maze,'AlphaData',0.2)
colorbar
colormap(winter)
grid on
set(gca,'xtick',0:1:n)
set(gca,'ytick',0:1:n)
hold off
end

4 comentarios
Mostrar 2 comentarios más antiguosOcultar 2 comentarios más antiguos

Sam Chak el 24 de Jun. de 2022

Not yet, @Aristi Christoforou. I'm still studying the code.

Which parameters influence the learning process most strongly?

Aristi Christoforou el 24 de Jun. de 2022

Editada: Aristi Christoforou el 24 de Jun. de 2022

gamma , alpha,rewardt and maxItr.. the problem is that for some reason the learning process always start with a step up which is not ideal if the corner of goal is rdc or ldc. Thanks for the help in advance @Sam Chak . My solution is based on this example https://www.mathworks.com/matlabcentral/fileexchange/67989-q-learning-example

Iniciar sesión para comentar.

Iniciar sesión para responder a esta pregunta.

Answer 1

Pratik el 14 de Nov. de 2023

0
Enlazar

Enlace directo a esta respuesta

https://es.mathworks.com/matlabcentral/answers/1747120-optimal-path-problem-using-q-learning-algorithm#answer_1352037

Hi Aristi,

As per my understanding, you want to get the most optimal path to the nearest corner in the grid avoiding the obstacle using Q-learning. However, the algorithm is not able to find the most optimal path and gets stuck in infinite loop in some cases.

The example on which this code is based on aims to solve a maze and finding the most optimal path is not it’s primary objective. To find the most optimal path following methods can be used:

1. Assigning a negative reward to each step in the path, thus when trying to increase the reward, the agent will find optimal path.

2. Current code chooses next action randomly from set of possible actions. An epsilon-greedy strategy can be implemented which balances exploration and exploitation.

3. Experimenting with different values of learning rate and discount factor and increasing the number of iterations can also improve the performance.

Infinite loop can occur in the code when the algorithm is stuck in sub-optimal loop where best action in state ‘A’ goes to state ‘B’ and best action in state ‘B’ goes to state ‘A’. To avoid this scenario, path can be tracked, and same state can be avoided by giving negative reward.

Hope this helps!

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Iniciar sesión para comentar.

optimal path problem using q-learning algorithm

4 comentarios
Mostrar 2 comentarios más antiguosOcultar 2 comentarios más antiguos

Respuestas (1)

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Ver también

Categorías

Etiquetas

Community Treasure Hunt

optimal path problem using q-learning algorithm

4 comentarios Mostrar 2 comentarios más antiguosOcultar 2 comentarios más antiguos

Respuestas (1)

0 comentarios Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Ver también

Categorías

Etiquetas

Community Treasure Hunt

4 comentarios
Mostrar 2 comentarios más antiguosOcultar 2 comentarios más antiguos

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos