Why does my Custom TD3 not learn like the built-in TD3 agent?

Question

Vincent el 18 de Ag. de 2025

0
Enlazar

Enlace directo a esta pregunta

https://es.mathworks.com/matlabcentral/answers/2179441-why-does-my-custom-td3-not-learn-like-the-built-in-td3-agent

Editada: Vincent el 23 de Ag. de 2025

So I have tried to code up my custom TD3 agent to behave as much like the built in TD3 agent in the same simulink environment, the only difference between them is for the custom agent, I had to use a rate transition block to perform zero order hold between the states, rewards, done signal and the custom agent. I used the rate transition block specify mode for output port sample time options to set the custom agent sample time.

My code for my custom TD3 agent is below, I tried to make it as much like the built-in TD3 as possible, the ep_counter,num_of_ep properties are unused.

classdef test_TD3Agent_V2 < rl.agent.CustomAgent
    properties
        %neural networks
        actor
        critic1
        critic2
        %target networks
        target_actor
        target_critic1
        target_critic2
        %dimensions
        statesize
        actionsize
        %optimizers
        actor_optimizer
        critic1_optimizer
        critic2_optimizer
        %buffer
        statebuffer
        nextstatebuffer
        actionbuffer
        rewardbuffer
        donebuffer
        counter %keeps count of number experiences encountered
        index %keeps track of current available index in buffer
        buffersize
        batchsize
        %episodes
        num_of_ep
        ep_counter
        %keep count of critic number of updates
        num_critic_update
    end
    methods
        %constructor
        function obj = test_TD3Agent_V2(actor,critic1,critic2,target_actor,target_critic1,target_critic2,actor_opt,critic1_opt,critic2_opt,statesize,actionsize,buffer_size,batchsize,num_of_ep)
            %(required) call abstract class constructor
            obj = obj@rl.agent.CustomAgent();
            %define observation + action space
            obj.ObservationInfo = rlNumericSpec([statesize 1]);
            obj.ActionInfo = rlNumericSpec([actionsize 1],LowerLimit = -1,UpperLimit = 1);
            obj.SampleTime = -1; %determined by rate transition block
            %define networks
            obj.actor = actor;
            obj.critic1 = critic1;
            obj.critic2 = critic2;
            %define target networks
            obj.target_actor = target_actor;
            obj.target_critic1 = target_critic1;
            obj.target_critic2 = target_critic2;
            %define optimizer
            obj.actor_optimizer = actor_opt;
            obj.critic1_optimizer = critic1_opt;
            obj.critic2_optimizer = critic2_opt;
            %record dimensions
            obj.statesize = statesize;
            obj.actionsize = actionsize;
            %initialize buffer
            obj.statebuffer = dlarray(zeros(statesize,1,buffer_size));
            obj.nextstatebuffer = dlarray(zeros(statesize,1,buffer_size));
            obj.actionbuffer = dlarray(zeros(actionsize,1,buffer_size));
            obj.rewardbuffer = dlarray(zeros(1,buffer_size));
            obj.donebuffer = zeros(1,buffer_size);
            obj.buffersize = buffer_size;
            obj.batchsize = batchsize;
            obj.counter = 0;
            obj.index = 1;
            %episodes (unused)
            obj.num_of_ep = num_of_ep;
            obj.ep_counter = 1;
            %used for delay actor update and target network soft transfer
            obj.num_critic_update = 0;
        end
    end
    methods (Access = protected)
        %Action method
        function action = getActionImpl(obj,Observation)
            % Given the current state of the system, return an action 
            action = getAction(obj.actor,Observation);
        end
        
        %Action with noise method
        function action = getActionWithExplorationImpl(obj,Observation)
            % Given the current observation, select an action
            action = getAction(obj.actor,Observation);
            % Add random noise to action
        end
        %Learn method
        function action = learnImpl(obj,Experience)
            %parse experience 
            state = Experience{1};
            action_ = Experience{2};
            reward = Experience{3};
            next_state = Experience{4};
            isdone = Experience{5};
            %buffer operations
            %check if index wraps around
            if (obj.index > obj.buffersize)
                obj.index = 1;
            end       
            %record experience in buffer
            obj.statebuffer(:,:,obj.index) = state{1};
            obj.actionbuffer(:,:,obj.index) = action_{1};
            obj.rewardbuffer(:,obj.index) = reward;
            obj.nextstatebuffer(:,:,obj.index) = next_state{1};
            obj.donebuffer(:,obj.index) = isdone;
            
            %increment index and counter
            obj.counter = obj.counter + 1;
            obj.index = obj.index + 1;
            %if non terminal state
            if (isdone == false)
                action = getAction(obj.actor,next_state); %select next action 
                noise = randn([6,1]).*0.1; %gaussian noise with standard dev of 0.1
                action{1} = action{1} + noise; %add noise
                action{1} = clip(action{1},-1,1); %clip action noise
            else
                %learning at the end of episode
                if (obj.counter >= obj.batchsize)
                    max_index = min([obj.counter obj.buffersize]); %range of index 1 to max_index for buffer
                    %sample experience randomly from buffer
                    sample_index_vector = randsample(max_index,obj.batchsize); %vector of index experience to sample
                    
                    %create buffer mini batch dlarrays
                    state_batch = dlarray(zeros(obj.statesize,1,obj.batchsize));
                    nextstate_batch = dlarray(zeros(obj.statesize,1,obj.batchsize));
                    action_batch = dlarray(zeros(obj.actionsize,1,obj.batchsize));
                    reward_batch = dlarray(zeros(1,obj.batchsize));
                    done_batch = zeros(1,obj.batchsize);
    
                    for i = 1:obj.batchsize %iterate through buffer and transfer experience over to mini batch
                        state_batch(:,:,i) = obj.statebuffer(:,:,sample_index_vector(i));
                        nextstate_batch(:,:,i) = obj.nextstatebuffer(:,:,sample_index_vector(i));
                        action_batch(:,:,i) = obj.actionbuffer(:,:,sample_index_vector(i));
                        reward_batch(:,i) = obj.rewardbuffer(:,sample_index_vector(i));
                        done_batch(:,i) = obj.donebuffer(:,sample_index_vector(i));
                    end
    
                    %update critic networks
                    criticgrad1 = dlfeval(@critic_gradient,obj.critic1,obj.target_actor,obj.target_critic1,obj.target_critic2,{state_batch},{nextstate_batch},{action_batch},reward_batch,done_batch,obj.batchsize);
                    [obj.critic1,obj.critic1_optimizer] = update(obj.critic1_optimizer,obj.critic1,criticgrad1);
                    criticgrad2 = dlfeval(@critic_gradient,obj.critic2,obj.target_actor,obj.target_critic1,obj.target_critic2,{state_batch},{nextstate_batch},{action_batch},reward_batch,done_batch,obj.batchsize);
                    [obj.critic2,obj.critic2_optimizer] = update(obj.critic2_optimizer,obj.critic2,criticgrad2);
                    %update num of critic updates
                    obj.num_critic_update = obj.num_critic_update + 1;
                    %delayed actor update + target network transfer
                    if (mod(obj.num_critic_update,2) == 0)
                        actorgrad = dlfeval(@actor_gradient,obj.actor,obj.critic1,obj.critic2,{state_batch});
                        [obj.actor,obj.actor_optimizer] = update(obj.actor_optimizer,obj.actor,actorgrad);
                        target_soft_transfer(obj);
                    end
                end
            end
        end
        %function used to soft transfer over to target networks
        function target_soft_transfer(obj)
            smooth_factor = 0.005;
            
            for i = 1:6
                obj.target_actor.Learnables{i} = smooth_factor*obj.actor.Learnables{i} + (1 - smooth_factor)*obj.target_actor.Learnables{i};
                obj.target_critic1.Learnables{i} = smooth_factor*obj.critic1.Learnables{i} + (1 - smooth_factor)*obj.target_critic1.Learnables{i};
                obj.target_critic2.Learnables{i} = smooth_factor*obj.critic2.Learnables{i} + (1 - smooth_factor)*obj.target_critic2.Learnables{i};
            end
        end
        
    end
end
%obtain gradient of Q value wrt actor
function actorgradient = actor_gradient(actorNet,critic1,critic2,states,batchsize)
    actoraction = getAction(actorNet,states); %obtain actor action
    %obtain Q values
    Q1 = getValue(critic1,states,actoraction); 
    Q2 = getValue(critic2,states,actoraction);
    %obtain min Q values + reverse sign for gradient ascent
    Qmin = min(Q1,Q2);
    Q = -1*mean(Qmin);
    gradient = dlgradient(Q,actorNet.Learnables); %calculate gradient of Q value wrt NN learnables
    
    actorgradient = gradient;
end
%obtain gradient of critic NN
function criticgradient = critic_gradient(critic,target_actor,target_critic_1,target_critic_2,states,nextstates,actions,rewards,dones,batchsize)
    %obtain target action
    target_actions = getAction(target_actor,nextstates);
    %target policy smoothing  
    for i = 1:batchsize
        target_noise = randn([6,1]).*sqrt(0.2);
        target_noise = clip(target_noise,-0.5,0.5);
        target_actions{1}(:,:,i) = target_actions{1}(:,:,i) + target_noise; %add noise to action for smoothing
    end
    target_actions{1}(:,:,:) = clip(target_actions{1}(:,:,:),-1,1); %clip btw -1 and 1
    %obtain Q values
    Qtarget1 = getValue(target_critic_1,nextstates,target_actions);
    Qtarget2 = getValue(target_critic_2,nextstates,target_actions);
    Qmin = min(Qtarget1,Qtarget2);
    Qoptimal = rewards + 0.99*Qmin.*(1 - dones);
    Qpred = getValue(critic,states,actions);
    %obtain critic loss
    criticLoss = 0.5*mean((Qoptimal - Qpred).^2);
    criticgradient = dlgradient(criticLoss,critic.Learnables);
end

And here is my code when using the built in TD3 agent

clc
%define times
dt = 0.1; %time steps
Tf = 7; %simulation time
%create stateInfo and actionInfo objects
statesize = 38;
actionsize = 6;
stateInfo = rlNumericSpec([statesize 1]);
actionInfo = rlNumericSpec([actionsize 1],LowerLimit = -1,UpperLimit = 1);
mdl = 'KUKA_EE_Controller_v18_disturbed';
blk = 'KUKA_EE_Controller_v18_disturbed/RL Agent';
%create environment object
env = rlSimulinkEnv(mdl,blk,stateInfo,actionInfo);
%assign reset function
env.ResetFcn = @ResetFunction;
% %create actor network
actorlayers = [
   featureInputLayer(statesize)
   fullyConnectedLayer(800)
   reluLayer
   fullyConnectedLayer(600)
   reluLayer
   fullyConnectedLayer(actionsize)
   tanhLayer
   ];
actorNet = dlnetwork;
actorNet = addLayers(actorNet, actorlayers);
actorNet = initialize(actorNet);
actor = rlContinuousDeterministicActor(actorNet, stateInfo, actionInfo);
%create critic networks
statelayers = [
   featureInputLayer(statesize, Name='states')
   concatenationLayer(1, 2, Name='concat')
   fullyConnectedLayer(400)
   reluLayer
   fullyConnectedLayer(400)
   reluLayer
   fullyConnectedLayer(1, Name='Qvalue')
   ];
actionlayers = featureInputLayer(actionsize, Name='actions');
criticNet = dlnetwork;
criticNet = addLayers(criticNet, statelayers);
criticNet = addLayers(criticNet, actionlayers);
criticNet = connectLayers(criticNet, 'actions', 'concat/in2');
criticNet = initialize(criticNet);
critic1 = rlQValueFunction(criticNet,stateInfo,actionInfo,ObservationInputNames='states',ActionInputNames='actions');
criticNet2 = dlnetwork;
criticNet2 = addLayers(criticNet2, statelayers);
criticNet2 = addLayers(criticNet2, actionlayers);
criticNet2 = connectLayers(criticNet2, 'actions', 'concat/in2');
criticNet2 = initialize(criticNet2);
critic2 = rlQValueFunction(criticNet2,stateInfo,actionInfo,ObservationInputNames='states',ActionInputNames='actions');
%create options object for actor and critic
actoroptions = rlOptimizerOptions(Optimizer='adam',LearnRate=0.001);
criticoptions = rlOptimizerOptions(Optimizer='adam',LearnRate=0.003);
agentoptions = rlTD3AgentOptions;
agentoptions.SampleTime = dt;
agentoptions.ActorOptimizerOptions = actoroptions;
agentoptions.CriticOptimizerOptions = criticoptions;
agentoptions.DiscountFactor = 0.99;
agentoptions.TargetSmoothFactor = 0.005;
agentoptions.ExperienceBufferLength = 1000000;
agentoptions.MiniBatchSize = 250;
agentoptions.ExplorationModel.StandardDeviation = 0.1;
agentoptions.ExplorationModel.StandardDeviationDecayRate = 1e-4;
agent = rlTD3Agent(actor, [critic1 critic2], agentoptions);
%create training options object
trainOpts = rlTrainingOptions(MaxEpisodes=20,MaxStepsPerEpisode=floor(Tf/dt),StopTrainingCriteria='none',SimulationStorageType='none');
%train agent
trainresults = train(agent,env,trainOpts);

I made my custom TD3 agent with the same actor and critic structures, with the same hyperparameters, and with the same agent options. But it doesn't seem to learn and I don't know why. I don't know if the rate transition block is having a negative impact on the training. One difference between my custom TD3 and the built-in TD3 is the actor gradient. In the matlab documentation on TD3 agent, it says the gradient is calculated for every sample in the mini batch then the gradient is accumulated and averaged out.

https://www.mathworks.com/help/reinforcement-learning/ug/td3-agents.html (TD3 documentation)

But how I calculated my actor gradient in my above code in the actorgradient function, I averaged the Q values over the minbatch first, then I performed only one gradient operation. So maybe that's one possible reason why my built in TD3 agent isn't learning. Here are my reward for

Built-TD3

Custom TD3, I stopped it early because it wasn't learning

I would appreciate any help because I have been stuck for months.

3 comentarios
Mostrar 1 comentario más antiguoOcultar 1 comentario más antiguo

Vincent el 21 de Ag. de 2025

@Emmanouil Tzorakoleftherakis sorry to spam, but I would really appreciate an experts advice, Regards

Vincent el 21 de Ag. de 2025

Abrir en MATLAB Online

%obtain gradient of Q value wrt actor
function actorgradient = actor_gradient(actorNet,critic1,critic2,states,batchsize)
   actoraction = getAction(actorNet,states); %obtain actor action
   %obtain Q values
   Q1 = getValue(critic1,states,actoraction);
   Q2 = getValue(critic2,states,actoraction);
   %obtain min Q values + reverse sign for gradient ascent
   Qmin = min(Q1,Q2);
   Q = -1*Qmin;
   %accumulate gradient over minibatch
   gradient = dlgradient(Q(1),actorNet.Learnables);
   for i = 2:batchsize
       grad = dlgradient(Q(i),actorNet.Learnables);
       for j = 1:6
           gradient{j} = gradient{j} + grad{j};
       end
   end
   %average out gradient
   for i = 1:6
       gradient{i} = (1/batchsize)*gradient{i};
   end
   actorgradient = gradient;
end

As an update, I tried updating my actor gradient to accumulate gradient over each sample in the batch size, and my agent still didn't learn

Iniciar sesión para comentar.

Iniciar sesión para responder a esta pregunta.

Answer 1

Emmanouil Tzorakoleftherakis el 21 de Ag. de 2025

0
Enlazar

Enlace directo a esta respuesta

https://es.mathworks.com/matlabcentral/answers/2179441-why-does-my-custom-td3-not-learn-like-the-built-in-td3-agent#answer_1569618

Editada: Emmanouil Tzorakoleftherakis el 21 de Ag. de 2025

A few things:

The target action policy smoothing can be vectorized in the critic loss fcn
Sampling to mini-batch data can be vectorized
The critic loss function doesn't account for truncated episodes (e.g. isdone == 2)
I think this line will always return false "if (isdone == false)". I don't think it particularly matters in this case however
You are only sampling 1 mini-batch at the end of episode. We sample up to MaxMiniBatchPerEpoch samples per episode
Your getActionWithExplorationImpl does not add noise (no exploration)
You don't implement action noise decay as you did in the built-in agent.

There may still be small details here and there. I would focys more on getting your custom agent to start learning, rather than trying to replicate the built-in one, at least initially.

5 comentarios
Mostrar 3 comentarios más antiguosOcultar 3 comentarios más antiguos

Drew Davis el 22 de Ag. de 2025

Abrir en MATLAB Online

@Vincent

Your original actor gradients look generally correct. While your modified actor gradient looks conceptually correct, it will be horribly ineffecient, not taking advantage of vectorized gradient operations.

Also, your critic operations are not consistent with https://www.mathworks.com/help/reinforcement-learning/ug/td3-agents.html. Specifically, YOU re-compute targets for each critic loss. In the toolbox implementation, targets are computed once, then used as a target for each critic loss. This may not make a huge difference, but is something to point out as a difference from the aforementioned doc page.

You can sample mini-batches at every step but that would not be consistent with the official toolbox implementation (LearningFrequency defaults to -1, indicating that learning will occur at end of episode). If you want to change your implementation to learn sample mini-batches at every step, consider setting the following on the toolbox agent options to have an apples-apples comparison.

agentOptions.LearningFrequency = 1

If you are not convinced about computing gradients, I suggest creating some simple test cases using simple actor and critic networks where you can hand derive the exact analytical gradients. That way you can compare your loss function gradients to the expected gradients and debug as needed.

You can also use this example as inspiration for writing your loss functions, since the DDPG and TD3 loss functions are generally the same (TD3 has +1 critic, and target action noise excitation).

Good luck with your thesis

Drew

Vincent el 23 de Ag. de 2025

Editada: Vincent el 23 de Ag. de 2025

Thanks for the reply, so I think @Emmanouil Tzorakoleftherakis was correct it was I wasn't doing any exploration. Once I fixed that it started to learn finally. Also I'm not familiar with vectorization. Do you mind giving a sample code example?

I appreciate all the help from everyone, once again sorry for the spam, I was just under alot of stress.

Iniciar sesión para comentar.

Why does my Custom TD3 not learn like the built-in TD3 agent?

3 comentarios
Mostrar 1 comentario más antiguoOcultar 1 comentario más antiguo

Respuestas (1)

5 comentarios
Mostrar 3 comentarios más antiguosOcultar 3 comentarios más antiguos

Ver también

Categorías

Etiquetas

Productos

Versión

Community Treasure Hunt

Why does my Custom TD3 not learn like the built-in TD3 agent?

3 comentarios Mostrar 1 comentario más antiguoOcultar 1 comentario más antiguo

Respuestas (1)

5 comentarios Mostrar 3 comentarios más antiguosOcultar 3 comentarios más antiguos

Ver también

Categorías

Etiquetas

Productos

Versión

Community Treasure Hunt

3 comentarios
Mostrar 1 comentario más antiguoOcultar 1 comentario más antiguo

5 comentarios
Mostrar 3 comentarios más antiguosOcultar 3 comentarios más antiguos