48 views (last 30 days)

Show older comments

EDIT: Can you at least suggest me where to find the specific matlab function that implements the LM algorithm?

Hi all,

for personal research reasons I'm trying to implement from scratch the training algorithms featured in deep learning toolbox. Everything went well with first order algorithms such as standard backpropagation gradient descent. My custom code converges to the exact same weights/biases than the matlab's one and it's also twice faster. I suppose that the higher speed is due to the greater number of if/else instructions matlab needs to do compared with my code that is suited only for one layered networks with tanh activations.

But I'm having problems with the Levenberg Marquardt. I wrote 3 functions:

- Jacobian_LM
- UpdatesThroughHessianAndGradient
- Levenberg_Marquardt_Backpropagation

The first one just computes the "per-example" Jacobian of the network WRT weights and biases through the standard backpropagation algorithm. It means that it outputs matrices in which every column is a gradient relative to a single input/output couple.

The second one flattens the Jacobians to a single long vector and computes the hessian matrix H and the gradient D that are needed for the LM update rule:

Then it reshapes and splits again the updates into the specific weights and biases matrices of the network with the correct size and shape.

Finally the third one is an implementation of the LM loop that at every iteration updates weights and biases of a standard matlab network object (with one hidden layer and tanh activation) until a performance improvement is achieved through an increase/decrase pattern of mu.

Here is the functions you need to try my algorithm:

function [Net] = Levenberg_Marquardt_Backpropagation(Net,x,y,mu,mu_increase_rate,mu_decrease_rate,max_mu,iterations)

%UNTITLED Summary of this function goes here

% Detailed explanation goes here

%NETWORK WEIGHTS\BIASES STORAGE

IW = Net.IW{1,1};

Ib = Net.b{1,1};

LW = Net.LW{2,1};

Lb = Net.b{2,1};

%%%%%%%%%%%%%%%%%

for i = 1:iterations

while mu <= max_mu && mu > 1e-20

%PREV-PERFORMANCE COMPUTATION

Pred = LW*tansig(IW*x + Ib) + Lb;

Prev_Perf = mean((y-Pred).^2);

%%%%%%%%%%%%%%%%%%%%%%%%

%PREVIOUS WEIGHTS\BIASES STORAGE

Prev_IW = IW;

Prev_Ib = Ib;

Prev_LW = LW;

Prev_Lb = Lb;

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%GRADIENT\HESSIAN COMPUTATION

[IWJ,IbJ,LWJ,LbJ] = Jacobian_LM(IW,LW,Ib,Lb,x,y);

[IWUpdate,IbUpdate,LWUpdate,LbUpdate] = UpdatesThroughHessianAndGradient(IWJ,IbJ,LWJ,LbJ,Pred,y,mu);

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%WEIGHTS\BIASES UPDATE

IW = IW + IWUpdate;

Ib = Ib + IbUpdate;

LW = LW + LWUpdate';

Lb = Lb + LbUpdate;

%%%%%%%%%%%

%PERFORMANCE COMPUTATION

Pred = LW*tansig(IW*x + Ib) + Lb;

Perf = mean((y-Pred).^2);

%%%%%%%%%%%%%%%%%%%%%%%%

%PERFORMANCE CHECK

if(Perf >= Prev_Perf)

IW = Prev_IW;

Ib = Prev_Ib;

LW = Prev_LW;

Lb = Prev_Lb;

mu = mu*mu_increase_rate;

else

mu = mu*mu_decrease_rate;

break;

end

%%%%%%%%%%%%%%%%%%

end

end

%FINAL NET UPDATE

Net.IW{1,1} = IW;

Net.LW{2,1} = LW;

Net.b{1,1} = Ib;

Net.b{2,1} = Lb;

%%%%%%%%%%%

end

function [IWUpdate,IbUpdate,LWUpdate,LbUpdate] = UpdatesThroughHessianAndGradient(IWJ,IbJ,LWJ,LbJ,Pred,y,mu)

%UNTITLED Summary of this function goes here

% Detailed explanation goes here

s1 = size(IWJ,1);

s2 = size(IWJ,2);

s3 = size(IbJ,1);

s4 = size(LWJ,1);

s5 = size(LbJ,1);

s6 = size(IWJ,3);

Jac = nan(s1*s2 + s3 + s4 + s5,s6);

Jac(1:s1*s2,:) = reshape(IWJ,s1*s2,s6);

Jac(s1*s2+1:s1*s2+s3,:) = IbJ;

Jac(s1*s2+s3+1:s1*s2+s3+s4,:) = LWJ;

Jac(s1*s2+s3+s4+1:s1*s2+s3+s4+s5,:) = LbJ;

H = (Jac*Jac')/s6;

D = mean(Jac.*(Pred - y),2);

Update_Tot = -pinv(H + mu*eye(size(H,1)), min(H(:))/1000)*D;

IWUpdate = reshape(Update_Tot(1:s1*s2),s1,s2);

IbUpdate = Update_Tot(s1*s2+1:s1*s2+s3);

LWUpdate = Update_Tot(s1*s2+s3+1:s1*s2+s3+s4);

LbUpdate = Update_Tot(s1*s2+s3+s4+1:s1*s2+s3+s4+s5);

end

function [IWJ,IbJ,LWJ,LbJ] = Jacobian_LM(IW,LW,Ib,Lb,x,y) %#ok<INUSL>

%UNTITLED Summary of this function goes here

% Detailed explanation goes here

%FORWARD PASS

zI = IW*x + Ib;

aI = tansig(zI);

%zL = LW*aI + Lb;

%aL = zL;

%%%%%%%%%%%%%%

%BACKPROPAGATION

deltaLW = ones(1,size(y,2));

deltaIW = (1 - aI.^2).*LW'.*deltaLW;

%%%%%%%%%%%%%%%%

%JACOBIAN COMPUTATION

IWJ = nan(size(deltaIW,1),size(x,1),size(x,2));

for i = 1:size(x,2)

IWJ(:,:,i) = deltaIW(:,i).*x(:,i)';

end

IbJ = deltaIW;

LWJ = aI.*deltaLW;

LbJ = deltaLW;

%%%%%%%%%%%%%%%%%%%%%

end

And here's a script that you can just copy/paste to test the code (if you have the file above in your working directory of course):

rng(0)

clear

format long

warning('off')

%DEFINE A SIMPLE PROBLEM

x = rand(2,1000)*10;

x = x-5;

y = x(1,:).^2 + x(2,:).^2;

x = x/10;

%%%%%%%%%%%%%%%%%%%%%%%%

%DEFINE TRAINING PARAMETERS

disp(" ")

disp(" ")

Initial_Mu = 0.001;

Incr_Rate = 10;

Decr_Rate = 0.1;

Max_Mu = 1e10;

Epochs = 1000;

Hidden_Neurons = 20;

disp(['Initial_Mu: ',num2str(Initial_Mu)]);

disp(['Incr_Rate: ',num2str(Incr_Rate)]);

disp(['Decr_Rate: ',num2str(Decr_Rate)]);

disp(['Hidden_Neurons: ',num2str(Hidden_Neurons)]);

%%%%%%%%%%%%%%%%%%%%%%%%%%%

%DEFINE AND INITIALIZE A NETWORK

net = feedforwardnet(Hidden_Neurons);

net.trainParam.epochs = Epochs;

net.trainParam.mu_dec = Decr_Rate;

net.trainParam.mu_inc = Incr_Rate;

net.trainParam.mu = Initial_Mu;

net.trainParam.mu_max = Max_Mu;

%net.trainParam.showWindow = false;

net.inputs{1,1}.processFcns = {};

net.outputs{1,2}.processFcns = {};

net.trainParam.min_grad = 1e-25;

net.trainParam.max_fail = 50;

net.divideParam.trainRatio = 1;

net.divideParam.valRatio = 0;

net.divideParam.testRatio = 0;

net = configure(net,x,y);

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%USE THAT INITIALIZED NETWORK FOR TRAINING WITH MATLAB'S TRAINLM

disp(" ")

disp(" ")

netMATLAB = net;

tic

netMATLAB = train(netMATLAB,x,y);

disp("Matlab time:")

toc

disp(" ")

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%USE THE SAME INITIALIZED NETWORK FOR CUSTOM TRAINING

netCUSTOM = net;

tic

[netCUSTOM] = Levenberg_Marquardt_Backpropagation(netCUSTOM,x,y,Initial_Mu,Incr_Rate,Decr_Rate,Max_Mu,Epochs);

disp("Custom time:")

toc

disp(" ")

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%COMPARE RESULTS

Pred_Matlab = netMATLAB(x);

Pred_Custom = netCUSTOM(x);

disp("Absolute difference between Matlab and Custom outputs");

disp(mean(abs(Pred_Matlab - Pred_Custom)));

disp("Matlab MAE")

disp(mean(abs(Pred_Matlab - y)))

disp("Custom MAE")

disp(mean(abs(Pred_Custom - y)))

%%%%%%%%%%%%%%%%

This script compares both execution time and convergence.

When you run the script above you get results like following.

Here's some examples with different settings:

With 3 neurons the 2 algorithms (trainlm and my custom LM) converges to similar (yet not identical) networks:

Initial_Mu: 0.001

Incr_Rate: 10

Decr_Rate: 0.1

Hidden_Neurons: 3

Matlab time:

Elapsed time is 0.793722 seconds.

Custom time:

Elapsed time is 1.711315 seconds.

Absolute difference between Matlab and Custom outputs

7.440071774738044e-07

Matlab MAE

6.179026991132736

Custom MAE

6.179027091263226

With 5 neurons the results are very different. For some reason my algorithm converges to a better network:

Initial_Mu: 0.001

Incr_Rate: 10

Decr_Rate: 0.1

Hidden_Neurons: 5

Matlab time:

Elapsed time is 0.883287 seconds.

Custom time:

Elapsed time is 1.980380 seconds.

Absolute difference between Matlab and Custom outputs

0.007647618697004

Matlab MAE

0.013716265278847

Custom MAE

0.006069555043366

With 10 neurons the networks are pretty much the same:

Initial_Mu: 0.001

Incr_Rate: 10

Decr_Rate: 0.1

Hidden_Neurons: 10

Matlab time:

Elapsed time is 1.011172 seconds.

Custom time:

Elapsed time is 2.766820 seconds.

Absolute difference between Matlab and Custom outputs

1.992397943695323e-08

Matlab MAE

0.030143779923111

Custom MAE

0.030143774946144

With 50 neurons the results are still very similar:

Initial_Mu: 0.001

Incr_Rate: 10

Decr_Rate: 0.1

Hidden_Neurons: 50

Matlab time:

Elapsed time is 2.985799 seconds.

Custom time:

Elapsed time is 7.052290 seconds.

Absolute difference between Matlab and Custom outputs

8.268088436125254e-13

Matlab MAE

2.147009752994717e-04

Custom MAE

2.147009753220598e-04

With 100 neurons, again, the results are close:

Initial_Mu: 0.001

Incr_Rate: 10

Decr_Rate: 0.1

Hidden_Neurons: 100

Matlab time:

Elapsed time is 8.071735 seconds.

Custom time:

Elapsed time is 17.729801 seconds.

Absolute difference between Matlab and Custom outputs

3.318731955914700e-13

Matlab MAE

6.768720617779367e-05

Custom MAE

6.768720614557056e-05

Finally with 200 neurons there's a significant difference:

Initial_Mu: 0.001

Incr_Rate: 10

Decr_Rate: 0.1

Hidden_Neurons: 200

Matlab time:

Elapsed time is 23.504194 seconds.

Custom time:

Elapsed time is 49.679209 seconds.

Absolute difference between Matlab and Custom outputs

1.711683279275178e-07

Matlab MAE

2.712217358494099e-04

Custom MAE

2.712337472282703e-04

BUT, if you force mu to be constant by settings the increase and decrease rate to 1, the results becomes close even with 200 neurons:

Initial_Mu: 0.001

Incr_Rate: 1

Decr_Rate: 1

Hidden_Neurons: 200

Matlab time:

Elapsed time is 16.398573 seconds.

Custom time:

Elapsed time is 24.675034 seconds.

Absolute difference between Matlab and Custom outputs

7.406640634144424e-12

Matlab MAE

0.020155109646983

Custom MAE

0.020155109647200

Observations and questions:

- My algorithm is way slower than trainlm. It's about 2 times slower. There's some clear bottleneck that I'm missing in my code?
- It seems there's some implementation difference between trainlm and my algo. They're definitely not the same. Similar but not the same.
- My first guess is that matlab uses some clever way to invert the hessian matrix. I just used the "inv" function and disabled all the warnings. How matlab invert the hessian matrix exactly? Does it use the Penrose pseudoinverse with some smart tolerance settings?
- Considering that with a costant mu the two algos produce very similar outputs my second guess is that the difference is in the mu update pattern, probably I use a slightly different way to update the mu parameter. At every iteration I just keep updating mu with a while loop until a performance improvement is achieved. How exactly matlab updates the parameter mu through the iterations?
- Does matlab use some implicit minimal mu to limit the mu updates like the max_mu parameter?

I apologize for the length of this post and thank you in advance

Matt J
on 23 Dec 2019

Edited: Matt J
on 23 Dec 2019

My first guess is that matlab uses some clever way to invert the hessian matrix. I just used the "inv" function and disabled all the warnings. How matlab invert the hessian matrix exactly?

It probably doesn't invert the Hessian at all. A complete matrix inversion is a very inefficient way to solve a set of linear equations. More likely, it uses mldivide() or linsolve(). Compare:

>> N=6000; A=rand(N); b=rand(N,1);

>> tic; A\b; toc

Elapsed time is 2.510903 seconds.

>> tic; inv(A)*b; toc

Elapsed time is 10.053053 seconds.

EDIT:

At every iteration I just keep reducing mu with a while loop until a performance improvement is achieved. How exactly matlab updates the parameter mu through the iterations?

I don't think you want to be reducing mu in pursuit of better descent. That will make the inversion more and more singular. You want to be increasing mu instead. The mechanism by which trainlm does this is described in the trainlm documentation, in particular

"The adaptive value mu is increased by mu_inc until the change above results in a reduced performance value. The change is then made to the network and mu is decreased by mu_dec."

Matt J
on 23 Dec 2019

Edited: Matt J
on 23 Dec 2019

AHMED FAKHRI
on 1 Dec 2020

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!