Main Content

lbfgsupdate

Update parameters using limited-memory BFGS (L-BFGS)

Since R2023a

    Description

    Update the network learnable parameters in a custom training loop using the limited-memory BFGS (L-BFGS) algorithm.

    The L-BFGS algorithm [1] is a quasi-Newton method that approximates the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm. The L-BFGS algorithm is best suited for small networks and data sets that you can process in a single batch.

    Note

    This function applies the L-BFGS optimization algorithm to update network parameters in custom training loops that use networks defined as dlnetwork objects or model functions. If you want to train a network defined as a Layer array or as a LayerGraph, use the following functions:

    • Create an SGDM, Adam, or RMSProp training options object using the trainingOptions function.

    • Use the options object with the trainNetwork function.

    example

    [netUpdated,solverStateUpdated] = lbfgsupdate(net,lossFcn,solverState) updates the learnable parameters of the network net using the L-BFGS algorithm with the specified loss function and solver state. Use this syntax in a training loop to iteratively update a network defined as a dlnetwork object.

    [parametersUpdated,solverStateUpdated] = lbfgsupdate(parameters,solverState) updates the learnable parameters in parameters using the L-BFGS algorithm with the specified loss function and solver state. Use this syntax in a training loop to iteratively update the learnable parameters of a network defined as a function.

    Examples

    collapse all

    Load the iris flower dataset.

    [XTrain, TTrain] = iris_dataset;

    Convert the predictors to a dlarray object with format "CB" (channel, batch).

    XTrain = dlarray(XTrain,"CB");

    Define the network architecture.

    numInputFeatures = size(XTrain,1);
    numClasses = size(TTrain,1);
    numHiddenUnits = 32;
    
    layers = [
        featureInputLayer(numInputFeatures)
        fullyConnectedLayer(numHiddenUnits)
        reluLayer
        fullyConnectedLayer(numHiddenUnits)
        reluLayer
        fullyConnectedLayer(numClasses)
        softmaxLayer];
    
    net = dlnetwork(layers);

    Define the modelLoss function, listed in the Model Loss Function section of the example. This function takes as input a neural network, input data, and targets. The function returns the loss and the gradients of the loss with respect to the network learnable parameters.

    The lbfgsupdate function requires a loss function with the syntax [loss,gradients] = f(net). Create a variable that parameterizes the evaluated modelLoss function to take a single input argument.

    lossFcn = @(net) dlfeval(@modelLoss,net,XTrain,TTrain);

    Initialize an L-BFGS solver state object with a maximum history size of 3 and an initial inverse Hessian approximation factor of 1.1.

    solverState = lbfgsState( ...
        HistorySize=3, ...
        InitialInverseHessianFactor=1.1);

    Train the network for 10 epochs.

    numEpochs = 10;
    for i = 1:numEpochs
        [net, solverState] = lbfgsupdate(net,lossFcn,solverState);
    end

    Model Loss Function

    The modelLoss function takes as input a neural network net, input data X, and targets T. The function returns the loss and the gradients of the loss with respect to the network learnable parameters.

    function [loss, gradients] = modelLoss(net, X, T)
    
    Y = forward(net,X);
    loss = crossentropy(Y,T);
    gradients = dlgradient(loss,net.Learnables);
    
    end

    Input Arguments

    collapse all

    Neural network, specified as a dlnetwork object.

    The function updates the Learnables property of the dlnetwork object. net.Learnables is a table with three variables:

    • Layer — Layer name, specified as a string scalar.

    • Parameter — Parameter name, specified as a string scalar.

    • Value — Parameter value, specified as a cell array containing a dlarray object.

    Learnable parameters, specified as a dlarray object, a numeric array, a cell array, a structure, or a table.

    If you specify parameters as a table, it must contain these variables:

    • Layer — Layer name, specified as a string scalar.

    • Parameter — Parameter name, specified as a string scalar.

    • Value — Parameter value, specified as a cell array containing a dlarray object.

    You can specify parameters as a container of learnable parameters for your network using a cell array, structure, or table, or using nested cell arrays or structures. The learnable parameters inside the cell array, structure, or table must be dlarray objects or numeric values with the data type double or single.

    If parameters is a numeric array, then lossFcn must not use the dlgradient function.

    Loss function, specified as a function handle or an AcceleratedFunction object with the syntax [loss,gradients] = f(net), where loss and gradients correspond to the loss and gradients of the loss with respect to the learnable parameters, respectively.

    To parametrize a model loss function that has a call to the dlgradient function, specify the loss function as @(net) dlfeval(@modelLoss,net,arg1,...,argN), where modelLoss is a function with the syntax [loss,gradients] = modelLoss(net,arg1,...,argN) that returns the loss and gradients of the loss with respect to the learnable parameters in net given arguments arg1,...,argN.

    If parameters is a numeric array, then the loss function must not use the dlgradient or dlfeval functions.

    If the loss function has more than two outputs, also specify the NumLossFunctionOutputs argument.

    Data Types: function_handle

    Solver state, specified as an lbfgsState object or [].

    Name-Value Arguments

    Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

    Example: lbfgsupdate(net,lossFcn,solverState,LineSearchMethod="strong-wolfe") updates the learnable parameters in net and searches for a learning rate that satisfies the strong Wolfe conditions.

    Method to find suitable learning rate, specified as one of these values:

    • "weak-wolfe" — Search for a learning rate that satisfies the weak Wolfe conditions. This method maintains a positive definite approximation of the inverse Hessian matrix.

    • "strong-wolfe" — Search for a learning rate that satisfies the strong Wolfe conditions. This method maintains a positive definite approximation of the inverse Hessian matrix.

    • "backtracking" — Search for a learning rate that satisfies sufficient decrease conditions. This method does not maintain a positive definite approximation of the inverse Hessian matrix.

    Maximum number of line search iterations to determine learning rate, specified as a positive integer.

    Data Types: single | double | int8 | int16 | int32 | int64 | uint8 | uint16 | uint32 | uint64

    Number of loss function outputs, specified as an integer greater than or equal to two. Set this option when lossFcn has more than two output arguments.

    Data Types: single | double | int8 | int16 | int32 | int64 | uint8 | uint16 | uint32 | uint64

    Output Arguments

    collapse all

    Updated network, returned as a dlnetwork object.

    The function updates the Learnables property of the dlnetwork object.

    Updated learnable parameters, returned as an object with the same type as parameters.

    Updated solver state, returned as an lbfgsState state object.

    Algorithms

    collapse all

    Limited-Memory BFGS

    The L-BFGS algorithm [1] is a quasi-Newton method that approximates the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm. The L-BFGS algorithm is best suited for small networks and data sets that you can process in a single batch.

    The algorithm updates learnable parameters W at iteration k+1 using the update step given by

    Wk+1=WkηkBk1J(Wk),

    where Wk denotes the weights at iteration k, ηk is the learning rate at iteration k, Bk is an approximation of the Hessian matrix at iteration k, and J(Wk) denotes the gradients of the loss with respect to the learnable parameters at iteration k.

    The L-BFGS algorithm computes the matrix-vector product Bk1J(Wk) directly. The algorithm does not require computing the inverse of Bk.

    To save memory, the L-BFGS algorithm does not store and invert the dense Hessian matrix B. Instead, the algorithm uses the approximation Bkm1λkI, where m is the history size, the inverse Hessian factor λk is a scalar, and I is the identity matrix, and stores the scalar inverse Hessian factor only. The algorithm updates the inverse Hessian factor at each step.

    To compute the matrix-vector product Bk1J(Wk) directly, the L-BFGS algorithm uses this recursive algorithm:

    1. Set r=Bkm1J(Wk), where m is the history size.

    2. For i=m,,1:

      1. Let β=1skiykiykir, where ski and yki are the step and gradient differences for iteration ki, respectively.

      2. Set r=r+ski(akiβ), where a is derived from s, y, and the gradients of the loss with respect to the loss function. For more information, see [1].

    3. Return Bk1J(Wk)=r.

    References

    [1] Liu, Dong C., and Jorge Nocedal. "On the limited memory BFGS method for large scale optimization." Mathematical programming 45, no. 1 (August 1989): 503-528. https://doi.org/10.1007/BF01589116.

    Extended Capabilities

    Version History

    Introduced in R2023a