Background Data Dispatch with Custom Training Loop

Question

Pascal Kutschbach el 11 de Nov. de 2020

0
Enlazar

Enlace directo a esta pregunta

https://es.mathworks.com/matlabcentral/answers/644513-background-data-dispatch-with-custom-training-loop

Editada: Joss Knight el 18 de Dic. de 2020

Respuesta aceptada: Joss Knight

I have a question regarding the training of a deep neural network with Matlab.

I have built a custom training loop for the training of a regression network on a machine with 2 GPUs.

The training loop performs fine, however it is rather slow in comparison to the automatic trainNetwork function.

The trainNetwork function does not provide the type of network progress monitor i like. The trainNetwork function also seems to error unpredictably on my machine and sometimes the network are not "finished" properly. This is why i make use of a custom training loop.

I use a parallel pool with 2 workers and the randomPatchExtraction Datastore (which is partitionable). The parallel operations

are written in an spmd block.

What would be the best way to use data dispatching in the background in a custom training loop?

I have tried to scale up the number of workers in the parallel pool. This leads to the case that some workers

cannot read data since the Datastores are only partitioned according to the number of GPUs, not the number of workers.

Which operations do i have to assign to the workers that are supposed to preload data?

Has anybody tried using a "self-written" data dispatching in a custom training loop?

Thanks in advance!

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Iniciar sesión para comentar.

Iniciar sesión para responder a esta pregunta.

Answer 1

Joss Knight el 22 de Nov. de 2020

0
Enlazar

Enlace directo a esta respuesta

https://es.mathworks.com/matlabcentral/answers/644513-background-data-dispatch-with-custom-training-loop#answer_552568

Use a minibatchqueue with the DispatchInBackground option.

4 comentarios
Mostrar 2 comentarios más antiguosOcultar 2 comentarios más antiguos

Pascal Kutschbach el 23 de Nov. de 2020

Abrir en MATLAB Online

Hi Joss, Thanks for the answer.

The use of a minibatchqueue only partly solves my issue.

I can build a mbq with the "DispatchInBackground" option but yet i have to assign specific workers

of the spmd block to the "dispatching operation". With 2 GPUs i can use a parpool with 4 workers.

Two workers should be assigned to the training computation and two (or more) workers should be assigned to the dispatch operation. Currently with these settings i (obviously) have 4 workers working on the training computation even with the usage of a minibatchqueue and the "DispatchInBackground" option enabled. My question is how do i build an spmd block where:

spmd
    if labindex == 1
        [X,Y] = next(mbq);        
        "training computation"
    elseif labindex == 2
        [X,Y] = next(mbq);    
        "training computation"
    elseif labindex == 3
        "dispatch operation for worker 1"
    elseif labindex == 4
        "dispatch operation for worker 2"
    end   
end

How do i assign the proper sequence of workers so that worker 3 and 4 operate first and then worker 1 and 2 operate second? How do i solve data consistency so that data is available on workers 1 and 2 in the first iteration? How do i find out which workers are specifically tied to GPUs?

Thanks in advance!

Joss Knight el 23 de Nov. de 2020

Editada: Joss Knight el 18 de Dic. de 2020

Abrir en MATLAB Online

Hi, you can't use DispatchInBackground and use SPMD, is the simple answer, not in a custom training loop. Only trainNetwork supports both parallel training and background dispatch, and it uses a bunch of clever stuff involving MPI communicators.

In a future release this will become possible using a thread pool nested inside a process pool, but not yet.

You can do this right now using some complex point-to-point communication, not pretty:

spmd
    % Workers 1 and 2 are background workers. 3 and 4
    % are compute workers.
    % Partition datastore into 2 parts. Read first part
    % and send to compute workers.
    if labindex < 3
        subds = partition(ds, 2, labindex);
        data = read(subds); % Add some batching logic here
        labSend(data, labindex+2);
    else
        data = labReceive(labindex-2);
    end
    loop = true;
    while loop
        % Background workers read next batch while compute
        % workers process the current one
        if labindex < 3
            loop = hasdata(subds);
            if loop
                data = read(subds); % Again, might need batching logic
            end
            labSend(data, labindex+2);
        else
            % Do some computation to compute gradients
            % Send and receive gradients between compute workers
            otherComputeWorker = mod(labindex-2,2)+3;
            theirGradients = labSendReceive(otherComputeWorker,otherComputeWorker,myGradients);
            % Do something with the two sets of gradients,
            % probably add them together and update a model
            % Then receive next batch
            data = labReceive(labindex-2);
        end
        % Detect when either partition is finished
        loop = gop(@and, loop);
    end
end

You can do a version of this using gop or gplus to sum gradients as in the examples, but you need to make sure the background workers also participate - every worker has to call gop. One advantage of doing that is that you'll get fast peer-to-peer data transfer for gpuArray data. At the moment labSendReceive doesn't use fast data transfer. You could also implement labSendReceive using a call to labSend then to labReceive on worker 3, and the opposite way round on the other compute worker. That will use fast GPU-GPU communication, but loses out on the asynchronicity.

I haven't actually checked this works so maybe there are issues but I'm sure you can debug them.

Pascal Kutschbach el 25 de Nov. de 2020

This example definitely helps and will solve my issue.

I was not able to make it run yet but i can see the schematic behind the idea. I understand that i have to tell each specific worker when and what to send to the other workers in order to mimic a communication between the workers. As of now i run into "deadlock" issues where (i assume) at a certain point i want to receive data on one worker where there is no data to be received (yet). Probably this results from the usage of labSend and labReceive instead of labSendReceive to make use of NV-Link communication between the GPUs.

Thanks again for the help!

Joss Knight el 25 de Nov. de 2020

Great! labSend is blocking, so you can't have both workers 3 and 4 call labSend at the same time. You need to choose which one goes first.

Iniciar sesión para comentar.

Background Data Dispatch with Custom Training Loop

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Respuesta aceptada

4 comentarios
Mostrar 2 comentarios más antiguosOcultar 2 comentarios más antiguos

Más respuestas (0)

Ver también

Categorías

Etiquetas

Productos

Versión

Community Treasure Hunt

Background Data Dispatch with Custom Training Loop

0 comentarios Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Respuesta aceptada

4 comentarios Mostrar 2 comentarios más antiguosOcultar 2 comentarios más antiguos

Más respuestas (0)

Ver también

Categorías

Etiquetas

Productos

Versión

Community Treasure Hunt

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

4 comentarios
Mostrar 2 comentarios más antiguosOcultar 2 comentarios más antiguos