Background Data Dispatch with Custom Training Loop

2 visualizaciones (últimos 30 días)
Pascal Kutschbach
Pascal Kutschbach el 11 de Nov. de 2020
Editada: Joss Knight el 18 de Dic. de 2020
I have a question regarding the training of a deep neural network with Matlab.
I have built a custom training loop for the training of a regression network on a machine with 2 GPUs.
The training loop performs fine, however it is rather slow in comparison to the automatic trainNetwork function.
The trainNetwork function does not provide the type of network progress monitor i like. The trainNetwork function also seems to error unpredictably on my machine and sometimes the network are not "finished" properly. This is why i make use of a custom training loop.
I use a parallel pool with 2 workers and the randomPatchExtraction Datastore (which is partitionable). The parallel operations
are written in an spmd block.
What would be the best way to use data dispatching in the background in a custom training loop?
I have tried to scale up the number of workers in the parallel pool. This leads to the case that some workers
cannot read data since the Datastores are only partitioned according to the number of GPUs, not the number of workers.
Which operations do i have to assign to the workers that are supposed to preload data?
Has anybody tried using a "self-written" data dispatching in a custom training loop?
Thanks in advance!

Respuesta aceptada

Joss Knight
Joss Knight el 22 de Nov. de 2020
Use a minibatchqueue with the DispatchInBackground option.
  4 comentarios
Pascal Kutschbach
Pascal Kutschbach el 25 de Nov. de 2020
This example definitely helps and will solve my issue.
I was not able to make it run yet but i can see the schematic behind the idea. I understand that i have to tell each specific worker when and what to send to the other workers in order to mimic a communication between the workers. As of now i run into "deadlock" issues where (i assume) at a certain point i want to receive data on one worker where there is no data to be received (yet). Probably this results from the usage of labSend and labReceive instead of labSendReceive to make use of NV-Link communication between the GPUs.
Thanks again for the help!
Joss Knight
Joss Knight el 25 de Nov. de 2020
Great! labSend is blocking, so you can't have both workers 3 and 4 call labSend at the same time. You need to choose which one goes first.

Iniciar sesión para comentar.

Más respuestas (0)

Categorías

Más información sobre Parallel and Cloud en Help Center y File Exchange.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by