Cannot utilize fully all GPUs during network training

It’s the performance and use of the resources installed on the Computer (Amazon Cloud EC2 instance in our case).
I am using a p3.8xlarge instance in EC2 on awamzon web server – basically it means I am using 4 GPUs V100,
I am training a neural network.
using:
mdl(i).Net = trainNetwork(trainData(:, :, :, 1: itStep: end), trainLabels(1: itStep: end, :), layers, options);
in options I define 'multi-gpu'
I also defined 'parallel' and tried to play with number of workers but all I see is just more processes waiting in the GPU queue on nvidia-smi.
For some reason I see that all GPU are working (see GPU.png) but for limited amount of time (very high usage for 3 seconds and then drops to 0% for 10 seconds at least.
I looked at the htop information(htop.jpg), I see that not all threads of the CPU are in use so that is the bottleneck (I think?)
I have a xeon processor on this instance with 32 cores (16 physical, 32 logical)
When I try to utilize all threads through local profile (profile local pool.png) it seems like it still doesn’t respond .
I get more workers because of it (CPU ?), but the GPUs still doesn't seem to improve
Tried to increase batch size - but at some point the GPU is out of memory, so that's not the problem.
How do i utilize all cores of the CPU to transfer data to the GPUs?
I read somewhere that you can also load the data to the pool itself? will that help?
I use the https://ngc.nvidia.com/catalog/containers/partners:matlab/tagsmatlab container for matlab:r2019a
I scanned these already:
Would appreciate your help.
Tomer

9 comentarios

Joss Knight
Joss Knight el 31 de Dic. de 2019
Editada: Joss Knight el 31 de Dic. de 2019
It's hard to know where to start here; to be sure we'd need complete reproduction steps with code and data, and for that process I suggest you contact Technical Support.
Without further knowledge of what is going on, I can only advise getting a profile report to see what your MATLAB workers think they are spending their time doing.
Start your pool independently. I'm assuming you are running on some kind of virtual desktop.
parpool('local', gpuDeviceCount);
Then start the mpiprofiler.
spmd
mpiprofile on
end
Now run your training code. Then afterwards generate the profile report.
spmd
mpiprofile viewer
end
You should be able to use this report to generate a pdf. See the help for mpiprofile for more info. mpiprofile was generally intended for use in pmode but it can be used successfully in spmd in this way.
Analyzing the report you should be able to see where the bottleneck is. Typically in cases like this you'll find that either
  • Your network is too small and there isn't enough work for the GPU to do, so the time is dominated in other activity such as loading and preprocessing input data, plotting, or sending data between workers.
  • Your input data is taking a long time to load. It may be that trainData is huge and is being stored in virtual memory, so indexing into it is taking a huge amount of time. Very large data shouldn't be stored in memory, it should be stored in files and accessed using a Datastore.
  • Some hardware issue means you're not actually using all the GPUs.
The fact that you are creating multiple networks and storing them in an array is a concern. It either means you are storing a lot of data in memory, or your network is very small (which means it is unlikely to benefit from parallel training). It would be good to see more of the code in which your call to trainNetwork is embedded.
First of all thank you for your answer Joss!
I had a support session with mathworks today and sent them profile of my run.
We ran it on a single-GPU but it had the same behaviour (high utilization and then idle for a few seconds) batch size wasn't too big and memory usage on the GPU seemed to be reasonable (11 GB out of 16GB available on the GPU).
It seems like your most important point is using the Datastore, I think that's what causes my bottle neck.
If I load all the data to the matlab software(on my virtual memory), aside of the fact this is not a good practice, shouldn't it work faster?
I'll consider moving my data into a datastore object, I just wanted to know what are the benefits of it regarding performence of the current issue I have.
Thanks again,
Tomer
It's just impossible for me to be too certain without seeing your code, and if I speculate too much I could just end up confusing you more. It sounds like you're saying that you have multiple gigabytes of data stored in a matlab variable X that you are passing to trainNetwork? And we're guessing that maybe whos X is so enormous that you're ending up in the machine's swap space, which means you're using the disk, but in a way managed by the operating system rather than MATLAB. And, frankly, it's just not designed for this kind of thing, and never behaves well with MATLAB. But you haven't yet profiled your MATLAB as I advised so we don't have any actual direct evidence that the bottleneck is the point where trainNetwork tries to load a batch of data.
Is this true? Where did X come from? Was it created somehow from data that was originally on disk? So can we get you to store the data somewhere easily accessible by your Amazon instance, such as Amazon S3? And then use a datastore to load batches of data directly off the disk?
Anyway, as I say, this is just so much speculation.
Tomer Nahshon
Tomer Nahshon el 23 de En. de 2020
Editada: Tomer Nahshon el 23 de En. de 2020
Hey Joss,
I saw a thread that Helped me.
I used the exact solution (mild changes regarding the size of the dimension of the .mat file.
Now I can use more data without worrying about my RAM.
I used this function since I have a regression problem just as stated in the thread.
function CombinedFileDatastore = CreateCombinedFDS(Path)
inputData=fileDatastore(Path,'ReadFcn',@load,'FileExtensions','.mat');
targetData=fileDatastore([Path, '/Labels'],'ReadFcn',@load,'FileExtensions','.mat');
inputDatat = transform(inputData,@(data) rearrange_datastore(data));
targetDatat = transform(targetData,@(data) rearrange_datastore(data));
CombinedFileDatastore=combine(inputDatat,targetDatat);
end
function image = rearrange_datastore(data)
image=data.temp;
image= {image};
end
Where Path holds all my input "*.mat" files and Path/Labels holds all my Labels.
1) For some reason it is not parallilzed datastore? why? I thought FileDatastore are Parallelized.
2) my GPU is still starved, is the related to the fact I am using 'ReadFcn' argument? how can I solve this to use all the threads available and not just the main Matlab Thread?
sent this to Matlab Support as well.
I attach here a short profileinfo.zip with my profiler as suggested by Matlab support
Thank you,
Tomer
File Datastore is partitionable but Combined Datastore isn't in R2019b. Does the DispatchInBackground training option work for your example? If not we may be talking about using a Custom datastore so you can do the file loading and transform on a parallel pool.
Tomer Nahshon
Tomer Nahshon el 26 de En. de 2020
Editada: Tomer Nahshon el 26 de En. de 2020
When trying to change the dispatchinbackground property in the options struct it gives me the error after starting training:
"Datastore distribution is only supported for Partitionable datastores."
Enconutered this as well.
says that :
There are some limitations when using datastores with parallel training, multi-GPU training, and background dispatching:
  • Datastores do not support specifying the 'Shuffle' name-value pair argument of trainingOptions as 'none'.
  • Combined datastores are not partitionable and therefore do not support parallel training, multi-GPU training, or background dispatching.
Is there a way to solve my (custom) image regression problem in a more elegant way?
Like this guy just with parallelizing the datastore?
or I need to refer to custom datastore as you stated?
I also sent this corrspondence to Matlab support currently handling my issue.
Thanks alot Joss, you are very helpful,
Tomer
You need to follow this documentation on writing custom datastores. This will allow you to create a datastore which manually combines your two datastores, and implement the partition method to divide them. You will, of course, need to partition them ensuring that the data and targets are partitioned in the same way - this is what CombinedDatastore can't do for you.
Once you've done this you ought to get DispatchInBackground, parallel training, and both together. It sounds like DispatchInBackground on its own is what you are after, you need to do a bunch of data preparation and you can use a pool to do that in the background for you. It just depends on being able to divide up the data so that each worker has a different piece of the data.
Tomer Nahshon
Tomer Nahshon el 29 de En. de 2020
Editada: Tomer Nahshon el 29 de En. de 2020
Thanks Joss,
So I went through the documentation and started building the datastore.
But how do I define Labels (targets) in this datastore?
Or manually combine them so they will act the same as the combined DS but with the partition abilities?
Do you have any pointers for that?
Will this work?
Thanks,
Thanks,
Tomer
Joss Knight
Joss Knight el 29 de En. de 2020
Editada: Joss Knight el 2 de Feb. de 2020
That will work or just implement an ordinary datastore with the Partitionable mixin, which is the currently advised approach. All you have to do is implement read() to read the data and the targets that go with it, and return a cell array containing both.
"For networks with a single input, the table or cell array returned by the datastore has two columns that specify the network inputs and expected responses, respectively.
For networks with multiple inputs, the datastore must be a combined or transformed datastore that returns a cell array with (numInputs+1) columns containing the predictors and the responses, where numInputs is the number of network inputs and numResponses is the number of responses. For i less than or equal to numInputs, the ith element of the cell array corresponds to the input layers.InputNames(i), where layers is the layer graph defining the network architecture. The last column of the cell array corresponds to the responses."

Iniciar sesión para comentar.

 Respuesta aceptada

Tomer Nahshon
Tomer Nahshon el 2 de Feb. de 2020
Editada: Tomer Nahshon el 2 de Feb. de 2020
Ok, problem solved.
as Suggested by Joss and Mathworks, MathWorks Support created a custom Datastore inheriting the properties involved in this procedure,
The Labels are a numeric vector as an input to the DS function but could but could be loaded from a .mat file as well.
for Training, apprently I needed to use the 'parallel' execution environment and define DispatchInBackground training option as 'true' (probably since I use AWS cloud service).
classdef matFilesDatastore < matlab.io.Datastore & ...
matlab.io.datastore.Shuffleable & ...
matlab.io.datastore.Partitionable
properties
Datastore
Labels
ReadSize
end
properties(SetAccess = protected)
NumObservations
end
properties(Access = private)
CurrentFileIndex
end
methods
function ds = matFilesDatastore(folder, labels)
% ds = matFilesDatastore(folder, labels) creates a datastore
% from the data in folder and labels
% Create file datastore
fds = fileDatastore(folder, ...
'ReadFcn',@readData, ...
'IncludeSubfolders',true);
ds.Datastore = fds;
numObservations = numel(fds.Files);
% Labels.
ds.Labels = labels;
% Initialize datastore properties.
ds.ReadSize = 1;
ds.NumObservations = numObservations;
ds.CurrentFileIndex = 1;
end
function tf = hasdata(ds)
% tf = hasdata(ds) returns true if more data is available.
tf = ds.CurrentFileIndex + ds.ReadSize - 1 ...
<= ds.NumObservations;
end
function [data,info] = read(ds)
% [data,info] = read(ds) read one mini-batch of data.
miniBatchSize = ds.ReadSize;
info = struct;
for i = 1:miniBatchSize
predictors{i,1} = read(ds.Datastore);
responses{i,1} = ds.Labels(ds.CurrentFileIndex);
ds.CurrentFileIndex = ds.CurrentFileIndex + 1;
end
data = table(predictors,responses);
end
function reset(ds)
% reset(ds) resets the datastore to the start of the data.
reset(ds.Datastore);
ds.CurrentFileIndex = 1;
end
function dsNew = shuffle(ds)
% dsNew = shuffle(ds) shuffles the files and the corresponding
% labels in the datastore.
% Create copy of datastore.
dsNew = copy(ds);
dsNew.Datastore = copy(ds.Datastore);
fds = dsNew.Datastore;
% Shuffle files and corresponding labels.
numObservations = dsNew.NumObservations;
idx = randperm(numObservations);
fds.Files = fds.Files(idx);
dsNew.Labels = dsNew.Labels(idx);
end
function subds = partition(ds, numPartitions, idx)
subds = copy(ds);
subds.Datastore = partition(ds.Datastore, numPartitions, idx);
subds.NumObservations = numel(subds.Datastore.Files);
indices = pigeonHole(idx, numPartitions, ds.NumObservations);
subds.Labels = ds.Labels(indices);
reset(subds);
end
end
methods(Access = protected)
function n = maxpartitions(ds)
n = ds.NumObservations;
end
end
methods (Hidden = true)
function frac = progress(ds)
% frac = progress(ds) returns the percentage of observations
% read in the datastore.
frac = (ds.CurrentFileIndex - 1) / ds.NumObservations;
end
end
end
function data = readData(filename)
% data = readData(filename) reads the data X from the MAT file
% filename
S = load(filename);
data = S.image;
% label = S.label;
end
function observationIndices = pigeonHole(partitionIndex, numPartitions, numObservations)
%pigeonHole Helper function that maps partition index and numpartitions
% to the corresponding observation indices.
observationIndices = floor((0:numObservations - 1) * numPartitions / numObservations) + 1;
observationIndices = find(observationIndices == partitionIndex);
% Convert to a vector if observationIndices is empty.
if isempty(observationIndices)
observationIndices = double.empty(0, 1);
end
end

3 comentarios

Bravo! Looks good.
You need to use the 'parallel' option if your compute cluster is remote from the client. When your client MATLAB is running on your local machine and your cluster is on EC2, that's your situation. However, it sounded as though you were renting a machine on EC2 and then logging into it with a virtual desktop. In that case the client and cluster are in the same remote location, the pool is 'local', and you can use 'multi-gpu'. In practice, there is basically no difference. The 'multi-gpu' option exists to help people think of multi-gpu training as just a special performance optimization and could theoretically be implemented in future using threads, but really all it does different is make sure your pool is local.
Thanks Joss,
Just one last question ,
I still notice that I don't use all my available threads although it works faster, I guess it's due to the fact I am using a custom 'ReadFcn' and can't avoid it?
Correct?
Thanks,
Tomer
It could be a number of reasons. With a custom ReadFcn there's no prefetching, so that will limit CPU core usage to one per MATLAB. But also you have 4 GPUs and if they become the compute bottleneck, your CPU cores will be waiting regardless.
You ought to be able to get all the CPU cores working with the DispatchInBackground training option. But that would preclude you using all the GPUs as well. Ideally you would use both DispatchInBackground and multi-gpu training, but I don't think that will work with your custom datastore. To get that going you're going to need to use the MiniBatchable mixin and PartitionableByIndex - because this feature needs to be able to divide your data by index.

Iniciar sesión para comentar.

Más respuestas (0)

Categorías

Productos

Versión

R2019a

Preguntada:

el 22 de Dic. de 2019

Comentada:

el 17 de Feb. de 2020

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by