trainNetwork error unable to read file

HI all,
I am learning to train a convolutional network for image classification on the cloud. As a first step, I am following the example named "Train Network in the Cloud Using Automatic Parallel Support" on Mathworks.
I have started my cluster successfully and uploaded the cifar10 image library to my Amazon S3 bucket.
I then create succssefully the datastore using:
imdsTrain = imageDatastore('s3://mybucket/cifar10/train', ...
'IncludeSubfolders',true, ...
'LabelSource','foldernames');
My problem comes at the training level, where I use:
options = trainingOptions('sgdm', ...
'ExecutionEnvironment','parallel', ... % Turn on automatic parallel support.
'InitialLearnRate',initialLearnRate, ... % Set the initial learning rate.
'MiniBatchSize',miniBatchSize, ... % Set the MiniBatchSize.
'Verbose',true, ... % Do not send command line output.
'Plots','training-progress', ... % Turn on the training progress plot.
'L2Regularization',1e-10, ...
'MaxEpochs',50, ...
'Shuffle','every-epoch', ...
'ValidationData',imdsTest, ...
'ValidationFrequency',floor(numel(imdsTrain.Files)/miniBatchSize), ...
'LearnRateSchedule','piecewise', ...
'LearnRateDropFactor',0.1, ...
'LearnRateDropPeriod',45);
net = trainNetwork(augmentedImdsTrain,layers,options);
the training starts, the display of the training starts with the indication: "initializing input data normalization"
However it stops quickly with the error message:
Error in test_parallel_cloud (line 77)
net = trainNetwork(augmentedImdsTrain,layers,options);
Caused by:
Error using nnet.internal.cnn.DistributedDispatcher/computeInParallel (line
193)
Error detected on worker 1.
Error using matlab.io.datastore.ImageDatastore/read (line 77)
Unable to read file: 's3://mybucket/cifar10/train/deer/image35398.png'.
Error using matlab.io.datastore/DsFileReader (line 113)
Could not find file : s3://mybucket/cifar10/train/deer/image35398.png
every time I rerun the code it seems to stop on another image it cannot read. However the image is always on the bucket and do not seems to be corrupt when I check using imshow.
Can you see where the problem is?

7 comentarios

Fred
Fred el 4 de Abr. de 2020
also, I have no problem for starting the network training on my local cpu with the library saved on a local folder. It really seems the problem comes from the data being on the cloud.
It would help to remove trainNetwork from the equation and just see whether your data can be accessed on the cloud:
spmd
data = read(imdsTrain);
end
If this doesn't work then by far the most likely scenario is that you do not have the correct authentication on the cloud instances.
you are right, running your suggestion does't work and return the same error:
Error detected on worker 1.
Caused by:
Error using matlab.io.datastore.ImageDatastore/read (line 77)
Unable to read file: 's3://rumi.test.1/train/airplane/image10009.png'.
Error using matlab.io.datastore/DsFileReader (line 113)
Could not find file : s3://rumi.test.1/train/airplane/image10009.png
the weird thing though is that this image is the first of the imageDatastore and I can read it and even display it nicely using:
img=imdsTrain.readimage(1)
imshow(img)
if the problem is due to my credentials, don't we expect the readimage function to generate an error as well?
Thanks for the help.
Fred
Fred el 8 de Abr. de 2020
Hi,
thanks for guiding me on the good direction, problem is solved by setting the environnement variables in the parpoll function.
Fred
Joss Knight
Joss Knight el 8 de Abr. de 2020
Ah good. That was going to be my next suggestion!
Fouzia Adjailia
Fouzia Adjailia el 1 de Mayo de 2020
hello,
I'm having a similar problem to yours and I would highly appreciate it if you can help me.
I created an image data store with a costumised read function called @formoccupancygrid, when I run my code using the parallel I get this error:
Error using classifyData (line 33)
Error detected on worker 1.
Caused by:
Error using matlab.io.datastore.ImageDatastore/readall (line 42)
Error using ReadFcn @UNKNOWN Function for file
D:\--*******************************
Undefined function handle.
I solved this problem using a parfevalOnAll, it excutes the function in all the workers. after that I have anotehr error which stats that the files don't exist, I added the files to the attached files and path in the additional path in the cluster profile manager but with no luck
looking forward to your reply.
Daniel Csata
Daniel Csata el 29 de Oct. de 2022
Hi!
I just ran into this same exact problem. Could you please tell me exactly how you solved it with the parpool function? Because it seems like that didnt work for me or I did something wrong.
Thank you,
Daniel

Iniciar sesión para comentar.

Respuestas (1)

Harsha Priya Daggubati
Harsha Priya Daggubati el 7 de Abr. de 2020

0 votos

1 comentario

Fred
Fred el 7 de Abr. de 2020
Hi,
thanks for the help!
yes I carefully followed all steps mentioned one by one.
The only deviation is that I had to set up number of workers to 1 and not 8. That is because the aws system has limits on the vCPU number I can use and the instance I am using (p2.xlarge) has only one GPU.
The problem occurs when running the TrainNetwork function on the "train a network in the cloud using a buil-in parallel support" page.
Fred

Iniciar sesión para comentar.

Etiquetas

Preguntada:

el 4 de Abr. de 2020

Comentada:

el 29 de Oct. de 2022

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by