Some issues about running parallel Matlab jobs in cluster
19 visualizaciones (últimos 30 días)
Mostrar comentarios más antiguos
Michael
el 31 de En. de 2019
Comentada: Xiaoyang Guo
el 24 de Abr. de 2020
I was trying to run parallel jobs in cluster.
I launched the matlab engine from python using
matlab.engine.start_matlab()
I submitted the python jobs using slurm.
Some parallel jobs work, or work at first several function evaluations.
Some parallel jobs do not work. And it shows the following message:
Warning: The cluster failed to cancel the job execution. The error was: Unable to read file '/rhome/chong009/.matlab/l
ocal_cluster_jobs/R2018a/Job12.in.mat'. No such file or directory.
> In parallel.internal.cluster.CJSJobMethods.cancelOneJob (line 53)
In parallel.job.CJSConcurrentJob/cancelOneJob (line 57)
In parallel.Job/cancel (line 1348)
In parallel.Cluster/hDeleteOneJob (line 911)
In parallel.internal.pool.InteractiveClient>iDeleteJobs (line 873)
In parallel.internal.pool.InteractiveClient/pRemoveOldJobs (line 481)
In parallel.internal.pool.InteractiveClient/start (line 315)
In parallel.Pool>iStartClient (line 791)
In parallel.Pool.hBuildPool (line 585)
In parallel.internal.pool.doParpool (line 18)
In parpool (line 98)
In parallel.internal.pool.PoolArrayManager.getOrAutoCreateWithCleanup (line 60)
In pctTryCreatePoolIfNecessary (line 23)
In distcomp.remoteparfor.tryRemoteParfor
In parallel_function (line 433)
In crank_nicolson (line 38)
Warning: The cluster reported an error while deleting an unavailable job. This job may already have been deleted.
> In parallel.internal.cluster.CJSJobMethods.destroyOneJob (line 77)
In parallel.job.CJSConcurrentJob/destroyOneJob (line 52)
In parallel.Job/delete (line 1279)
In parallel.Cluster/hDeleteOneJob (line 926)
In parallel.internal.pool.InteractiveClient>iDeleteJobs (line 873)
In parallel.internal.pool.InteractiveClient/pRemoveOldJobs (line 481)
In parallel.internal.pool.InteractiveClient/start (line 315)
In parallel.Pool>iStartClient (line 791)
In parallel.Pool.hBuildPool (line 585)
In parallel.internal.pool.doParpool (line 18)
In parpool (line 98)
In parallel.internal.pool.PoolArrayManager.getOrAutoCreateWithCleanup (line 60)
In pctTryCreatePoolIfNecessary (line 23)
In distcomp.remoteparfor.tryRemoteParfor
In parallel_function (line 433)
In crank_nicolson (line 38)
Warning: The cluster failed to cancel the job execution. The error was: Unable to read file '/rhome/chong009/.matlab/local_cluster_jobs/R2018a/Job44.in.mat'. No such file or directory.
> In parallel.internal.cluster.CJSJobMethods.cancelOneJob (line 53)
In parallel.job.CJSConcurrentJob/cancelOneJob (line 57)
In parallel.Job/cancel (line 1348)
In parallel.Cluster/hDeleteOneJob (line 911)
In parallel.internal.pool.InteractiveClient>iDeleteJobs (line 873)
In parallel.internal.pool.InteractiveClient/pRemoveOldJobs (line 481)
In parallel.internal.pool.InteractiveClient/start (line 315)
In parallel.Pool>iStartClient (line 791)
In parallel.Pool.hBuildPool (line 585)
In parallel.internal.pool.doParpool (line 18)
In parpool (line 98)
In parallel.internal.pool.PoolArrayManager.getOrAutoCreateWithCleanup (line 60)
In pctTryCreatePoolIfNecessary (line 23)
In distcomp.remoteparfor.tryRemoteParfor
In parallel_function (line 433)
In crank_nicolson (line 38)
Warning: The cluster reported an error while deleting an unavailable job. This job may already have been deleted.
> In parallel.internal.cluster.CJSJobMethods.destroyOneJob (line 77)
In parallel.job.CJSConcurrentJob/destroyOneJob (line 52)
In parallel.Job/delete (line 1279)
In parallel.Cluster/hDeleteOneJob (line 926)
In parallel.internal.pool.InteractiveClient>iDeleteJobs (line 873)
In parallel.internal.pool.InteractiveClient/pRemoveOldJobs (line 481)
In parallel.internal.pool.InteractiveClient/start (line 315)
In parallel.Pool>iStartClient (line 791)
In parallel.Pool.hBuildPool (line 585)
In parallel.internal.pool.doParpool (line 18)
In parpool (line 98)
In parallel.internal.pool.PoolArrayManager.getOrAutoCreateWithCleanup (line 60)
In pctTryCreatePoolIfNecessary (line 23)
In distcomp.remoteparfor.tryRemoteParfor
In parallel_function (line 433)
In crank_nicolson (line 38)
Warning: The cluster failed to cancel the job execution. The error was: Unable to read file '/rhome/chong009/.matlab/local_cluster_jobs/R2018a/Job92.in.mat'. No such file or directory.
> In parallel.internal.cluster.CJSJobMethods.cancelOneJob (line 53)
In parallel.job.CJSConcurrentJob/cancelOneJob (line 57)
In parallel.Job/cancel (line 1348)
In parallel.Cluster/hDeleteOneJob (line 911)
In parallel.internal.pool.InteractiveClient>iDeleteJobs (line 873)
In parallel.internal.pool.InteractiveClient/pRemoveOldJobs (line 481)
In parallel.internal.pool.InteractiveClient/start (line 315)
In parallel.Pool>iStartClient (line 791)
In parallel.Pool.hBuildPool (line 585)
In parallel.internal.pool.doParpool (line 18)
In parpool (line 98)
In parallel.internal.pool.PoolArrayManager.getOrAutoCreateWithCleanup (line 60)
In pctTryCreatePoolIfNecessary (line 23)
In distcomp.remoteparfor.tryRemoteParfor
In parallel_function (line 433)
In crank_nicolson (line 38)
Warning: The cluster reported an error while deleting an unavailable job. This job may already have been deleted.
> In parallel.internal.cluster.CJSJobMethods.destroyOneJob (line 77)
In parallel.job.CJSConcurrentJob/destroyOneJob (line 52)
In parallel.Job/delete (line 1279)
In parallel.Cluster/hDeleteOneJob (line 926)
In parallel.internal.pool.InteractiveClient>iDeleteJobs (line 873)
In parallel.internal.pool.InteractiveClient/pRemoveOldJobs (line 481)
In parallel.internal.pool.InteractiveClient/start (line 315)
In parallel.Pool>iStartClient (line 791)
In parallel.Pool.hBuildPool (line 585)
In parallel.internal.pool.doParpool (line 18)
In parpool (line 98)
In parallel.internal.pool.PoolArrayManager.getOrAutoCreateWithCleanup (line 60)
In pctTryCreatePoolIfNecessary (line 23)
In distcomp.remoteparfor.tryRemoteParfor
In parallel_function (line 433)
In crank_nicolson (line 38)
Preserving jobs with IDs: 5 21 22 68 69 because they contain crash dump files.
You can use 'delete(myCluster.Jobs)' to remove all jobs created with profile local. To create 'myCluster' use 'myCluster = parcluster('local')'.
Starting parallel pool (parpool) using the 'local' profile ...
Then there is no message like "connectd to 4 workers". And the running time is approaximately 4 times slower.
Some jobs work perfectly at first, but after some matlab evaluations, the parallel jobs seems not work. And it shows the following information:
Starting parallel pool (parpool) using the 'local' profile ...
Preserving jobs with IDs: 5 21 22 68 69 because they contain crash dump files.
You can use 'delete(myCluster.Jobs)' to remove all jobs created with profile local. To create 'myCluster' use 'myCluster = parcluster('local')'.
Starting parallel pool (parpool) using the 'local' profile ...
Preserving jobs with IDs: 5 21 22 68 69 because they contain crash dump files.
You can use 'delete(myCluster.Jobs)' to remove all jobs created with profile local. To create 'myCluster' use 'myCluster = parcluster('local')'.
connected to 4 workers.
best so far in the initial data -0.366481187141
EI, 1th job, 0th iteration, func=overlap, q=4
EI takes 0.589984893799 seconds
EI suggests points:
[[ 5.62950456e-33 3.93339029e-32 1.00000000e-01 1.00000000e-01]
[ 4.47912054e-02 1.10230209e-30 6.65211795e-02 1.00000000e-01]
[ 1.39665061e-33 7.55063601e-33 5.47105991e-31 8.32740650e-33]
[ 2.35956500e-32 1.00000000e-01 1.00000000e-01 1.00000000e-01]]
evaluating takes 6.06351319949 mins
evaluating takes capital 1.0 so far
retraining the model takes 6.38158202171 seconds
But after some matlab evaluations, there is only "Starting parallel pool (parpool) using the 'local' profile ..." without connected to 4 workers and the running time is almost 4 times slower.
EI, VOI 0.0, best so far -0.920065378916
EI, 1th job, 44th iteration, func=overlap, q=4
EI takes 1.0830039978 seconds
EI suggests points:
[[ 0.03138878 0.09221666 0.06362966 0.00471924]
[ 0.09320626 0.08688824 0.05820342 0.06846487]
[ 0.04571523 0.00076893 0.01246081 0.00150557]
[ 0.00917403 0.06416236 0.07597119 0.09062772]]
Starting parallel pool (parpool) using the 'local' profile ...
evaluating takes 26.8635237853 mins
evaluating takes capital 45.0 so far
retraining the model takes 1807.75688004 seconds
0 comentarios
Respuesta aceptada
Michael
el 3 de Feb. de 2019
Editada: Walter Roberson
el 3 de Feb. de 2019
4 comentarios
Guilherme Salvador Vieira
el 5 de Abr. de 2020
Solved my problem as well! Thanks for sharing this link, I was not aware of this overwritting problem between different pools.
Xiaoyang Guo
el 24 de Abr. de 2020
Seems promising...need two or three days to see whether this works
Más respuestas (0)
Ver también
Categorías
Más información sobre Parallel Computing Fundamentals en Help Center y File Exchange.
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!