failing to run Reinforcement learning job on the cluster

2 visualizaciones (últimos 30 días)
Ahmad Momani
Ahmad Momani el 14 de Mayo de 2024
Comentada: Edric Ellis el 15 de Mayo de 2024
I have a custom reinforcement learning environment in which I train an agent using the SAC algorithm. The training runs smoothly on my desktop with four cores, but attempting to speed up the process on the university cluster has been unsuccessful. Below is some information about the job. Can this issue be resolved?
>> (jobRL5)
jobRL5 =
Job
Properties:
ID: 103
Type: pool
Username: amomani1
State: failed
SubmitDateTime: 13-May-2024 18:25:52
StartDateTime: 13-May-2024 18:27:12
RunningDuration: 0 days 13h 39m 28s
NumWorkersRange: [11 11]
NumThreads: 2
AutoAttachFiles: true
Auto Attached Files: List files
AttachedFiles: R:\amomani1\matlabcodes_SI_2023a\talbot_inversion.m
R:\amomani1\matlabcodes_SI_2023a\talbot_inversion2.m
R:\amomani1\matlabcodes_SI_2023a\talbotcode.m
AutoAddClientPath: true
AdditionalPaths: \\lightning.bu.binghamton.edu\matlab\nonshared\23a\IntegrationScripts\spiedie
\\lightning.bu.binghamton.edu\matlab\nonshared\23a
C:\Users\amomani1\Documents\MATLAB
C:\Users\amomani1\AppData\Local\Temp\8\Editor_retgg
FileStore: [1x1 parallel.FileStore]
ValueStore: [1x1 parallel.ValueStore]
EnvironmentVariables: {}
Associated Tasks:
Number Pending: 0
Number Running: 0
Number Finished: 11
Task ID of Errors: []
Task ID of Warnings: []
Task Scheduler IDs: 4857192
>> c.getDebugLog(jobRL5)
LOG FILE OUTPUT:
Node file: compute[078,162]
Starting SMPD on compute078 compute162 ...
srun --ntasks-per-node=1 --ntasks=2 /cm/shared/apps/Mathworks-MPS/2023a/bin/mw_smpd -phrase MATLAB -port 27192 -debug 0 &
Checking that SMPD processes are running (Attempt 1 of 60)
/cm/shared/apps/Mathworks-MPS/2023a/bin/mw_smpd -phrase MATLAB -port 27192 -status compute078 > /dev/null 2>&1
No SMPD process running on compute078
/cm/shared/apps/Mathworks-MPS/2023a/bin/mw_smpd -phrase MATLAB -port 27192 -status compute162 > /dev/null 2>&1
No SMPD process running on compute162
Checking that SMPD processes are running (Attempt 2 of 60)
/cm/shared/apps/Mathworks-MPS/2023a/bin/mw_smpd -phrase MATLAB -port 27192 -status compute078 > /dev/null 2>&1
SMPD process found running on compute078
/cm/shared/apps/Mathworks-MPS/2023a/bin/mw_smpd -phrase MATLAB -port 27192 -status compute162 > /dev/null 2>&1
SMPD process found running on compute162
All SMPDs launched
Machine args: -hosts 2 compute078 6 compute162 6
"/cm/shared/apps/Mathworks-MPS/2023a/bin/mw_mpiexec" -smpd -phrase MATLAB -port 27192 -l -hosts 2 compute078 6 compute162 6 -genvlist PARALLEL_SERVER_DECODE_FUNCTION,PARALLEL_SERVER_STORAGE_LOCATION,PARALLEL_SERVER_STORAGE_CONSTRUCTOR,PARALLEL_SERVER_JOB_LOCATION,PARALLEL_SERVER_DEBUG,PARALLEL_SERVER_LICENSE_NUMBER,MLM_WEB_LICENSE,MLM_WEB_USER_CRED,MLM_WEB_ID,TZ,MDCE_DECODE_FUNCTION,MDCE_STORAGE_LOCATION,MDCE_STORAGE_CONSTRUCTOR,MDCE_JOB_LOCATION,MDCE_DEBUG,MDCE_LICENSE_NUMBER "/cm/shared/apps/Mathworks-MPS/2023a/bin/worker" -parallel
job aborted:
rank: node: exit code[: error message]
0: compute078: -2
1: compute078: -2
2: compute078: -2
3: compute078: -2
4: compute078: -2
5: compute078: -2
6: compute162: -2
7: compute162: -2
8: compute162: -2
9: compute162: -2
10: compute162: -2
11: compute162: 1: process 11 exited without calling finalize
Stopping SMPD ...
srun --ntasks-per-node=1 --ntasks=2 /cm/shared/apps/Mathworks-MPS/2023a/bin/mw_smpd -shutdown -phrase MATLAB -port 27192
Exiting with code: 123
  1 comentario
Edric Ellis
Edric Ellis el 15 de Mayo de 2024
This looks like you aren't getting as far as running any sort of job on the cluster. Contact MathWorks support, they can help sort out this sort of thing.

Iniciar sesión para comentar.

Respuestas (0)

Categorías

Más información sobre Third-Party Cluster Configuration en Help Center y File Exchange.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by