Documentation

Troubleshooting and Debugging

Object Data Size Limitations

The size limit of data transfers among the parallel computing objects is limited by the Java® Virtual Machine (JVM™) memory allocation. This limit applies to single transfers of data between client and workers in any job using an MJS cluster. The approximate size limitation depends on your system architecture:

System ArchitectureMaximum Data Size Per Transfer (approx.)
64-bit2.0 GB
32-bit600 MB

File Access and Permissions

Ensuring That Workers on Windows Operating Systems Can Access Files

By default, a worker on a Windows® operating system is installed as a service running as LocalSystem, so it does not have access to mapped network drives.

Often a network is configured to not allow services running as LocalSystem to access UNC or mapped network shares. In this case, you must run the mdce service under a different user with rights to log on as a service. See the section Set the User in the MATLAB® Distributed Computing Server™ System Administrator's Guide.

Task Function Is Unavailable

If a worker cannot find the task function, it returns the error message

Error using ==> feval
      Undefined command/function 'function_name'.

The worker that ran the task did not have access to the function function_name. One solution is to make sure the location of the function's file, function_name.m, is included in the job's AdditionalPaths property. Another solution is to transfer the function file to the worker by adding function_name.m to the AttachedFiles property of the job.

Load and Save Errors

If a worker cannot save or load a file, you might see the error messages

??? Error using ==> save
Unable to write file myfile.mat: permission denied.
??? Error using ==> load
Unable to read file myfile.mat: No such file or directory.

In determining the cause of this error, consider the following questions:

  • What is the worker's current folder?

  • Can the worker find the file or folder?

  • What user is the worker running as?

  • Does the worker have permission to read or write the file in question?

Tasks or Jobs Remain in Queued State

A job or task might get stuck in the queued state. To investigate the cause of this problem, look for the scheduler's logs:

  • Platform LSF® schedulers might send emails with error messages.

  • Microsoft® Windows HPC Server (including CCS), LSF®, PBS Pro®, and TORQUE save output messages in a debug log. See the getDebugLog reference page.

  • If using a generic scheduler, make sure the submit function redirects error messages to a log file.

Possible causes of the problem are:

  • The MATLAB worker failed to start due to licensing errors, the executable is not on the default path on the worker machine, or is not installed in the location where the scheduler expected it to be.

  • MATLAB could not read/write the job input/output files in the scheduler's job storage location. The storage location might not be accessible to all the worker nodes, or the user that MATLAB runs as does not have permission to read/write the job files.

  • If using a generic scheduler:

    • The environment variable MDCE_DECODE_FUNCTION was not defined before the MATLAB worker started.

    • The decode function was not on the worker's path.

No Results or Failed Job

Task Errors

If your job returned no results (i.e., fetchOutputs(job) returns an empty cell array), it is probable that the job failed and some of its tasks have their Error properties set.

You can use the following code to identify tasks with error messages:

errmsgs = get(yourjob.Tasks, {'ErrorMessage'});
nonempty = ~cellfun(@isempty, errmsgs);
celldisp(errmsgs(nonempty));

This code displays the nonempty error messages of the tasks found in the job object yourjob.

Debug Logs

If you are using a supported third-party scheduler, you can use the getDebugLog function to read the debug log from the scheduler for a particular job or task.

For example, find the failed job on your LSF scheduler, and read its debug log:

c = parcluster('my_lsf_profile')
failedjob = findJob(c, 'State', 'failed');
message = getDebugLog(c, failedjob(1))

Connection Problems Between the Client and MJS

For testing connectivity between the client machine and the machines of your compute cluster, you can use Admin Center. For more information about Admin Center, including how to start it and how to test connectivity, see Start Admin Center and Test Connectivity in the MATLAB Distributed Computing Server documentation.

Detailed instructions for other methods of diagnosing connection problems between the client and MJS can be found in some of the Bug Reports listed on the MathWorks Web site.

The following sections can help you identify the general nature of some connection problems.

Client Cannot See the MJS

If you cannot locate or connect to your MJS with parcluster, the most likely reasons for this failure are:

  • The MJS is currently not running.

  • Firewalls do not allow traffic from the client to the MJS.

  • The client and the MJS are not running the same version of the software.

  • The client and the MJS cannot resolve each other's short hostnames.

  • The MJS is using a nondefault BASE_PORT setting as defined in the mdce_def file, and the Host property in the cluster profile does not specify this port.

MJS Cannot See the Client

If a warning message says that the MJS cannot open a TCP connection to the client computer, the most likely reasons for this are

  • Firewalls do not allow traffic from the MJS to the client.

  • The MJS cannot resolve the short hostname of the client computer. Use pctconfig to change the hostname that the MJS will use for contacting the client.

SFTP Error: Received Message Too Long

The example code for generic schedulers with non-shared file systems contacts an sftp server to handle the file transfer to and from the cluster's file system. This use of sftp is subject to all the normal sftp vulnerabilities. One problem that can occur results in an error message similar to this:

Caused by:
    Error using ==> RemoteClusterAccess>RemoteClusterAccess.waitForChoreToFinishOrError at 780
    The following errors occurred in the 
         com.mathworks.toolbox.distcomp.clusteraccess.UploadFilesChore:
     Could not send Job3.common.mat for job 3: 
     One of your shell's init files contains a command that is writing to stdout,
        interfering with sftp. Access help
     com.mathworks.toolbox.distcomp.remote.spi.plugin.SftpExtraBytesFromShellException: 
     One of your shell's init files contains a command that is writing to stdout, 
        interfering with sftp.
     Find and wrap the command with a conditional test, such as

    	if ($?TERM != 0) then
    		if ("$TERM" != "dumb") then
    			/your command/
    		endif
    	endif

     : 4: Received message is too long: 1718579037

The telling symptom is the phrase "Received message is too long:" followed by a very large number.

The sftp server starts a shell, usually bash or tcsh, to set your standard read and write permissions appropriately before transferring files. The server initializes the shell in the standard way, calling files like .bashrc and .cshrc. This problem happens if your shell emits text to standard out when it starts. That text is transferred back to the sftp client running inside MATLAB, and is interpreted as the size of the sftp server's response message.

To work around this error, locate the shell startup file code that is emitting the text, and either remove it or bracket it within if statements to see if the sftp server is starting the shell:

if ($?TERM != 0) then
    if ("$TERM" != "dumb") then
        /your command/
    endif
endif

You can test this outside of MATLAB with a standard UNIX or Windows sftp command-line client before trying again in MATLAB. If the problem is not fixed, the error message persists:

> sftp yourSubmitMachine
Connecting to yourSubmitMachine...
Received message too long 1718579042

If the problem is fixed, you should see:

> sftp yourSubmitMachine
Connecting to yourSubmitMachine...
Was this topic helpful?