Further Notes on Communicating Jobs
Number of Tasks in a Communicating Job
Although you create only one task for a communicating job, the system copies this
task for each worker that runs the job. For example, if a communicating job runs on
four workers, the Tasks
property of the job contains four task
objects. The first task in the job's Tasks
property corresponds
to the task run by the worker whose spmdIndex
is 1
, and so on, so that the
ID
property for the task object and
spmdIndex
for the worker that ran that task have the same
value. Therefore, the sequence of results returned by the fetchOutputs
function corresponds to
the value of spmdIndex
and to the order of tasks in the job's
Tasks
property.
Avoid Deadlock and Other Dependency Errors
Because code running in one worker for a communicating job can block execution
until some corresponding code executes on another worker, the potential for deadlock
exists in communicating jobs. This is most likely to occur when transferring data
between workers or when making code dependent upon the
spmdIndex
in an if
statement. Some
examples illustrate common pitfalls.
Suppose you have a codistributed array D
, and you want to use
the gather
function to assemble the
entire array in the workspace of a single
worker.
if spmdIndex == 1 assembled = gather(D); end
The reason this fails is because the gather
function requires
communication between all the workers across which the array is distributed. When
the if
statement limits execution to a single worker, the other
workers required for execution of the function are not executing the statement. As
an alternative, you can use gather
itself to collect the data
into the workspace of a single worker: assembled = gather(D,
1)
.
In another example, suppose you want to transfer data from every worker to the
next worker on the right (defined as the next higher
spmdIndex
). First you define for each worker what the workers
on the left and right
are.
from_worker_left = mod(spmdIndex - 2, spmdSize) + 1; to_worker_right = mod(spmdIndex, spmdSize) + 1;
Then try to pass data around the ring.
spmdSend (outdata, to_worker_right); indata = spmdReceive(from_worker_left);
The reason this code might fail is because, depending on the size of the data
being transferred, the spmdSend
function can block execution in a worker until the
corresponding receiving worker executes its spmdReceive
function. In this case, all the workers are attempting
to send at the same time, and none are attempting to receive while
spmdSend
has them blocked. In other words, none of the
workers get to their spmdReceive
statements because they are
all blocked at the spmdSend
statement. To avoid this particular
problem, you can use the spmdSendReceive
function.