Improve `parfor` Performance

You can improve the performance of `parfor`-loops in various ways. This includes parallel creation of arrays inside the loop; profiling `parfor`-loops; slicing arrays; and optimizing your code on local workers before running on a cluster.

Where to Create Arrays

When you create a large array in the client before your `parfor`-loop, and access it within the loop, you might observe slow execution of your code. To improve performance, tell each MATLAB® worker to create its own arrays, or portions of them, in parallel. You can save the time of transferring data from client to workers by asking each worker to create its own copy of these arrays, in parallel, inside the loop. Consider changing your usual practice of initializing variables before a `for`-loop, avoiding needless repetition inside the loop. You might find that parallel creation of arrays inside the loop improves performance.

Performance improvement depends on different factors, including

• size of the arrays

• time needed to create arrays

• number of loop iterations that each worker performs

Consider all factors in this list when you are considering to convert `for`-loops to `parfor`-loops. For more details, see Convert for-Loops Into parfor-Loops.

As an alternative, consider the `parallel.pool.Constant` function to establish variables on the pool workers before the loop. These variables remain on the workers after the loop finishes, and remain available for multiple `parfor`-loops. You might improve performance using `parallel.pool.Constant`, because the data is transferred only once to the workers.

In this example, you first create a big data set `D` and execute a `parfor`-loop accessing `D`. Then you use `D` to build a `parallel.pool.Constant` object, which allows you to reuse the data by copying `D` to each worker. Measure the elapsed time using `tic` and `toc` for each case and note the difference.

```function constantDemo D = rand(1e7, 1); tic for i = 1:20 a = 0; parfor j = 1:60 a = a + sum(D); end end toc tic D = parallel.pool.Constant(D); for i = 1:20 b = 0; parfor j = 1:60 b = b + sum(D.Value); end end toc end```
```>> constantDemo Starting parallel pool (parpool) using the 'local' profile ... connected to 4 workers. Elapsed time is 63.839702 seconds. Elapsed time is 10.194815 seconds. ```
In the second case, you send the data only once. You can enhance the performance of the `parfor`-loop by using the `parallel.pool.Constant` object.

Profiling `parfor`-loops

You can profile a `parfor`-loop by measuring the time elapsed using `tic` and `toc`. You can also measure how much data is transferred to and from the workers in the parallel pool by using `ticBytes` and `tocBytes`. Note that this is different from profiling MATLAB code in the usual sense using the MATLAB profiler, see Profile Your Code to Improve Performance.

This example calculates the spectral radius of a matrix and converts a `for`-loop into a `parfor`-loop. Measure the resulting speedup and the amount of transferred data.

1. In the MATLAB Editor, enter the following `for`-loop. Add `tic` and `toc` to measure the time elapsed. Save the file as `MyForLoop.m`.

```function a = MyForLoop(A) tic for i = 1:200 a(i) = max(abs(eig(rand(A)))); end toc end```
2. Run the code, and note the elapsed time.

`a = MyForLoop(500);`
`Elapsed time is 31.935373 seconds.`

3. In `MyForLoop.m`, replace the `for`-loop with a `parfor`-loop. Add `ticBytes` and `tocBytes` to measure how much data is transferred to and from the workers in the parallel pool. Save the file as `MyParforLoop.m`.

```ticBytes(gcp); parfor i = 1:200 a(i) = max(abs(eig(rand(A)))); end tocBytes(gcp)```

4. Run the new code, and run it again. Note that the first run is slower than the second run, because the parallel pool has to be started and you have to make the code available to the workers. Note the elapsed time for the second run.

By default, MATLAB automatically opens a parallel pool of workers on your local machine.

`a = MyParforLoop(500);`
```Starting parallel pool (parpool) using the 'local' profile ... connected to 4 workers. ... BytesSentToWorkers BytesReceivedFromWorkers __________________ ________________________ 1 15340 7024 2 13328 5712 3 13328 5704 4 13328 5728 Total 55324 24168 Elapsed time is 10.760068 seconds. ```
The elapsed time is 31.9 seconds in serial and 10.8 seconds in parallel, and shows that this code benefits from converting to a `parfor`-loop.

Slicing Arrays

If a variable is initialized before a `parfor`-loop, then used inside the `parfor`-loop, it has to be passed to each MATLAB worker evaluating the loop iterations. Only those variables used inside the loop are passed from the client workspace. However, if all occurrences of the variable are indexed by the loop variable, each worker receives only the part of the array it needs.

As an example, you first run a `parfor`-loop using a sliced variable and measure the elapsed time.

```% Sliced version M = 100; N = 1e6; data = rand(M, N); tic parfor idx = 1:M out2(idx) = sum(data(idx, :)) ./ N; end toc```
`Elapsed time is 2.261504 seconds.`

Now suppose that you accidentally use a reference to the variable `data` instead of `N` inside the `parfor`-loop. The problem here is that the call to `size(data, 2)` converts the sliced variable into a broadcast (non-sliced) variable.

```% Accidentally non-sliced version clear M = 100; N = 1e6; data = rand(M, N); tic parfor idx = 1:M out2(idx) = sum(data(idx, :)) ./ size(data, 2); end toc```
`Elapsed time is 8.369071 seconds.`
Note that the elapsed time is greater for the accidentally broadcast variable.

In this case, you can easily avoid the non-sliced usage of `data`, because the result is a constant, and can be computed outside the loop. In general, you can perform computations that depend only on broadcast data before the loop starts, since the broadcast data cannot be modified inside the loop. In this case, the computation is trivial, and results in a scalar result, so you benefit from taking the computation out of the loop.

Optimizing on Local vs. Cluster Workers

Running your code on local workers might offer the convenience of testing your application without requiring the use of cluster resources. However, there are certain drawbacks or limitations with using local workers. Because the transfer of data does not occur over the network, transfer behavior on local workers might not be indicative of how it will typically occur over a network.

With local workers, because all the MATLAB worker sessions are running on the same machine, you might not see any performance improvement from a `parfor`-loop regarding execution time. This can depend on many factors, including how many processors and cores your machine has. The key point here is that a cluster might have more cores available than your local machine. If your code can be multithreaded by MATLAB, then the only way to go faster is to use more cores to work on the problem, using a cluster.

You might experiment to see if it is faster to create the arrays before the loop (as shown on the left below), rather than have each worker create its own arrays inside the loop (as shown on the right).

Try the following examples running a parallel pool locally, and notice the difference in time execution for each loop. First open a local parallel pool:

`parpool('local')`

Run the following examples, and execute again. Note that the first run for each case is slower than the second run, because the parallel pool has to be started and you have to make the code available to the workers. Note the elapsed time, for each case, for the second run.

 ```tic; n = 200; M = magic(n); R = rand(n); parfor i = 1:n A(i) = sum(M(i,:).*R(n+1-i,:)); end toc``` ```tic; n = 200; parfor i = 1:n M = magic(n); R = rand(n); A(i) = sum(M(i,:).*R(n+1-i,:)); end toc```

Running on a remote cluster, you might find different behavior, as workers can simultaneously create their arrays, saving transfer time. Therefore, code that is optimized for local workers might not be optimized for cluster workers, and vice versa.