Main Content

matlab.mapreduce.DeploySparkMapReducer Class

Namespace: matlab.mapreduce
Superclasses:

Configure a MATLAB tall array application with Spark parameters as key-value pairs

Description

A DeploySparkMapReducer object stores the configuration parameters of the tall array application being deployed to Spark™. Every tall array application must be configured prior to being deployed on a Spark cluster. Some of the configuration parameters define properties of the application and some are used by Spark to allocate resources on the cluster. The configuration parameters are passed onto a Spark cluster through a mapreducer function.

Construction

conf = matlab.mapreduce.DeploySparkMapReducer('AppName',name,'Master',url,'SparkProperties',prop) creates a DeploySparkMapReducer object with the specified configuration parameters.

conf = matlab.mapreduce.DeploySparkMapReducer('AppName',name,'Master',url,'SparkProperties',prop,Name,Value) creates a DeploySparkMapReducer object with additional configuration parameters specified by one or more Name,Value pair arguments. Name is a property name of the class and Value is the corresponding value. Name must appear inside single quotes (''). You can specify several name-value pair arguments in any order as Name1,Value1,...,NameN,ValueN.

Input Arguments

expand all

Name of application specified as a character vector inside single quotes ('').

Example: 'AppName', 'myApp'

Data Types: char | string

Name of the master URL specified as a character vector inside single quotes ('').

URLDescription
yarn-clientConnect to a Hadoop® YARN cluster in client mode. The cluster location is found based on the HADOOP_CONF_DIR or YARN_CONF_DIR variable.

Example: 'Master', 'yarn-client'

Data Types: char | string

A containers.Map object containing Spark configuration properties as key-value pairs.

When deploying to a Hadoop YARN cluster, set the value for prop with the appropriate Spark configuration properties as key-value pairs. The precise set of Spark configuration properties vary from one deployment scenario to another, based on the deployment cluster environment. Users must verify the Spark setup with a system administrator to use the appropriate configuration properties. See the table for commonly used Spark properties. For a full set of properties, see the latest Spark documentation.

Running Spark on YARN

Property Name (Key)Default (Value)Description
spark.executor.cores1

The number of cores to use on each executor.

For YARN and Spark standalone mode only. In Spark standalone mode, setting this parameter allows an application to run multiple executors on the same worker, provided that there are enough cores on that worker. Otherwise, only one executor per application runs on each worker.

spark.executor.instances2

The number of executors.

Note

This property is incompatible with spark.dynamicAllocation.enabled. If both spark.dynamicAllocation.enabled and spark.executor.instances are specified, dynamic allocation is turned off and the specified number of spark.executor.instances is used.

spark.driver.memory

  • 1g

  • 2048m (recommended)

Amount of memory to use for the driver process.

If you get any out of memory errors while using tall/gather, consider increasing this value.

spark.executor.memory

  • 1g

  • 2048m (recommended)

Amount of memory to use per executor process.

If you get any out of memory errors while using tall/gather, consider increasing this value.

spark.yarn.executor.memoryOverhead

  • executorMemory * 0.10, with minimum of 384.

  • 4096m (recommended)

The amount of off-heap memory (in MBs) to be allocated per executor.

If you get any out of memory errors while using tall/gather, consider increasing this value.

spark.dynamicAllocation.enabledfalse

This option integrates Spark with the YARN resource management. Spark initiates as many executors as possible given the executor memory requirement and number of cores. This property requires that the cluster be set up.

Setting this property to true specifies whether to use dynamic resource allocation, which scales the number of executors registered with this application up and down based on the workload.

This property requires spark.shuffle.service.enabled to be set. The following configurations are also relevant: spark.dynamicAllocation.minExecutors, spark.dynamicAllocation.maxExecutors, and spark.dynamicAllocation.initialExecutors

spark.shuffle.service.enabledfalse

Enables the external shuffle service. This service preserves the shuffle files written by executors so the executors can be safely removed. This must be enabled if spark.dynamicAllocation.enabled is set to true. The external shuffle service must be set up in order to enable it.

MATLAB Specific Properties

Property Name (Key)Default (Value)Description
spark.matlab.worker.debugfalseFor use in standalone/interactive mode only. If set to true, a Spark deployable MATLAB application executed within the MATLAB desktop environment, starts another MATLAB session as worker, and will enter the debugger. Logging information is directed to log_<nbr>.txt.
spark.matlab.worker.reusetrueWhen set to true, a Spark executor pools workers and reuses them from one stage to the next. Workers terminate when the executor under which the workers are running terminates.
spark.matlab.worker.profilefalseOnly valid when using a session of MATLAB as a worker. When set to true, it turns on the MATLAB Profiler and generates a Profile report that is saved to the file profworker_<split_index>_<socket>_<worker pass>.mat.
spark.matlab.worker.numberOfKeys10000Number of unique keys that can be held in a containers.Map object while performing *ByKey operations before map data is spilled to a file.
spark.matlab.executor.timeout600000

Spark executor timeout in milliseconds. Not applicable when deploying tall arrays.

Monitoring and Logging

Property Name (Key)Default (Value)Description
spark.history.fs.logDirectoryfile:/tmp/spark-events

Directory that contains application event logs to be loaded by the history server.

spark.eventLog.dirfile:///tmp/spark-events

Base directory in which Spark events are logged, if spark.eventLog.enabled is true. Within this base directory, Spark creates a sub directory for each application, and logs the events specific to the application in this directory. You can set this to a unified location like an HDFS™ directory so history files can be read by the history server.

spark.eventLog.enabledfalse

Whether to log Spark events. This is useful for reconstructing the web UI after the application has finished.

Name-Value Arguments

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

A character vector specifying the path to MATLAB Runtime within single quotes ''.

Example: 'MCRRoot', '/share/MATLAB/MATLAB_Runtime/v91'

Data Types: char | string

Specify the log level to set as a character vector with log level enclosed in ''.

Data Types: char | string

Properties

The properties of this class are hidden.

Methods

There are no user executable methods for this class.

Examples

collapse all

Define Spark properties and create a DeploySparkMapReducer object.

sparkProperties = containers.Map( ...
 {'spark.executor.cores', ...
 'spark.executor.memory', ...
 'spark.yarn.executor.memoryOverhead', ...
 'spark.dynamicAllocation.enabled', ...
 'spark.shuffle.service.enabled', ...
 'spark.eventLog.enabled', ...
 'spark.eventLog.dir'}, ...
 {'1', ...
  '2g', ...
  '1024', ...
  'true', ...
  'true', ...
  'true', ...
  'hdfs://hadoopfs:54310/user/<username>/sparkdeploy'});

conf = matlab.mapreduce.DeploySparkMapReducer( ...
      'AppName','myTallApp', ...
      'Master','yarn-client', ...
      'SparkProperties',sparkProperties);

mapreducer(conf);

Version History

Introduced in R2016b