Extend Tall Arrays with Other Products
Products Used: Statistics and Machine Learning Toolbox™, Database Toolbox™, Parallel Computing Toolbox™, MATLAB® Parallel Server™, MATLAB Compiler™
Several toolboxes enhance the capabilities of tall arrays. These enhancements include writing machine learning algorithms, integrating with big data systems, and deploying standalone apps.
Statistics and Machine Learning
Statistics and Machine Learning Toolbox enables you to perform advanced statistical calculations on tall arrays. Capabilities include:
K-means clustering
Linear regression fitting
Grouped statistics
Classification
See Analysis of Big Data with Tall Arrays (Statistics and Machine Learning Toolbox) for more information.
Control Where Your Code Runs
When you execute calculations on tall arrays, the default execution
environment uses either the local MATLAB session,
or a local parallel pool if you have Parallel Computing Toolbox.
Use the mapreducer
function
to change the execution environment of tall arrays when using Parallel Computing Toolbox, MATLAB
Parallel Server,
or MATLAB
Compiler:
Parallel Computing Toolbox — Run calculations in parallel using local or cluster workers to speed up large tall array calculations. See Use Tall Arrays on a Parallel Pool (Parallel Computing Toolbox) or Process Big Data in the Cloud (Parallel Computing Toolbox) for more information.
MATLAB Parallel Server — Run tall array calculations on a cluster, including Apache® Spark™ enabled Hadoop® clusters. This can significantly reduce the execution time of very large calculations. See Use Tall Arrays on a Spark Cluster (Parallel Computing Toolbox) for more information.
MATLAB Compiler — Deploy MATLAB applications containing tall arrays as standalone apps on Apache Spark. See Spark Applications (MATLAB Compiler) for more information.
One of the benefits of developing your algorithms with tall
arrays is that you only need to write the code once. You can develop
your code locally, then use mapreducer
to scale
up and take advantage of the capabilities offered by Parallel Computing Toolbox, MATLAB
Parallel Server,
or MATLAB
Compiler, without needing to rewrite your algorithm.
Note
Each tall array is bound to a single execution environment when
it is constructed using tall(ds)
. If that execution
environment is later modified or deleted, then the tall array becomes
invalid.
For this reason, each time you change the execution environment you must reconstruct the tall array.
Work with Databases
Database Toolbox enables you to create a tall table from a DatabaseDatastore
that is backed by data in a database. For more information, see Analyze Large Data in Database Using Tall Arrays (Database Toolbox).
Note
DatabaseDatastore
has these limitations:
DatabaseDatastore
must use the local MATLAB session as the execution environment. Set this environment using the commandmapreducer(0)
.Standalone applications containing tall arrays that use
DatabaseDatastore
cannot be deployed against Apache Spark using MATLAB Compiler.
See Also
mapreducer
| gcmr
| tall