Process Big Data in the Cloud
This example shows how to access a large data set in the cloud and process it in a cloud cluster using MATLAB® capabilities for big data.
Learn how to:
Access a publicly available large data set on Amazon Cloud.
Find and select an interesting subset of this data set.
Use datastores, tall arrays, and Parallel Computing Toolbox to process this subset in less than 20 minutes.
The public data set in this example is part of the Wind Integration National Dataset Toolkit, or WIND Toolkit [1], [2], [3], [4]. For more information, see Wind Integration National Dataset Toolkit.
Requirements
To run this example, you must set up access to a cluster in Amazon AWS. In MATLAB, you can create clusters in Amazon AWS directly from the MATLAB desktop. On the Home tab, in the Parallel menu, select Create and Manage Clusters. In the Cluster Profile Manager, click Create Cloud Cluster. Alternatively, you can use MathWorks Cloud Center to create and access compute clusters in Amazon AWS. For more information, see Getting Started with Cloud Center.
Set Up Access to Remote Data
The data set used in this example is the Techno-Economic WIND Toolkit. It contains 2 TB (terabyte) of data for wind power estimates and forecasts along with atmospheric variables from 2007 to 2013 within the continental U.S.
The Techno-Economic WIND Toolkit is available via Amazon Web Services, in the location s3://nrel-pds-wtk/wtk-techno-economic/pywtk-data. It contains two data sets:
s3://nrel-pds-wtk/wtk-techno-economic/pywtk-data/met_data - Metrology Data
s3://nrel-pds-wtk/wtk-techno-economic/pywtk-data/fcst_data - Forecast Data
To work with remote data in Amazon S3, you must define environment variables for your AWS credentials. For more information on setting up access to remote data, see Work with Remote Data. In the following code, replace YOUR_AWS_ACCESS_KEY_ID
and YOUR_AWS_SECRET_ACCESS_KEY
with your own Amazon AWS credentials. If you are using temporary AWS security credentials, also set the environment variable AWS_SESSION_TOKEN
.
setenv("AWS_ACCESS_KEY_ID","YOUR_AWS_ACCESS_KEY_ID"); setenv("AWS_SECRET_ACCESS_KEY","YOUR_AWS_SECRET_ACCESS_KEY");
This data set requires you to specify its geographic region, and so you must set the corresponding environment variable.
setenv("AWS_DEFAULT_REGION","us-west-2");
To give the workers in your cluster access to the remote data, add these environment variable names to the EnvironmentVariables
property of your cluster profile. To edit the properties of your cluster profile, use the Cluster Profile Manager, in Parallel > Create and Manage Clusters. For more information, see Set Environment Variables on Workers.
Find Subset of Big Data
The 2 TB data set is quite large. This example shows you how to find a subset of the data set that you want to analyze. The example focuses on data for the state of Massachusetts.
First obtain the IDs that identify the metrological stations in Massachusetts, and determine the files that contain their metrological information. Metadata information for each station is in a file named three_tier_site_metadata.csv
. Because this data is small and fits in memory, you can access it from the MATLAB client with readtable
. You can use the readtable
function to access open data in S3 buckets directly without needing to write special code.
tMetadata = readtable("s3://nrel-pds-wtk/wtk-techno-economic/pywtk-data/three_tier_site_metadata.csv",... "ReadVariableNames",true,"TextType","string");
To find out which states are listed in this data set, use unique
.
states = unique(tMetadata.state)
states = 50×1 string array
""
"Alabama"
"Arizona"
"Arkansas"
"California"
"Colorado"
"Connecticut"
"Delaware"
"District of Columbia"
"Florida"
"Georgia"
"Idaho"
"Illinois"
"Indiana"
"Iowa"
"Kansas"
"Kentucky"
"Louisiana"
"Maine"
"Maryland"
"Massachusetts"
"Michigan"
"Minnesota"
"Mississippi"
"Missouri"
"Montana"
"Nebraska"
"Nevada"
"New Hampshire"
"New Jersey"
"New Mexico"
"New York"
"North Carolina"
"North Dakota"
"Ohio"
"Oklahoma"
"Oregon"
"Pennsylvania"
"Rhode Island"
"South Carolina"
"South Dakota"
"Tennessee"
"Texas"
"Utah"
"Vermont"
"Virginia"
"Washington"
"West Virginia"
"Wisconsin"
"Wyoming"
Identify which stations are located in the state of Massachusetts.
index = tMetadata.state == "Massachusetts"; siteId = tMetadata{index,"site_id"};
The data for a given station is contained in a file that follows this naming convention: s3://nrel-pds-wtk/wtk-techno-economic/pywtk-data/met_data/folder/site_id.nc
, where folder
is the nearest integer less than or equal to site_id/500
. Using this convention, compose a file location for each station.
folder = floor(siteId/500);
fileLocations = compose("s3://nrel-pds-wtk/wtk-techno-economic/pywtk-data/met_data/%d/%d.nc",folder,siteId);
Process Big Data
You can use datastores and tall arrays to access and process data that does not fit in memory. When performing big data computations, MATLAB accesses smaller portions of the remote data as needed, so you do not need to download the entire data set at once. With tall arrays, MATLAB automatically breaks the data into smaller blocks that fit in memory for processing.
If you have Parallel Computing Toolbox, MATLAB can process the many blocks in parallel. The parallelization enables you to run an analysis on a single desktop with local workers, or scale up to a cluster for more resources. When you use a cluster in the same cloud service as the data, the data stays in the cloud and you benefit from improved data transfer times. Keeping the data in the cloud is also more cost-effective. This example ran in less than 20 minutes using 18 workers on a c4.8xlarge machine in Amazon AWS.
If you use a parallel pool in a cluster, MATLAB processes this data using workers in the cluster. Create a parallel pool in the cluster. In the following code, use the name of your cluster profile instead. Attach the script to the pool, because the parallel workers need to access a helper function in it.
p = parpool("myAWSCluster");
Starting parallel pool (parpool) using the 'myAWSCluster' profile ... connected to 18 workers.
addAttachedFiles(p,mfilename("fullpath"));
Create a datastore with the metrology data for the stations in Massachusetts. The data is in the form of Network Common Data Form (NetCDF) files, and you must use a custom read function to interpret them. In this example, this function is named ncReader
and reads the NetCDF data into timetables. You can explore its contents at the end of this script.
dsMetrology = fileDatastore(fileLocations,"ReadFcn",@ncReader,"UniformRead",true);
Create a tall timetable with the metrology data from the datastore.
ttMetrology = tall(dsMetrology)
ttMetrology = M×6 tall timetable Time wind_speed wind_direction power density temperature pressure ____________________ __________ ______________ ______ _______ ___________ ________ 01-Jan-2007 00:00:00 5.905 189.35 3.3254 1.2374 269.74 97963 01-Jan-2007 00:05:00 5.8898 188.77 3.2988 1.2376 269.73 97959 01-Jan-2007 00:10:00 5.9447 187.85 3.396 1.2376 269.71 97960 01-Jan-2007 00:15:00 6.0362 187.05 3.5574 1.2376 269.68 97961 01-Jan-2007 00:20:00 6.1156 186.49 3.6973 1.2375 269.83 97958 01-Jan-2007 00:25:00 6.2133 185.71 3.8698 1.2376 270.03 97952 01-Jan-2007 00:30:00 6.3232 184.29 4.0812 1.2379 270.19 97955 01-Jan-2007 00:35:00 6.4331 182.51 4.3382 1.2382 270.3 97957 : : : : : : : : : : : : : :
Get the mean temperature per month using groupsummary
, and sort the resulting tall table. For performance, MATLAB defers most tall operations until the data is needed. In this case, plotting the data triggers evaluation of deferred calculations.
meanTemperature = groupsummary(ttMetrology,"Time","month","mean","temperature"); meanTemperature = sortrows(meanTemperature);
Plot the results.
figure; plot(meanTemperature.mean_temperature,"*-"); ylim([260 300]); xlim([1 12*7+1]); xticks(1:12:12*7+1); xticklabels(["2007","2008","2009","2010","2011","2012","2013","2014"]); title("Average Temperature in Massachusetts 2007-2013"); xlabel("Year"); ylabel("Temperature (K)")
Many MATLAB functions support tall arrays, so you can perform a variety of calculations on big data sets using familiar syntax. For more information on supported functions, see Supporting Functions.
Define Custom Read Function
The data in the Techno-Economic WIND Toolkit is saved in NetCDF files. Define a custom read function to read its data into a timetable. For more information on reading NetCDF files, see NetCDF Files.
function t = ncReader(filename) % NCREADER Read NetCDF File (.nc), extract data set and save as a timetable % Get information about NetCDF data source fileInfo = ncinfo(filename); % Extract variable names and datatypes varNames = string({fileInfo.Variables.Name}); varTypes = string({fileInfo.Variables.Datatype}); % Transform variable names into valid names for table variables if any(startsWith(varNames,["4","6"])) strVarNames = replace(varNames,["4","6"],["four","six"]); else strVarNames = varNames; end % Extract the length of each variable fileLength = fileInfo.Dimensions.Length; % Extract initial timestamp, sample period and create the time axis tAttributes = struct2table(fileInfo.Attributes); startTime = datetime(cell2mat(tAttributes.Value(contains(tAttributes.Name,"start_time"))),"ConvertFrom","epochtime"); samplePeriod = seconds(cell2mat(tAttributes.Value(contains(tAttributes.Name,"sample_period")))); % Create the output timetable numVars = numel(strVarNames); tableSize = [fileLength numVars]; t = timetable('Size',tableSize,'VariableTypes',varTypes,'VariableNames',strVarNames,'TimeStep',samplePeriod,'StartTime',startTime); % Fill in the timetable with variable data for k = 1:numVars t(:,k) = table(ncread(filename,varNames{k})); end end
References
[1] Draxl, C., B. M. Hodge, A. Clifton, and J. McCaa. Overview and Meteorological Validation of the Wind Integration National Dataset Toolkit (Technical Report, NREL/TP-5000-61740). Golden, CO: National Renewable Energy Laboratory, 2015.
[2] Draxl, C., B. M. Hodge, A. Clifton, and J. McCaa. "The Wind Integration National Dataset (WIND) Toolkit." Applied Energy. Vol. 151, 2015, pp. 355-366.
[3] King, J., A. Clifton, and B. M. Hodge. Validation of Power Output for the WIND Toolkit (Technical Report, NREL/TP-5D00-61714). Golden, CO: National Renewable Energy Laboratory, 2014.
[4] Lieberman-Cribbin, W., C. Draxl, and A. Clifton. Guide to Using the WIND Toolkit Validation Code (Technical Report, NREL/TP-5000-62595). Golden, CO: National Renewable Energy Laboratory, 2014.
See Also
tall
| datastore
| readtable
| parpool
Related Examples
More About
- Work with Deep Learning Data in AWS (Deep Learning Toolbox)
- Deep Learning with Big Data (Deep Learning Toolbox)