Develop Custom Datastore

This topic shows how to implement a custom datastore for file-based data. Use this framework only when writing your own custom datastore interface. Otherwise, for standard file formats, such as images or spreadsheets, use an existing datastore from MATLAB^®. For more information, see Getting Started with Datastore.

Overview

To build your custom datastore interface, use the custom datastore classes and objects. Then, use the custom datastore to bring your data into MATLAB and leverage the MATLAB big data capabilities such as tall, mapreduce, and Hadoop^®.

Designing your custom datastore involves inheriting from one or more abstract classes and implementing the required methods. The specific classes and methods you need depend on your processing needs.

Processing Needs	Classes
Datastore for Serial Processing in MATLAB	`matlab.io.Datastore` See Implement Datastore for Serial Processing
Datastore with support for Parallel Computing Toolbox™ and MATLAB Parallel Server™	`matlab.io.Datastore` and `matlab.io.datastore.Partitionable` See Add Support for Parallel Processing
Datastore with support for Hadoop	`matlab.io.Datastore` and `matlab.io.datastore.HadoopLocationBased` See Add Support for Hadoop
Datastore with support for shuffling samples in a datastore in random order	`matlab.io.Datastore` and `matlab.io.datastore.Shuffleable` See Add Support for Shuffling
Datastore with support for writing files via `writeall`	`matlab.io.Datastore` and `matlab.io.datastore.FileWritable` (Optionally, inheriting from `matlab.io.datastore.FoldersPropertyProvider` adds support for a `Folders` property.) See Add Support for Writing Data

Start by implementing datastore for serial processing, and then add support for parallel processing, Hadoop, shuffling, or writing.

Implement Datastore for Serial Processing

To implement a custom datastore named MyDatastore, create a script MyDatastore.m. The script must be on the MATLAB path and should contain code that inherits from the appropriate class and defines the required methods. The code for creating a datastore for serial processing in MATLAB must:

Inherit from the base class matlab.io.Datastore.
Define these methods: hasdata, read, reset, and progress.
Define additional properties and methods based on your data processing and analysis needs.

For a sample implementation, follow these steps.

Steps	Implementation
Inherit from the base class `Datastore`.	classdef MyDatastore < matlab.io.Datastore properties (Access = private) CurrentFileIndex double FileSet matlab.io.datastore.DsFileSet end
Add this property to create a datastore on one machine that works seamlessly on another machine or cluster that possibly has a different file system or operating system. Add methods to get and set this property in the methods section.	% Property to support saving, loading, and processing of % datastore on different file system machines or clusters. % In addition, define the methods get.AlternateFileSystemRoots() % and set.AlternateFileSystemRoots() in the methods section. properties(Dependent) AlternateFileSystemRoots end
Implement the function `MyDatastore` that creates the custom datastore.	methods % begin methods section function myds = MyDatastore(location,altRoots) myds.FileSet = matlab.io.datastore.DsFileSet(location,... 'FileExtensions','.bin', ... 'FileSplitSize',8*1024); myds.CurrentFileIndex = 1; if nargin == 2 myds.AlternateFileSystemRoots = altRoots; end reset(myds); end
Implement the `hasdata` method.	function tf = hasdata(myds) % Return true if more data is available. tf = hasfile(myds.FileSet); end
Implement the `read` method. This method uses `MyFileReader`, which is a function that you must create to read your proprietary file format. See Create Function to Read Your Proprietary File Format.	function [data,info] = read(myds) % Read data and information about the extracted data. if ~hasdata(myds) error(sprintf(['No more data to read.\nUse the reset ',... 'method to reset the datastore to the start of ' ,... 'the data. \nBefore calling the read method, ',... 'check if data is available to read ',... 'by using the hasdata method.'])) end fileInfoTbl = nextfile(myds.FileSet); data = MyFileReader(fileInfoTbl); info.Size = size(data); info.FileName = fileInfoTbl.FileName; info.Offset = fileInfoTbl.Offset; % Update CurrentFileIndex for tracking progress if fileInfoTbl.Offset + fileInfoTbl.SplitSize >= ... fileInfoTbl.FileSize myds.CurrentFileIndex = myds.CurrentFileIndex + 1 ; end end
Implement the `reset` method.	function reset(myds) % Reset to the start of the data. reset(myds.FileSet); myds.CurrentFileIndex = 1; end
Define the methods to get and set the `AlternateFileSystemRoots` property. You must reset the datastore in the `set` method.	% Before defining these methods, add the AlternateFileSystemRoots % property in the properties section % Getter for AlternateFileSystemRoots property function altRoots = get.AlternateFileSystemRoots(myds) altRoots = myds.FileSet.AlternateFileSystemRoots; end % Setter for AlternateFileSystemRoots property function set.AlternateFileSystemRoots(myds,altRoots) try % The DsFileSet object manages the AlternateFileSystemRoots % for your datastore myds.FileSet.AlternateFileSystemRoots = altRoots; % Reset the datastore reset(myds); catch ME throw(ME); end end end
Implement the `progress` method.	methods (Hidden = true) function frac = progress(myds) % Determine percentage of data read from datastore if hasdata(myds) frac = (myds.CurrentFileIndex-1)/... myds.FileSet.NumFiles; else frac = 1; end end end
Implement the `copyElement` method when you use the `DsFileSet` object as a property in your datastore.	methods (Access = protected) % If you use the DsFileSet object as a property, then % you must define the copyElement method. The copyElement % method allows methods such as readall and preview to % remain stateless function dscopy = copyElement(ds) dscopy = copyElement@matlab.mixin.Copyable(ds); dscopy.FileSet = copy(ds.FileSet); end end
End the `classdef` section.	end

Create Function to Read Your Proprietary File Format

The implementation of the read method of your custom datastore uses a function called MyFileReader. You must create this function to read your custom or proprietary data. Build this function using DsFileReader object and its methods. For instance, create a function that reads binary files.

function data = MyFileReader(fileInfoTbl)
% create a reader object using the FileName
reader = matlab.io.datastore.DsFileReader(fileInfoTbl.FileName);

% seek to the offset
seek(reader,fileInfoTbl.Offset,'Origin','start-of-file');

% read fileInfoTbl.SplitSize amount of data
data = read(reader,fileInfoTbl.SplitSize);
end

Add Support for Parallel Processing

To add support for parallel processing with Parallel Computing Toolbox and MATLAB Parallel Server, update your implementation code in MyDatastore.m to:

Inherit from an additional class matlab.io.datastore.Partitionable.
Define two additional methods: maxpartitions and partition.

For a sample implementation, follow these steps.

Steps	Implementation
Update the `classdef` section to inherit from the `Partitionable` class.	classdef MyDatastore < matlab.io.Datastore & ... matlab.io.datastore.Partitionable . . .
Add the definition for `partition` to the `methods` section.	methods . . . function subds = partition(myds,n,ii) subds = copy(myds); subds.FileSet = partition(myds.FileSet,n,ii); reset(subds); end end
Add definition for `maxpartitions` to the `methods` section.	methods (Access = protected) function n = maxpartitions(myds) n = maxpartitions(myds.FileSet); end end
End `classdef`.	end

Add Support for Hadoop

To add support for Hadoop, update your implementation code in MyDatastore.m to:

Inherit from an additional class matlab.io.datastore.HadoopLocationBased.
Define two additional methods: getLocation and initializeDatastore.

For a sample implementation, follow these steps.

Steps Implementation

Steps	Implementation
Update the `classdef` section to inherit from the `HadoopLocationBased` class.	classdef MyDatastore < matlab.io.Datastore & ... matlab.io.datastore.HadoopLocationBased . . .
Add the definition for `getLocation`, `initializeDatastore`, and `isfullfile` (optional) to the `methods` section.	methods (Hidden = true) . . . function initializeDatastore(myds,hadoopInfo) import matlab.io.datastore.DsFileSet; myds.FileSet = DsFileSet(hadoopInfo,... 'FileSplitSize',myds.FileSet.FileSplitSize); reset(myds); end function loc = getLocation(myds) loc = myds.FileSet; end % isfullfile method is optional function tf = isfullfile(myds) tf = isequal(myds.FileSet.FileSplitSize,'file'); end end
End the `classdef` section.	end

Update the classdef section to inherit from the HadoopLocationBased class.

classdef MyDatastore < matlab.io.Datastore & ...
                       matlab.io.datastore.HadoopLocationBased 
     .
     .
     .

Add the definition for getLocation, initializeDatastore, and isfullfile (optional) to the methods section.

 methods (Hidden = true)
     .
     .
     .   

     function initializeDatastore(myds,hadoopInfo)
        import matlab.io.datastore.DsFileSet;
        myds.FileSet = DsFileSet(hadoopInfo,...
             'FileSplitSize',myds.FileSet.FileSplitSize);
        reset(myds);         
     end 
     
     function loc = getLocation(myds)
        loc = myds.FileSet;         
     end 
     
     % isfullfile method is optional
     function tf = isfullfile(myds)
        tf = isequal(myds.FileSet.FileSplitSize,'file');          
     end 
 
 end

End the classdef section.

end

Add Support for Shuffling

To add support for shuffling, update your implementation code in MyDatastore.m to:

Inherit from an additional class matlab.io.datastore.Shuffleable.
Define the additional method shuffle.

For a sample implementation, follow these steps.

Steps Implementation

Steps	Implementation
Update the `classdef` section to inherit from the `Shuffleable` class.	classdef MyDatastore < matlab.io.Datastore & ... matlab.io.datastore.Shuffleable . . .
Add the definition for `shuffle` to the existing `methods` section.	methods % previously defined methods . . . function dsNew = shuffle(ds) % dsNew = shuffle(ds) shuffles the files and the % corresponding labels in the datastore. % Create a copy of datastore dsNew = copy(ds); dsNew.Datastore = copy(ds.Datastore); fds = dsNew.Datastore; % Shuffle files and corresponding labels numObservations = dsNew.NumObservations; idx = randperm(numObservations); fds.Files = fds.Files(idx); dsNew.Labels = dsNew.Labels(idx); end end
End the `classdef` section.	end

Update the classdef section to inherit from the Shuffleable class.

classdef MyDatastore < matlab.io.Datastore & ...
                      matlab.io.datastore.Shuffleable 
     .
     .
     .

Add the definition for shuffle to the existing methods section.

  methods

        % previously defined methods
        .
        .
        . 
   
        function dsNew = shuffle(ds)
            % dsNew = shuffle(ds) shuffles the files and the
            % corresponding labels in the datastore.
            
            % Create a copy of datastore
            dsNew = copy(ds);
            dsNew.Datastore = copy(ds.Datastore);
            fds = dsNew.Datastore;
            
            % Shuffle files and corresponding labels
            numObservations = dsNew.NumObservations;
            idx = randperm(numObservations);
            fds.Files = fds.Files(idx);
            dsNew.Labels = dsNew.Labels(idx);
        end

  end

End the classdef section.

end

Add Support for Writing Data

To add support for writing data, update your implementation code in MyDatastore.m to follow these requirements:

Inherit from an additional class matlab.io.datastore.FileWritable.
Initialize the properties SupportedOutputFormats and DefaultOutputFormat.
Implement a write method if the datastore writes data to a custom format.
Implement a getFiles method if the datastore does not have a Files property.
Implement a getFolders method if the datastore does not have a Folders property.
The output location is validated as a string. If your datastore requires further validation, you must implement a validateOutputLocation method.
If the datastore is meant for files that require multiple reads per file, then you must implement the methods getCurrentFilename and currentFileIndexComparator.
Optionally, inherit from another class matlab.io.datastore.FoldersPropertyProvider to add support for a Folders property (and thus the FolderLayout name-value pair of writeall). If you do this, then you can use the populateFoldersFromLocation method in the datastore constructor to populate the Folders property.
To add support for the 'UseParallel' option of writeall, you must subclass from both matlab.io.datastore.FileWritable and matlab.io.datastore.Partitionable and implement a partition method in the subclass that supports the syntax partition(ds,'Files',index).

For a sample implementation that inherits from matlab.io.datastore.FileWritable, follow these steps.

Steps	Implementation
Update the `classdef` section to inherit from the `FileWritable` class.	classdef MyDatastore < matlab.io.Datastore & ... matlab.io.datastore.FileWritable . . .
Initialize the properties `SupportedOutputFormats` and `DefaultOutputFormat`. In this example, the datastore supports all of the output formats of `ImageDatastore`, as well as a custom format `"dcm"`, which is also declared as the default output format.	properties (Constant) SupportedOutputFormats = ... [matlab.io.datastore.ImageDatastore.SupportedOutputFormats, "dcm"]; DefaultOutputFormat = "dcm"; end
Add definitions for `getFiles` and `getFolders` to the existing `methods` section. These methods are required when the datastore does not have `Files` or `Folders` properties.	methods (Access = {?matlab.io.datastore.FileWritable, ... ?matlab.bigdata.internal.executor.FullfileDatastorePartitionStrategy}) function files = getFiles(ds) files = {'data/folder/file1', 'data/folder/file2',...}; end end methods (Access = protected) function folders = getFolders(ds) folders = {'data/folder1/', 'data/folder2/',...}; end end
Add a `write` method when the datastore intends to write data to a custom format. In this example, the method switches between using a custom write function for `"dcm"` and the built-in write function for known formats.	methods(Access = protected) function tf = write(myds, data, writeInfo, outFmt, varargin) if outFmt == "dcm" % use custom write fcn for dcm format dicomwrite(data, writeInfo.SuggestedOutputName, varargin{:}); else % callback into built-in for known formats write@matlab.io.datastore.FileWritable(myds, data, ... writeInfo, outFmt, varargin{:}); end tf = true; end end
End the `classdef` section.	end

For a longer example class that inherits from both matlab.io.datastore.FileWritable and matlab.io.datastore.FoldersPropertyProvider, see Develop Custom Datastore for DICOM Data.

Validate Custom Datastore

After following the instructions presented here, the implementation step of your custom datastore is complete. Before using this custom datastore, qualify it using the guidelines presented in Testing Guidelines for Custom Datastores.

Develop Custom Datastore

Overview

Implement Datastore for Serial Processing

Create Function to Read Your Proprietary File Format

Add Support for Parallel Processing

Add Support for Hadoop

Add Support for Shuffling

Add Support for Writing Data

Validate Custom Datastore

See Also

Topics