How can I utilize tall arrays for my data structure without it being in a cell array?

5 visualizaciones (últimos 30 días)
I've recently inhereted ownership over some legacy software that outputs data in a specific way and that can't be modified. It also outputs a lot of that data, between 2Gb and 8Gb depending on the session. I have MATLAB code that reads all of this data into memory and plots it and that code works most of the time, but starts hitting memory issues around the 8Gb mark. I've gotten as far as I can get with other memory-optimization tricks in my code (mostly making sure I'm not generating copies of such a large data array during intermediate calculations) and now I want to see if I can implement a datastore/tall array setup to have the code work for arbitrarily large datasets in the future.
The data file format is as follows; There are 30 different data sources that we are keeping track of in time, so for every measurement snapshot we take, there are 31 values being stored (one timestamp and one measurement from each of the 30 sources). Each of these values are being converted into a single and the binary values are all stored in sequence in a massive .bin file that contains these snapshots in sequence as well. Once the file gets to about half a gigabyte, a new file is opened and starts being filled up.
So, when you read it back in, I would use (slight psuedocode)
data_ReadIntoMemory = zeros(31,numOfExpectedDataPoints); %numOfExpectedDataPoints can be calculated from the size of all the files
for i = 1:numfiles
fileID = fopen(filename{i},'r','n');
currData = fread(fileID,inf,'*single',0,'ieee-le');
currData = reshape(currData,[31,length(currData)/31]);
data_ReadIntoMemory(:,startPoint:endPoint) = currData; %start/endpoint can be calculated using current and previously loaded file sizes
fclose(fileID)
end
This lets me take all the files, read them in, and create one massive array when I have the memory for it.
For situations where there are too many files, here is how I'm trying to use the datastore functions:
function data = ReadBinFile(filename)
fileID = fopen(filename,'r','n');
data = fread(fileID,inf,'*single',0,'ieee-le');
data = reshape(data,[31,length(data)/31]);
fclose(fileID)
end
myDataStore = fileDatastore(myDirectory,"ReadFcn",@ReadBinFile,"FileExtensions",'.bin') %Some non-.bin files are in the directory so I need to filter for them
data_Datastore = tall(myDataStore)
Now, here's where I start to get issues. If I was to run gather(data_Datastore) on a dataset that was small enough to fit into memory, I don't get the exact same full data array that I get for data_ReadIntoMemory. I get a cell array where each cell represents the data from one file, and I would need to use horzcat(data_Datastore{:}) to get it into the same format as data_ReadIntoMemory.
This is an issue because the intermediate calculations that I need to run before plotting throw errors when they are evaluated on data_Datastore; and the reason is because it is in this cell format.
For example, trying to run a function like gather(data_Datastore(6,:)) throws an error if less than 6 files were loaded in for 'exceeding array bounds' where the bound not to exceed is always the same as the number of files.
All this to frame the question: is there a way to get my data to be read in in a way that isn't a cell array, or doesn't require being fully read in and manipulated first using horzcat (since that would just cause the same issue that started all this with low memory). I know tall arrays can be read in as tables for "tabular" data, but I can't find a way to get this to work with my current data structure.
  3 comentarios
dpb
dpb el 29 de Jul. de 2025
I've not yet messed around with the datastore, as it seems quite complex and I've managed with memmapfile with what large(ish) datasets have had to mess with recently.
But, datastore documentation specifically limits the output types as being either table/timetable or cell. I don't comprehend that, either, yet; why if the data are all the same type can't return the fundamental data class does seem to be an added complexity to deal with.
I tried writing a .bin file to mimic your description and then create a TallDatastore to it, but that doesn't work directly, either; one would have to go through the above sequence of reading all the files into memory and write them back out with the write method of tall to do that. I haven't gotten far enough along to see about the 'file' type.
dpb
dpb el 29 de Jul. de 2025
I just noticed <FEX submittal cell2underlying> that might solve the actual question asked. Looking at the code it seems pretty complex, but it has high marks.

Iniciar sesión para comentar.

Respuesta aceptada

Edric Ellis
Edric Ellis el 30 de Jul. de 2025
Editada: Edric Ellis el 30 de Jul. de 2025
You can use UniformRead=true in your fileDatastore to avoid the need for cell2underlying. Like this:
dirname = fullfile(tempdir, "data_files");
if ~isfolder(dirname)
mkdir(dirname);
end
makeSomeFiles(dirname);
% Without UniformRead - get a cell array
t1 = tall(fileDatastore(dirname, ReadFcn=@ReadBinFile, FileExtensions=".bin"));
head(t1)
Evaluating tall expression using the Local MATLAB Session: - Pass 1 of 1: 0% complete - Pass 1 of 1: Completed in 0.51 sec Evaluation completed in 0.67 sec {100×31 single} {100×31 single} {100×31 single} {100×31 single} {100×31 single} {100×31 single} {100×31 single} {100×31 single}
% With UniformRead - get numeric array
t2 = tall(fileDatastore(dirname, ReadFcn=@ReadBinFile, ...
FileExtensions=".bin", UniformRead=true));
head(t2)
Evaluating tall expression using the Local MATLAB Session: - Pass 1 of 1: 0% complete - Pass 1 of 1: Completed in 0.072 sec Evaluation completed in 0.16 sec Columns 1 through 18 0.3289 0.9308 0.4696 0.3568 0.6754 0.5802 0.4179 0.6907 0.3941 0.6753 0.6750 0.1697 0.6579 0.7239 0.1972 0.3395 0.5684 0.5779 0.0016 0.1957 0.1327 0.4781 0.1755 0.6272 0.1420 0.1352 0.0553 0.6035 0.4224 0.7938 0.8141 0.2770 0.0373 0.7989 0.3175 0.3785 0.4750 0.1540 0.2161 0.7176 0.2297 0.3621 0.8475 0.2358 0.6638 0.1494 0.9366 0.7775 0.3787 0.1691 0.3113 0.2292 0.9661 0.5620 0.4041 0.6139 0.4539 0.8139 0.1485 0.8899 0.7140 0.9949 0.7610 0.1544 0.9431 0.8052 0.2934 0.3014 0.6858 0.7874 0.7229 0.9528 0.2361 0.6919 0.2924 0.5637 0.1011 0.0381 0.3046 0.4321 0.2706 0.9510 0.6782 0.1797 0.9313 0.6921 0.2593 0.9228 0.8298 0.9948 0.9111 0.8507 0.8993 0.2347 0.2011 0.8509 0.6860 0.4809 0.7143 0.5698 0.6604 0.5687 0.8197 0.8609 0.8439 0.3772 0.1328 0.8385 0.3509 0.2323 0.2379 0.9308 0.5911 0.3410 0.8434 0.3656 0.2713 0.1418 0.6095 0.7457 0.9086 0.1395 0.3453 0.9494 0.2587 0.0088 0.3134 0.6405 0.2133 0.6634 0.3789 0.6891 0.4150 0.6345 0.8739 0.6589 0.0599 0.3294 0.8263 0.9148 0.9762 0.6857 0.2569 0.6316 Columns 19 through 31 0.8879 0.9345 0.1042 0.6832 0.3236 0.6281 0.9984 0.4885 0.9548 0.6040 0.0457 0.6696 0.2132 0.5941 0.3019 0.5208 0.5237 0.4337 0.2236 0.7818 0.2278 0.4154 0.0931 0.9749 0.3032 0.3566 0.7178 0.2134 0.2635 0.3125 0.9260 0.6741 0.6713 0.6337 0.0771 0.7141 0.8272 0.6493 0.0761 0.1354 0.6295 0.0368 0.3487 0.8600 0.3651 0.1600 0.2800 0.0557 0.3963 0.5267 0.1156 0.4217 0.3538 0.0969 0.8557 0.9881 0.8558 0.2145 0.8945 0.4767 0.1730 0.0303 0.7228 0.4689 0.1463 0.7149 0.0021 0.3806 0.4231 0.3380 0.0049 0.3643 0.5080 0.7519 0.9627 0.9779 0.6296 0.3157 0.7360 0.0637 0.9878 0.0726 0.5481 0.3348 0.3130 0.6760 0.6201 0.7845 0.4610 0.1913 0.3020 0.7800 0.6866 0.3556 0.6867 0.8253 0.4901 0.8228 0.3320 0.5700 0.0769 0.5974 0.5506 0.9041
%% Read binary file
function data = ReadBinFile(filename)
fh = fopen(filename, "rb");
data = fread(fh, Inf, "*single");
data = reshape(data, [], 31);
end
%% Make some random data
function makeSomeFiles(dirname)
for ii = 1:10
fh = fopen(fullfile(dirname, sprintf("file_%03d.bin", ii)), "wb");
assert(fh > 0);
fwrite(fh, rand(100,31,"single"),"single");
fclose(fh);
end
end
  9 comentarios
Edric Ellis
Edric Ellis el 31 de Jul. de 2025
@dpb Ah, I see. Yes, I agree that the first section of the "Description" does definitely imply that tall(ds) is either a tall table, tall timetable, or tall cell. I'll report this to our doc team. Thanks for pointing out exactly where the problem is!
dpb
dpb el 31 de Jul. de 2025
Editada: dpb el 31 de Jul. de 2025
Thanks. I knew I was looking directly at the statement when I wrote the original response! <g>.
I stopped looking after seing that figuring it was useless; hence I didn't see (or go looking for) the 'UniformRead' named parameter...although it did seem like a strange restriction.

Iniciar sesión para comentar.

Más respuestas (0)

Categorías

Más información sobre Tables en Help Center y File Exchange.

Etiquetas

Productos


Versión

R2022a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by