Building tall table from tall arrays generates error

3 visualizaciones (últimos 30 días)
Harry Cho
Harry Cho el 14 de Mzo. de 2023
Comentada: Harry Cho el 15 de Mzo. de 2023
clear
dataFile = 'data.csv';
ds = tabularTextDatastore(dataFile, FileExtensions='.csv');
ds.ReadVariableNames = true;
ds.Delimiter = ',';
ds.SelectedVariableNames = ["hash", "count"];
ds.SelectedFormats = {'%s', '%f'};
data = tall(ds);
Starting parallel pool (parpool) using the 'Processes' profile ... Connected to the parallel pool (number of workers: 2).
[g, THash] = findgroups(data.hash);
TCount = splitapply(@(x) {x}, data.count, g);
%% This works but cannot use it because actual data file is far larger than memory
hash = gather(THash);
Evaluating tall expression using the Parallel Pool 'Processes': - Pass 1 of 1: 0% complete - Pass 1 of 1: 100% complete - Pass 1 of 1: Completed in 1.9 sec Evaluation completed in 2.8 sec
count = gather(TCount);
Evaluating tall expression using the Parallel Pool 'Processes': - Pass 1 of 3: 0% complete - Pass 1 of 3: 100% complete - Pass 1 of 3: Completed in 0.54 sec - Pass 2 of 3: 0% complete - Pass 2 of 3: 100% complete - Pass 2 of 3: Completed in 0.46 sec - Pass 3 of 3: 0% complete - Pass 3 of 3: 100% complete - Pass 3 of 3: Completed in 0.58 sec Evaluation completed in 2.3 sec
T1 = table(hash, count);
%% This is the intended code but doesn't work
TT = table(THash,TCount);
Error using tall/table
Incompatible non-scalar tall array arguments. Each of the tall arrays must be the same size in the first dimension, must be derived from a single tall array, and must not have been indexed
differently in the first dimension (indexing operations include functions such as VERTCAT, SPLITAPPLY, SORT, CELL2MAT, SYNCHRONIZE, RETIME and so on).
write(fullfile(pwd,'data'),TT,FileType="parquet");

Respuestas (1)

Oguz Kaan Hancioglu
Oguz Kaan Hancioglu el 15 de Mzo. de 2023
Your code wasn't work because "gather(TCount)" returns cell array for each element. Therefore you are trying to write double array in to one single cell. You can find the length of each array into the cell. I hope this solves your problem.
%% This works but cannot use it because actual data file is far larger than memory
hash = gather(THash);
count = gather(TCount);
cellsz = cellfun(@size,count,'uni',false);
newCount = cellfun(@(x) x(1),cellsz,'UniformOutput',false)
T1 = table(hash, newCount);
  1 comentario
Harry Cho
Harry Cho el 15 de Mzo. de 2023
Thank you for the reply. Unfortunately I have to collect cell array, in which each cell has different length of double array. My question is why it works in-memory table T1 but not in tall table TT.

Iniciar sesión para comentar.

Categorías

Más información sobre Analysis of Big Data with Tall Arrays en Help Center y File Exchange.

Etiquetas

Productos


Versión

R2022b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by