Processing Data in a Tall Array
Mostrar comentarios más antiguos
I am working on a text file with 22 GB consisting of around 180 million rows and 17 columns. It's transportation data consisting of "Trip" and "Speed". I want to calculate the average speed of each trip.
clear;clc;
data='D:\Umich\eBlueBus_2019 Summer\UMTRI data\SQLDataExport.txt';
% data='D:\Umich\eBlueBus_2019 Summer\Simulation\Trip_11to12_0702.txt';
ds = tabularTextDatastore(data);
ds.ReadSize=200000;
ds.SelectedVariableNames={'Trip','GpsSpeed'};
ds.SelectedFormats={'%f','%f'};
t_array=tall(ds);
A=matlab.tall.transform(@reduce_fcn,t_array,'OutputsLike',{(table(1,1,'VariableNames',{'Trip','Spd'}))});
Where the @reduce_fcn is the function below:
function TT=reduce_fcn(t_array)
[groups,Y]=findgroups(t_array.Trip);
D=splitapply(@mean,t_array.GpsSpeed,groups);
TT= table(Y,D,'VariableNames',{'Trip','Spd'});
end
After I ran and commanded B=gather(A), I got an error:
>> B=gather(A);
Evaluating tall expression using the Local MATLAB Session:
- Pass 1 of 1: 12% complete
Evaluation 12% complete
Error using matlab.io.datastore.TabularTextDatastore/readData (line 77)
Mismatch between file and format character vector.
Trouble reading 'Numeric' field from file (row number 58461, field number 2) ==>
Trip,Time,GpsTime,GpsWeek,GpsHeading,GpsSpeed,Latitude,Longitude,Altitude,NumberOfSats,Differential,FixMode,Pdop,GpsBytes,UtcTime,UtcWeek\n
Learn more about errors encountered during GATHER.
Error in matlab.io.datastore.TabularDatastore/read (line 120)
[t, info] = readData(ds);
Error in tall/gather (line 50)
[varargout{:}, readFailureSummary] = iGather(varargin{:});
Caused by:
Reading the variable name 'Trip' using format '%f' from file: 'D:\Umich\eBlueBus_2019
Summer\UMTRI data\SQLDataExport.txt' starting at offset 2852126773.
I did check the row 58461 of the data,
the Trip and GpsSpeed are the second column (i.e. 10) and the seventh column (i.e. 1.4299999). They don't seem like a problem to me. I am confused why reading t_array.Trip by %f would be a problem indicated in the error.
the Trip and GpsSpeed are the second column (i.e. 10) and the seventh column (i.e. 1.4299999). They don't seem like a problem to me. I am confused why reading t_array.Trip by %f would be a problem indicated in the error. Besides, when I ran the test data file which is only a portion of the original data consisting only Trip 11 and 12 (i.e. the comment part of the code), I did not get any problem and have the answer I'd expected: a 2 by 2 table where the first column is Trip 11 & 12 and the second is average speed of the Trip 11 & 12.
Appreciate any suggestions. Thank you
1 comentario
Juan-Jie Sun
el 4 de Jul. de 2019
Respuesta aceptada
Más respuestas (1)
Josh Meyer
el 3 de Jul. de 2019
findgroups and splitapply work with tall arrays, so don't worry about using tall.transform or tall.reduce for this simple task.
You can tell which functions work with tall arrays by checking the bottom of the reference page in the doc for an Extended Capabilities section. If there are any limitations of the support it will be detailed there as well.
What I noticed is that
mean_spd = gather(splitapply(@mean,spd,trip));
should be
mean_spd = gather(splitapply(@mean,spd,G));
If this doesn't work (and I expect it will, with the issue being that "trip" isn't a well-defined grouping variable on its own), it makes me start to wonder if the data needs to be cleaned before the operation. You can use ismissing, rmmissing, fillmissing, etc... to do that if necessary.
1 comentario
Juan-Jie Sun
el 4 de Jul. de 2019
Categorías
Más información sobre Text Data Preparation en Centro de ayuda y File Exchange.
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!

