Help on parallelizing code - Nested files

4 visualizaciones (últimos 30 días)
Juan Bolanos
Juan Bolanos el 8 de Oct. de 2020
Comentada: Seth Furman el 12 de Oct. de 2020
Hello friends!
I am working on parsing a large library (tens of thousands) of XML files. My intention is to parse through all of them and save the information I need in a single variable for post processing (and to save it in another format that isn't as nested).
The files are not custom made by me or my team, and they have a nested structure that is rather convoluted. In pseudo code; my current parfor loop iterator looks like this:
data_out = ["colum_header_1", "colum_header_2", "colum_header_n"]
parfor z = 1:numFiles
file = xml2struct(fullFileNames{z});
* for i = 1:length(file.logfile.scan)
%Header Info
var_1 = convertCharstoStrings(file.logfile.Attributes.var1)
var_2 = convertCharstoStrings(file.logfile.Attributes.var2)
var_n = convertCharstoStrings(file.logfile.Attributes.var3)
%now sometimes the scane will be singular and sometimes there are multiple, so I have an if case to filter that out and prevent an error of indexing. Ommited, and showing only the multiple scan case.
first_section_file = file.logfile.scan{1,i}
try
** for j = 1:length(first_section_file)
%Here I need some data from let's say. firstsection.info_1.Attributes Additionally there is another structure in this point, let's say info_2 where I also have to get data out. However, as with scan, it can have a singular reading, or multiple readings. As such I have an if else
second_var_1 = firstsection.info_1.Attributes.var1
second_var_2 = firstsection.info_1.Attributes.var2
second_var_3 = firstsection.info_1.Attributes.var3
if reading == 1
third_var_1 = convertCharstoStrings(first_section_file.info_2.Attributes.var1)
third_var_2 = convertCharstoStrings(first_section_file.info_2.Attributes.var2)
third_var_n = convertCharstoStrings(first_section_file.info_2.Attributes.var3)
else
%same code as I would be just getting the information out from the given reading and then iterating over it.
end
data_out = variables
end %Here I end the ** for loop
catch
fprintf(No data)
end % End of the try
end %Here I end the * for loop. end of scane
end %End of code.
My intetion is making the first loop the parfor loop, that way I will be using different workers per file. The problem I have is setting the "data_out " variable appropriately so that the data I need to be saved on it is saved. As the code stands, I don't have a problem with the parfor, but rather I think it's a "race condition" of sorts where given that each loop resets the value, it never saves anything.
I tried setting it up as data_out = [data_out ; variables], but that results in an error from the parfor and using cat doesn't work either. I tried also setting the loops as a separate function, but that give more problems than solutions (granted I could have made a couple of mistakes trying it). Another issue is that the iteration indices are not a good way of saving data in the data_out variable, since they will be reset every iteration as I need to iterate over all the scans. That would overwrite the values already existing in, and the z index would be very slow-moving as it is the file counter.
Maybe someone has dealt with an issue like this and can shed some light? In case anyone has heard of it before, I am working with OpenBMap files. I have a working for loop iterator, but as it stands, the duration of each iteration just grows as more data is saved into the data_out array (as one would expect clearly). I could preallocate, but I don't really know the amount of datapoints that there will be after all the files are read.
Oh, as a side note, I made a separate parfor loop to convert all the XML files into structures in another test I did given that the profiler pointed to that particular function being the bottle-neck of the code, but the performance gain from doing so in the long run, wasn't as great as expect.

Respuesta aceptada

Seth Furman
Seth Furman el 12 de Oct. de 2020
A few suggestions:
Keep in mind that:
  • Any output variable can't be dependent on other loop iterations, so things like concatenation will never work. It is, however, possible to assign to distinct elements of an output variable, like the following example, and then do some aggregation on the output varaible.
parfor i = 1:10
outputVar(i) = blah(files(i)); % assign to consecutive elements of an output variable
end
% aggregate output var
aggregatedOutputVar = aggregateBlahOutputs(outputVar);
  • There is overhead involved with using parfor loops and using parfor may not necessarily speed up your code, for example when there is a lot of data to be copied to and from each worker in the pool. See Decide When to Use parfor for considering whether parfor makes sense for your use case.
You may want to consider a different approach.
  • Take a look at Large Files and Big Data. You may benefit from a using a Datastore, which simplifies processing large collections of files that don't fit into memory.
  2 comentarios
Juan Bolanos
Juan Bolanos el 12 de Oct. de 2020
Hello Seth!
Thanks a lot for your answer. I don't understand the difference between readstruct and xml2struct (aside from the former being native. I didn't see it while I was checking out how to read XMLs in another format that was a DOM, so thanks a lot for pointing it out). I will test it out and see the differences :)
When looking at when to use the parfor, it might not be the way to go for then, since sometimes there is a large amount of data between worker --> head communication. It might be valuable however, when I have data centralized in a single database, or a more manageble format.
With regards to the Datastore, I was considering de-nesting (if that's a term) each of the individual XML files to have them on a more linear representation with the data that I need from them, which I can have as purely a matrix of strings and then process it accordingly where needed. I was thinking doing separate .mat files (saving the workspaces), but it seems I would need to make them .csv instead. I need to read more on the matter of course, but would it be then possible to read all the .csv, make a datastore and then perform post analysis on it?
For example, selecting datapoints based on the value of a specific variable? (this I could run in parallel and feed each worker sections of the datastore, correct?)
Seth Furman
Seth Furman el 12 de Oct. de 2020
You can create a datastore for XML files so long as you provide a function for reading them into MATLAB (which you already have with "readstruct" in this case). Then you can process the data in the datastore using functions like "transform", "mapreduce", and "partition".
Here are a couple examples of using datastore with parallel algorithms:
Here's also a simple example of defining a datastore for XML files and pulling out specific fields.
% Create 5 example XML files in a new folder
mkdir("example");
cd("example");
for i = 1:5
copyfile(which("music.xml"),"f"+string(i)+".xml");
end
cd("..");
% Create a datastore out of the "example" folder
ds = datastore("example","Type","file","ReadFcn",@readstruct);
preview(ds)
% Define a new datastore that's just the
% <Ensemble><Music>...</Music></Ensemble> fields of each XML file
ensembleMusicDs = transform(ds,@(x) x.Ensemble.Music);
% Read ensembleMusicDs into memory
readall(ensembleMusicDs)
% OR
% Write ensembleMusicDs into a new folder structure
% writeall(ds,...)

Iniciar sesión para comentar.

Más respuestas (0)

Categorías

Más información sobre Parallel Computing Fundamentals en Help Center y File Exchange.

Productos


Versión

R2020b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by