Help on parallelizing code - Nested files

Question

Juan Bolanos el 8 de Oct. de 2020

0
Enlazar

Enlace directo a esta pregunta

https://es.mathworks.com/matlabcentral/answers/608161-help-on-parallelizing-code-nested-files

Comentada: Seth Furman el 12 de Oct. de 2020

Hello friends!

I am working on parsing a large library (tens of thousands) of XML files. My intention is to parse through all of them and save the information I need in a single variable for post processing (and to save it in another format that isn't as nested).

The files are not custom made by me or my team, and they have a nested structure that is rather convoluted. In pseudo code; my current parfor loop iterator looks like this:

data_out = ["colum_header_1", "colum_header_2", "colum_header_n"]

parfor z = 1:numFiles

file = xml2struct(fullFileNames{z});

* for i = 1:length(file.logfile.scan)

%Header Info

var_1 = convertCharstoStrings(file.logfile.Attributes.var1)

var_2 = convertCharstoStrings(file.logfile.Attributes.var2)

var_n = convertCharstoStrings(file.logfile.Attributes.var3)

%now sometimes the scane will be singular and sometimes there are multiple, so I have an if case to filter that out and prevent an error of indexing. Ommited, and showing only the multiple scan case.

first_section_file = file.logfile.scan{1,i}

try

** for j = 1:length(first_section_file)

%Here I need some data from let's say. firstsection.info_1.Attributes Additionally there is another structure in this point, let's say info_2 where I also have to get data out. However, as with scan, it can have a singular reading, or multiple readings. As such I have an if else

second_var_1 = firstsection.info_1.Attributes.var1

second_var_2 = firstsection.info_1.Attributes.var2

second_var_3 = firstsection.info_1.Attributes.var3

if reading == 1

third_var_1 = convertCharstoStrings(first_section_file.info_2.Attributes.var1)

third_var_2 = convertCharstoStrings(first_section_file.info_2.Attributes.var2)

third_var_n = convertCharstoStrings(first_section_file.info_2.Attributes.var3)

else

%same code as I would be just getting the information out from the given reading and then iterating over it.

end

data_out = variables

end %Here I end the ** for loop

catch

fprintf(No data)

end % End of the try

end %Here I end the * for loop. end of scane

end %End of code.

My intetion is making the first loop the parfor loop, that way I will be using different workers per file. The problem I have is setting the "data_out " variable appropriately so that the data I need to be saved on it is saved. As the code stands, I don't have a problem with the parfor, but rather I think it's a "race condition" of sorts where given that each loop resets the value, it never saves anything.

I tried setting it up as data_out = [data_out ; variables], but that results in an error from the parfor and using cat doesn't work either. I tried also setting the loops as a separate function, but that give more problems than solutions (granted I could have made a couple of mistakes trying it). Another issue is that the iteration indices are not a good way of saving data in the data_out variable, since they will be reset every iteration as I need to iterate over all the scans. That would overwrite the values already existing in, and the z index would be very slow-moving as it is the file counter.

Maybe someone has dealt with an issue like this and can shed some light? In case anyone has heard of it before, I am working with OpenBMap files. I have a working for loop iterator, but as it stands, the duration of each iteration just grows as more data is saved into the data_out array (as one would expect clearly). I could preallocate, but I don't really know the amount of datapoints that there will be after all the files are read.

Oh, as a side note, I made a separate parfor loop to convert all the XML files into structures in another test I did given that the profiler pointed to that particular function being the bottle-neck of the code, but the performance gain from doing so in the long run, wasn't as great as expect.

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Iniciar sesión para comentar.

Iniciar sesión para responder a esta pregunta.

Answer 1

Seth Furman el 12 de Oct. de 2020

0
Enlazar

Enlace directo a esta respuesta

https://es.mathworks.com/matlabcentral/answers/608161-help-on-parallelizing-code-nested-files#answer_511901

Abrir en MATLAB Online

A few suggestions:

Consider using readstruct, which makes it easy to read XML data into a struct in MATLAB.
Take a look at the examples and suggestions in the documentation for writing Parallel for-Loops (parfor), e.g. Convert for-Loops Into parfor-Loops.

Keep in mind that:

Any output variable can't be dependent on other loop iterations, so things like concatenation will never work. It is, however, possible to assign to distinct elements of an output variable, like the following example, and then do some aggregation on the output varaible.

parfor i = 1:10
    outputVar(i) = blah(files(i)); % assign to consecutive elements of an output variable
end
% aggregate output var
aggregatedOutputVar = aggregateBlahOutputs(outputVar);

There is overhead involved with using parfor loops and using parfor may not necessarily speed up your code, for example when there is a lot of data to be copied to and from each worker in the pool. See Decide When to Use parfor for considering whether parfor makes sense for your use case.

You may want to consider a different approach.

Take a look at Large Files and Big Data. You may benefit from a using a Datastore, which simplifies processing large collections of files that don't fit into memory.

2 comentarios
Mostrar NingunoOcultar Ninguno

Juan Bolanos el 12 de Oct. de 2020

Hello Seth!

Thanks a lot for your answer. I don't understand the difference between readstruct and xml2struct (aside from the former being native. I didn't see it while I was checking out how to read XMLs in another format that was a DOM, so thanks a lot for pointing it out). I will test it out and see the differences :)

When looking at when to use the parfor, it might not be the way to go for then, since sometimes there is a large amount of data between worker --> head communication. It might be valuable however, when I have data centralized in a single database, or a more manageble format.

With regards to the Datastore, I was considering de-nesting (if that's a term) each of the individual XML files to have them on a more linear representation with the data that I need from them, which I can have as purely a matrix of strings and then process it accordingly where needed. I was thinking doing separate .mat files (saving the workspaces), but it seems I would need to make them .csv instead. I need to read more on the matter of course, but would it be then possible to read all the .csv, make a datastore and then perform post analysis on it?

For example, selecting datapoints based on the value of a specific variable? (this I could run in parallel and feed each worker sections of the datastore, correct?)

Seth Furman el 12 de Oct. de 2020

Abrir en MATLAB Online

You can create a datastore for XML files so long as you provide a function for reading them into MATLAB (which you already have with "readstruct" in this case). Then you can process the data in the datastore using functions like "transform", "mapreduce", and "partition".

Here are a couple examples of using datastore with parallel algorithms:

https://www.mathworks.com/help/parallel-computing/run-mapreduce-on-a-parallel-pool.html

https://www.mathworks.com/help/parallel-computing/partition-a-datastore-in-parallel.html

Here's also a simple example of defining a datastore for XML files and pulling out specific fields.

% Create 5 example XML files in a new folder
mkdir("example");
cd("example");
for i = 1:5
    copyfile(which("music.xml"),"f"+string(i)+".xml");
end
cd("..");
% Create a datastore out of the "example" folder
ds = datastore("example","Type","file","ReadFcn",@readstruct);
preview(ds)
% Define a new datastore that's just the
% <Ensemble><Music>...</Music></Ensemble> fields of each XML file
ensembleMusicDs = transform(ds,@(x) x.Ensemble.Music);
% Read ensembleMusicDs into memory
readall(ensembleMusicDs)
% OR
% Write ensembleMusicDs into a new folder structure
% writeall(ds,...)

Iniciar sesión para comentar.

Help on parallelizing code - Nested files

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Respuesta aceptada

2 comentarios
Mostrar NingunoOcultar Ninguno

Más respuestas (0)

Ver también

Categorías

Etiquetas

Productos

Versión

Community Treasure Hunt

Help on parallelizing code - Nested files

0 comentarios Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Respuesta aceptada

2 comentarios Mostrar NingunoOcultar Ninguno

Más respuestas (0)

Ver también

Categorías

Etiquetas

Productos

Versión

Community Treasure Hunt

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

2 comentarios
Mostrar NingunoOcultar Ninguno