- getting the raw information from disc to matlab process
- interpreting the raw information (eg converting text representation of numbers into binary representation
- saving binary representation to disk
I am reading a large text file in Matlab is there any way I can use the parallel processing toolbox to speed up the read times?
10 views (last 30 days)
Stephen Forczyk on 16 May 2019
I routinely readlarge text and binary files. Read times can be 10-30 minutes. Is there any way to use the parallel computing toolbox to reduce the read times. Once the file is imported it is saved as a Matlab formatted file so reloading is quick after the first read
Walter Roberson on 16 May 2019
No, not generally.
Reading a file like that has three important phases:
The disc controller can typically hold a couple of location+size commands queued so that it can be seeking and starting to read from disc while the previous query results is being transferred to main memory. However capacity for commands is limited and operatingssystems typically predict you are going to need more sector of a file you have been bulk reading so operating systems often queue commands in advance to keep the command buffer full when there is work to do. Adding more processes reading from the same disc does not increase the size of the hardware command queue.
Files tend to be stored in groups of adjacent sectors (though not strictly so) and not having to seek to another track is always faster than seeking, with fastest being to just continue reading the next adjacent sector. So adding multiple processes each demanding to move to a different section of the disk to read, represents contention for the read head that is going to be slower than just reading continuous blocks (unless data placement is planned out very very carefully ahead of time.)
The hardware to transfer data from the disk to the drive controller is always bandwidth limited and the hardware to do DMA from the controller to main memory is always bandwidth limited. In some cases the controller can DMA faster than one drive can read, so sometimes you can benefit from splitting data onto two drives attached to the same controller. Carefully planned RAID systems can sometimes make the most of a controller with multiple drives. But even then you need to have used careful planning indeed for having multiple requests for parts of the same file to not be interfering with each other.
If you do split the reading up then you would need some way to tell each worker how to quickly locate the beginning of a record to read from, so that processes do not start reading in the middle of a line. If the text file does not use fixed length lines and has not been indexed ahead of time, ending up in the middle of a record is more likely than not.
A situation where it is possible to gain by having multiple readers, is a situation where parsing the text into binary is fairly complicated and is slower than reading from disc. In such cases it can make sense to have multiple processes reading keeping the disk busy while a worker puzzles out the chunk read in. However typical conversion of text representation of numbers to the binary number is faster than disk is so this does not help for most plain text files. Now if you were adding each point into a large kdtree as it is read in...
For typical files, adding more workers makes things slower. There are files that can benefit, but those tend to be delegated to database software that takes care of generating and controlling the threads.
The third phase is writing to disc. If you have one worker writing to the same disc that another is reading from then again you have contention for hardware.