Unreasonably Large MAT File

I am applying a customer developed data transform domain based compression algorithm to compress a data file. The algoritm performs as expected where an acceptable compression ratio is achieved, and the orginial data is reconstructed with small error. I attached plots of the original and reconstructed data. No issues with the algorithm performance. However, I am having issues when I save the reconstructed data onto a disk as a MAT file. The input data is about 7 MB on disk while the reconstructed data takes more than 30 MB space on the hard disk.
The attached plots show two different input data sets along with the corresponding reconstructed data sets. To save the reconstructed data to MAT file, I used MATLAB's "save" command.
save test1.mat reconstructedData;
save test2.mat inputData; %I did this to just to verify that the mat file has the same size as the input mat file.
Why is the reconstructedData much larger on disk than the input data even though the plots tell a different story?

21 comentarios

Walter Roberson
Walter Roberson el 21 de Ag. de 2020
Is it possible that you are saving as -v7.3 by default?
AdiKaba
AdiKaba el 21 de Ag. de 2020
Saving the variable as -v7 is no different:
File size when saving as -v7.3: 30,862KB
File size when saving as -v7: 30,064 KB.
Walter Roberson
Walter Roberson el 22 de Ag. de 2020
It strikes me that 30 megabyte divided by 7.5 megabyte is 4 .
If the original data is 16 bit integer, and you convert it to double precision, then that would multiply the size by 4.
per isakson
per isakson el 22 de Ag. de 2020
v7.3 creates large files. HDFView Software can help understand how.
AdiKaba
AdiKaba el 22 de Ag. de 2020
@Walter, you are correct, I extracted the original data from a proprietary file format into a mat file as 16-bit integer. I will check if the reconstructed data is being converted to double precision by MATLAB.
Walter Roberson
Walter Roberson el 22 de Ag. de 2020
16 bit integers used to encode floating point signals are often based upon some kind of non-uniform pulse code modulation based upon differences between adjacent signals; or sometimes based upon non-linear lookups such as u-law https://en.wikipedia.org/wiki/%CE%9C-law_algorithm
It is also not uncommon these days to take a signal, take adjacent differences, do a wavelet transform, and encode the transform coefficients as integers.
In each of these cases, the reconstructed signal is floating point. But the cases vary as to how much information content there is in each floating point value.
When I look at the reconstructed signal plot, it looks to me as if there has been some kind of filtering and some kind of pulse code modulation going on.
AdiKaba
AdiKaba el 22 de Ag. de 2020
Very good observations. The input signal goes through the following processing stages: DWT transformed, thresholded, quantized and encoded. To reconstructed the signal, reverse of the previous processing is performed except the threshold stages. That is, decoded, de-quantized and IDWT. It is still puzzling why MATLAB incur so much overhead when saving the file.
Walter Roberson
Walter Roberson el 22 de Ag. de 2020
Unfortunately with that sequence, I am not very good at estimating the actual information content of the samples -- the minimum change in floating point value that would result in a change in the encoding. I suspect it is not uniform anyhow.
AdiKaba
AdiKaba el 22 de Ag. de 2020
Yes, the encoding is not uniform. As you can observe from the plots, amplitude of the reconstructed signal is lower than the orginal signal. Why is the signal with lower amplitude occupying a large storage space?
Because of the difference between uint16 (2 bytes per entry) and single precision (4 bytes per entry) and double precision (8 bytes per entry)
Consider:
x = uint8(1:16);
information content: 16 distinct possibilities, so ceil(log2(16)) = 4 bits each
Minimum bits required to present in binary: 5 (0 is unused, 16 exactly requires 5 bits)
Representation size: 8 bits each
y = double(x) ./ 2.^(floor(log2(double(x)+1)));
information content: 11 distinct possibilities, so ceil(log2(11)) = 4 bits each. (Clearly with a bit of work I could have ended up with fewer unique values, reducing the information content.)
>> y * 16
ans =
8 16 12 16 20 24 14 16 18 20 22 24 26 28 15 16
So 5 bits required to represent exactly as multiples of 1/16
Representation size: 8 bytes = 64 bits each
AdiKaba
AdiKaba el 24 de Ag. de 2020
Thank you, your comments would held true if the types of the input and reconstructed variables were different. However, the input and reconstructed variables have the same size and type as shown below. Please check the number of bytes required to store the reconstructed signal.
>> whos InputSignal
Name Size Bytes Class Attributes
InputSignal 2000000x1 32000000 double complex
>> whos reconstructedSignal
Name Size Bytes Class Attributes
reconstructedSignal 2000000x1 32000000 double complex
>> save test.mat InputSignal
>> save test1.mat reconstructedSignal
f1=dir('test.mat');
>> fprintf('%d\n',f1.bytes)
4186913
>> f2=dir('test1.mat');
>> fprintf('%d\n',f2.bytes)
30785398
AdiKaba
AdiKaba el 24 de Ag. de 2020
The reconstruced signal can be represented with a smaller number of bytes per sample. Assume complex input sarepresented by a double precision (32 bytes per sample). For example, if the compression ratio is 5, this indicates that the reconstructed signal can be represented by 32/5 bytes (single precision float).
Walter Roberson
Walter Roberson el 25 de Ag. de 2020
you said it was 16 bit integer input
https://www.mathworks.com/matlabcentral/answers/582881-unreasonably-large-mat-file#comment_981467
Walter Roberson
Walter Roberson el 25 de Ag. de 2020
save -v7 compresses by default. But not all data is equally compressible with the algorithm used.
AdiKaba
AdiKaba el 25 de Ag. de 2020
Sorry for the confusion, I was talking about how the input data is obtained from a stored file. The input data is complex where each I/Q sample is represented by a double precision. I don't mind if MATLAB further compresses the reconstructed file but I will be just fine if MATLAB allocate proper storage space for the reconstructed signal. The mystery to me is that why MATLAB is adding excessively large overhead when saving the reconstructed signal.
Walter Roberson
Walter Roberson el 25 de Ag. de 2020
Editada: Walter Roberson el 26 de Ag. de 2020
32000000 double is being stored in a file of length 30785398. That requires compression on the part of MATLAB. It is not unnecessary overhead, it is compression.
You are deriving the signal from a mat with disk size 4186913 that decompresses to 32000000 and you are expecting a file size that is one of 4186913/5 or 4186913 or 32000000/5, I am not sure which. That assumes that your transformed signal compresses at least 5 to 1 using the algorithm that MATLAB uses for saving files. Which we have no reason to expect.
MATLAB uses a standard zlib which is what used for most internet compression as being a reasonable trade-off between flexibility and speed and memory. It is LZSS+Huffman based, which is a dictionary lookup compression scheme. When you do your idwt you are spreading information across your data in a way that does not happen to align nicely with dictionary compression. Whereas the input file happened to be better suited for dictionary compression.
You should not be looking at it as if there is "really" only 4 megabytes of data in the input file. There is really 32 megabytes of data, that happened to compress roughly 8:1 with zlib compression. The transformed 32 megabytes signal does not happen to compress nearly as well for the zlib compression that MATLAB uses.
AdiKaba
AdiKaba el 25 de Ag. de 2020
Editada: AdiKaba el 25 de Ag. de 2020
I don't think I agree with some of your comments, in particular the one related to IDWT. IDWT synthesizes the decomposed input wavelet coefficients into the time domain giving a reconstructed signal that is similar to the input signal. This gives a more compact representation of the signal that can be represented with less number of bits compared to the original signal. That is why I am expecting the file size to decrease (not the data type) by a factor of the compression ratio. As the plots above show the reconstructed signal has smaller amplitude relative to the input signal and I expect MATLAB to store in a MAT file with size at least 4186913/5. If MATLAB applies further compression, that is fine too.
I will state my question again: the reconstructed signal is a compact representation of the input signal, that is , where and correspond to the amplitudes of the reconstructed and input signals. In other words, the amplitude of the reconstructed signal is upper bounded by the amplitude of the input signal. Thus, I expect the reconstructed signal to occupy a storage space atmost the same as that of the input signal. However, MATLAB is allocating an execessively large overhead! Please note that I don't MATLAB to compress the variable while saving it. Compression is done by the algorithm as shown in the block diagram below where thresholding and quantization provide the required compression.
Walter Roberson
Walter Roberson el 26 de Ag. de 2020
Sorry, you will need to open a support case about this, as you clearly are not willing to accept my (correct) explanations for what is happening.
AdiKaba
AdiKaba el 26 de Ag. de 2020
No worries! Thank you.
I didn't accept your explanation because your reasoning regarding the wavelet transform is not mathematically correct. Hope that doesn't offend you. By the way, the problem I have been describing is not limited only to DWT. The problem extends to DCT based compression as well.
Walter Roberson
Walter Roberson el 26 de Ag. de 2020
" IDWT synthesizes the decomposed input wavelet coefficients into the time domain giving a reconstructed signal that is similar to the input signal."
Yes.
"This gives a more compact representation of the signal that can be represented with less number of bits compared to the original signal."
The DWT often has that property, but the IDWT does not.
"That is why I am expecting the file size to decrease (not the data type) by a factor of the compression ratio."
Is your file size 4186913 the original signal, or is it the DWT version of the signal, or is it the reconstructed version of the signal?
"That is why I am expecting the file size to decrease"
File size is determined by how much compression zlib can find for the data, which is a different matter than the "information content" (entropy) of the data.
Consider, for example, a 17 Hz sine wave with no phase delay, sampled at 5 megahertz: the "information content" is the fundamental frequency and the sample rate and the number of samples. If you were in a situation where the only permitted fundamentals were the integers 0 to 31, and the only permitted sampling rates were integer megaherz 0 to 7, and the only permitted lengths were "full cycles" 0 to 255, then the "information content" would be only 16 bits (5 bits for fundamental, 3 for sampling, 8 bits for number of cycles.)
The compression available through a dictionary technique such as zlib uses, would be at most two copies of each y value (one for rise, one for fall) per full cycle -- not very good. zlib does not even attempt mathematical calculations to predict values.
A discrete fourier transform (fft) of such a signal would, to within round-off, show a single non-zero at 17 Hz and (two sided transform) at -17 Hz, and if you used find() to locate that you could arrive at a fairly compact representation.
Wavelet transform of the same signal... it would depend which wavelet you choose. The tests I did just now found some that could do a 2:1 compression (cd was small enough to potentially be all zero) but I did not encounter any that could do better.
You are confusing different representations of the data in your signal with the information content of the data.
And you are also confused in thinking that a 5:1 amplitude reduction makes a difference in the information content. There is as much information in the line segment between 1 and 2 as there is between 1/5 and 2/5 (infinite information if you are talking about real numbers). IEEE 754 floating point repreresentation does not use fewer bits for a value that is 1/5th of the original.
Walter Roberson
Walter Roberson el 26 de Ag. de 2020
"I didn't accept your explanation because your reasoning regarding the wavelet transform is not mathematically correct."
What I wrote about idwt is,
"When you do your idwt you are spreading information across your data in a way that does not happen to align nicely with dictionary compression."
I am distinguishing between information and data.
Consider for example the wavelet that is square waves. If your data happens to be square waves with duty cycle 1/2, then the wavelet can compact the information into a small number of coefficients -- just enough to encode the width and length in a structured way. And similar to the discrete fourier transform I described above, a lot of the coefficients might be zero, which would compress well with the dictionary-based compression scheme used by zlib (and so used by MATLAB) to store .mat files.
Then when you idwt(), the information ("square wave, amplitude, duty cycle, frequency, cycle count) gets spread out over the data that is the reconstructed square wave. And that data might not happen to compress nearly as well with the dictionary compression scheme as the wavelet transform was able to do with it.
"Please note that I don't MATLAB to compress the variable while saving it."
Notice how compression of .mat files is on automatically for -v7 and -v7.3 files, unless you specifically ask for -nocompression. The 4186913 byte file size you are seeing is after MATLAB's zlib compression has been used.

Iniciar sesión para comentar.

Respuestas (1)

AdiKaba
AdiKaba el 26 de Ag. de 2020
Editada: AdiKaba el 26 de Ag. de 2020

0 votos

I understand you have an MVP status to protect as there are some mathematical inaccuracies in your responses. You are referring to my comments as "confused" which I think it a bad choice of word. Again, I disagree with your comments, just because you wrote a long reponse doesn't mean it is correct. No confusion here. Good luck.

1 comentario

Walter Roberson
Walter Roberson el 27 de Ag. de 2020
What result do you get when you save your inputSignal with -nocompression ?
I firmly recommend the book Text Compression, by Bell, Cleary, and Witten (Prentice Hall, 1990), https://books.google.ca/books/about/Text_Compression.html for making clearer the difference between information content and representation.
Reducing the amplitude of a signal does not reduce its entropy (disorder, difficulty of predicting). Filtering can reduce entropy (but does not necessarily do so.)
You have 32000000 bytes of data that under LZSS+Huffman Encoding (zlib) compresses to 4186913 bytes. You process the decompressed signal, and you expect the stored file to be at most 4186913 bytes and expect a 5:1 compression, so you are hoping for on the order of 837383 byte file output. But there is no certainty that the processing you do will happen to end up with something that compresses nicely with the LZSS+Huffman Encoding compression scheme.
Let me give another example drawn from fourier transform (which, I showed above, in some cases could be a way to get a significant compression, for some signals.) Consider a 50% duty cycle square wave. That is potentially just bi-level, a number of zeros followed by the same number of ones, and the pattern repeated many times. The (non-discrete) fourier transform of a square wave is an infinite series. Suppose we take the dft, and now we process it, filtering out the 4/5 of coefficients that are least absolute value. That would be compression under that model. Now ifft() . The result is not going to be a square wave: it is going to be a waveform with a lot of ringing on it, that does not lend itself nearly as well to LZSS+huffman dictionary compression. The inaccuracies caused by the approximation get smeared out over all of the data when you ifft() to reconstruct.
Likewise, wavelets are based upon repeated shapes at different amplitude and frequencies. Wavelets do not describe individual samples in the signal: they calculate amplitudes at different frequencies that when reconstructed try to approximate the signal well, and any change in coefficients (by zeroing them for compression) gets propagated as a subtle change across the entire signal. But "well" for reconstruction is not measured by exact reconstruction: it is measured by error in reconstruction. And although the SSE on the reconstruction may be small, that the wavelet might be an excellent representation of the "interesting" information from the signal, that does not mean that the reconstructed signal is going to happen to be a good match for the LZSS+Huffman compression scheme that MATLAB automatically applies when you save files unless you say to use -nocompression .
The processing you do might well have reduced the information in the signal in a way that is useful for your purpose. But that does not mean that the automatically compression that MATLAB uses (unless told not to) is a good match for the processed result. What it does mean is that you have the potential to write your own compression routine that does a good job on the signal.
For example you might want to experiment with using fwrite() of the processed data (producing a 32000000 byte file), and then using gzip -9 on the binary file.
MATLAB is not adding overhead to the saved file: you just happen to be using an output signal that does not compress especially well with its built-in compression. And you can demonstrate whether MATLAB's compression is faulty by writing out the binary 32000000 bytes and putting it through some compression tools.

Iniciar sesión para comentar.

Etiquetas

Preguntada:

el 21 de Ag. de 2020

Comentada:

el 27 de Ag. de 2020

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by