I'm having some trouble with .mat files that loaded in R2020a but no longer load in R2020b. This appears to be due to a UTF-8 encoded string; a small example file is attached. If possible, I would like to get UTF-8 encoded strings in .mat files to load correctly in R2020b.
These files come from software we have written in-house that outputs mat files for later analysis, in accordance with the .mat file specification given by mathworks. The example file contains the string 'test°test', i.e., 'test' + the degree symbol (U+00B0) + 'test', in a variable 'x'. All of this is being done on Windows 10 64-bit version 1909 (build 18363.1379). In R2020a (version = '9.8.0.1451342 (R2020a) Update 5') load('test.mat') gives:
That last character is the 2-byte squence E9FF (inspected with double(x(end))). In R2020b (version = '9.9.0.1467703 (R2020b)') load('test.mat') gives:
Error using load
Cannot read file D:\temp\mexload_unicode\bin\test.mat.
Obviously R2020a is not loading the string correctly either - I'm not sure why there are random bytes on the end - but it does load, which has been good enough for us so far (we almost never have non-ASCII data).
The hex dump of the bytes encoding the variable 'x' in the mat file are:
10 00 00 00 0A 00 00 00 74 65 73 74 C2 B0 74 65 73 74 00 00 00 00 00 00
Which is broken down into (per pgs 1-5 and 1-6 of the mat file specification):
- 10 00 00 00 = (16 decimal) = miUTF8
- 0A 00 00 00 = (10 decimal) bytes
- 74 65 73 74 C2 B0 74 65 73 74 ... = UTF-8 encoded 'test°test' (C2 B0 = degree in UTF-8) + padding to a 64-bit boundary as required by the mat file format
We've been using the software that produces these files for a long time (since ~R2012) and it's only with R2020b that we've seen failures to load. I've seen some references to UTF-8 in the R2020b release notes but nothing detailed enough to be useful or even specifically related to mat files. Usually Google has all the answers but in this case I can't find anyone with a related issue.
Apart from distilling the problem down to the example above, I've tried:
- Enabling the "Beta: Use Unicode UTF-8 for worldwide language support" option in the "Region" settings of Windows 10 (and restarting), this made no difference
- Inspecting .mat files made from within matlab - these seem to all be UTF-16 encoded, even when the above option was checked, and I can't find a way to force UTF-8 encoding
- Tweaking the byte count for the field in case matlab doesn't count the "C2" of "C2B0", this only corrupted the string further
Using UTF-16 encoding loads completely correctly (no spurious bytes, in both R2020a and R2020b), however this takes up twice as much space - and some of our files are large enough / have enough strings for this to matter (when being processed in RAM, doesn't matter so much once compressed on the disk). So I would like to get the UTF-8 encoding working.
Is there anything wrong with the UTF-8 encoding above or the mat file it's in? Or is there any detailed information about the changes between R2020a and R2020b with regards to UTF-8 encoding and mat file loading?