Level 5 .mat file with UTF-8 encoded character array fails to load on R2020b

Question

Russel Burgess el 8 de Mzo. de 2021

2
Enlazar

Enlace directo a esta pregunta

https://es.mathworks.com/matlabcentral/answers/765746-level-5-mat-file-with-utf-8-encoded-character-array-fails-to-load-on-r2020b

Respondida: Russel Burgess el 9 de Mzo. de 2021

test.zip

I'm having some trouble with .mat files that loaded in R2020a but no longer load in R2020b. This appears to be due to a UTF-8 encoded string; a small example file is attached. If possible, I would like to get UTF-8 encoded strings in .mat files to load correctly in R2020b.

These files come from software we have written in-house that outputs mat files for later analysis, in accordance with the .mat file specification given by mathworks. The example file contains the string 'test°test', i.e., 'test' + the degree symbol (U+00B0) + 'test', in a variable 'x'. All of this is being done on Windows 10 64-bit version 1909 (build 18363.1379).

In R2020a (version = '9.8.0.1451342 (R2020a) Update 5') load('test.mat') gives:

x = 'test°test□'

That last character is the 2-byte squence E9FF (inspected with double(x(end))). In R2020b (version = '9.9.0.1467703 (R2020b)') load('test.mat') gives:

Error using load
Cannot read file D:\temp\mexload_unicode\bin\test.mat.

Obviously R2020a is not loading the string correctly either - I'm not sure why there are random bytes on the end - but it does load, which has been good enough for us so far (we almost never have non-ASCII data).

The hex dump of the bytes encoding the variable 'x' in the mat file are:

10 00 00 00 0A 00 00 00 74 65 73 74 C2 B0 74 65 73 74 00 00 00 00 00 00

Which is broken down into (per pgs 1-5 and 1-6 of the mat file specification):

10 00 00 00 = (16 decimal) = miUTF8
0A 00 00 00 = (10 decimal) bytes
74 65 73 74 C2 B0 74 65 73 74 ... = UTF-8 encoded 'test°test' (C2 B0 = degree in UTF-8) + padding to a 64-bit boundary as required by the mat file format

We've been using the software that produces these files for a long time (since ~R2012) and it's only with R2020b that we've seen failures to load. I've seen some references to UTF-8 in the R2020b release notes but nothing detailed enough to be useful or even specifically related to mat files. Usually Google has all the answers but in this case I can't find anyone with a related issue.

Apart from distilling the problem down to the example above, I've tried:

Enabling the "Beta: Use Unicode UTF-8 for worldwide language support" option in the "Region" settings of Windows 10 (and restarting), this made no difference
Inspecting .mat files made from within matlab - these seem to all be UTF-16 encoded, even when the above option was checked, and I can't find a way to force UTF-8 encoding
Tweaking the byte count for the field in case matlab doesn't count the "C2" of "C2B0", this only corrupted the string further

Using UTF-16 encoding loads completely correctly (no spurious bytes, in both R2020a and R2020b), however this takes up twice as much space - and some of our files are large enough / have enough strings for this to matter (when being processed in RAM, doesn't matter so much once compressed on the disk). So I would like to get the UTF-8 encoding working.

Is there anything wrong with the UTF-8 encoding above or the mat file it's in? Or is there any detailed information about the changes between R2020a and R2020b with regards to UTF-8 encoding and mat file loading?

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Iniciar sesión para comentar.

Iniciar sesión para responder a esta pregunta.

Answer 1

Russel Burgess el 9 de Mzo. de 2021

0
Enlazar

Enlace directo a esta respuesta

https://es.mathworks.com/matlabcentral/answers/765746-level-5-mat-file-with-utf-8-encoded-character-array-fails-to-load-on-r2020b#answer_643397

I found the issue - it appears that matlab counts UTF-8 continuation bytes in the data element size but not in the array dimension size (which makes sense even if not explicitly pointed out anywhere).

Going further back in the hex dump of test.mat, the break down is:

(dimensions array subelement)

05 00 00 00 (miINT32)

08 00 00 00 (8 bytes)

01 00 00 00 (1 row)

0A 00 00 00 (10 columns)

(array name subelement)

01 00 01 00 78 00 00 00 (miINT8, 1 byte, 'x')

(data element)

10 00 00 00 (miUTF8)

0A 00 00 00 (10 bytes)

74 65 73 74 C2 B0 74 65 73 74 00 00 00 00 00 00 ('test°test')

By changing the column count in the dimensions array subelement from 0A to 09 (the number of complete UTF-8 characters in 'test°test') the file loads correctly. Presumably old versions of matlab ignored this discrepancy and a check was added in R2020b.

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Iniciar sesión para comentar.

Level 5 .mat file with UTF-8 encoded character array fails to load on R2020b

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Respuesta aceptada

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Más respuestas (0)

Ver también

Categorías

Etiquetas

Productos

Versión

Community Treasure Hunt

Level 5 .mat file with UTF-8 encoded character array fails to load on R2020b

0 comentarios Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Respuesta aceptada

0 comentarios Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Más respuestas (0)

Ver también

Categorías

Etiquetas

Productos

Versión

Community Treasure Hunt

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos