Why is the string type not implemented as standard type?

5 visualizaciones (últimos 30 días)
Jan
Jan el 12 de Dic. de 2021
Comentada: Andrew Janke el 14 de Dic. de 2021
The string type is impelemented as opaque class and not as standard type as doubles of chars. This makes it inefficient to access the contents of string arrays in Mex functions. Some Matlab functions need a special treatment also:
circshift('1234', 2) % '3412'
circshift("1234", 2) % ["1234"]
For the string the scalar array is shifted, not the contents:
circshift(["1bc", "2bc", "3bc", "4bc"], 2) % ["3bc", "4bc", "1bc", "2bc"]
Which advantages does this method of implementing strings have?

Respuestas (2)

James Tursa
James Tursa el 12 de Dic. de 2021
Editada: James Tursa el 12 de Dic. de 2021
All of the standard full numeric types as well as char and logical are implemented as simple rectangular data arrays. The string type, by its very nature of being able to handle different length strings in each element, cannot be represented this way. An OOP class is necessary to keep track of the various string lengths. Also my understanding is that the individual strings are kept in memory in such a way as to make operations on individual elements somewhat efficient without the need to make each string its own mxArray. The behavior of circhift( ) on strings is more akin to what would happen with a cell array of strings ... i.e., it operates on the elements as a group instead of operating on each individual element. But your question does bring up two of my biggest rants that have been going on for a while now ...
1) The fact that you can't get at the string data pointers in a mex routine is not restricted to strings ... it is the general behavior of all classdef objects, and quite frankly why I avoid them. The old @directory style class objects were far superior in this respect since they were essentially held in memory as struct objects with a thin wrapper, so you could get at the data pointers in a mex routine with the struct API functions. But TMW has made it clear up to this point that they aren't going to give us access to those classdef data pointers. My feeling is it's my data ... so please give me direct access to it in a mex routine!
2) The half data type was not implemented as a standard numeric class when it easily could have been. There is no reason it was necessary to implement this as a classdef OOP object. This greatly crippled the use of half data types. Can't use many of the numeric functions on them such as typecast( ). Can't get at the data pointers in a mex routine. Can't easily create them in a mex routine. Pretty much anything you do with them necessates a deep data copy. Etc. etc. I complained about this very soon after they introduced this class, but it looks as though nothing will change. See this related post for an example of the headaches this implementation decision causes:
Maybe it is time I worked on my own bfloat16 class (based on the old @directory class object style of course) since that seems to be the way computer graphics are headed anyway because of the wider range of bfloat16 types compared to half types (bfloat16 allocates more bits to the exponent field at the expense of less significand bits).
  1 comentario
Jan
Jan el 14 de Dic. de 2021
Editada: Jan el 14 de Dic. de 2021
Thanks for your answer.
I've misused the field names of structs to emulate a string object in the past. On the mex level it is no problem to create field names, which are no valid Matlab symbols, like empty strings, dots or spaces. Of course the strings are limited to 63 characters. But the names are stored in a fixed width raster in the memory, such that it is easy to access. The benefit is small compared to a CHAR matrix with storing the string columnwise.
Having strings stored in memory including a trailing zero would allow to use the pointer to the data directly when calling an external library, e.g. methods of the operating system. Without the delimiter a copy is still required.
The implementation of different string length in a standard Matlab type would not bee too hard: In a cell string we have a data pointer to Matlab arrays, but this could be data pointers to the mxChar arrays also.
For my needs, the new string class is neither useful nor usable. What a pity.

Iniciar sesión para comentar.


Andrew Janke
Andrew Janke el 13 de Dic. de 2021
The older ways of representing arrays of strings have issues. Here's a whole blog post I wrote about it: http://blog.apjanke.net/2019/04/20/matlab-string-representation-is-a-mess.html
Long story short:
  • Storing multiple strings as a 2-D char array is lousy, because a list of N strings is not an N-element long array; it's an N-by-MaxStringLength array, so none of the normal Matlab array operations work on it and you can't write generic code against it. And it's inefficient due to space-padding, and bc the characters of a given string aren't even contiguous in memory, because Matlab arrays are column-major but 2-D char arrays are read row-majorly.
  • Cellstrs are lousy because they're not type-safe, don't support many standard Matlab operations, and they have the overhead of storing a full mxArray inside each single cell.
So IMHO the new string type is much nicer to use at an M-code level. When you're working on strings, often you'll want to do string-wise operations that treat each element of an array as a full string, instead of exposing the individual characters. E.g. stuff like someArrayOfStrings == "the string I'm looking for". If you really want to do character-wise operations, like concatenating substrings, extracting or replacing individual characters and so on, you can convert your strings to char arrays to do that lower-level work. (Like how Java has separate String and char data types.) E.g. str2 = string(circshift(char(str1))).
And because it's a new, string-specific type instead of being built on cells of chars, it gives Matlab internals to use a more efficient internal data representation and faster implementations of string operations. (Though this is largely yet to be realized; only some string array operations are big wins over the cellstr equivalent, and in some cases they're even slower.)
Not having access to the raw character data of the strings in a MEX file is a huge bummer, though. I didn't realize this was the case. I can see a reason to not have it return raw 16-bit char data with mwGetData because that'd expose internals in a way that would bind Matlab to a particular internal representation forever. (Which they might not want to do, because e.g. they might want to switch string arrays from storing 2-byte UTF-16 char data to 1-byte UTF-8, or even a "flexible width" string format like Python uses, both of which could be significant wins in efficiency (at least for non-Asian text).)
But there really ought to be a way to get at the underlying string data in a C MEX file, at least in a read-only manner! Especially because "string arrays are the way to go now" seems to be MathWorks current strategic position. I looked through https://www.mathworks.com/help/matlab/cc-mx-matrix-library.html and don't see a way to do this. Can a MathWorker comment on this?
I do see a way to do it in C++ MEX files: https://www.mathworks.com/help/matlab/matlab-data-array.html says there's a StringArray type, which looks like what is needed here. (Though in the doco I don't see how to extract the underlying character data to something user C++ code can work with?)
I'd like to see something like this back-ported to the C Matrix API, though. Lots of MEX code is still in C. And converting it to C++ is a substantial project: I've found it much harder to write C++ MEX files that are actually fast, compared to C MEX files. Buncha performance gotchas in the C++ MEX/Data API from what I can see.
  2 comentarios
Jan
Jan el 14 de Dic. de 2021
Thanks for your exhaustive answer.
"Storing multiple strings as a 2-D char array is lousy" - I do agree.The row wise storage was a strange decision.
"So IMHO the new string type is much nicer to use at an M-code level" - I can confirm this: nicer, more efficient and worth to be supported for future additions.
I'd prefer a string type including the explicite storing of the string length and a trailing zero, such that the pointer to the string data can be provided to external libraries directly (read-only!), e.g. as file name for operating system.
MathWorks could implement a "mxGetStringPtr" without loosing the flexibility fpr future changes, if an additional layer is inserted on demand:
char * mxGetStringPtr(mxArray, "utf16")
This replies a pointer, if the string is stored in UTF16 or it creates a copy, if a conversion is required. This encapsulates the internal representation and allowes to use the data efficiently, if it is possible.
In some dozens of my C-Mex functions I obtain only file names or use CHARs to provide some options. Some Mex Functions as "startsWith", "stringCompare/i", "Contains" would be sufficient already. For file names a simple "GetString" with conversion to ASCII/UTF8/UTF16 and appending of a 0 would be enough also, because file operations are much slower than the copy of a string.
It would be easy to implement such functions transparently accepting CHAR (vectors!), cell strings and STRING arrays, if I only had access to the string data.
Andrew Janke
Andrew Janke el 14 de Dic. de 2021
You're welcome!
Yeah, something like a (probably read-only) mxGetStringPtr to get the underlying character data in a certain format and encoding (which could differ from the string array's internal format) sounds like a good idea and seems like it would take care of this issue. Matlab already has transcoding code to support this (we know this because it's needed for native2unicode, and for where Matlab internals interface with other libraries that take differently-encoded text) so it seems like it wouldn't be a huge lift.

Iniciar sesión para comentar.

Categorías

Más información sobre Characters and Strings en Help Center y File Exchange.

Productos


Versión

R2016b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by