Working with unicode paths

13 visualizaciones (últimos 30 días)
Jim Hokanson
Jim Hokanson el 2 de Sept. de 2013
The following is a followup to:
This question however is a bit more specific. I have a file which was created using a program on Windows. I can browse to the file in Windows Explorer (Win 7). I am however unable to:
  1. Open the file in Matlab (using fopen)
  2. If I create a directory with the same name, I am unable to cd to the directory. cd(directory)
I have uploaded the file to a public folder on my dropbox account. https://www.dropbox.com/sh/d2mghr9xyb426lz/ZEM4DH8XTp
The files are: v. Békésy - 1957.txt v. Békésy - 1957.zip
I am currently unable to provide instructions as to how one would create such a file in Matlab (hence providing them for download). For handling naming, I have also included the file in a zip, so that even if the zip is renamed on download, the file inside should maintain the same name. Incidentally, it was by exporting the zip to a folder with the same name that created the folder which I cannot cd to with Matlab.
Thus, the question is how do I get around issues #1 and #2 (without renaming them using manually using a windows interface). I am assuming this might mean using a custom library (mex and/or Java code).
The ideal solution is to provide a generic class of code that actually works for path/file manipulation instead of needing manual interference any time this problem is encountered.
Thanks, Jim
  4 comentarios
per isakson
per isakson el 4 de Sept. de 2013
Editada: per isakson el 4 de Sept. de 2013
Now, I'm on a different computer (same installation: R2013a 64bit on Windows 7). I read your file without problems here too. "Standard Swedish" installation, I guess. And:
a = get(0, 'Language')
import java.nio.charset.Charset
b = Charset.defaultCharset()
c = feature('DefaultCharacterSet')
returns
a =
sv_se
b =
windows-1252
c =
windows-1252
Jim Hokanson
Jim Hokanson el 6 de Sept. de 2013
Editada: Jim Hokanson el 6 de Sept. de 2013
EDIT: I am now having trouble NOT getting this to work. The interface through dropbox might be causing a problem and changing the character type. This should be kept in mind given the original response below.
Thanks Per, setting my language to Swedish works, although is obviously not ideal.
Specifically I can read the file if I do: set(0,'Language','sv_se')
Matlab documentation is very vague (from what I can tell) as to what these options actually do. I would have guessed that option only changed the way that figures were rendered, but apparently not.
The question still remains open then as to how to read the file on a more generic basis.

Iniciar sesión para comentar.

Respuesta aceptada

Jim Hokanson
Jim Hokanson el 6 de Sept. de 2013
As others have alluded to, the problem seems to be with Matlab touching the character data. I still don't have a solution for changing the directory, since I don't know of a way of doing this without using Matlab strings.
Here's how to read a file (in this case to bytes) using Java which bypasses the unicode problem.
dir_obj = java.io.File(DIR_ROOT);
dir_files = dir_obj.listFiles;
file_bytes = typecast(org.apache.commons.io.FileUtils.readFileToByteArray(dir_files(end)),'uint8');
NOTE: There are other methods of extracting bytes given a file but the method alluded to above exists on my system and seemed the most straightforward.
At this point native2unicode() or char() would be fine if you wanted the content as a string.
It seems like the problem is most likely tied to combining characters, which is one way of adding something like an accent to a "normal" letter.
I believe that the file on disk which has caused the problem actually consists of a combined character which adds an accent to an e, thus the 101 769, which is the letter e followed by a combining acute accent:

Más respuestas (2)

Malcolm Lidierth
Malcolm Lidierth el 6 de Sept. de 2013
Editada: Malcolm Lidierth el 6 de Sept. de 2013
Jim
MATLAB/Java need to talk to an OS and a file system beneath so this is likely to vary across FAT12/16/32, NTFS etc as well as OS or MATLAB/Java.
From @Pers comments: the windows-1252 charset is proprietary, not unicode, and to convert a Java String to the originating byte[] requires the CharSet.
So, telling the difference between "'v. Békésy" on this screen to the byte[] that it was created from requires information that the string does not contain and, AFAIK, neither does the directory entry of any file system.
On my Mac:
>> java.nio.charset.Charset.availableCharsets.size()
ans =
166
The answer then is that there is no answer beyond "don't use special characters in file names" as suggested by Jan on your first post. But, on the assumption that nobody is likely to have used anything but 8 bit encoding:
>> java.lang.String('v. Békésy').getBytes()
ans =
118
46
32
66
-23
107
-23
115
121
But compare that with the MATLAB char array:
>> char(java.lang.String('v. Békésy').getBytes())
ans =
v .
B
k
s y
and with:
>> uint8(java.lang.String('v. Békésy').getBytes())
ans =
118
46
32
66
0
107
0
115
121
For this problem, MATLAB's uint arithmetic rules may not be the most useful.
>> java.lang.String(java.lang.String('v. Békésy').getBytes(),java.nio.charset.Charset.defaultCharset())
ans =
v. Békésy
>> java.nio.charset.Charset.defaultCharset()
ans =
ISO-8859-1
but:
>> java.lang.String(java.lang.String('v. Békésy').getBytes(), 'US-ASCII')
ans =
v. Bks
Regards ML
  5 comentarios
Malcolm Lidierth
Malcolm Lidierth el 6 de Sept. de 2013
@Jim Not a Java issue. On my Mac e.g. with both a Mac HD and a FAT32 drive
>> java.io.File('v. Békésy')
ans =
v. Békésy
>> ans.exists()
ans =
1
Its a MATLAB issue.
P.S. SciLab works fine while R gives the name as v. Be\314\201ke\314\201sy but works happily with that.
Jim Hokanson
Jim Hokanson el 6 de Sept. de 2013
@Malcom,
Thanks for the clarification.
Jim

Iniciar sesión para comentar.


per isakson
per isakson el 12 de Sept. de 2013
Editada: per isakson el 13 de Sept. de 2013
Googling taught me
  • NTFS stores file names in Unicode.
  • Not all zip-tools are Unicode-aware.
  • The name of files transferred in zip-files between systems with different default character sets may be "corrupted".
Links

Categorías

Más información sobre Characters and Strings en Help Center y File Exchange.

Etiquetas

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by