Borrar filtros
Borrar filtros

Getting MD5 hash of files with "bad" names (Windows, NTFS)

6 visualizaciones (últimos 30 días)
bbb_bbb
bbb_bbb el 16 de Oct. de 2017
Editada: Jan el 19 de Oct. de 2017
I tryed several variants of getting MD5 hash of files with "bad" names (e.g. 'C:\wrongname.' (with trailing dots); 'C:\wrongname ' (with trailing spaces); (path)names with accents like 'é', 'é' which occur in French, German, Hungarian; containing various forms of dashes (–) etc.).
None is working, except Var5, but is too slow (calculates a 10Mb file within 10 min).
Can you suggest any working variant, or how to speed up Var 5?
OS: Windows 10, File system: NTFS, Matlab 2015a
% Var 1 ---------
% dirinfo(i).name - is string containing full pathname (e.g. 'C:\myfolder\myfile.ext')
Opt.Format = 'HEX'; Opt.Method = 'MD5'; Opt.Input='file';
hash(i) = DataHash(['\\?\' dirinfo(i).name], Opt); % ERROR - not working with "bad" names
% DataHash.m - https://www.mathworks.com/matlabcentral/fileexchange/31272-datahash
%Var 2 ----------
hash(i)=mMD5(['\\?\' dirinfo(i).name]); % fast, works on 'C:\wrongname.' (with ending dots), 'C:\wrongname ' (with ending spaces), but do NOT works with file names (or pathes) with accents like 'é', é'
% (mMD5.c, see https://www.mathworks.com/matlabcentral/fileexchange/7919-md5-in-matlab)
% Var 3 ---------
mddigest = java.security.MessageDigest.getInstance('MD5');
bufsize = 8192;
[fid,errmsg] = fopen(['\\?\' dirinfo(i).name]); % ERROR here - matlab fopen don't understand "bad" names
if fid>=3 % if success
while ~feof(fid)
[currData,len] = fread(fid, bufsize, '*uint8');
if ~isempty(currData)
mddigest.update(currData, 0, len);
end
end
fclose(fid);
hash(i) = reshape(dec2hex(typecast(mddigest.digest(),'uint8'))',1,[]);
else
disp('can't open file');
end
% Var 4 ---------
file = java.io.File(['\\?\' dirinfo(i).name]);
digestream = java.security.DigestInputStream(file,mddigest);
file_bytes = typecast(org.apache.commons.io.FileUtils.readFileToByteArray(file),'uint8'); % ERROR: out of memory if BIG file
if ~isempty(file_bytes)
mddigest.update(file_bytes, 0, numel(file_bytes));
end
hash(i) = reshape(dec2hex(typecast(mddigest.digest(),'uint8'))',1,[]);
% Var 5 --------
mddigest = java.security.MessageDigest.getInstance('MD5');
filestream = java.io.FileInputStream(java.io.File(['\\?\' dirinfo(i).name]));
digestream = java.security.DigestInputStream(filestream,mddigest);
while(digestream.read() ~= -1), end % TOO LONG - never goes out this cycle
hash(i)=reshape(dec2hex(typecast(mddigest.digest(),'uint8'))',1,[]);

Respuestas (2)

Guillaume
Guillaume el 16 de Oct. de 2017
Editada: Guillaume el 17 de Oct. de 2017
As others have pointed and as Microsoft clearly says:
Do not end a file or directory name with a space or a period. Although the underlying file system may support such names, the Windows shell and user interface does not.
As you've found out, matlab fopen does not support it. .Net (which you can directly from matlab) also does not. From your testing, it looks like Java does not handle it properly either.
The only way you can access such file is directly through the win32 api, e.g. with CreateFile. So if you really need to handle such paths you'll have to resort to mex.
But really, the best fix would be to fix the tool that creates these files in the first place so that it doesn't use bad filenames. It's not just matlab that can't handle them, it's also most backup tools, file transfer tools, etc.
As for path names with accents, there does not appear to be any problem there. Matlab (R2017a tested) handles them fine.
  8 comentarios
bbb_bbb
bbb_bbb el 17 de Oct. de 2017
Editada: bbb_bbb el 17 de Oct. de 2017
So I'm not sure what's reliable about a program that creates files that are guaranteed to cause problems for the majority of other programs.
This statement shows that you still did not understand what I want to do. The program code must calculate md5 sum of existing file sets, and must be smart enough to process all files, not only 99% of them. As well as a backup program don't creates files, it just makes the copies of existing file sets. A good backup program must do its job for 100%, not 99%.
What have Microsoft recommendations to do with it?
Jan
Jan el 17 de Oct. de 2017
Editada: Jan el 19 de Oct. de 2017
@bbb_bbb: Reliable programs work on well formatted input. There is no coding style which can change the old rule:
Garbage in, garbage out.
There are many limitation in file names, such that e.g. backup systems will fail: At first the old limitation to 260 characters for the path and for the file name also. Using \\?\ you can expand the limit to 32,767 characters - many, but limited. Then special characters > < ? | : " \ / * and forbidden file names: CON, PRN, AUX, NUL, COM1, COM2, COM3, COM4, COM5, COM6, COM7, COM8, COM9, LPT1, LPT2, LPT3, LPT4, LPT5, LPT6, LPT7, LPT8, LPT9. Such names can come e.g. from extracting folders, which have been created under linux, which has less limitations. Backup systems are influenced by the underlying file systems, and I have seen bugs concerning the limits of fat16, fat32, NTFS, AFS and BTRFS. Sometimes hard or soft links let the system crash, sometimes the Alternate Data Streams of NTFS. Some programs crashed during the change of the day-light-saving time, or during the extra second.
Any software has a limited reliability only, as any mechanical tool has a limited applicability also. "Reliable" does not mean that you can process everything you want and always get what you expect, but it means, that the software works correctly inside the specified limits and with inputs inside the specifications.
Therefore the Windows API does work reliably, because it is well documented, that trailing dots are not handled. Your input files violate the specifications. Then asking for modifying the Windows API and the tools I offer for free is the wrong approach. Even if you would get the MD5 hashes, you cannot copy, move or process these files with standard software, e.g. the Windows Explorer or any backup tool.
If you now insist on using such file names, you act like using a drilling machine to beat a nail into the wall.
We have suggested several work-arounds and explained clearly, that the reliable way to solve the problem is not to process weird inputs by even more weird code, but to remove the source of the problem by fixing the program, which creates the input data. Your ironic question "do not you want to create reliable programs?" is ignorant. I think it is you, who does "not understand the true deepness of the problem".

Iniciar sesión para comentar.


Jan
Jan el 16 de Oct. de 2017
Editada: Jan el 16 de Oct. de 2017
I think, that the problem occurs in
% dirinfo(i).name - is string containing full pathname (e.g. 'C:\myfile.ext')
and
['\\?\' dirinfo(i).name]
already. I see no reason to assume, that Matlab or any of the other functions "does not understand bad names". Please post the complete error messages instead of the rough description 'ERROR - not working with "bad" names', 'ERROR here - matlab fopen don't understand "bad" names'.
Do you obtain dirinfo by the dir command? Then try:
File = ['\\?\', fullfile(dirinfo(i).folder, dirinfo(i).name)];
exist(File, 'file')
disp(['\\?\' dirinfo(i).name])
exist(['\\?\' dirinfo(i).name], 'file')
What do you get as output?
[EDITED, Walter is right: The Windows Command shell suffers from trailing dots and spaces.]
  18 comentarios
bbb_bbb
bbb_bbb el 19 de Oct. de 2017
copyfile() to tempname()
is acceptable, but only as a last chance or work-around, because it is very time-consumpting, e.g. if there are many files in a "bad" folder or files themselves are big-sized.
Stephen23
Stephen23 el 19 de Oct. de 2017
Editada: Stephen23 el 19 de Oct. de 2017
'It violates the rule of good programming: "don't change original data unless necessary."'
No it doesn't, because the rule "fix bugs where they occur" exactly makes this change "necessary". Your filenames are outside of those specified to work correctly with Windows. Solution: change them so that they are suitable for Windows. Not only that, but filenames should not contain data at all, or at most only some very high-level meta-data, so changing the name should make no difference to your data.
"Renaming is not quite good decision"
Ensuring that the names are created correctly in the first place would be the best decision. Anything else will ultimately just waste more of your time. Like this discussion already proves.
"because it is very time-consumpting, e.g. if there are many files in a "bad" folder or files themselves are big-sized."
Nope, it would only take a few minutes with the right tool. I do this all the time with quite large files from our test department, to store the files systematically. It doesn't change the data at all.
It is not clear to me what the problem is. Why are you letting something as trivial as filenames get in the way of doing your work?

Iniciar sesión para comentar.

Categorías

Más información sobre Environment and Settings en Help Center y File Exchange.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by