How do I parse mixed, dynamic binary & string files?

Question

0 votos

I'm having trouble parsing files that are mixed strings & numbers. The data I need is in 3 columns (comma-delimited), and isn't in any specific form. It can have 10 leading gibberish characters, it might have some letters, symbols, numbers, it might say "Error: hrere" on 1 line (for example). I've used textscan, strread, !strings, fgetl with strread, but I can't seem to get what I need out into 3 variables [Col1 Col2 Col3]. Been racking my brain...How can I do this! Here's a sample of the file:

@#*%;AJ))&3#* a) 24.568, 34.1024, -0.1023

&$@!*(!( (*&Y$)@ 24.568, 34.1020, -0.0888

()(@E$@!*(!( (*&Y$)@ 23.568, 34.1020, -0.0888

$64&$@!*(!( (*&Y$)@ 24.4568, 34.0020, -0.0888

Bad Command

$64&$@!*(!( (*&Y$)@ 24.4568, 34.0020, -0.0888

&!)*~*(ER!( (*6&Y$)@ 24.568, 34.1020, -0.0888

(*!$)^@ 23.568, 34.1020, -0.0888

etc....

The closest I got was using something like:

fid = fopen('file.txt','r');
tline = fgetl(fid);
[c1 c2 c3] = strread(tline,'%f','delimiter',',');
fclose(fid);

but I can't iterate it, and it quits also if I read a line with a bad string (non-floating point)

1 comentario
Mostrar -1 comentarios más antiguos Ocultar -1 comentarios más antiguos

dpb el 17 de En. de 2014

That's a {insert proverbial adjective here}...

It looks like the one consistent thing is that there's a blank preceding the first floating point value for the lines with valid data. I'd probably try to locate that on each line by a rear-to-front search for the second comma delimiter location and then the preceding blank prior to that, then try to convert that substring.

It'll take quite a lot of logic to then add to the point of being able to handle all the other special cases you find I suspect.

Once in a former life had the problem of processing large amounts of data returned from power plant monitoring computer via punch paper tape that was always rife with mispunches and the like...it was a similar lot of work to write a reasonably robust processor to salvage them. From that experience, "good luck".

regexp may also be your friend here...

Iniciar sesión para comentar.

Iniciar sesión para responder a esta pregunta.

Follow Question

Answer 1

Walter Roberson el 17 de En. de 2014

Abrir en MATLAB Online

1 voto

fid = fopen('file.txt','r');
datacell = textscan('%s%s%s', 'Delimiter', ',');
fclose(fid);
col1s = regexprep( datacell{1}, '^.*\s', '' );
Col1 = str2double(col1s);
Col2 = str2double(datacell{2});
Col3 = str2double(datacell{3});

There will be NaN in any entry that did not match the proper format.

2 comentarios
Mostrar Ninguno Ocultar Ninguno

Tom W el 23 de En. de 2014

Kudos! I was able to use another method, however, your suggestion worked in much fewer lines of code than what I was using. To learn, I don't quite understand the '^.*\s' syntax, what does that interpretively say or tell the program? I'm wondering if that syntax would be of use elsewhere if I understand it better. Thanks again!

Walter Roberson el 23 de En. de 2014

'^.*\s' is a regular expression, which is a pattern that needs to be matched. The '^' means that the match must occur at the beginning of a line. The . means to match any one character. The * modifier after the . means to extend the previous specification (the dot) as far as possible to the right such that the rest of the pattern afterwards is still satisfied -- so to gobble as many characters as you can such that the rest still works. The \s means any one whitespace character (such as a blank). Re-interpreting this, it means to start at the beginning, find the last space in the string, and take everything from the beginning up to and including that space.

This pattern is inside a regexprep() call, which says to replace the matched string with what is described in the next argument. The next argument I gave is '' which is the empty string. So the effect is to delete all characters from the beginning of the line up to and including the final space, leaving the last series of non-blank characters alone. In other words, to cut out everything except the last column.

Iniciar sesión para comentar.

How do I parse mixed, dynamic binary & string files?

1 comentario
Mostrar -1 comentarios más antiguos Ocultar -1 comentarios más antiguos

Respuesta aceptada

2 comentarios
Mostrar Ninguno Ocultar Ninguno

Más respuestas (0)

Categorías

Etiquetas

Community Treasure Hunt

How do I parse mixed, dynamic binary & string files?

1 comentario Mostrar -1 comentarios más antiguos Ocultar -1 comentarios más antiguos

Respuesta aceptada

2 comentarios Mostrar Ninguno Ocultar Ninguno

Más respuestas (0)

Categorías

Etiquetas

Ver también

Community Treasure Hunt

1 comentario
Mostrar -1 comentarios más antiguos Ocultar -1 comentarios más antiguos

2 comentarios
Mostrar Ninguno Ocultar Ninguno