Textscan with different formats

Question

0 votos

Hi,

I'm not familiar reading text files and also formats involved in text files. Here is my problem, i been trying to read a text file which has a unknown rows and columns, with a '\t' delimiter, column headers with more than 2( second one will be an unit which is not required for me, only first one is considered). I was using importdata for reading text and data separately, it was working fine but yesterday i found a problem like my input text file contains '*' for missing data, which during importing considered as character and as a row header.

There is been hundreds of questions asked for text file reading, ive found solutions like tableread, import as char and convert with str2double(which is slow),readtext(file exchange) but none of the solution is as fast as importdata function.

What i was expecting is read only the numeric data from the textfile(replace char with NaN during import itself as xlsread), I understand which can be done using textscan but i was unable to give formatspec for the files Or a faster str2double function.

When i give formatspec as ('%s %f') is the first row is taken as string or the first column?

Note: text file size is 100000*600 column.Some files second column(Units) may not be present,data starts form second column itself. Suppose if my delimiter changes to ',' for another file how to auto detect delimiter?

4 comentarios
Mostrar 2 comentarios más antiguos Ocultar 2 comentarios más antiguos

Stephen23 el 25 de Oct. de 2017

@surey: are the missing data always in that column, or can they occur in other columns as well?

Vick el 25 de Oct. de 2017

Editada: Vick el 25 de Oct. de 2017

Hi, There are more than 20 missing column in my actual data.. Can be in any column.. Additionally My missing data may be at a single row at any column,rather than being a whole column...

Iniciar sesión para comentar.

Iniciar sesión para responder a esta pregunta.

Follow Question

Answer 1

Walter Roberson el 24 de Oct. de 2017

Abrir en MATLAB Online

0 votos

"When i give formatspec as ('%s %f') is the first row is taken as string or the first column?"

No, not either. textscan() loops contining from the current file position, which might be in the middle of a line. If your format only reads a portion of a line, then the rest of the line is not discarded before the format is used again: instead the file position is updated right into the middle of a line and then it loops and applies the format again to where-ever it is.

For example, in the file

abc 123 456 789 1011
def

then a "%s%f" format would first read the 'abc' with %s format, then read the 123 with numeric format, temporarily leaving the textscan output as {{'abc'}, [123]} . Then textscan would re-apply the format from where it was, reading '456' with the %s format and 789 with the numeric format, updating the textscan output to {{'abc'; '456'}, [123; 789]}. Then the %s would grab the 1011, and the %f would choke on the def of the next line, leaving you with {{'abc'; '456'; '1011'}, [123; 789]} -- notice the numeric column is shorter than the text column because it happened to give up reading before that column was updated.

Now, if you happen to have the same number of format items as you have columns, then the effect is that each format item applies to a column. But if you hit a row that has a missing entry that is implied by spacing (no explicit delimiter between fields), or you have a numeric field specification but encounter a string instead and you do not have TreatAsEmpty set, or if %s column unexpectedly has a space in it... in any of those circumstances, the nice correspondence between column and format specifier will get messed up.

One of the key things you need to know about textscan() is that unless you have set 'WhiteSpace' to exclude the space character, that at the beginning of every format specifier, blanks starting at the current position are discarded -- even if the format specifier is %c or %s or %[]. This makes it tricky to deal with optional fields that are replaced by blanks, (unless you happen to be using a field separator such as comma or tab). The immediate thought might be to just remove space from the 'whitespace', but when that parameter does not include space, then leading spaces are an error for numeric fields! I showed how to get around that in https://www.mathworks.com/matlabcentral/answers/361377-textscan-failing-to-read-data-in-text-file#answer_286302

1 comentario
Mostrar -1 comentarios más antiguos Ocultar -1 comentarios más antiguos

Vick el 25 de Oct. de 2017

Hi Roberson,

Thanks for the detailed explanation. I'm now able to specify the format spec for simpler problems but Still i'm struggling to specify the formatspec for my problem.

Attached the file on @Stephen Cobeldick's comment.. https://in.mathworks.com/matlabcentral/answers/362921-textscan-with-different-formats#comment_496926

Iniciar sesión para comentar.

Textscan with different formats

4 comentarios
Mostrar 2 comentarios más antiguos Ocultar 2 comentarios más antiguos

Respuestas (1)

1 comentario
Mostrar -1 comentarios más antiguos Ocultar -1 comentarios más antiguos

Categorías

Productos

Etiquetas

Community Treasure Hunt

Textscan with different formats

4 comentarios Mostrar 2 comentarios más antiguos Ocultar 2 comentarios más antiguos

Respuestas (1)

1 comentario Mostrar -1 comentarios más antiguos Ocultar -1 comentarios más antiguos

Categorías

Productos

Etiquetas

Ver también

Community Treasure Hunt

4 comentarios
Mostrar 2 comentarios más antiguos Ocultar 2 comentarios más antiguos

1 comentario
Mostrar -1 comentarios más antiguos Ocultar -1 comentarios más antiguos