How to read the PRE html tags and replace some white spaces

I read data from html file and delimmited by the following tags
<pre>
12.0 29132 -60.3 -91.4 1 0.01 260 753.2 753.3 753.2
10.0 30260 -57.9 1 0.01 260 58 802.4 802.5 802.4
9.8 30387 -57.7 -89.7 1 0.01 261 61 807.8 807.9 807.8
6.0 33631 -40.4 -77.4 1 0.17 260 88 1004.0 1006.5 1004.1
5.9 33746 -40.3 -77.3 1 0.17 1009.2 1011.8 1009.3
</pre>
by the code:
t = regexp(html, '<PRE[^>]*>(.*?)</PRE>', 'tokens');
where t is a cell of char
Well, now I am trying to replace blank space with NaN to obtain:
12.0 29132 -60.3 -91.4 1 0.01 260 Nan 753.2 753.3 753.2
10.0 30260 -57.9 Nan 1 0.01 260 58 802.4 802.5 802.4
9.8 30387 -57.7 -89.7 1 0.01 261 61 807.8 807.9 807.8
6.0 33631 -40.4 -77.4 1 0.17 260 88 1004.0 1006.5 1004.1
5.9 33746 -40.3 -77.3 1 0.17 NaN NaN 1009.2 1011.8 1009.3
In this data set the columns are not always delimited by the same space and I do not know the lenght of the white spaces.
For example: in the last one line of my frist one data set there are two "empty places" that I would replace with 'NaN'. The position of all elements can't be changed (textscan function is dangerous I think)
Do you have any suggestion? Maybe I should to read the PRE tags by another way?
Thank you

 Respuesta aceptada

Cedric
Cedric el 20 de Jun. de 2014
Editada: Cedric el 21 de Jun. de 2014
I've got to run, but here is one way (I'll come back later to discuss further if needed).
EDIT: the first solution could not work, I will update it as soon as I have more information.

6 comentarios

Yes, I have case with consecutive missing values in the same line, you can view in the last one line of my data set above.
Stefano
Stefano el 20 de Jun. de 2014
Editada: Stefano el 20 de Jun. de 2014
I would like to point out that I have already a cell that contains the data is: t = regexp(html, '<PRE[^>]*>(.*?)</PRE>', 'tokens'); It would be ideal to start out from here
Cedric
Cedric el 21 de Jun. de 2014
Editada: Cedric el 21 de Jun. de 2014
Ok, I have a little more time now. Can any column be missing or is there one or more which is/are always present? Also, what is the range (or the range of the width) of the numbers in each column? Finally, could you attach a sample, so I have the exact content, because the forum alters a bit the content.
Stefano
Stefano el 21 de Jun. de 2014
Editada: Stefano el 21 de Jun. de 2014
Thank you for your interest. The first two columns are always present. I can't know the range of the numbers but I could only estimate it. I attached a .txt file with another situation. Whereas that I have a file of data, I could to solve reading the .txt file for replacing the blank space? It could be another way...but I haven't idea how to.
Cedric
Cedric el 21 de Jun. de 2014
Editada: Cedric el 21 de Jun. de 2014
Ok, this is a table with 7 characters fixed column width. So you can process it as follows
regexprep( content, ' {7}', ' NaN' )
where content is the token that is outputted by you first call to REGEXP. If you have more than 7 white spaces at the beginning of each line, e.g. because of HTML indentation, we can refine the pattern to exclude them. Just let me know.
Ok, thank you! Good job! It's a perfect solution for my data. Answer accepted :)

Iniciar sesión para comentar.

Más respuestas (0)

Categorías

Más información sobre Data Type Conversion en Centro de ayuda y File Exchange.

Preguntada:

el 20 de Jun. de 2014

Comentada:

el 23 de Jun. de 2014

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by