readtable(html file) producing extra empty columns

Question

0 votos

Original question: In another thread, similar question was asked for readtable(csv file). The answer was to set {'delimiter', ','}. Because htmlImportOptions does not have 'delimiter' property, that answer does not work for my problem. I found that {'EmptyColumnRule','skip'} is a solution. Unfortunately, it can't work together with htmlImportOptions, which is used to set up DataRows.

Update: name-value pair does have 'DataRows' option.

opt.ExtraColumnsRule = 'ignore' % readtable only the first column.

% either 
opt = htmlImportOptions;
opt.DataRows = 4; 
% opt.EmptyColumnRule = 'skip' % error, html opt doesn't have this property.
% update
opt.ExtraColumnsRule = 'ignore'; 
readtable(htmlfile, opt) % read in only the first column. The other non-extra columns are ignored.
% or
% orignial post: readtable(htmlfile, 'EmptyColumnRule', 'skip') % {'DataRows', 4} is an error
% update. this works
readtable(htmlfile, 'EmptyColumnRule', 'skip', 'DataRows', 4) 
% but not both
readtable(htmlfile, opt, 'EmptyColumnRule', 'skip') % error

I suppose I can read in the ExtraVar columns first and then delete the empty columns, just that I would rather readtable( ) handle it.

Thanks for any solutions!

6 comentarios
Mostrar 4 comentarios más antiguos Ocultar 4 comentarios más antiguos

Simon el 11 de Sept. de 2023

@dpb I will see if I can create a sample data.

dpb el 11 de Sept. de 2023

It would seem highly unlikely that simply uploading a few files with one or two records would reveal anything terribly damaging. :) Of course, it some rare instance it might be possible for industrial sabatoge to occur with only a handful of numbers or it may be company policy regardless of whether there's any real danger or not, or it could be a case such as in my former employment is part of a classified document which, by those rules makes anything in the document classified whether the specific pieces of data are sensitive or not and so can't release anything (despire our current and former leaders who seem to ignore such rules) without a signoff from a derivative classifier who likely won't declassify it for you just on general principle.

IOW, I'm just suggesting to really consider the actual content and whether it's really of need to not just use the data as is...of course, it should be relatively simple to just readcell, substitute the numeric values with rand of the same size and write back out...

Iniciar sesión para comentar.

Iniciar sesión para responder a esta pregunta.

Follow Question

Answer 1

dpb el 10 de Sept. de 2023

Abrir en MATLAB Online

0 votos

Use 'SelectedVariableNames' with the variable(s) desired

I can't tell what you want, specifically, there's a comment to read only the first??? If that is so, then

opt = htmlImportOptions;
opt.DataRows = 4; 
opt.ExtraColumnsRule = 'ignore';
opt.SelectedVariables=opt.VariableNames(1);  % read only the first column
tData=readtable(htmlfile, opt);

5 comentarios
Mostrar 3 comentarios más antiguos Ocultar 3 comentarios más antiguos

Simon el 11 de Sept. de 2023

Editada: Simon el 11 de Sept. de 2023

@dpb "Use 'SelectedVariableNames' with the variable(s) desired"

I think your answer is in the right direction. I suspect the header rows, which I select to be used as variables, cause the problem. The headers of the first two columns are empty, and the other headers are, say X, Y, Z. So the extra variables are created by Matlab as Z_1, Z_2, ... etc.

But the problem is that I have many tables, whose headers may more or less be the same, but not alway be the same. For example, one table may have headers (empty), (empty), X, Y, Z., another table (empty), (empty), X, Y, W, Z. Matlab treats the first empty header as Var1, and the rest as 'extra' columns.

I will give 'SelectedVariableNames' a shot. Thank you.

@dpb I have just now tried it. not working. Matlab insist treats the first column, whose header is empty, as the only variable. Even SelectedVariableNames = 1:2 will produce out-of-range error. It's critical for me to have the headers of the other columns as variables. But I don't know what they might be beforehand. I have to unaccept your answer. Sorry!

dpb el 11 de Sept. de 2023

Editada: dpb el 11 de Sept. de 2023

You can't expect a single solution to be able to handle different file formats.

If you can't standardize the file format, then you'll have to have a catalog of what are possible formats you expect to be able to handle and then build a tool which can discern which is which from some characteristic within the file or have a way to know heuristically which is which a priori.

But, if the header line or data content is such that the import options detection heuristics don't return the expected number of variables/columns, then you'll have to use some other technique to identify what are columns/variables. But again, if we can't see examples to work with, we're also shooting blindly at an ill-defined problem.

The mind-(file)reading toolbox has yet to be released...

ADDENDUM:

If, as the original Q? content seems to suggest, if the "dead ahead" readtable function returns the expected variables plus some that may occasionally be extra/empty, the most expedient solution may well be to simply read the file then delete the extraneous columns instead of going through a lot of gyrations of parsing the changeable input file to handle various cases explicitly -- again, if you can't standardize the input form and can't know a priori from some other source the specifics of the particular file a priori.

In a particular case with some more-or-less free format Excel files, I simply read the variable names row first and parse it for the specific variables of interest. In that case, the variables will always exist; just sometimes there are some others that user may have thrown in ad hoc and sometimes there are some empty columns to parse out, but detectImportOptions will return a correct number of variables. If, in the case of an .html file that can't be assured, then it's going to be a lot more trouble, but I don't have to mess with such and without specific example(s) from which to work, as noted before, we're pretty-much hamstrung.

dpb el 12 de Sept. de 2023

As above I've never had to really mess with parsing HTML much, but it's not set up as a format for scanning by tools such as readtable so it's not at all surprising to me to find you're having difficulties.

While it won't be directly applicable to your case, I'll see if I can strip out the parsing stuff/modifications to the import object I described above into a short piece of example code just as idea generator.

If you can figure out a way to post some examples of what your files actually look like, it would still be the best way to see if somebody can build a better mouse trap.

Simon el 17 de Sept. de 2023

@dpb Thanks for offering the help. I couldn't find a similar sample file to upload here. htmls have all sorts of defects. You were right in your earlier comments not to rely on one function to parse them correctly. I finally used a simple combination of all and ismissing to remove the extra empty columns after readtable(). I greatly appreciate your feedbacks.

Iniciar sesión para comentar.

readtable(html file) producing extra empty columns

6 comentarios
Mostrar 4 comentarios más antiguos Ocultar 4 comentarios más antiguos

Respuestas (1)

5 comentarios
Mostrar 3 comentarios más antiguos Ocultar 3 comentarios más antiguos

Categorías

Productos

Versión

Etiquetas

Community Treasure Hunt

readtable(html file) producing extra empty columns

6 comentarios Mostrar 4 comentarios más antiguos Ocultar 4 comentarios más antiguos

Respuestas (1)

5 comentarios Mostrar 3 comentarios más antiguos Ocultar 3 comentarios más antiguos

Categorías

Productos

Versión

Etiquetas

Ver también

Community Treasure Hunt

6 comentarios
Mostrar 4 comentarios más antiguos Ocultar 4 comentarios más antiguos

5 comentarios
Mostrar 3 comentarios más antiguos Ocultar 3 comentarios más antiguos