readtable(html file) producing extra empty columns

13 visualizaciones (últimos 30 días)
Simon
Simon el 10 de Sept. de 2023
Comentada: Simon el 17 de Sept. de 2023
Original question: In another thread, similar question was asked for readtable(csv file). The answer was to set {'delimiter', ','}. Because htmlImportOptions does not have 'delimiter' property, that answer does not work for my problem. I found that {'EmptyColumnRule','skip'} is a solution. Unfortunately, it can't work together with htmlImportOptions, which is used to set up DataRows.
Update: name-value pair does have 'DataRows' option.
opt.ExtraColumnsRule = 'ignore' % readtable only the first column.
% either
opt = htmlImportOptions;
opt.DataRows = 4;
% opt.EmptyColumnRule = 'skip' % error, html opt doesn't have this property.
% update
opt.ExtraColumnsRule = 'ignore';
readtable(htmlfile, opt) % read in only the first column. The other non-extra columns are ignored.
% or
% orignial post: readtable(htmlfile, 'EmptyColumnRule', 'skip') % {'DataRows', 4} is an error
% update. this works
readtable(htmlfile, 'EmptyColumnRule', 'skip', 'DataRows', 4)
% but not both
readtable(htmlfile, opt, 'EmptyColumnRule', 'skip') % error
I suppose I can read in the ExtraVar columns first and then delete the empty columns, just that I would rather readtable( ) handle it.
Thanks for any solutions!
  6 comentarios
Simon
Simon el 11 de Sept. de 2023
@dpb I will see if I can create a sample data.
dpb
dpb el 11 de Sept. de 2023
It would seem highly unlikely that simply uploading a few files with one or two records would reveal anything terribly damaging. :) Of course, it some rare instance it might be possible for industrial sabatoge to occur with only a handful of numbers or it may be company policy regardless of whether there's any real danger or not, or it could be a case such as in my former employment is part of a classified document which, by those rules makes anything in the document classified whether the specific pieces of data are sensitive or not and so can't release anything (despire our current and former leaders who seem to ignore such rules) without a signoff from a derivative classifier who likely won't declassify it for you just on general principle.
IOW, I'm just suggesting to really consider the actual content and whether it's really of need to not just use the data as is...of course, it should be relatively simple to just readcell, substitute the numeric values with rand of the same size and write back out...

Iniciar sesión para comentar.

Respuestas (1)

dpb
dpb el 10 de Sept. de 2023
Use 'SelectedVariableNames' with the variable(s) desired
I can't tell what you want, specifically, there's a comment to read only the first??? If that is so, then
opt = htmlImportOptions;
opt.DataRows = 4;
opt.ExtraColumnsRule = 'ignore';
opt.SelectedVariables=opt.VariableNames(1); % read only the first column
tData=readtable(htmlfile, opt);
  5 comentarios
dpb
dpb el 12 de Sept. de 2023
As above I've never had to really mess with parsing HTML much, but it's not set up as a format for scanning by tools such as readtable so it's not at all surprising to me to find you're having difficulties.
While it won't be directly applicable to your case, I'll see if I can strip out the parsing stuff/modifications to the import object I described above into a short piece of example code just as idea generator.
If you can figure out a way to post some examples of what your files actually look like, it would still be the best way to see if somebody can build a better mouse trap.
Simon
Simon el 17 de Sept. de 2023
@dpb Thanks for offering the help. I couldn't find a similar sample file to upload here. htmls have all sorts of defects. You were right in your earlier comments not to rely on one function to parse them correctly. I finally used a simple combination of all and ismissing to remove the extra empty columns after readtable(). I greatly appreciate your feedbacks.

Iniciar sesión para comentar.

Categorías

Más información sobre Text Data Preparation en Help Center y File Exchange.

Productos


Versión

R2023a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by