apply string math to everything in a table

I have a table where one variable is a char array.
I am trying to scrub excess information out of that array
e.g. ' "h t t p s : / / c a . s t y l e . y a h o o . c o m / 5 - a p p s " '
I will reduce to ' "c a . s t y l e . y a h o o . c o m " '
In general I know I can do that with erase to remove the "h t t p s : / / " and then use strfind to find the next '/' and make a new variable that contains the string up to that address.
But strfind and erase don't seem to happily comply with the table format. Nor can I figure out how to apply strfind to the whole table (as opposed to writing a for loop to step through it).
Is there some way to make these functions work with tables?

 Respuesta aceptada

Walter Roberson
Walter Roberson el 13 de Mayo de 2020
scrubbed = regexprep(YourTable.VariableName, {'^([^/]*/){2}\s*', '\s*/.*$'}, {'',''}, 'once', 'lineanchors', 'dotexceptnewline');
This code does not remove the spaces within the url. Doing that would certainly be possible:
scrubbed = regexprep(YourTable.VariableName, {'^([^/]*/){2}\s*', '\s*/.*$', '\s+'}, {'','',''}, 'lineanchors', 'dotexceptnewline');

10 comentarios

Budding MATLAB Jockey
Budding MATLAB Jockey el 14 de Mayo de 2020
Editada: Budding MATLAB Jockey el 14 de Mayo de 2020
Whoa. I seem to have missed a whole section of matlab. This does work though!
If I also wanted to scrub out the www. when they showed up (not always there) how would I add that in? (I don't need to remove the spaces). I can just do it as a seperate step. My guess is simply
scrubbed = regexprep(YourTable.VariableName, {'^www.'}, {''}, 'once', 'lineanchors', 'dotexceptnewline');
Which does nothing T_T
Then so I can catch up to you mentally:
The cell array {'^([^/]*/){2}\s*', '\s*/.*$'} is looking for:
first element: the start until the 2nd '/' of a '//' with anything bewteen it and the start, then after 2 characters is a whitespace with anything after it. I dont get the {2}\s part, why add that?
Without it wouldn't it find the area up to the '//'
second: white space, anything, '/', with at least something after it, until the end. Again I don't get the white space additions :|
Then it replaces that with nothingness {'',''}
scrubbed = regexprep(YourTable.VariableName, {'^([^/]*/){2}\s*', '\s*/.*$', '^(w\s*}{3}\.\s*'}, {'','',''}, 'once', 'lineanchors', 'dotexceptnewline');
[^/] is any one character that is not a /
[^/]* is any number (including zero) of (any one character that is not a /) . In other words, match up to but excluding the next /
[^/]*/ is any number (including zero) of (any one character that is not a /), followed by a / . In other words, match up to and including the next /
([^/]*/) groups that pattern together. The ( and ) are not characters to match in the input, they are creating a group.
([^/]*/){2} says to repeat the group exactly twice. In other words, match non-slash followed by slash twice. This is slightly weak in that it would, for example, match abc/def/ when you only want to match abc// .Instead [^/]+/\s*/ would be a better pattern. And you could get fancier to insist on the :: being there.
The \s* after the ([^/]*/){2} says to match any number of whitespace. In terms of your example, that makes the difference between matching 'h t t p s : / / ' or 'h t t p s : / /' -- that is, do you want the output to be 'c a . s t y l e . y a h o o . c o m' or do you want the output to be ' c a . s t y l e . y a h o o . c o m' with the space left in before the c ?
\s*/.*$ is whitespace, slash, then any number of any characters (except newlines because of the options we used) to end of the line. In your example, the \s* makes the difference between the output being 'c a . s t y l e . y a h o o . c o m ' with trailing space, or 'c a . s t y l e . y a h o o . c o m' without trailing space.
One point about using regexprep() with a cell array of patterns to match, is that the first replacement is done, and the second is applied to the results of the first, and the third is applied to the results of the second, and so on. For example,
rexprep('aaQb', {'[A-Z]', 'aaa' }, {'a', '%'})
first takes the input 'aaQb' and applies (replace '[A-Z]' with 'a') giving 'aaab' . Then it applies the rule (replace 'aaa' with '%') giving '%b' . Notice that the pattern 'aaa' did not occur in the original input. Do not think of the replacements as going in "in parallel": they need to be understood as sequences of steps.
Budding MATLAB Jockey
Budding MATLAB Jockey el 14 de Mayo de 2020
Editada: Budding MATLAB Jockey el 14 de Mayo de 2020
ok I think I'm caught up. This feels hard but I get it.
A big hold back I had was the whitespace. Now I understand your comments about white space as I realize there is whitespace in my table! I thought matlab just transferred a funny font when I copied and pasted it into this form.
There is no white space when I open it in excel so that
  1. confused me
  2. is a new problem i have to deal with :|
I just use this command to read the table in from the attached file. It's called a csv, but is tab delimited... I don't know why, I have tonnes of these files (I do get warnings about the file being UTF16 and I have no idea what that means...)
temp = readtable(target,'delimiter','tab');
Would you happen to be able to upgrade to R2020a? R2020a starts being able to handle UTF encoded files for readtable.
fmt = ['%f%f%f%f%f%s%s%f%f',repmat('%s',1,10), '%f%f%s%f'];
fid = fopen(target, 'rt', 'n', 'utf16-le'); %ignore warning about UTF16-LE not being supported
data = textscan(fid,fmt,'delimiter','\t','headerlines',1);
fclose(fid)
urls = data{6};
The urls will not have those embedded whitespace.
You will get a warning about UTF16-LE not being supported; you can ignore that. If you want to turn off that warning, its ID is 'MATLAB:iofun:UnsupportedEncoding'.
If it really bugs you that the warning is there, then there are ways around it.
... but switching to R2020a or later makes the problems go away.
Budding MATLAB Jockey
Budding MATLAB Jockey el 14 de Mayo de 2020
Editada: Budding MATLAB Jockey el 14 de Mayo de 2020
Ack. The rest of the people here want to keep all the code on 2019b so I can't change to 2020.
ok this sort of works but...
This is the real file opened in notebook which apparently has quotes around everything and makes that code not work.
# Total Backlinks Domain Rating URL Rating (desc) Referring Domains Referring Page URL Referring Page Title Internal Links Count External Links Count Link URL TextPre Link Anchor TextPost Type Backlink Status First Seen Last Check Day Lost Language Traffic Keywords Js rendered Linked Domains
"1" "1" "48" "21" "19" "https://www.pp.com/" "51 Best ...
whereas when I saved that test file it didn't have the quotes (my bad very sorry...in excel it looks the same :( )
# Total Backlinks Domain Rating URL Rating (desc) Referring Domains Referring Page URL Referring Page Title Internal Links Count External Links Count Link URL TextPre Link Anchor TextPost Type Backlink Status First Seen Last Check Day Lost Language Traffic Keywords Js rendered Linked Domains
1 1 48 21 19 https://www.pp.com/ blogs
I can save everything as a string, but then i have a million strings :(
Also will that regexprep stuff still work? with a giant cell aray instead of a table? I also need to figure out how to save the output without writetable() :(
Walter Roberson
Walter Roberson el 14 de Mayo de 2020
Well, provide a sample of the actual file and I will see what I can do.
ok sure, I guess there is nothing secret in here.
target = 'uploadable.csv';
opts = detectImportOptions(target, 'encoding', 'utf16le');
t = readtable(target, opts);
Before R2020a you will get warnings about the encoding not being supported, and also a warning about a byte order mark.
warning('off', 'MATLAB:iofun:UnsupportedEncoding')
will get rid of the message about unsupported encoding.
Or you could use
target = 'uploadable.csv';
fmt = ['"%f" "%f" "%f" "%f" "%f" %q %q "%f" "%f" ',repmat('%q ',1,10), '"%f" "%f" %q "%f"'];
fid = fopen(target, 'rt', 'n', 'utf16-le'); %ignore warning about UTF16-LE not being supported
data = textscan(fid, fmt, 'delimiter', '\t', 'headerlines', 1);
fclose(fid)
This will give you a single warning about the encoding not being supported.
If the warning about encoding really bugs you then,
target = 'uploadable.csv';
fmt = ['"%f" "%f" "%f" "%f" "%f" %q %q "%f" "%f" ',repmat('%q ',1,10), '"%f" "%f" %q "%f"'];
fid = fopen(target, 'r');
bytes = fread(fid, [1 inf], '*uint8');
fclose(fid)
s = native2unicode(bytes, 'utf16le');
data = textscan(s, fmt, 'delimiter', '\t', 'headerlines', 1);
.. provided that the files do not occupy more than about 1/3 of your available memory.
For reference everything worked great. You are my hero!

Iniciar sesión para comentar.

Más respuestas (0)

Categorías

Etiquetas

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by