apply string math to everything in a table

Question

0 votos

I have a table where one variable is a char array.

I am trying to scrub excess information out of that array

e.g. ' "h t t p s : / / c a . s t y l e . y a h o o . c o m / 5 - a p p s " '

I will reduce to ' "c a . s t y l e . y a h o o . c o m " '

In general I know I can do that with erase to remove the "h t t p s : / / " and then use strfind to find the next '/' and make a new variable that contains the string up to that address.

But strfind and erase don't seem to happily comply with the table format. Nor can I figure out how to apply strfind to the whole table (as opposed to writing a for loop to step through it).

Is there some way to make these functions work with tables?

0 comentarios
Mostrar -2 comentarios más antiguos Ocultar -2 comentarios más antiguos

Iniciar sesión para comentar.

Iniciar sesión para responder a esta pregunta.

Follow Question

Answer 1

Walter Roberson el 13 de Mayo de 2020

Abrir en MATLAB Online

0 votos

scrubbed = regexprep(YourTable.VariableName, {'^([^/]*/){2}\s*', '\s*/.*$'}, {'',''}, 'once', 'lineanchors', 'dotexceptnewline');

This code does not remove the spaces within the url. Doing that would certainly be possible:

scrubbed = regexprep(YourTable.VariableName, {'^([^/]*/){2}\s*', '\s*/.*$', '\s+'}, {'','',''}, 'lineanchors', 'dotexceptnewline');

10 comentarios
Mostrar 8 comentarios más antiguos Ocultar 8 comentarios más antiguos

Walter Roberson el 14 de Mayo de 2020

Abrir en MATLAB Online

[^/] is any one character that is not a /

[^/]* is any number (including zero) of (any one character that is not a /) . In other words, match up to but excluding the next /

[^/]*/ is any number (including zero) of (any one character that is not a /), followed by a / . In other words, match up to and including the next /

([^/]*/) groups that pattern together. The ( and ) are not characters to match in the input, they are creating a group.

([^/]*/){2} says to repeat the group exactly twice. In other words, match non-slash followed by slash twice. This is slightly weak in that it would, for example, match abc/def/ when you only want to match abc// .Instead [^/]+/\s*/ would be a better pattern. And you could get fancier to insist on the :: being there.

The \s* after the ([^/]*/){2} says to match any number of whitespace. In terms of your example, that makes the difference between matching 'h t t p s : / / ' or 'h t t p s : / /' -- that is, do you want the output to be 'c a . s t y l e . y a h o o . c o m' or do you want the output to be ' c a . s t y l e . y a h o o . c o m' with the space left in before the c ?

\s*/.*$ is whitespace, slash, then any number of any characters (except newlines because of the options we used) to end of the line. In your example, the \s* makes the difference between the output being 'c a . s t y l e . y a h o o . c o m ' with trailing space, or 'c a . s t y l e . y a h o o . c o m' without trailing space.

One point about using regexprep() with a cell array of patterns to match, is that the first replacement is done, and the second is applied to the results of the first, and the third is applied to the results of the second, and so on. For example,

rexprep('aaQb', {'[A-Z]', 'aaa' }, {'a', '%'})

first takes the input 'aaQb' and applies (replace '[A-Z]' with 'a') giving 'aaab' . Then it applies the rule (replace 'aaa' with '%') giving '%b' . Notice that the pattern 'aaa' did not occur in the original input. Do not think of the replacements as going in "in parallel": they need to be understood as sequences of steps.

Budding MATLAB Jockey el 14 de Mayo de 2020

Editada: Budding MATLAB Jockey el 14 de Mayo de 2020

Abrir en MATLAB Online

Ack. The rest of the people here want to keep all the code on 2019b so I can't change to 2020.

ok this sort of works but...

This is the real file opened in notebook which apparently has quotes around everything and makes that code not work.

#	Total Backlinks	Domain Rating	URL Rating (desc)	Referring Domains	Referring Page URL	Referring Page Title	Internal Links Count	External Links Count	Link URL	TextPre	Link Anchor	TextPost	Type	Backlink Status	First Seen	Last Check	Day Lost	Language	Traffic	Keywords	Js rendered	Linked Domains
"1"	"1"	"48"	"21"	"19"	"https://www.pp.com/"	"51 Best ...

whereas when I saved that test file it didn't have the quotes (my bad very sorry...in excel it looks the same :( )

#	Total Backlinks	Domain Rating	URL Rating (desc)	Referring Domains	Referring Page URL	Referring Page Title	Internal Links Count	External Links Count	Link URL	TextPre	Link Anchor	TextPost	Type	Backlink Status	First Seen	Last Check	Day Lost	Language	Traffic	Keywords	Js rendered	Linked Domains
1	1	48	21	19	https://www.pp.com/	blogs

I can save everything as a string, but then i have a million strings :(

Also will that regexprep stuff still work? with a giant cell aray instead of a table? I also need to figure out how to save the output without writetable() :(

Walter Roberson el 15 de Mayo de 2020

Abrir en MATLAB Online

target = 'uploadable.csv';
opts = detectImportOptions(target, 'encoding', 'utf16le');
t = readtable(target, opts);

Before R2020a you will get warnings about the encoding not being supported, and also a warning about a byte order mark.

warning('off', 'MATLAB:iofun:UnsupportedEncoding')

will get rid of the message about unsupported encoding.

Or you could use

target = 'uploadable.csv';
fmt = ['"%f" "%f" "%f" "%f" "%f" %q %q "%f" "%f" ',repmat('%q ',1,10), '"%f" "%f" %q "%f"'];
fid = fopen(target, 'rt', 'n', 'utf16-le');   %ignore warning about UTF16-LE not being supported
data = textscan(fid, fmt, 'delimiter', '\t', 'headerlines', 1);
fclose(fid)

This will give you a single warning about the encoding not being supported.

If the warning about encoding really bugs you then,

target = 'uploadable.csv';
fmt = ['"%f" "%f" "%f" "%f" "%f" %q %q "%f" "%f" ',repmat('%q ',1,10), '"%f" "%f" %q "%f"'];
fid = fopen(target, 'r');
bytes = fread(fid, [1 inf], '*uint8');
fclose(fid)
s = native2unicode(bytes, 'utf16le');
data = textscan(s, fmt, 'delimiter', '\t', 'headerlines', 1);

.. provided that the files do not occupy more than about 1/3 of your available memory.

Budding MATLAB Jockey el 21 de Mayo de 2020

For reference everything worked great. You are my hero!

Iniciar sesión para comentar.

apply string math to everything in a table

0 comentarios
Mostrar -2 comentarios más antiguos Ocultar -2 comentarios más antiguos

Respuesta aceptada

10 comentarios
Mostrar 8 comentarios más antiguos Ocultar 8 comentarios más antiguos

Más respuestas (0)

Categorías

Etiquetas

Community Treasure Hunt

apply string math to everything in a table

0 comentarios Mostrar -2 comentarios más antiguos Ocultar -2 comentarios más antiguos

Respuesta aceptada

10 comentarios Mostrar 8 comentarios más antiguos Ocultar 8 comentarios más antiguos

Más respuestas (0)

Categorías

Etiquetas

Ver también

Community Treasure Hunt

0 comentarios
Mostrar -2 comentarios más antiguos Ocultar -2 comentarios más antiguos

10 comentarios
Mostrar 8 comentarios más antiguos Ocultar 8 comentarios más antiguos