Count the number of times a word begins with "co" in a text using Text Analytics Toolbox

1 visualización (últimos 30 días)

Mostrar comentarios más antiguos

Angelavtc el 21 de Abr. de 2022

0
Enlazar

Enlace directo a esta pregunta

https://es.mathworks.com/matlabcentral/answers/1701665-count-the-number-of-times-a-word-begins-with-co-in-a-text-using-text-analytics-toolbox

Comentada: Angelavtc el 25 de Abr. de 2022

Respuesta aceptada: Jonas

Factiva_sample_headlines_1.pdf

Abrir en MATLAB Online

Dear community,

I have a pdf with news headlines, and I need to count the number of words each title has and the number of times the words starting with "co" and the word "price" appear in each title. I have not much experience using the Text Analytics Toolbox in Matlab. As far as I can see, "tokenizedDocument" already gives you the total number of words (or tokens) per headline, and "context" counts a specific word. However, I do not know how to ask Matlab to look for words starting with "co". Also, how do I get this information displayed in a table?

I leave my pdf and my code.

I really appreciate any help you can provide!

filename = "Factiva_sample_headlines_1.pdf";
str = extractFileText(filename);
textData = split(str,[newline newline]); %split the text into separate news using split
textData = textData(cellfun(@(s)isempty(regexp(s,'Page')),textData)); %Erase data related to number of page
cleanedDocuments = tokenizedDocument(textData); %Create an array of tokenized documents.

12 comentarios
Mostrar 10 comentarios más antiguosOcultar 10 comentarios más antiguos

Stephen23 el 21 de Abr. de 2022

Just like any text analytics, the devil is in the detail.

It is worth noting right at the start that PDF format is not intended for data exchange, it actually a language that arranges graphical objects for consistent visual display. How the that data is displayed can be quite different to how it is stored in the file or when extracted.

In this case, I cannot find any one simple pattern that uniquely identifies all of the headlines (blue text in the PDF):

most headlines are contained on one line, but some are two lines.
most headlines have two leading newines, but so does the start of the 2nd page (which is not a headline).
most headlines have a folliwing line of the publisher and date... except for the last headline.
Most headlines start with alphabetic characters, but some start with quotation marks.

Using a better file format for data exchange would probably make this task easier.

Also note the question is incomplete/badly formed:

Do you want to match the case or ignore the case?
Do you want to match only "price" as the entire word, or also as part of a word? E.g. "prices".
What is a "word"? Is "time-share" one or two words? Is "(JCPOA)" a word? What about "45%" ?

Angelavtc el 22 de Abr. de 2022

Oh la la, it seems more complex than expected :( perhaps I should move to another software 😭. In any case, thank you very much @Stephen!

Angelavtc el 23 de Abr. de 2022

Abrir en MATLAB Online

@Stephen Sorry for the inconvenience again, but I have managed to transform the file to html format (https://drive.google.com/file/d/1Z5bW98_gWohr2appS8zKgxCC1_mlLpzc/view?usp=sharing) Now the problem is that when I use :

filename = "Factiva_1.html";
str = extractFileText(filename);

I only get one article loaded. Any idea how to make matlab read all of them and classify them by title, date and body?

Thank you very much!

Iniciar sesión para comentar.

Iniciar sesión para responder a esta pregunta.

Respuesta aceptada

Jonas el 21 de Abr. de 2022

0
Enlazar

Enlace directo a esta respuesta

https://es.mathworks.com/matlabcentral/answers/1701665-count-the-number-of-times-a-word-begins-with-co-in-a-text-using-text-analytics-toolbox#answer_947555

Editada: Jonas el 21 de Abr. de 2022

Abrir en MATLAB Online

are your searching for something like in this example, applied to your textData?

a={'cotrol', 'alcotro','conect','trial','co'};
cellfun(@(in) strcmp(in(1:2),'co'),a)
ans =
  1×5 logical array
   1   0   1   0   1

you can sum that array to get the total number of words starting with "co"

8 comentarios
Mostrar 6 comentarios más antiguosOcultar 6 comentarios más antiguos

Jonas el 25 de Abr. de 2022

Abrir en MATLAB Online

you could add another complicated arrayfun or cellfun instead of the strfind pattern:

cellfun(@(in) strfind(str,in),{" se" " sh"},'UniformOutput',false)

after that i would switch to a for statement for better readability

str = [ "se an example of a short sentence short shark" "a second short sentence" " another thing witout matches" ];
str=append(" ",str ); % if the first word starts with the searched characters, we need to add a space
startingPattern=[" se" " sh"];
% for eafh search term you get one cell containing a cell
whereIsWhat=arrayfun(@(in) strfind(str,in),startingPattern,'UniformOutput',false);
% if you want to store it you can to this eg as matrix, because all results
% have the same number of entries because they search in eual number of
% texts
howOftenIsThisTermnInText=zeros(numel(str),numel(startingPattern)); % search Terms from left to right, results for specific text from top to bottom
for searchTermNumber=1:numel(startingPattern)
    howOftenIsThisTermnInText(:,searchTermNumber)=cellfun(@numel,whereIsWhat{searchTermNumber});
end

Walter Roberson el 25 de Abr. de 2022

Más respuestas (0)

Iniciar sesión para responder a esta pregunta.

Categorías

MATLAB Language Fundamentals Data Types Characters and Strings

Más información sobre Characters and Strings en Help Center y File Exchange.

Productos

Text Analytics Toolbox

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by

Count the number of times a word begins with "co" in a text using Text Analytics Toolbox

12 comentarios
Mostrar 10 comentarios más antiguosOcultar 10 comentarios más antiguos

Respuesta aceptada

8 comentarios
Mostrar 6 comentarios más antiguosOcultar 6 comentarios más antiguos

Más respuestas (0)

Ver también

Categorías

Etiquetas

Productos

Community Treasure Hunt

Count the number of times a word begins with "co" in a text using Text Analytics Toolbox

12 comentarios Mostrar 10 comentarios más antiguosOcultar 10 comentarios más antiguos

Respuesta aceptada

8 comentarios Mostrar 6 comentarios más antiguosOcultar 6 comentarios más antiguos

Más respuestas (0)

Ver también

Categorías

Etiquetas

Productos

Community Treasure Hunt

12 comentarios
Mostrar 10 comentarios más antiguosOcultar 10 comentarios más antiguos

8 comentarios
Mostrar 6 comentarios más antiguosOcultar 6 comentarios más antiguos