Text Extraction and retrieval

Question

John el 24 de Oct. de 2017

0
Enlazar

Enlace directo a esta pregunta

https://es.mathworks.com/matlabcentral/answers/363069-text-extraction-and-retrieval

Editada: shilpa patil el 23 de Sept. de 2019

 <P ID=1>
A LITTLE BLACK BIRD.
</P>
 <P ID=2>
Story about a bird, 
(1811)
</P>
 <P ID=3>
Part 1.
</P>

As I am new to text extraction, I need help in;

Writing a code to count the delimiters (</P>)
Remove all punctuation
Break the text into individual documents at each delimiter, knowing that ID=1 refers to document 1, ID=2 refers to document 2. etc

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Iniciar sesión para comentar.

Iniciar sesión para responder a esta pregunta.

Answer 1

Akira Agata el 25 de Oct. de 2017

1
Enlazar

Enlace directo a esta respuesta

https://es.mathworks.com/matlabcentral/answers/363069-text-extraction-and-retrieval#answer_287567

Abrir en MATLAB Online

Just tried to make a script to do that. Here is the result (assuming the maximum ID = 10).

% Read your text file
fid = fopen('yourText.txt');
C = textscan(fid,'%s','TextType','string','Delimiter','\n','EndOfLine','\r\n');
C = C{1};
fclose(fid);
% 1. Count the delimiters '</P>'
idx = strfind(C,'</P>');
n = nnz(cellfun(@(x) ~isempty(x), idx));
% 2. Remove all punctuation
C2 = regexprep(C,'[.,!?:;]','');
% 3. Break the text into individual documents at each delimiter
idx2 = find(strcmp(C,'</P>'));
for kk = 1:10
  str = ['<P ID=',num2str(kk),'>'];
  idx_s = find(strcmp(C,str));
  if ~isempty(idx_s)
      idx_e = idx2(find(idx2>idx_s,1));
      fileName = ['document',num2str(kk),'.txt'];
      fid = fopen(fileName,'w');
      fprintf(fid,'%s\r\n',C(idx_s:idx_e));
      fclose(fid);
  end    
end

6 comentarios
Mostrar 4 comentarios más antiguosOcultar 4 comentarios más antiguos

Akira Agata el 30 de Oct. de 2017

Editada: Akira Agata el 30 de Oct. de 2017

Abrir en MATLAB Online

Thanks for your reply. I've just made a script to do the items 1~3, as follows. I hope this will help you somehow.

Regarding your last question ("count the number of documents each word appear in"), I think you can do that by combining the following script with my previous one.

% Read your text file
fid = fopen('yourText.txt');
C = textscan(fid,'%s','TextType','string','Delimiter','\n','EndOfLine','\r\n');
C = C{1};
fclose(fid);
C = regexprep(C,'<[\w \=\/]+>',''); % Remove tags
C = regexprep(C,'[.,!?:;()]','');   % Remove punctuation and brackets
C = regexprep(C,'[0-9]+','');       % Remove numbers
C = lower(C);                       % Convert to lower case
% Extract every words
words = regexp(C,'[a-z\-]+','match');
words = [words{:}];
% (1) Count total number of words
numOfWords = numel(words); % --> 9
% (2) Count the total number of distinct words
numOfDistWords = numel(unique(words)); % --> 7
% (3) Find the number of times each word is used in the original text
wordList = unique(words);
wordCount = arrayfun(@(x) nnz(strcmp(x,words)), wordList);
% Show the result
figure
bar(wordCount)
xticklabels(wordList)

John el 7 de Nov. de 2017

Abrir en MATLAB Online

Thanks. I am stuck running the counter.

for kk = 1:n
  str = ['<p id=',num2str(kk),'>'];
  idx_s = find(strcmp(C,str));
  if ~isempty(idx_s)
      idx_e = idx2(find(idx2>idx_s,1));
      Doc=C(idx_s:idx_e); %May need to remove tags later
      Doc = regexp(Doc,'[a-z0-9\-]+','match');
      Doc = [Doc{:}];
      Unique_Doc_count = arrayfun(@(x) nnz(strcmp(x,Doc)), Unique);
      Unique_Doc_freq=[Unique;Unique_Doc_count];
  end    
end

I want to search if the elements in string array 'Unique' exist in 'Doc'. I got results in 'Unique_Doc_count' as the number of their occurrences but I need just 1 or 0 values (exist) or (not exist). The aim is to loop 'kk' over multiple documents and find the number of documents that contain each word in 'Unique'. Not even number of times the word occurs, but number of documents it appears in.

Iniciar sesión para comentar.

Answer 2

Cedric el 26 de Oct. de 2017

2
Enlazar

Enlace directo a esta respuesta

https://es.mathworks.com/matlabcentral/answers/363069-text-extraction-and-retrieval#answer_287987

Abrir en MATLAB Online

Here is another approach based on pattern matching:

 >> data = regexp(fileread('data.txt'), '(?<=<P[^>]+>\s*)[\w ]+', 'match' )
 data =
  1×3 cell array
    {'A LITTLE BLACK BIRD'}    {'Story about a bird'}    {'Part 1'}

if you don't need the IDs (e.g. if in any case they will go from 1 to the number of P tags), you are done.

If you needed the IDs, you could get both IDs and content as follows:

 >> data = regexp(fileread('data.txt'), '<P ID=(\d+)>\s*([\w ]+)', 'tokens' ) ;
    data = vertcat( data{:} ) ;
    ids  = str2double( data(:,1) )
    data = data(:,2)
 ids =
     1
     2
     3
 data =
  3×1 cell array
    {'A LITTLE BLACK BIRD'}
    {'Story about a bird' }
    {'Part 1'             }

6 comentarios
Mostrar 4 comentarios más antiguosOcultar 4 comentarios más antiguos

John el 7 de Nov. de 2017

Abrir en MATLAB Online

Thanks. I am stuck running a counter.

for kk = 1:n
  str = ['<p id=',num2str(kk),'>'];
  idx_s = find(strcmp(C,str));
  if ~isempty(idx_s)
      idx_e = idx2(find(idx2>idx_s,1));
      Doc=C(idx_s:idx_e); %May need to remove tags later
      Doc = regexp(Doc,'[a-z0-9\-]+','match');
      Doc = [Doc{:}];
      Unique_Doc_count = arrayfun(@(x) nnz(strcmp(x,Doc)), Unique);
      Unique_Doc_freq=[Unique;Unique_Doc_count];
  end    
end

I want to search if the elements in string array 'Unique' exist in 'Doc'. I got results in 'Unique_Doc_count' as the number of their occurrences but I need just 1 or 0 values (exist) or (not exist). The aim is to loop 'kk' over multiple documents and find the number of documents that contain each word in 'Unique'. Not even number of times the word occurs, but number of documents it appears in.

Cedric el 9 de Nov. de 2017

Editada: Cedric el 9 de Nov. de 2017

Abrir en MATLAB Online

If you have a count per document, finding the number of documents a keyword is in is easy:

 counts = [7, 0 ,3] ;
 hasKey = counts > 0 ;        % [1,0,1]
 nDocs  = sum( hasKey ) ;     % 2

Iniciar sesión para comentar.

Answer 3

Christopher Creutzig el 2 de Nov. de 2017

0
Enlazar

Enlace directo a esta respuesta

https://es.mathworks.com/matlabcentral/answers/363069-text-extraction-and-retrieval#answer_289028

Editada: Christopher Creutzig el 2 de Nov. de 2017

Abrir en MATLAB Online

It's probably easiest to split the text and then check the number of splits created to count, using string functions:

str = extractFileText('file.txt');
paras = split(str,"</P>");
paras(end) = [];                % the split left an empty last entry
paras = extractAfter(paras,">") % Drop the "<P ID=n>" from the beginning

Then, numel(paras) will give you the number of </P>.

If you do not have extractFileText, calling string(fileread('file.txt')) should work just fine, too.

In one of the comments, you indicated you also need to count the frequency of words in documents. That is what bagOfWords is for:

tdoc = tokenizedDocument(lower(paras));
bag = bagOfWords(tdoc)
bag = 
bagOfWords with 13 words and 3 documents:
      a   little   black   bird   .   …
      1        1       1      1   1
      1        0       0      1   0
      …

2 comentarios
Mostrar NingunoOcultar Ninguno

John el 7 de Nov. de 2017

Abrir en MATLAB Online

Thanks. I am stuck running a counter.

for kk = 1:n
  str = ['<p id=',num2str(kk),'>'];
  idx_s = find(strcmp(C,str));
  if ~isempty(idx_s)
      idx_e = idx2(find(idx2>idx_s,1));
      Doc=C(idx_s:idx_e); %May need to remove tags later
      Doc = regexp(Doc,'[a-z0-9\-]+','match');
      Doc = [Doc{:}];
      Unique_Doc_count = arrayfun(@(x) nnz(strcmp(x,Doc)), Unique);
      Unique_Doc_freq=[Unique;Unique_Doc_count];
  end    
end

I want to search if the elements in string array 'Unique' exist in 'Doc'. I got results in 'Unique_Doc_count' as the number of their occurrences but I need just 1 or 0 values (exist) or (not exist). The aim is to loop 'kk' over multiple documents and find the number of documents that contain each word in 'Unique'. Not even number of times the word occurs, but number of documents it appears in.

shilpa patil el 23 de Sept. de 2019

Editada: shilpa patil el 23 de Sept. de 2019

how to rewrite the above code for a document image

instead of text file

Iniciar sesión para comentar.

Text Extraction and retrieval

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Respuesta aceptada

6 comentarios
Mostrar 4 comentarios más antiguosOcultar 4 comentarios más antiguos

Más respuestas (2)

6 comentarios
Mostrar 4 comentarios más antiguosOcultar 4 comentarios más antiguos

2 comentarios
Mostrar NingunoOcultar Ninguno

Ver también

Categorías

Etiquetas

Productos

Community Treasure Hunt

Text Extraction and retrieval

0 comentarios Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Respuesta aceptada

6 comentarios Mostrar 4 comentarios más antiguosOcultar 4 comentarios más antiguos

Más respuestas (2)

6 comentarios Mostrar 4 comentarios más antiguosOcultar 4 comentarios más antiguos

2 comentarios Mostrar NingunoOcultar Ninguno

Ver también

Categorías

Etiquetas

Productos

Community Treasure Hunt

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

6 comentarios
Mostrar 4 comentarios más antiguosOcultar 4 comentarios más antiguos

6 comentarios
Mostrar 4 comentarios más antiguosOcultar 4 comentarios más antiguos

2 comentarios
Mostrar NingunoOcultar Ninguno