Main Content

topkngrams

Most frequent n-grams

Description

example

tbl = topkngrams(bag) returns a table listing the five most frequently seen n-grams in the bag-of-n-grams model bag. The function, by default, is case sensitive.

example

tbl = topkngrams(bag,k) lists the k most frequently seen n-grams in the bag-of-n-grams model bag. The function, by default, is case sensitive.

example

tbl = topkngrams(___,Name,Value) specifies additional options using one or more name-value pair arguments.

Examples

collapse all

Create a table of the most frequent bigrams of a bag-of-n-grams model.

Load the example data. The file sonnetsPreprocessed.txt contains preprocessed versions of Shakespeare's sonnets. The file contains one sonnet per line, with words separated by a space. Extract the text from sonnetsPreprocessed.txt, split the text into documents at newline characters, and then tokenize the documents.

filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);

Create a bag-of-n-grams model.

bag = bagOfNgrams(documents)
bag = 
  bagOfNgrams with properties:

          Counts: [154×8799 double]
      Vocabulary: [1×3092 string]
          Ngrams: [8799×2 string]
    NgramLengths: 2
       NumNgrams: 8799
    NumDocuments: 154

Find the top 5 bigrams.

tbl = topkngrams(bag)
tbl=5×3 table
         Ngram          Count    NgramLength
    ________________    _____    ___________

    "thou"    "art"      34           2     
    "mine"    "eye"      15           2     
    "thy"     "self"     14           2     
    "thou"    "dost"     13           2     
    "mine"    "own"      13           2     

Find the top 10 bigrams.

tbl = topkngrams(bag,10)
tbl=10×3 table
          Ngram          Count    NgramLength
    _________________    _____    ___________

    "thou"    "art"       34           2     
    "mine"    "eye"       15           2     
    "thy"     "self"      14           2     
    "thou"    "dost"      13           2     
    "mine"    "own"       13           2     
    "thy"     "sweet"     12           2     
    "thy"     "love"      11           2     
    "dost"    "thou"      10           2     
    "thou"    "wilt"      10           2     
    "love"    "thee"       9           2     

Load the example data. The file sonnetsPreprocessed.txt contains preprocessed versions of Shakespeare's sonnets. The file contains one sonnet per line, with words separated by a space. Extract the text from sonnetsPreprocessed.txt, split the text into documents at newline characters, and then tokenize the documents.

filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);

Create a bag-of-n-grams model. To count n-grams of length 2 and 3 (bigrams and trigrams), specify 'NgramLengths' to be the vector [2 3].

bag = bagOfNgrams(documents,'NgramLengths',[2 3])
bag = 
  bagOfNgrams with properties:

          Counts: [154×18022 double]
      Vocabulary: [1×3092 string]
          Ngrams: [18022×3 string]
    NgramLengths: [2 3]
       NumNgrams: 18022
    NumDocuments: 154

View the 10 most common n-grams of length 2 (bigrams).

topkngrams(bag,10,'NGramLengths',2)
ans=10×3 table
             Ngram             Count    NgramLength
    _______________________    _____    ___________

    "thou"    "art"      ""     34           2     
    "mine"    "eye"      ""     15           2     
    "thy"     "self"     ""     14           2     
    "thou"    "dost"     ""     13           2     
    "mine"    "own"      ""     13           2     
    "thy"     "sweet"    ""     12           2     
    "thy"     "love"     ""     11           2     
    "dost"    "thou"     ""     10           2     
    "thou"    "wilt"     ""     10           2     
    "love"    "thee"     ""      9           2     

View the 10 most common n-grams of length 3 (trigrams).

 topkngrams(bag,10,'NGramLengths',3)
ans=10×3 table
               Ngram                Count    NgramLength
    ____________________________    _____    ___________

    "thy"     "sweet"    "self"       4           3     
    "why"     "dost"     "thou"       4           3     
    "thy"     "self"     "thy"        3           3     
    "thou"    "thy"      "self"       3           3     
    "mine"    "eye"      "heart"      3           3     
    "thou"    "shalt"    "find"       3           3     
    "fair"    "kind"     "true"       3           3     
    "thou"    "art"      "fair"       2           3     
    "love"    "thy"      "self"       2           3     
    "thy"     "self"     "thou"       2           3     

Input Arguments

collapse all

Input bag-of-n-grams model, specified as a bagOfNgrams object.

Number of n-grams to return, specified as a positive integer or Inf.

If k is Inf, then the function returns all n-grams. For bag-of-n-grams and LDA model input, the function sorts the n-grams in order of frequency and importance, respectively.

Example: 20

Name-Value Arguments

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

Example: 'NgramLengths',[2 3] specifies to return the top bigrams and trigrams.

N-gram lengths, specified as the comma separated pair consisting of 'NgramLengths' and a positive integer or a vector of positive integers.

If you specify NgramLengths, then the function returns n-grams of these lengths only. If you do not specify NgramLengths, then the function returns the top n-grams regardless of length.

Example: [1 2 3]

Option to ignore case, specified as the comma-separated pair consisting of 'IgnoreCase' and one of the following:

  • false – treat n-grams differing only by case as separate n-grams.

  • true – treat n-grams differing only by case as the same n-gram and merge counts.

Indicator for forcing output to be returned as cell array, specified as the comma separated pair consisting of 'ForceCellOutput' and true or false.

Data Types: logical

Output Arguments

collapse all

Top n-grams, returned as a table or a cell array of tables. For bag-of-n-grams and LDA model input, the function sorts the n-grams in order of frequency and importance, respectively.

The table has the following columns:

NgramN-gram specified as a string vector
CountNumber of times the n-gram appears in the bag-of-n-grams model.
NgramLengthLength of the n-gram.

If bag is a non-scalar array or 'ForceCellOutput' is true, then the function returns the outputs as a cell array of tables. Each element in the cell array is a table containing the top n-grams of the corresponding element of bag.

Version History

Introduced in R2018a