Main Content


Add lemma forms of tokens to documents


Use addLemmaDetails to add lemma forms to documents.

The function supports English, Japanese, and Korean text.


updatedDocuments = addLemmaDetails(documents) adds lemma details to documents and updates the token details. To get the lemma details from updatedDocuments, use tokenDetails.


updatedDocuments = addLemmaDetails(documents,'DiscardKnownValues',true) discards previously computed details and recomputes them.


Use addLemmaDetails before using the lower, upper, and normalizeWords functions as addLemmaDetails uses information that is removed by these functions.


collapse all

Create a tokenized document array.

str = [ ...
    "The dogs ran after the cat."
    "I am building a house."];
documents = tokenizedDocument(str);

Add lemma details to the documents using addLemmaDetails. This function lemmatizes the text and adds the lemma form of each token to the table returned by tokenDetails. View the updated token details of the first few tokens.

documents = addLemmaDetails(documents);
tdetails = tokenDetails(documents);
     Token     DocumentNumber    LineNumber       Type        Language     Lemma 
    _______    ______________    __________    ___________    ________    _______

    "The"            1               1         letters           en       "the"  
    "dogs"           1               1         letters           en       "dog"  
    "ran"            1               1         letters           en       "run"  
    "after"          1               1         letters           en       "after"
    "the"            1               1         letters           en       "the"  
    "cat"            1               1         letters           en       "cat"  
    "."              1               1         punctuation       en       "."    
    "I"              2               1         letters           en       "i"    

Input Arguments

collapse all

Input documents, specified as a tokenizedDocument array.

Output Arguments

collapse all

Updated documents, returned as a tokenizedDocument array. To get the token details from updatedDocuments, use tokenDetails.

Version History

Introduced in R2018b