Preprocess Text Data

Preprocess and clean up text data for analysis

Since R2023a

Description

The Preprocess Text Data Live Editor task helps prepare text data for analysis.

You can use the task to control these processing steps:

HTML clean up
Tokenization
Adding token details
Word normalization
Changing and removing words

The Preprocess Text Data Live Editor task generates code that performs the selected preprocessing steps, which you can use to create a preprocessing function for your workflows.

Preprocess Text Data Task in Live Editor

Open the Task

To add the Preprocess Text Data task to a live script in the MATLAB^® Editor:

On the Live Editor tab, select Task > Preprocess Text Data.
In a code block in the live script, type a relevant keyword, such as preprocess, clean, or text. Select Preprocess Text Data from the suggested command completions.

Examples

expand all

Create Simple Preprocessing Function

This example shows how to create a function which cleans and preprocesses text data for analysis using the Preprocess Text Data Live Editor task.

First, load the factory reports data. The data contains textual descriptions of factory failure events.

tbl = readtable("factoryReports.csv")

A table with variables "Description", "Category", and "Urgency". The "Description" variable contains textual descriptions such as "Items are occasionally getting stuck in the scanner spools". The "Category" variable contains categorical labels such as "Mechanical Failure", and the "Urgency" variable contains categorical labels such as "Medium".

Open the Preprocess Text Data Live Editor task. To open the task, begin typing the task name and select Preprocess Text Data from the suggested command completions. Alternatively, on the Live Editor tab, select Task > Preprocess Text Data.

Drop down list showing suggested command completions. The only suggestion in the list is for the Preprocess Text Data task, and is selected.

Preprocess the text using these options:

Select tbl as the input data and select the table variable Description.
Tokenize the text using automatic language detection.
To improve lemmatization, add part-of-speech tags to the token details.
Normalize the words using lemmatization.
Remove words with fewer than 3 characters or more than 14 characters.
Remove stop words.
Erase punctuation.
Display the preprocessed text in a word cloud.

Preprocess Text Data task with fields corresponding to preprocessing options highlighed with numbered red rectangles. The image highlights these options in order: "Data", "Language", "Add part-of-speech tags", "Normalize words", "Minimum word length", "Maximum word length", "Remove stop words", "Erase punctuation", and "Show word cloud".

A word cloud showing words in different font sizes. Larger font sizes indicate more frequent words in the data. The word cloud highlights words like "assembler" and "mixer". Words like "the" and "in" do not appear in the word cloud.

The Preprocess Text Data Live Editor task generates code in your live script. The generated code reflects the options that you select and includes code to generate the display. To see the generated code, click Show code at the bottom of the task parameter area. The task expands to display the generated code.

MATLAB code generated by Preprocess Text Data task

By default, the generated code uses preprocessedText as the name of the output variable returned to the MATLAB workspace. To specify a different output variable name, enter a new name in the summary line at the top of the task.

To reuse the same steps in your code, create a function that takes as input the text data and outputs the preprocessed text data. You can include the function at the end of a script or as a separate file. The preprocessTextData function listed at the end of the example, uses the code generated by the Preprocess Text Data Live Editor task.

To use the function, specify the table as input to the preprocessTextData function.

documents = preprocessTextData(tbl);

Preprocessing Function

The preprocessTextData function uses the code generated by the Preprocess Text Data Live Editor task. The function takes as input the table tbl and returns the preprocessed text preprocessedText. The function performs these steps:

Extract the text data from the Description variable of the input table.
Tokenize the text using tokenizedDocument.
Add part-of-speech details using addPartOfSpeechDetails.
Lemmatize the words using normalizeWords.
Remove words with 2 or fewer characters using removeShortWords.
Remove words with 15 or more characters using removeLongWords.
Remove stop words (such as "and", "of", and "the") using removeStopWords.
Erase punctuation using erasePunctuation.

function preprocessedText = preprocessTextData(tbl)

%% Preprocess Text
preprocessedText = tbl.Description;

% Tokenize
preprocessedText = tokenizedDocument(preprocessedText);

% Add token details
preprocessedText = addPartOfSpeechDetails(preprocessedText);

% Change and remove words
preprocessedText = normalizeWords(preprocessedText,Style="lemma");
preprocessedText = removeShortWords(preprocessedText,2);
preprocessedText = removeLongWords(preprocessedText,15);
preprocessedText = removeStopWords(preprocessedText,IgnoreCase=false);
preprocessedText = erasePunctuation(preprocessedText);

end

For an example showing a more detailed workflow, see Preprocess Text Data in Live Editor. For next steps in text analytics, you can try creating a classification model or analyze the data using topic models. For examples, see Create Simple Text Model for Classification and Analyze Text Data Using Topic Models.

Parameters

expand all

Select Data

`Data` — Text to preprocess
workspace variable

Text to preprocess, specified as a MATLAB workspace variable. The variable must be a table, string array, or character vector to appear in the list.

If you select a table, then specify the table variable containing the text data in the second drop-down box that appears.

Clean Up HTML

`Extract HTML text` — Extract text data from HTML tags
`off` (default) | `on`

Extract text data from HTML tags.

The generated code uses extractHTMLText.

`Remove HTML tags` — Remove HTML tags
`off` (default) | `on`

Remove HTML tags.

The generated code uses eraseTags.

`Decode HTML entities` — Convert HTML and XML entities into characters
`off` (default) | `on`

Convert HTML and XML entities into characters. For example convert "&" to "&".

The generated code uses decodeHTMLEntities.

Tokenize

`Language` — Text language
`Automatic` (default) | `English` | `German` | `Japanese` | `Korean`

Text language, specified as one of these options:

Automatic: Automatic language detection
English: English language
German: German language
Japanese: Japanese language
Korean: Korean language

The generated code uses tokenizedDocument.

`Split` — Text splitting mode
`None` (default) | `Sentences` | `Paragraphs`

Text splitting mode, specified as one of these options:

None

Do not split input.

Sentences

Split input into sentences. This option supports scalar input only.

The generated code uses splitSentences.

Paragraphs

Split input into paragraphs. This option supports scalar input only.

The generated code uses splitParagraphs.

Add Token Details

`Add sentence numbers` — Option to add sentence numbers
`off` (default) | `on`

Option to add sentence numbers to tokens.

The generated code uses addSentenceDetails.

`Add part-of-speech tags` — Option to add part-of-speech tags
`on` (default) | `off`

Option to add part-of-speech tags to tokens.

The generated code uses addPartOfSpeechDetails.

`Detect named entities` — Option to detect named entities
`off` (default) | `on`

Option to detect named entities in tokens.

The generated code uses addEntityDetails.

`Parse dependencies` — Option to parse dependencies
`off` (default) | `on`

Option to parse dependencies in tokens. This option requires Text Analytics Toolbox™ Model for Udify data support package.

The generated code uses addDependencyDetails.

Token Edit and Removal

`Word normalization` — Word normalization
`Lemma` (default) | `Stem` | `None`

Word normalization, specified as one of these options:

None: Do not normalize words.
Lemma: Normalize words using lemmatization. This option outputs text in lowercase.
Stem: Normalize words using stemming.

The generated code uses normalizeWords.

`Case normalization` — Case normalization
`None` (default) | `Uppercase` | `Lowercase`

Case normalization, specified as one of these options:

None

Do not normalize case.

Note

The Lemma option of Word normalization converts text to lowercase.

Lowercase

Convert text to lowercase.

The generated code uses lower.

Uppercase

Convert text to uppercase.

The generated code uses upper.

`Minimum word length` — Minimum word length
`3` (default) | positive integer | `off`

Minimum word length, specified as of these options:

off — Do not remove short words
positive integer — remove words with fewer than the specified number of characters

The generated code uses removeShortWords.

`Maximum word length` — Maximum word length
`14` (default) | positive integer | `off`

Maximum word length, specified as of these options:

off — Do not remove long words
positive integer — remove words with more than the specified number of characters

The generated code uses removeLongWords.

`Remove stop words` — Option to remove stop words
`on` (default) | `off`

Option to remove stop words.

The generated code uses removeStopWords.

`Remove Erase punctuation` — Option to erase punctuation
`on` (default) | `off`

Option to erase punctuation.

The generated code uses erasePunctuation.

`Replace words` — Source and target words for replacement
pairs of source and target strings

Source and target words for replacement, specified as pairs of source and target strings. To specify multiword phrases (n-grams), use whitespace separated words.

The generated code uses replaceWords and replaceNgrams.

`Remove words` — Words to remove
string

Words to remove, specified as strings. To specify multiword phrases (n-grams), use whitespace separated words.

The generated code uses removeWords and removeNgrams.

`Remove empty documents` — Option to remove empty documents
`off` (default) | `on`

Option to remove empty documents.

The generated code uses removeEmptyDocuments.

`Ignore case` — Option to ignore case
`off` (default) | `on`

Option to ignore case in word change and removal options.

Display Results

`Show tokenized text` — Option to show tokenized text
`off` (default) | `on`

Option to show tokenized text.

`Show token details` — Option to show token details
`off` (default) | `on`

Option to show token details.

The generated code uses tokenDetails.

`Show word cloud` — Option to show word cloud
`off` (default) | `on`

Option to show word cloud.

The generated code uses wordcloud.

Tips

By default, the Preprocess Text Data task does not automatically run when you modify the task parameters. To have the task run automatically after any change, select the Autorun checkbox at the top-right of the task. If your data set is large, do not enable this option.

Version History

Introduced in R2023a

Preprocess Text Data

Description

Open the Task

Examples

Create Simple Preprocessing Function

Parameters

Select Data

`Data` — Text to preprocess
workspace variable

Clean Up HTML

`Extract HTML text` — Extract text data from HTML tags
`off` (default) | `on`

`Remove HTML tags` — Remove HTML tags
`off` (default) | `on`

`Decode HTML entities` — Convert HTML and XML entities into characters
`off` (default) | `on`

Tokenize

`Language` — Text language
`Automatic` (default) | `English` | `German` | `Japanese` | `Korean`

`Split` — Text splitting mode
`None` (default) | `Sentences` | `Paragraphs`

Add Token Details

`Add sentence numbers` — Option to add sentence numbers
`off` (default) | `on`

`Add part-of-speech tags` — Option to add part-of-speech tags
`on` (default) | `off`

`Detect named entities` — Option to detect named entities
`off` (default) | `on`

`Parse dependencies` — Option to parse dependencies
`off` (default) | `on`

Token Edit and Removal

`Word normalization` — Word normalization
`Lemma` (default) | `Stem` | `None`

`Case normalization` — Case normalization
`None` (default) | `Uppercase` | `Lowercase`

`Minimum word length` — Minimum word length
`3` (default) | positive integer | `off`

`Maximum word length` — Maximum word length
`14` (default) | positive integer | `off`

`Remove stop words` — Option to remove stop words
`on` (default) | `off`

`Remove Erase punctuation` — Option to erase punctuation
`on` (default) | `off`

`Replace words` — Source and target words for replacement
pairs of source and target strings

`Remove words` — Words to remove
string

`Remove empty documents` — Option to remove empty documents
`off` (default) | `on`

`Ignore case` — Option to ignore case
`off` (default) | `on`

Display Results

`Show tokenized text` — Option to show tokenized text
`off` (default) | `on`

`Show token details` — Option to show token details
`off` (default) | `on`

`Show word cloud` — Option to show word cloud
`off` (default) | `on`

Tips

Version History

See Also

Topics

Preprocess Text Data

Description

Open the Task

Examples

Create Simple Preprocessing Function

Parameters

Select Data

Data — Text to preprocess workspace variable

Clean Up HTML

Extract HTML text — Extract text data from HTML tags off (default) | on

Remove HTML tags — Remove HTML tags off (default) | on

Decode HTML entities — Convert HTML and XML entities into characters off (default) | on

Tokenize

Language — Text language Automatic (default) | English | German | Japanese | Korean

Split — Text splitting mode None (default) | Sentences | Paragraphs

Add Token Details

Add sentence numbers — Option to add sentence numbers off (default) | on

Add part-of-speech tags — Option to add part-of-speech tags on (default) | off

Detect named entities — Option to detect named entities off (default) | on

Parse dependencies — Option to parse dependencies off (default) | on

Token Edit and Removal

Word normalization — Word normalization Lemma (default) | Stem | None

Case normalization — Case normalization None (default) | Uppercase | Lowercase

Minimum word length — Minimum word length 3 (default) | positive integer | off

Maximum word length — Maximum word length 14 (default) | positive integer | off

Remove stop words — Option to remove stop words on (default) | off

Remove Erase punctuation — Option to erase punctuation on (default) | off

Replace words — Source and target words for replacement pairs of source and target strings

Remove words — Words to remove string

Remove empty documents — Option to remove empty documents off (default) | on

Ignore case — Option to ignore case off (default) | on

Display Results

Show tokenized text — Option to show tokenized text off (default) | on

Show token details — Option to show token details off (default) | on

Show word cloud — Option to show word cloud off (default) | on

Tips

Version History

See Also

Topics

`Data` — Text to preprocess
workspace variable

`Extract HTML text` — Extract text data from HTML tags
`off` (default) | `on`

`Remove HTML tags` — Remove HTML tags
`off` (default) | `on`

`Decode HTML entities` — Convert HTML and XML entities into characters
`off` (default) | `on`

`Language` — Text language
`Automatic` (default) | `English` | `German` | `Japanese` | `Korean`

`Split` — Text splitting mode
`None` (default) | `Sentences` | `Paragraphs`

`Add sentence numbers` — Option to add sentence numbers
`off` (default) | `on`

`Add part-of-speech tags` — Option to add part-of-speech tags
`on` (default) | `off`

`Detect named entities` — Option to detect named entities
`off` (default) | `on`

`Parse dependencies` — Option to parse dependencies
`off` (default) | `on`

`Word normalization` — Word normalization
`Lemma` (default) | `Stem` | `None`

`Case normalization` — Case normalization
`None` (default) | `Uppercase` | `Lowercase`

`Minimum word length` — Minimum word length
`3` (default) | positive integer | `off`

`Maximum word length` — Maximum word length
`14` (default) | positive integer | `off`

`Remove stop words` — Option to remove stop words
`on` (default) | `off`

`Remove Erase punctuation` — Option to erase punctuation
`on` (default) | `off`

`Replace words` — Source and target words for replacement
pairs of source and target strings

`Remove words` — Words to remove
string

`Remove empty documents` — Option to remove empty documents
`off` (default) | `on`

`Ignore case` — Option to ignore case
`off` (default) | `on`

`Show tokenized text` — Option to show tokenized text
`off` (default) | `on`

`Show token details` — Option to show token details
`off` (default) | `on`

`Show word cloud` — Option to show word cloud
`off` (default) | `on`