Main Content

textanalytics.unicode.nfkc

Unicode compatibility composed normalized form (NFKC)

Since R2022b

    Description

    example

    newStr = textanalytics.unicode.nfkc(str) normalizes the string str to the Unicode compatibility composed normalized form (NFKC).

    Examples

    collapse all

    Strings that look identical can have different underlying representations. The Unicode compatibility canonical composition form (NFKC) ensures that equivalent strings have a unique binary representation.

    Consider the string "efficient", where the character "ffi" is represented by the code unit "\xFB03". The string has length 7.

    str = compose("e\xFB03") + "cient"
    str = 
    "efficient"
    
    strlength(str)
    ans = 7
    

    Normalize the string using the textanalytics.unicode.nfkc function.

    newStr = textanalytics.unicode.nfkc(str)
    newStr = 
    "efficient"
    

    View the length of the normalized string. The normalized representation includes two extra code units. In this case, the function replaces the "ffi" character with the string "ffi".

    strlength(newStr)
    ans = 9
    

    Extract the second to fourth code units of the normalized string.

    extractBetween(newStr,2,4)
    ans = 
    "ffi"
    

    Check whether the strings str and newStr are equal using the == operator. The operator returns 0 because the strings have different underlying representations.

    tf = str == newStr
    tf = logical
       0
    
    

    Input Arguments

    collapse all

    Input text, specified as a string array, character vector, or cell array of character vectors.

    Example: ["An example of a short sentence."; "A second short sentence."]

    Data Types: string | char | cell

    Output Arguments

    collapse all

    Output text, returned as a string array, character vector, or cell array of character vectors. str and newStr have the same data type.

    Algorithms

    collapse all

    Unicode Normalization Forms

    For more information about Unicode normalization forms, see Unicode Standard Annex #15 Unicode Normalization Forms.

    References

    [1] Whistler, Ken, ed. "Unicode Standard Annex #15: Unicode Normalization Forms." Unicode Technical Reports, August 27, 2021. https://unicode.org/reports/tr15/.

    Version History

    Introduced in R2022b