Efficient way to standardize large amounts of text
1 view (last 30 days)
André Kucharzewski on 19 Oct 2021
i have a table with a size of around 1 million rows. In one column there are different type of strings.
Mixed with letters and numbers. Like:
There are around 120 different text formats which repeat. Most of them are able to bring in a standard format like aa_11. Any format which is not able to fit get a standard undef format.
Any suggestions how i can handel such a large dataset without for loop over 1Million rows and check each cell?
Thanks in advance :)
Duncan Po on 19 Oct 2021
You may be able to use patterns. For example, suppose the standard format is letters followed by underscore followed by numbers, you can detect this pattern:
>> x = ["abc_123", "cdf_123", "123_cdf", "123 (abc)"]; % create an example string array
>> matches(x, lettersPattern + "_" + digitsPattern) % check if the strings match the standard pattern
1×4 logical array
1 1 0 0