# unique is giving the same expression twice

33 views (last 30 days)
Wesso on 29 Jan 2021
Edited: dpb on 29 Jan 2021
Hi,
(data is attached)
[Country,~,ix] = unique(A);
tally = accumarray(ix, 1);
Q2= table(Country, tally);
Q2 contains the same expression twice for the unique values of 'Audit and assurance, and tax services'. what could be the reason? and how to overcome it? is it a bug?
dpb on 29 Jan 2021
This undoubtedly is the same issue I pointed out before at https://www.mathworks.com/matlabcentral/answers/730643-replacing-999-in-a-table-to-nan-regardless-of-the-type-of-the-column?s_tid=srchtitle#comment_1294958 where the encoding is different. Thus the strings visually appear the same, but one contains a double-byte character and the other doesn't.
Here's the specifics to show what was there for that particular set of values I looked at; undoubtedly you'll find the same thing here if you look carefully...
>> sort(categories(Final.org04b))
ans =
46×1 cell array
{'-999' }
{'-9999' }
...
{'I don't know' }
{'I don’t know' }
...
>> tmp=ans(42:43)
tmp =
2×1 cell array
{'I don't know'}
{'I don’t know'}
>> strcmp(tmp(1),tmp(2))
ans =
logical
0
>> [double(tmp{1});double(tmp{2})]
ans =
73 32 100 111 110 39 116 32 107 110 111 119
73 32 100 111 110 8217 116 32 107 110 111 119
>>
NB: the extended character "8217" in the second instead of the ASCII 39 for the single quote.

dpb on 29 Jan 2021
Edited: dpb on 29 Jan 2021
I didn't notice the data attached for this case -- the same exercise as above shows:
>> sort(categories(A))
ans =
29×1 cell array
{'Agriculture and fishing' }
{'Audit and assurance, and tax services' }
{'Audit and assurance, and tax services' }
{'Banking and capital markets' }
{'Civil Societies/NGOs' }
{'Civil society/NGOs' }
{'Construction' }
{'Consulting services' }
{'Electronics' }
{'Energy, utilities and resources' }
{'Financial services' }
{'Food Services' }
{'Government and public services' }
{'Health and healthcare services' }
{'Hospitality' }
{'IT and telecommunications' }
{'Manufacturing' }
{'Mining and Quarrying' }
{'Oil and gas' }
{'Other' }
{'Petrochemicals' }
{'Real Estate' }
{'Tourism' }
{'Transportation and logistics' }
{'org03' }
>> tmp=ans(2:3)
tmp =
2×1 cell array
{'Audit and assurance, and tax services'}
{'Audit and assurance, and tax services'}
>>
There's an extended character (=160) in the second where there's an ordinary space in the first:
>> find(tmp{1}~=tmp{2})
ans =
25
>> [double(tmp{1}(25));double(tmp{2}(25))]
ans =
32
160
>>
Besides that, there are other anomolous entries as well just as were pointed out in the other categorical array in the previous Q?
...
{'Civil Societies/NGOs' }
{'Civil society/NGOs' }
...
...
that need to be cleaned up or one will never be able to match all elements of what are obviously intended to be the same categories but are not.
The data need a throrough cleaning before being ready for prime time.