use unstack to reshape table with dummy variable (edited: alternative crosstab method)

5 visualizaciones (últimos 30 días)
After I sort table according to a group variable, I like to put the group value as the 'header' column and put its members in the other columns in the same row.
Data for the sake of demostration:
C = {'q1', 'q1', 'q2', 'q3', 'q3', 'q3';
'apple', 'appl', 'banana', 'orange', 'orang', 'orange'}';
T = cell2table(C, 'VariableNames', {'code', 'fruit'});
[GroupsID, Groups] = findgroups(T.code);
unique_groupID = unique(GroupsID);
gT = table('Size', [10,4], 'VariableTypes', {'string', 'string', 'string', 'string'});
Method 1. (edited.) brutal for-loop that I don't like, and its result needs more processing to remove redundancy in each row.
for k=1:size(unique_groupID)
% extract group elements from 'fruit'
tmp = T.fruit(GroupsID==unique_groupID(k));
l = size(tmp',2);
gT(k,1) = {Groups(k)};
gT(k,2:l+1) = tmp';
end
rmmissing(gT, "MinNumMissing", 3)
ans = 3×4 table
Var1 Var2 Var3 Var4 ____ ________ _________ _________ "q1" "apple" "appl" <missing> "q2" "banana" <missing> <missing> "q3" "orange" "orang" "orange"
Method 2 using unstack
I created a dummy variable for this method in order to use unstack( ). The code is shorter but doesn't give me the result I want.
D = {'dm1', 'dm2', 'dm3', 'dm4', 'dm5', 'dm6';
'q1', 'q1', 'q2', 'q3', 'q3', 'q3';
'apple', 'appl', 'banana', 'orange', 'orang', 'orange'}';
T = cell2table(D, 'VariableNames', {'dummy', 'code', 'fruit'});
unstack(T, "fruit", "dummy")
ans = 3×7 table
code dm1 dm2 dm3 dm4 dm5 dm6 ______ __________ __________ __________ __________ __________ __________ {'q1'} {'apple' } {'appl' } {0×0 char} {0×0 char} {0×0 char} {0×0 char} {'q2'} {0×0 char} {0×0 char} {'banana'} {0×0 char} {0×0 char} {0×0 char} {'q3'} {0×0 char} {0×0 char} {0×0 char} {'orange'} {'orang' } {'orange'}
Edited. Method 3 using crosstab. This method works nicely, but I wish I don't have to use a for-loop. The result from this method is exactly what I want.
[tb,~,~,lbs] = crosstab(T.code, T.fruit);
For-loop to create the intended table:
m = size(tb,1);
header = lbs(1:m,1);
fruits = lbs(:,2);
gT = table('Size',[6,4], 'VariableTypes', {'string','string','string','string'});
for i=1:m
tmp = fruits(tb(i,:)>0)';
l = size(tmp,2);
gT(i,"Var1") = header(i);
gT(i, 2:l+1) = tmp;
end
rmmissing(gT, "MinNumMissing",4)
ans = 3×4 table
Var1 Var2 Var3 Var4 ____ ________ _________ _________ "q1" "apple" "appl" <missing> "q2" "banana" <missing> <missing> "q3" "orange" "orang" <missing>
Edited. After I post the above code, I thought about that Method 3 may be made leaner.
ix = find(tb>0);
[rows,cols]=ind2sub([3,4],ix);
% then for-loop through rows and cols to populate the final table.
% I still can't avoid for-loop.
  2 comentarios
Simon
Simon el 1 de Abr. de 2023
Thanks for your quick response. The result from Solution 1 needs more processing to remove redundancy in each row. Solution 3 has the correct result.
My real data have more than hundreds of thousands of rows. Its first column stores 'account codes', the second column, 'account definition', and the third column is financial numerical values. For example, '1XXXXX' is the code, and 'Assets' is the account definition. These two should have a perfect 1-1 relationship. However, owing to human factor, the actual entry for 'account definition' for a given account code might slightly vary. For example, 'Assets' may be keyed in as 'Assest', or 'Aset'.
The code's downstream step is for me to visually check if there is anything odd about the 'account codes'-'account definition'. There are only around a hundred of unique 'account codes', which is more managable to human inspection than the original super-tall table.

Iniciar sesión para comentar.

Respuesta aceptada

Stephen23
Stephen23 el 1 de Abr. de 2023
Editada: Stephen23 el 27 de Abr. de 2023
Using UNSTACK is quite a neat solution because it will automatically pad different-length data to the same number of columns, adding "missing" values as required. This is otherwise fiddly to replicate. But to use UNSTACK, we need to add a variable that tells UNSTACK which columns to move the data into:
C = {'q1','q1','q2','q3','q3','q3';'apple','appl','banana','orange','orang','orange'}.';
T = cell2table(C, 'VariableNames', {'code','fruit'})
T = 6×2 table
code fruit ______ __________ {'q1'} {'apple' } {'q1'} {'appl' } {'q2'} {'banana'} {'q3'} {'orange'} {'q3'} {'orang' } {'q3'} {'orange'}
U = unique(T,'rows');
G = findgroups(U.code); % note1
F = @(n)(1:nnz(n==G)).'; % note1
U.count = cell2mat(arrayfun(F,unique(G),'uni',0)) % note1
U = 5×3 table
code fruit count ______ __________ _____ {'q1'} {'apple' } 1 {'q1'} {'appl' } 2 {'q2'} {'banana'} 1 {'q3'} {'orange'} 1 {'q3'} {'orang' } 2
U = unstack(U,"fruit","count", "VariableNamingRule","modify")
U = 3×3 table
code x1 x2 ______ __________ __________ {'q1'} {'apple' } {'appl' } {'q2'} {'banana'} {0×0 char} {'q3'} {'orange'} {'orang' }
note1: This just gives a unique index to each element of a group. Astonishingly there does not seem to be an easy inbuilt way to achieve this... does anyone have any tips?: e.g. [1,1,1,2,2,1] -> [1,2,3,1,2,4] .
EDIT: I found a neater way:
G = findgroups(U.code);
U.count = grouptransform(ones(size(G)),G,@cumsum);
  11 comentarios
Stephen23
Stephen23 el 27 de Abr. de 2023
Editada: Stephen23 el 27 de Abr. de 2023
I thought of another approach based on GROUPTRANSFORM:
As mentioned in my answer, the desired transformation is [1,1,1,2,2,1] -> [1,2,3,1,2,4] .
G = [1;1,;1;2;2;1]; % must be column vector
Y = grouptransform(ones(size(G)),G,@cumsum)
Y = 6×1
1 2 3 1 2 4
Nice, it seems to work as we want. However in this case G luckily consists of integers 1..N. In all other cases we need to use e.g. FINDGROUPS first. Lets try it with the fake data that I used in my answer:
C = {'q1','q1','q2','q3','q3','q3';'apple','appl','banana','orange','orang','orange'}.';
T = cell2table(C, 'VariableNames', {'code','fruit'})
T = 6×2 table
code fruit ______ __________ {'q1'} {'apple' } {'q1'} {'appl' } {'q2'} {'banana'} {'q3'} {'orange'} {'q3'} {'orang' } {'q3'} {'orange'}
U = unique(T,'rows');
G = findgroups(U.code);
U.count = grouptransform(ones(size(G)),G,@cumsum)
U = 5×3 table
code fruit count ______ __________ _____ {'q1'} {'appl' } 1 {'q1'} {'apple' } 2 {'q2'} {'banana'} 1 {'q3'} {'orang' } 1 {'q3'} {'orange'} 2
U = unstack(U,"fruit","count", "VariableNamingRule","modify")
U = 3×3 table
code x1 x2 ______ __________ __________ {'q1'} {'appl' } {'apple' } {'q2'} {'banana'} {0×0 char} {'q3'} {'orang' } {'orange'}

Iniciar sesión para comentar.

Más respuestas (1)

Peter Perkins
Peter Perkins el 5 de Abr. de 2023
I'm having trouble understanding the desireed output, but others have created what is essentially a crosstabulation of counts, so, new in R2023a
>> T = cell2table(C, 'VariableNames', {'code', 'fruit'});
>> pivot(T,Rows="code",Columns="fruit")
ans =
3×7 table
code appl apple banana orang orange oranges
______ ____ _____ ______ _____ ______ _______
{'q1'} 1 2 0 0 0 0
{'q2'} 0 0 1 0 0 0
{'q3'} 0 0 0 1 2 1
As cyclist points out, there are a bunch of empty bins, so the original "tidy" format may be more useful. To me, this looks like a case of "fruit ought to be categorical, and you ought to apply mergecats to clean up those typos/different spellings".
  7 comentarios
Simon
Simon el 13 de Mayo de 2023
| > EDIT: I found a neater way:
|> G = findgroups(U.code);
|> U.count = grouptransform(ones(size(G)),G,@cumsum);
Sorry about being late response. I was overwhelmed by things. This is indeed a very neat solution. I used to think grouptransform( ) is quite limited in its function. But when it is put to work over a dummy/extra variable, it could be quite versatile in problem solving.

Iniciar sesión para comentar.

Categorías

Más información sobre Loops and Conditional Statements en Help Center y File Exchange.

Productos


Versión

R2023a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by