Splitting a matrix according to there labels

Question

NotA_Programmer el 10 de Mayo de 2022

0
Enlazar

Enlace directo a esta pregunta

https://es.mathworks.com/matlabcentral/answers/1715735-splitting-a-matrix-according-to-there-labels

Comentada: Jon el 11 de Mayo de 2022

I have a matrix of (1900 x 4 double), fourth column contains labels 3, 2 and 1. I want to split this data in 20:80 ratio of A and B where A contains 20% of each labels 3,2,&1. And B contains 80% of each labels i.e. 80% of label 3, 80% of label 2 and 80% of label 1. Please help how can this be achieved.

6 comentarios
Mostrar 4 comentarios más antiguosOcultar 4 comentarios más antiguos

NotA_Programmer el 10 de Mayo de 2022

@Dyuman Joshi

Randomly.

But,

A (containing 20% of data rows) should contain [20% from label 3 rows + 20% from label 2 rows + 20% from label 1 rows].

B (containing 80% of data rows) should contain [80% from label 3 rows + 80% from label 2 rows + 80% from label 1 rows].

dpb el 10 de Mayo de 2022

Add splitapply or if using table rowfun to above...

Iniciar sesión para comentar.

Iniciar sesión para responder a esta pregunta.

Answer 1

Jon el 10 de Mayo de 2022

0
Enlazar

Enlace directo a esta respuesta

https://es.mathworks.com/matlabcentral/answers/1715735-splitting-a-matrix-according-to-there-labels#answer_961075

Editada: Jon el 10 de Mayo de 2022

Abrir en MATLAB Online

This is one way to do it

% make an example data file with last column having either a "label" of 1,
% 2, or 3
data = [rand(1900,3),randi(3,[1900,1])];
% loop through labels making training and validation data sets
Aparts = cell(3,1);
Bparts = cell(3,1);
for k = 1:3
    % get the indices of the rows with kth label
    idx = find(data(:,4)==k);
    numWithLabel = numel(idx);
    idxrand = idx(randperm(numWithLabel)); % randomize the selection
    % randomly put (within rounding) 80% in training, 20% in validation
    numTrain = round(0.8*numWithLabel);
    Aparts{k} = data(idxrand(1:numTrain),:);
    Bparts{k} = data(idxrand(numTrain+1:end),:); % the rest go to validation
end
% put all of the parts in one matrix of doubles 
A = cell2mat(Aparts);
B = cell2mat(Bparts);

13 comentarios
Mostrar 11 comentarios más antiguosOcultar 11 comentarios más antiguos

NotA_Programmer el 10 de Mayo de 2022

Hi Jon, thanks for your reponse.

My early data changed to Combined_Data = (7886 x 8 double). Last column i.e. column 8 contains labels 3,2,1.

I tried this below code but it does not give the desired result. can you please have a check on this.

Desired ouput:

matrix A(xyz x 8 double)

matrix B(uvw x 8 double)

A (containing 20% of data rows) should contain [20% from label 3 rows + 20% from label 2 rows + 20% from label 1 rows].

B (containing 80% of data rows) should contain [80% from label 3 rows + 80% from label 2 rows + 80% from label 1 rows].

code:

filename = 'C.xlsx';

Combined_Data = xlsread(filename);

% loop through labels making training and validation data sets

Aparts = cell(7,1);

Bparts = cell(7,1);

for k = 1:7

idx = find(Combined_Data(:,8)==k);

numWithLabel = numel(idx);

idxrand = idx(randperm(numWithLabel)); % randomize the selection

% randomly put (within rounding) 80% in training, 20% in validation

numTrain = round(0.8*numWithLabel);

Aparts{k} = Combined_Data(idxrand(1:numTrain),:);

Bparts{k} = Combined_Data(idxrand(numTrain+1:end),:); % the rest go to validation

end

% put all of the parts in one matrix of doubles

A = cell2mat(Aparts);

B = cell2mat(Bparts);

Jon el 11 de Mayo de 2022

Editada: Jon el 11 de Mayo de 2022

Abrir en MATLAB Online

In case it is of interest, here is a much simpler way to do the splitting without using any loops. I also made it more general so that you can use any list of labels you want, they don't even have to be consecutive, and there can be an arbitrary number of labels

% parameters
numPoints = 1900; % only needed to generate example data
labelColNo = 8; % column number of labels
labels = [1,2,3]; % possible labels
% make an example data file with last column having random label values
% from set of possible labels
numLabels = numel(labels);
labelColumn = labels(randi(numLabels,numPoints,1));
data = [rand(numPoints,labelColNo-1),labelColumn(:)];
% randomize (shuffle) the rows
data = data(randperm(numPoints),:);
% make a number of data points by number of possible label values 
% matrix with a column for each label, whose i,jth entry is true if
% the jth label occurs in the ith row of the labelColumn
isLabel = data(:,labelColNo)==labels;
% count entries and normalize to get cumulative fractions
% record cumulative fraction corresponding to each occurence of label
counts = cumsum(isLabel);
f = counts./counts(end,:).*isLabel; % sets values to zero where label doesn't occur
% mark all the rows with up to 0.8 as being in the training set
isTraining = any(f>0 &f<=0.8,2);
A = data(isTraining,:);
B = data(~isTraining,:);

dpb el 11 de Mayo de 2022

Editada: dpb el 11 de Mayo de 2022

Abrir en MATLAB Online

Oh, if you want categorical labels, then use categorical variables -- that's what its for...

labels=randi(3,10,1);       % dummy dataset for show...
labels=categorical(labels,[1:3],{'Good','Average','Bad'},'ordinal',1);  % convert to categorical
labels = 
  10×1 categorical array
     Bad 
     Good 
     Bad 
     Average 
     Average 
     Bad 
     Bad 
     Good 
     Bad 
     Good 
>> 

Plots are aware of categorical variables so you get the labels automagically; you may have to use

>> categories(labels)
ans =
  3×1 cell array
    {'Good'    }
    {'Avgerage'}
    {'Bad'     }
>> 

or string or cellstr occasionally to get a string representation if need it specifically.

But, manipulating table data as categorical instead of as string is far easier and more effiicient besides.

While I showed as a standalone new variable called labels, what you really want to do is convert the actual variable to categorical and use it instead of the original...then the labels come along for free.

Jon el 11 de Mayo de 2022

@dpb Thanks I realize I need to get more familiar with categorical variables. From your example, and I think another one I saw recently I see that they provide some powerful capabilities.

Iniciar sesión para comentar.

Answer 2

dpb el 10 de Mayo de 2022

1
Enlazar

Enlace directo a esta respuesta

https://es.mathworks.com/matlabcentral/answers/1715735-splitting-a-matrix-according-to-there-labels#answer_961110

Editada: dpb el 10 de Mayo de 2022

Abrir en MATLAB Online

[ix,idx]=findgroups(X(:,4));        % get grouping variable on fourth column X
for i=idx.'                         % for each group ID (must be numeric as here)
  I=I(find(ix==i));                 % the indices into X for the group
  N=numel(I);                       % how many in this group
  I=I(randperm(N));                 % rearrange randomly the elements of index vector
  nA=floor(0.8*N);                  % how many to pick for A (maybe round() instead???)
  iA{i}=I(1:nA);                    % the randomized selection for A
  iB{i}=I(nA+1:end);                % rest for B
end

5 comentarios
Mostrar 3 comentarios más antiguosOcultar 3 comentarios más antiguos

dpb el 10 de Mayo de 2022

Editada: dpb el 10 de Mayo de 2022

Abrir en MATLAB Online

You've got a missing ".'" transpose operator on the for loop iterator -- it must be a row vector; passing a column vector will result in the problem that all three indices are passed at once. I could have made the code more robust by writing

for i=idx(:).'

instead which (:) forces a column vector and ".'" turns it into row.

However, I see I missed an important step in the cleanup from the anonymous function version -- the line

I=randperm(N);

needs to be

I=I(randperm(N));

to rearrange the subset indices to the grouped variables; the randperm(N) call simply generates the right length of vector subscripts in a random order; still need the actual subscripts from the matching operation of finding the ones in the given group.

With those corrections, it should work as is...cleanest would be to copy and paste the actual code instead of retyping; then you also get indenting and comments and all... :)

I did make the above correction in the Answer code...sorry I missed that first time; glad there was another issue that you reposted so had the chance to see it! :)

NotA_Programmer el 10 de Mayo de 2022

-cleanest would be to copy and paste the actual code instead of retyping; then you also get indenting and comments and all... :)

Yeah, I should have done it in tha way.

Thanks @dpb for your help!

Iniciar sesión para comentar.

Splitting a matrix according to there labels

6 comentarios
Mostrar 4 comentarios más antiguosOcultar 4 comentarios más antiguos

Respuesta aceptada

13 comentarios
Mostrar 11 comentarios más antiguosOcultar 11 comentarios más antiguos

Más respuestas (1)

5 comentarios
Mostrar 3 comentarios más antiguosOcultar 3 comentarios más antiguos

Ver también

Categorías

Etiquetas

Productos

Versión

Community Treasure Hunt

Splitting a matrix according to there labels

6 comentarios Mostrar 4 comentarios más antiguosOcultar 4 comentarios más antiguos

Respuesta aceptada

13 comentarios Mostrar 11 comentarios más antiguosOcultar 11 comentarios más antiguos

Más respuestas (1)

5 comentarios Mostrar 3 comentarios más antiguosOcultar 3 comentarios más antiguos

Ver también

Categorías

Etiquetas

Productos

Versión

Community Treasure Hunt

6 comentarios
Mostrar 4 comentarios más antiguosOcultar 4 comentarios más antiguos

13 comentarios
Mostrar 11 comentarios más antiguosOcultar 11 comentarios más antiguos

5 comentarios
Mostrar 3 comentarios más antiguosOcultar 3 comentarios más antiguos