437 views (last 30 days)

I have a huge data set that I need for training (32000*2500). This seems to be too much for my classifier. So I decided to do some reading on dimensionality reduction and specifically into PCA.

From my understanding PCA selects the current data and replots them on another (x,y) domain/scale. These new coordinates don't mean anything but the data is rearranged to give one axis maximum variation. After these new coefficients I can drop the cooeff having minimum variation.

Now I am trying to implement this in MatLab and am having trouble with the output provided. MatLab always considers rows as observations and columns as variables. So my inout to the pca function would be my matrix of size (32000*2500). This would return the PCA coefficients in an output matrix of size 2500*2500.

The help for pca states:

Each column of coeff contains coefficients for one principal component, and the columns are in descending order of component variance.

In this output, which dimension is the observations of my data? I mean if I have to give this to the classifier, will the rows of coeff represent my datas observations or is it now the columns of coeff?

And how do I remove the coefficients having the least variation? And thus effectively reduce the dimension of my data

the cyclist
on 27 Feb 2016

Edited: the cyclist
on 25 Oct 2018

Here is some code I wrote to help myself understand the MATLAB syntax for PCA.

rng 'default'

M = 7; % Number of observations

N = 5; % Number of variables observed

% Made-up data

X = rand(M,N);

% De-mean (MATLAB will de-mean inside of PCA, but I want the de-meaned values later)

X = X - mean(X); % Use X = bsxfun(@minus,X,mean(X)) if you have an older version of MATLAB

% Do the PCA

[coeff,score,latent,~,explained] = pca(X);

% Calculate eigenvalues and eigenvectors of the covariance matrix

covarianceMatrix = cov(X);

[V,D] = eig(covarianceMatrix);

% "coeff" are the principal component vectors. These are the eigenvectors of the covariance matrix. Compare ...

coeff

V

% Multiply the original data by the principal component vectors to get the projections of the original data on the

% principal component vector space. This is also the output "score". Compare ...

dataInPrincipalComponentSpace = X*coeff

score

% The columns of X*coeff are orthogonal to each other. This is shown with ...

corrcoef(dataInPrincipalComponentSpace)

% The variances of these vectors are the eigenvalues of the covariance matrix, and are also the output "latent". Compare

% these three outputs

var(dataInPrincipalComponentSpace)'

latent

sort(diag(D),'descend')

The first figure on the wikipedia page for PCA is really helpful in understanding what is going on. There is variation along the original (x,y) axes. The superimposed arrows show the principal axes. The long arrow is the axis that has the most variation; the short arrow captures the rest of the variation.

Before thinking about dimension reduction, the first step is to redefine a coordinate system (x',y'), such that x' is along the first principal component, and y' along the second component (and so on, if there are more variables).

In my code above, those new variables are dataInPrincipalComponentSpace. As in the original data, each row is an observation, and each column is a dimension.

These data are just like your original data, except it is as if you measured them in a different coordinate system -- the principal axes.

Now you can think about dimension reduction. Take a look at the variable explained. It tells you how much of the variation is captured by each column of dataInPrincipalComponentSpace. Here is where you have to make a judgement call. How much of the total variation are you willing to ignore? One guideline is that if you plot explained, there will often be an "elbow" in the plot, where each additional variable explains very little additional variation. Keep only the components that add a lot more explanatory power, and ignore the rest.

In my code, notice that the first 3 components together explain 87% of the variation; suppose you decide that that's good enough. Then, for your later analysis, you would only keep those 3 dimensions -- the first three columns of dataInPrincipalComponentSpace. You will have 7 observations in 3 dimensions (variables) instead of 5.

I hope that helps!

the cyclist
on 17 Apr 2019

I'd like to add one clarification to my comment above. While there is no implied ordering of the vectors V, since they are simply eigenvectors, the vectors in coeff are ordered in descending order of component variance (as stated in the documentation).

You can see that

var(dataInPrincipalComponentSpace)

has descending values.

the cyclist
on 13 Sep 2019

Run my code above, and then

figure

bar(latent)

will give a bar chart of the variances, in descending order.

Alternatively,

figure

bar(explained)

will plot the fraction of variance explained by each component. Note that

100*latent./sum(latent) == explained

to within floating-point error.

Sign in to comment.

naghmeh moradpoor
on 1 Jul 2017

Dear Cyclist,

I used your code and I was successful to find all the PCAs for my dataset. Thank you! On my dataset, PC1, PC2 and PC3 explained more than 90% of the variance. I would like to know how to find which variables from my dataset are related to PC1, PC2 and PC3?

Please could you help me with this Regards, Ngh

Abdul Haleem Butt
on 3 Nov 2017

Sign in to comment.

Sahil Bajaj
on 12 Feb 2019

Dear Cyclist,

Thansk a lot for your helpful explanation. I used your code and I was successful to find 4 PCAs explaining 97% variance for my dataset, which had total 14 components initially. I was just wondering how to find which variables from my dataset are related to PC1, PC2, PC3 and PC4 so that I can ignore the others, and know which parameters should I use for further analysis?

Thanks !

Sahil

Yaser Khojah
on 18 Apr 2019

Is there an answer for this question?

Which variables from my dataset are related to PC1, PC2, PC3 and PC4?

Here is the explinaiton of each componete which relates to PC and nothing is related to original data?

- coeff: contains coefficients for one principal component, and the columns are in descending order of component variance
- score: Rows of score correspond to observations, and columns correspond to components.
- explained: the percentage of the total variance explained by each principal component
- latent: Principal component variances, that is the eigenvalues of the covariance matrix of X, returned as a column vector.

I have used your codes and I see the coeff and v are not matching in order?

coeff =

-0.5173 0.7366 -0.1131 0.4106 0.0919

0.6256 0.1345 0.1202 0.6628 -0.3699

-0.3033 -0.6208 -0.1037 0.6252 0.3479

0.4829 0.1901 -0.5536 -0.0308 0.6506

0.1262 0.1334 0.8097 0.0179 0.5571

V =

0.0919 0.4106 -0.1131 -0.7366 -0.5173

-0.3699 0.6628 0.1202 -0.1345 0.6256

0.3479 0.6252 -0.1037 0.6208 -0.3033

0.6506 -0.0308 -0.5536 -0.1901 0.4829

0.5571 0.0179 0.8097 -0.1334 0.1262

However, (dataInPrincipalComponentSpace and score) and (var(dataInPrincipalComponentSpace)' and latent) are matching. Does that mean, the first row in latent is related to the first column in the original data? I think any new use is confused about how to related these answers to the original data's variables? Can you please explain. Thank you

the cyclist
on 19 Apr 2019

Your first question

Recall that the original data is a vector with M observations of N variables. There will also be N principal components. The relationship between the original data and the nth PC is

nth PC = X*coeff(:,n) % This is pseudocode, not valid MATLAB syntax.

For example, PC1 is given by

PC1 = X*coeff(:,1)

You can recover the original data from the principal components by

dataInPrincipalComponentSpace * coeff'

Your second question

The first row of latent is not related to the first column of the original data. It is related to the first principal component (which you can see is a linear combination of the original data).

Sign in to comment.

As Has
on 18 Sep 2019

i still not understand

i need an answer for my question------> how many eigenvector i have to use?

from these figures

the cyclist
on 19 Sep 2019

It is not a simple answer. The first value of the explained variable is about 30. That means that the first principal component explains about 30% of the total variance of all your variables. The next value of explained is 14. So, together, the first two components explain about 44% of the total variation. Is that enough? It depends on what you are trying to do. It is difficult to give generic advice on this point.

You can plot the values of explained or latent, to see how the explained variance is captured as you add each additional component. See, for example, the wikipedia article on scree plots.

As Has
on 19 Sep 2019

if we say that the first two components which explain about 44% enough for me, what does this mean for latent and coff . how can this lead me to the number of eigen vectors

thanks for your interest in reply. i appreicate this

the cyclist
on 20 Sep 2019

It means that the first two columns of coeff are the eigenvectors you want to use.

Sign in to comment.

Sign in to answer this question.

Opportunities for recent engineering grads.

Apply Today
## 0 Comments

Sign in to comment.