Identifying Significant Features and Classifying Protein Profiles

Open Script

This example shows how to classify mass spectrometry data and use some statistical tools to look for potential disease markers and proteomic pattern diagnostics.

Introduction

Serum proteomic pattern diagnostics can be used to differentiate samples from patients with and without disease. Profile patterns are generated using surface-enhanced laser desorption and ionization (SELDI) protein mass spectrometry. This technology has the potential to improve clinical diagnostics tests for cancer pathologies. The goal is to select a reduced set of measurements or "features" that can be used to distinguish between cancer and control patients. These features will be ion intensity levels at specific mass/charge values.

Preprocess Data

The ovarian cancer data set in this example is from the FDA-NCI Clinical Proteomics Program Databank. The data set was generated using the WCX2 protein array. The data set includes 95 controls and 121 ovarian cancers. For a detailed description of this data set, see [1] and [4].

This example assumes that you already have the preprocessed data OvarianCancerQAQCdataset.mat. However, if you do not have the data file, you can recreate by following the steps in the example Batch Processing of Spectra Using Sequential and Parallel Computing.

Alternatively, you can run the provided script msseqprocessing.m.

The preprocessing steps from the script and example listed above are intended to illustrate a representative set of possible pre-processing procedures. Using different steps or parameters may lead to different and possibly improved results of this example.

Load Data

Once you have the preprocessed data, you can load it into MATLAB.

load OvarianCancerQAQCdataset
whos

  Name          Size                Bytes  Class     Attributes

  MZ        15000x1                120000  double              
  Y         15000x216            25920000  double              
  grp         216x1                 25056  cell

There are three variables: MZ, Y, grp. MZ is the mass/charge vector, Y is the intensity values for all 216 patients (control and cancer), and grp holds the index information as to which of these samples represent cancer patients and which ones represent normal patients.

Initialize some variables that will be used through out the example.

N = numel(grp);                         % Number of samples
Cidx = strcmp('Cancer',grp);            % Logical index vector for Cancer samples
Nidx = strcmp('Normal',grp);            % Logical index vector for Normal samples
Cvec = find(Cidx);                      % Index vector for Cancer samples
Nvec = find(Nidx);                      % Index vector for Normal samples
xAxisLabel = 'Mass/Charge (M/Z)';       % x label for plots
yAxisLabel = 'Ion Intensity';           % y label for plots

Visualizing Some of the Samples

You can plot some data sets into a figure window to visually compare profiles from the two groups; in this example five spectrograms from cancer patients (blue) and five from control patients (green) are displayed.

figure; hold on;
hC = plot(MZ,Y(:,Cvec(1:5)),'b');
hN = plot(MZ,Y(:,Nvec(1:5)),'g');
xlabel(xAxisLabel); ylabel(yAxisLabel);
axis([2000 12000 -5 60])
legend([hN(1),hC(1)],{'Control','Ovarian Cancer'})
title('Multiple Sample Spectrograms')

Zooming in on the region from 8500 to 8700 M/Z shows some peaks that might be useful for classifying the data.

axis([8450,8700,-1,7])

Another way to visualize the whole data set is to look at the group average signal for the control and cancer samples. You can plot the group average and the envelopes of each group.

mean_N = mean(Y(:,Nidx),2);  % group average for control samples
max_N = max(Y(:,Nidx),[],2); % top envelopes of the control samples
min_N = min(Y(:,Nidx),[],2); % bottom envelopes of the control samples
mean_C = mean(Y(:,Cidx),2);  % group average for cancer samples
max_C = max(Y(:,Cidx),[],2); % top envelopes of the control samples
min_C = min(Y(:,Cidx),[],2); % bottom envelopes of the control samples

figure; hold on;
hC = plot(MZ,mean_C,'b');
hN = plot(MZ,mean_N,'g');
gC = plot(MZ,[max_C min_C],'b--');
gN = plot(MZ,[max_N min_N],'g--');
xlabel(xAxisLabel); ylabel(yAxisLabel);
axis([8450,8700,-1,7])
legend([hN,hC,gN(1),gC(1)],{'Control Group Avg.','Ovarian Cancer Group Avg',...
                            'Control Envelope','Ovarian Cancer Envelope'},...
                            'Location', 'NorthWest')
title('Group Average and Group Envelopes')

Observe that apparently there is no single feature that can discriminate both groups perfectly.

Ranking Key Features

A simple approach for finding significant features is to assume that each M/Z value is independent and compute a two-way t-test. rankfeatures returns an index to the most significant M/Z values, for instance 100 indices ranked by the absolute value of the test statistic. This feature selection method is also known as a filtering method, where the learning algorithm is not involved on how the features are selected.

[feat,stat] = rankfeatures(Y,grp,'CRITERION','ttest','NUMBER',100);

The first output of rankfeatures can be used to extract the M/Z values of the significant features.

sig_Masses = MZ(feat);
sig_Masses(1:7)' %display the first seven

ans =

   1.0e+03 *

    8.1009    8.1016    8.1024    8.1001    8.1032    7.7366    7.7359

The second output of rankfeatures is a vector with the absolute value of the test statistic. You can plot it over the spectra using yyaxis.

figure;

yyaxis left
plot(MZ, [mean_N mean_C]);
ylim([-1,20])
xlim([7950,8300])
title('Significant M/Z Values')
xlabel(xAxisLabel);
ylabel(yAxisLabel);

yyaxis right
plot(MZ,stat);
ylim([-1,22])
ylabel('Test Statistic');

legend({'Control Group Avg.','Ovarian Cancer Group Avg.', 'Test Statistics'})

Notice that there are significant regions at high M/Z values but low intensity (~8100 Da.). Other approaches to measure class separability are available in rankfeatures, such as entropy based, Bhattacharyya, or the area under the empirical receiver operating characteristic (ROC) curve.

Blind Classification Using Linear Discriminant Analysis (LDA)

Now that you have identified some significant features, you can use this information to classify the cancer and normal samples. Due to the small number of samples, you can run a cross-validation using the 20% holdout to have a better estimation of the classifier performance. cvpartition allows you to set the training and test indices for different types of system evaluation methods, such as hold-out, K-fold and Leave-M-Out.

per_eval = 0.20;          % training size for cross-validation
rng('default');             % initialize random generator to the same state
                          % used to generate the published example
cv  = cvpartition(grp,'holdout',per_eval)

cv = 

Hold-out cross validation partition
   NumObservations: 216
       NumTestSets: 1
         TrainSize: 173
          TestSize: 43
          IsCustom: 0

Observe that features are selected only from the training subset and the validation is performed with the test subset. classperf allows you to keep track of multiple validations.

cp_lda1 = classperf(grp); % initializes the CP object
for k=1:10 % run cross-validation 10 times
    cv = repartition(cv);
    feat = rankfeatures(Y(:,training(cv)),grp(training(cv)),'NUMBER',100);
    c = classify(Y(feat,test(cv))',Y(feat,training(cv))',grp(training(cv)));
    classperf(cp_lda1,c,test(cv)); % updates the CP object with current validation
end

After the loop you can assess the performance of the overall blind classification using any of the properties in the CP object, such as the error rate, sensitivity, specificity, and others.

cp_lda1

cp_lda1 = 

  classperformance with properties:

                  ClassLabels: {2×1 cell}
                  GroundTruth: [216×1 double]
         NumberOfObservations: 216
            ValidationCounter: 10
           SampleDistribution: [216×1 double]
            ErrorDistribution: [216×1 double]
    SampleDistributionByClass: [2×1 double]
     ErrorDistributionByClass: [2×1 double]
               CountingMatrix: [3×2 double]
                  CorrectRate: 0.8488
                    ErrorRate: 0.1512
              LastCorrectRate: 0.8837
                LastErrorRate: 0.1163
             InconclusiveRate: 0
               ClassifiedRate: 1
                  Sensitivity: 0.8208
                  Specificity: 0.8842
      PositivePredictiveValue: 0.8995
      NegativePredictiveValue: 0.7962
           PositiveLikelihood: 7.0890
           NegativeLikelihood: 0.2026
                   Prevalence: 0.5581
              DiagnosticTable: [2×2 double]
                        Label: ''
                  Description: ''
               ControlClasses: 2
                TargetClasses: 1

This naive approach for feature selection can be improved by eliminating some features based on the regional information. For example, 'NWEIGHT' in rankfeatures outweighs the test statistic of neighboring M/Z features such that other significant M/Z values can be incorporated into the subset of selected features

cp_lda2 = classperf(grp); % initializes the CP object
for k=1:10 % run cross-validation 10 times
    cv = repartition(cv);
    feat = rankfeatures(Y(:,training(cv)),grp(training(cv)),'NUMBER',100,'NWEIGHT',5);
    c = classify(Y(feat,test(cv))',Y(feat,training(cv))',grp(training(cv)));
    classperf(cp_lda2,c,test(cv)); % updates the CP object with current validation
end
cp_lda2.CorrectRate % average correct classification rate

ans =

    0.9023

PCA/LDA Reduction of the Data Dimensionality

Lilien et al. presented in [2] an algorithm to reduce the data dimensionality that uses principal component analysis (PCA), then LDA is used to classify the groups. In this example 2000 of the most significant features in the M/Z space are mapped to the 150 principal components

cp_pcalda = classperf(grp); % initializes the CP object
for k=1:10 % run cross-validation 10 times
    cv = repartition(cv);
    % select the 2000 most significant features.
    feat = rankfeatures(Y(:,training(cv)),grp(training(cv)),'NUMBER',2000);
    % PCA to reduce dimensionality
    P = pca(Y(feat,training(cv))');
    % Project into PCA space
    x = Y(feat,:)' * P(:,1:150);
    % Use LDA
    c = classify(x(test(cv),:),x(training(cv),:),grp(training(cv)));
    classperf(cp_pcalda,c,test(cv));
end
cp_pcalda.CorrectRate % average correct classification rate

ans =

    0.9814

Randomized Search for Subset Feature Selection

Feature selection can also be reinforced by classification, this approach is usually referred to as a wrapper selection method. Randomized search for feature selection generates random subsets of features and assesses their quality independently with the learning algorithm. Later, it selects a pool of the most frequent good features. Li et al. in [3] apply this concept to the analysis of protein expression patterns. The randfeatures function allows you to search a subset of features using LDA or a k-nearest neighbor classifier over randomized subsets of features.

Note: the following example is computationally intensive, so it has been disabled from the example. Also, for better results you should increase the pool size and the stringency of the classifier from the default values in randfeatures. Type help randfeatures for more information.

if 0  % <== change to 1 to enable. This may use extensive time to complete.
   cv = repartition(cv);
   [feat,fCount] = randfeatures(Y(:,training(cv)),grp(training(cv)),...
                         'CLASSIFIER','da','PerformanceThreshold',0.90);
else
   load randFeatCancerDetect
end

Assess the Quality of the Selected Features with the Evaluation Set

The first output from randfeatures is an ordered list of indices of MZ values. The first item occurs most frequently in the subsets where good classification was achieved. The second output is the actual counts of the number of times each value was selected. You can use hist to look at this distribution.

figure;
hist(fCount,max(fCount)+1);

You will see that most values appear at most once in a selected subset. Zooming in gives a better idea of the details for the more frequently selected values.

axis([0 80 0 100])

Only a few values were selected more than 10 times. You can visualize these by using a stem plot to show the most frequently selected features.

figure; hold on;
sigFeats = fCount;
sigFeats(sigFeats<=10) = 0;
plot(MZ,[mean_N mean_C]);
stem(MZ(sigFeats>0),sigFeats(sigFeats>0),'r');
axis([2000,12000,-1,80])
legend({'Control Group Avg.','Ovarian Cancer Group Avg.','Significant Features'}, ...
    'Location', 'NorthWest')
xlabel(xAxisLabel); ylabel(yAxisLabel);

These features appear to clump together in several groups. You can investigate further how many of the features are significant by running the following experiment. The most frequently selected feature is used to classify the data, then the two most frequently selected features are used and so on until all the features that were selected more than 10 times are used. You can then see if adding more features improves the classifier.

nSig = sum(fCount>10);
cp_rndfeat = zeros(20,nSig);
for i = 1:nSig
    for j = 1:20
        cv = repartition(cv);
        P = pca(Y(feat(1:i),training(cv))');
        x = Y(feat(1:i),:)' * P;
        c = classify(x(test(cv),:),x(training(cv),:),grp(training(cv)));
        cp = classperf(grp,c,test(cv));
        cp_rndfeat(j,i) = cp.CorrectRate; % average correct classification rate
    end
end
figure
plot(1:nSig, [max(cp_rndfeat);mean(cp_rndfeat)]);
legend({'Best CorrectRate','Mean CorrectRate'},'Location', 'SouthEast')

From this graph you can see that for as few as three features it is sometimes possible to get perfect classification. You will also notice that the maximum of the mean correct rate occurs for a small number of features and then gradually decreases.

[bestAverageCR, bestNumFeatures] = max(mean(cp_rndfeat));

You can now visualize the features that give the best average classification. You can see that these actually correspond to only three peaks in the data.

figure; hold on;
sigFeats = fCount;
sigFeats(sigFeats<=10) = 0;
ax_handle = plot(MZ,[mean_N mean_C]);
stem(MZ(feat(1:bestNumFeatures)),sigFeats(feat(1:bestNumFeatures)),'r');
axis([7650,8850,-1,80])
legend({'Control Group Avg.','Ovarian Cancer Group Avg.','Significant Features'})
xlabel(xAxisLabel); ylabel(yAxisLabel);

Alternative Statistical Learning Algorithms

There are many classification tools in MATLAB® that you can also use to analyze proteomic data. Among them are support vector machines (fitcsvm), k-nearest neighbors (fitcknn), neural networks (Deep Learning Toolbox™), and classification trees (fitctree). For feature selection, you can also use sequential subset feature selection (sequentialfs) or optimize the randomized search methods by using a genetic algorithm (Global Optimization Toolbox). For example, see Genetic Algorithm Search for Features in Mass Spectrometry Data.

References

[1] Conrads, T P, V A Fusaro, S Ross, D Johann, V Rajapakse, B A Hitt, S M Steinberg, et al. “High-Resolution Serum Proteomic Features for Ovarian Cancer Detection.” Endocrine-Related Cancer, June 2004, 163–78.

[2] Lilien, Ryan H., Hany Farid, and Bruce R. Donald. “Probabilistic Disease Classification of Expression-Dependent Proteomic Data from Mass Spectrometry of Human Serum.” Journal of Computational Biology 10, no. 6 (December 2003): 925–46.

[3] Li, L., D. M. Umbach, P. Terry, and J. A. Taylor. “Application of the GA/KNN Method to SELDI Proteomics Data.” Bioinformatics 20, no. 10 (July 1, 2004): 1638–40.

[4] Petricoin, Emanuel F, Ali M Ardekani, Ben A Hitt, Peter J Levine, Vincent A Fusaro, Seth M Steinberg, Gordon B Mills, et al. “Use of Proteomic Patterns in Serum to Identify Ovarian Cancer.” The Lancet 359, no. 9306 (February 2002): 572–77.