Esta página aún no se ha traducido para esta versión. Puede ver la versión más reciente de esta página en inglés.

cvpartition

Clase: cvpartition

Create cross-validation partition for data

Sintaxis

c = cvpartition(n,'KFold',k)
c = cvpartition(n,'HoldOut',p)
c = cvpartition(group,'KFold',k)
c = cvpartition(group,'KFold',k,'Stratify',stratifyOption)
c = cvpartition(group,'HoldOut',p)
c = cvpartition(group,'HoldOut',p,'Stratify',stratifyOption)
c = cvpartition(n,'LeaveOut')
c = cvpartition(n,'resubstitution')

Description

c = cvpartition(n,'KFold',k) constructs an object c of the cvpartition class defining a random nonstratified partition for k-fold cross-validation on n observations. The partition divides the observations into k disjoint subsamples (or folds), chosen randomly but with roughly equal size. The default value of k is 10.

c = cvpartition(n,'HoldOut',p) creates a random nonstratified partition for holdout validation on n observations. This partition divides the observations into a training set and a test (or holdout) set. The parameter p must be a scalar. When 0 < p < 1, cvpartition randomly selects approximately p*n observations for the test set. When p is an integer, cvpartition randomly selects p observations for the test set. The default value of p is 1/10.

ejemplo

c = cvpartition(group,'KFold',k) creates a random partition for a stratified k-fold cross-validation. group is a numeric vector, categorical array, character array, string array, or cell array of character vectors indicating the class of each observation. Each subsample has roughly equal size and roughly the same class proportions as in group.

When you supply group as the first input argument to cvpartition, the function creates cross-validation partitions that do not include rows of observations corresponding to missing values in group.

ejemplo

c = cvpartition(group,'KFold',k,'Stratify',stratifyOption) returns an object c defining a random partition for k-fold cross-validation. When you supply group as the first input argument to cvpartition, then the function implements stratification by default. If you also specify 'Stratify',false, then the function creates nonstratified random partitions.

You can specify 'Stratify',true only if the first input argument to cvpartition is group.

c = cvpartition(group,'HoldOut',p) randomly partitions observations into a training set and a holdout (or test) set with stratification, using the class information in group. Both the training and test sets have roughly the same class proportions as in group.

ejemplo

c = cvpartition(group,'HoldOut',p,'Stratify',stratifyOption) returns an object c defining a random partition into a training set and a holdout (or test) set. When you supply group as the first input argument to cvpartition, then the function implements stratification by default. If you also specify 'Stratify',false, then the function creates nonstratified random partitions.

c = cvpartition(n,'LeaveOut') creates a random partition for leave-one-out cross-validation on n observations. Leave-one-out is a special case of 'KFold' in which the number of folds equals the number of observations.

c = cvpartition(n,'resubstitution') creates an object c that does not partition the data. Both the training set and the test set contain all of the original n observations.

Ejemplos

expandir todo

Use stratified 10-fold cross-validation to compute misclassification rate.

Load Fisher's iris data set.

load fisheriris;
y = species;
X = meas;

Create a random partition for a stratified 10-fold cross-validation.

c = cvpartition(y,'KFold',10);

Create a function that computes the number of misclassified test samples.

fun = @(xTrain,yTrain,xTest,yTest)(sum(~strcmp(yTest,...
    classify(xTest,xTrain,yTrain)))); 

Return the estimated misclassification rate using cross-validation.

rate = sum(crossval(fun,X,y,'partition',c))...
           /sum(c.TestSize)
rate = 0.0200

Find the proportion of each class in a 5-fold nonstratified partition of the fisheriris data.

Load Fisher's iris data set.

load fisheriris;

Find the number of instances of each class in the data.

[C,~,idx] = unique(species);
C % Unique classes
C = 3x1 cell array
    {'setosa'    }
    {'versicolor'}
    {'virginica' }

n = accumarray(idx(:),1) % Number of instances for each class in species
n = 3×1

    50
    50
    50

The three classes occur in equal proportion.

Create a random, nonstratified 5-fold partition.

cv = cvpartition(species,'KFold',5,'Stratify',false) 
cv = 
K-fold cross validation partition
   NumObservations: 150
       NumTestSets: 5
         TrainSize: 120  120  120  120  120
          TestSize: 30  30  30  30  30

Show that the three classes do not occur in equal proportion for each fold of the data set.

for i = 1:cv.NumTestSets
    disp(['Fold ',num2str(i)])
    testClasses = species(cv.test(i));
    [C,~,idx] = unique(testClasses);
    C % Unique classes
    nCount = accumarray(idx(:),1) % Number of instances for each class in a fold
end
Fold 1
C = 3x1 cell array
    {'setosa'    }
    {'versicolor'}
    {'virginica' }

nCount = 3×1

     8
    13
     9

Fold 2
C = 3x1 cell array
    {'setosa'    }
    {'versicolor'}
    {'virginica' }

nCount = 3×1

    10
    11
     9

Fold 3
C = 3x1 cell array
    {'setosa'    }
    {'versicolor'}
    {'virginica' }

nCount = 3×1

    10
     8
    12

Fold 4
C = 3x1 cell array
    {'setosa'    }
    {'versicolor'}
    {'virginica' }

nCount = 3×1

    12
     8
    10

Fold 5
C = 3x1 cell array
    {'setosa'    }
    {'versicolor'}
    {'virginica' }

nCount = 3×1

    10
    10
    10

Because cv is a random nonstratified partition of the fisheriris data, the class proportions in each of the five folds are not guaranteed to be equal to the class proportions in species. That is, the classes do not occur equally in each fold, as they do in species. Cross-validation produces randomness in the results, so your number of instances for each class in a fold can vary from those shown.

Compare the number of instances for each class in a nonstratified holdout set with a stratified holdout set of a tall array.

Create a numeric vector of two classes, where class 1 and class 2 occur in the ratio 1:10.

group = [ones(20,1);2*ones(200,1)]
group = 220×1

     1
     1
     1
     1
     1
     1
     1
     1
     1
     1
      ⋮

Create a tall array from group.

tgroup = tall(group)
Starting parallel pool (parpool) using the 'local' profile ...
connected to 4 workers.

tgroup =

  220x1 tall double column vector

     1
     1
     1
     1
     1
     1
     1
     1
     :
     :

Holdout is the only cvpartition option that is supported for tall arrays. Create a random, nonstratified holdout partition.

CV0 = cvpartition(tgroup,'Holdout',1/4,'Stratify',false)  
CV0 = 
Hold-out cross validation partition
   NumObservations: [1x1 tall]
       NumTestSets: 1
         TrainSize: [1x1 tall]
          TestSize: [1x1 tall]

Return the result of CV0.test to memory by using the gather function.

testIdx0 = gather(CV0.test);
Evaluating tall expression using the Parallel Pool 'local':
Evaluation completed in 0.29 sec

Find the number of times each class occurs in the holdout set.

accumarray(group(testIdx0),1) % Number of instances for each class in the holdout set
ans = 2×1

     3
    52

cvpartition produces randomness in the results, so your number of instances for each class can vary from those shown.

Because CV0 is a nonstratified partition, class 1 and class 2 in the holdout set are not guaranteed to occur in the same ratio that they do in tgroup. However, because of the inherent randomness in cvpartition, you can sometimes obtain a holdout set for which the classes occur in the same ratio that they do in tgroup even though you specify 'Stratify',false. A similar result can be illustrated for the training set.

Return the result of CV0.training to memory.

trainIdx0 = gather(CV0.training);
Evaluating tall expression using the Parallel Pool 'local':
Evaluation completed in 0.075 sec

Find the number of times each class occurs in the training set.

accumarray(group(trainIdx0),1) % Number of instances for each class in the training set
ans = 2×1

    17
   148

The classes in the nonstratified training set are not guaranteed to occur in the same ratio that they do in tgroup.

Create a random, stratified holdout partition.

CV1 = cvpartition(tgroup,'Holdout',1/4)  
CV1 = 
Hold-out cross validation partition
   NumObservations: [1x1 tall]
       NumTestSets: 1
         TrainSize: [1x1 tall]
          TestSize: [1x1 tall]

Return the result of CV1.test to memory.

testIdx1 = gather(CV1.test);
Evaluating tall expression using the Parallel Pool 'local':
Evaluation completed in 0.091 sec

Find the number of times each class occurs in the holdout set.

accumarray(group(testIdx1),1) % Number of instances for each class in the holdout set
ans = 2×1

     5
    50

In the case of the stratified holdout partition, the class ratio in the holdout set and the class ratio in tgroup are the same (1:10).

Algoritmos

  • If you supply group as the first input argument to cvpartition, the function creates cross-validation partitions that do not include rows of observations corresponding to missing values in group.

  • When you supply group as the first input argument to cvpartition, then the function implements stratification by default. You can specify 'Stratify',false to create nonstratified random partitions.

  • You can specify 'Stratify',true only if the first input argument to cvpartition is group.

Capacidades ampliadas

Consulte también

|

Introducido en R2008a