Main Content

subset

Create new ensemble datastore from subset of existing ensemble datastore

Since R2021a

Description

The subset function allows you to extract a representative ensemble data set from a large ensemble datastore.

Use subset especially when your source data is too large to easily process and extract features from, as well as to import and experiment with your data in Diagnostic Feature Designer.

subset provides the following options that you can combine for creating the reduced data set:

  • By index — Specify an index vector to extract the specific ensemble members you want.

  • By number of members in class or ensemble — Specify the number of members to select from each condition class or from the entire ensemble. You can also specify the number of members based on the size of the smallest or largest class. This option allows you to not only reduce the size of the ensemble, but to balance the classes in the ensemble for more effective model development.

  • By order — Specify the order in which members are selected, such as from the start of the original data or randomly.

  • By holdout — Partition selected data into training and test ensembles.

Specify Subset by Index

example

sens = subset(ens,idx) creates a new ensemble datastore sens from a subset of the existing ensemble datastore ens by extracting the ensemble members that correspond to the indices in idx.

Use this syntax when you want to perform ensemble operations on a specific ensemble member or group of ensemble member. For example, you can use this syntax to:

  • Extract only ensemble members with a specific fault condition.

  • Extract a single ensemble member with specific characteristics to isolate and explore member behavior.

Specify which members you want to extract using the index vector idx. You can then operate on your extracted ensemble using the same techniques that you use for any data ensemble.

Specify Subset by Class

example

sens = subset(ens,ConditionVariable=cvName,NumMembers=numMembers) uses a subset that contains numMembers members in each class.

example

sens = subset(ens,ConditionVariable=cvName,ImbalancedClass="smallest") creates a balanced subset by reducing the size of all classes to the size of the smallest class.

example

sens = subset(ens,ConditionVariable=cvName,ImbalancedClass="largest",SampleSize=sampleSize) reduces the size of only the largest class. sampleSize specifies the reduced size of the largest class by decimal percentage or number of members. This syntax is particularly useful when you have much more data representing healthy equipment than faulty equipment.

sens = subset(___,SelectionOrder=selectionOrder) specifies which members subset retains when reducing ensemble size. You can use this syntax with any of the input-argument combinations in the Specify Subset by Class syntax group.

Specify Subset for Unlabeled Data

example

sens = subset(ens,NumMembers=numMembers) extracts a subset that contains numMembers members. Use this syntax when the data contains no labels, or classes, that can be used as condition values, or when you want to operate on the ensemble as a whole without considering class distribution.

example

sens = subset(ens,NumMembers=numMembers,SelectionOrder=selectionOrder) specifies which members subset retains when reducing the size of the ensemble.

Return Subset and Remainder Ensembles and Indices

example

[sens,sensidx,remens,remidx] = subset(___) returns the indices sensidx that extract the subset from ens, as well as the remainder ensemble remens and the remainder ensemble indices remidx. Use this syntax with any of the input-argument combinations in preceding syntaxes.

Partition Ensemble into Training and Test Sets

example

[trainsub,trainidx,testsub,testidx] = subset(ens,Holdout=holdout) specifies a random partition for holdout validation using holdout, which can be an integer or a percentage expressed as a fraction, with the function cvpartition. When holdout is an integer, cvpartition randomly selects holdout observations for the test set. When holdout is a value in the range (0,1), cvpartition randomly selects holdout*n observations, where n is the number of members in ens.

This syntax returns the training set and indices in trainsub and trainidx, respectively, and the test set and indices in testsub and testidx.

example

[trainsub,trainidx,testsub,testidx] = subset(ens,Holdout=holdout,ConditionVariable=cvName) creates the training and test sets for each class in cvName.

[trainsub,trainidx,testsub,testidx] = subset(ens,Holdout=holdout,ConditionVariable=cvName,Stratify=stratifyFlag) sets the cvpartition stratification flag to the logical value in stratifyflag.

Examples

collapse all

Extract the ensemble member that you identify from an ensemble datastore and use a single read command to obtain the contents.

For this example, use the following code to create a simulationEnsembleDatastore object using data previously generated by running a Simulink® model at a various fault values (see generateSimulationEnsemble). The ensemble includes simulation data for five different values of a model parameter, ToothFaultGain. Because of the volume of data, the unzip operation takes a few minutes.

unzip simEnsData.zip
ens = simulationEnsembleDatastore(pwd,'logsout')
ens = 
  simulationEnsembleDatastore with properties:

           DataVariables: [5x1 string]
    IndependentVariables: [0x0 string]
      ConditionVariables: [0x0 string]
       SelectedVariables: [5x1 string]
                ReadSize: 1
              NumMembers: 5
          LastMemberRead: [0x0 string]
                   Files: [5x1 string]

ems_nm = ens.NumMembers
ems_nm = 5

The ensemble contains five files.

Extract the fourth ensemble member into a new, single-member ensemble sens.

idx = 4;
sens = subset(ens,idx);
sens_nm = sens.NumMembers
sens_nm = 1

sens contains one member. View the filename to confirm the member index.

sens.Files
ans = 
"/tmp/Bdoc24a_2528353_1095265/tpcde5841d/predmaint-ex43507974/TransmissionCasingSimplified_log_4.mat"

Reset sens to the first member and read the contents.

reset(sens)
m4 = read(sens)
m4=1×5 table
    PMSignalLogName           SimulationInput                   SimulationMetadata                   Tacho                Vibration     
    _______________    ______________________________    _________________________________    ___________________    ___________________

      {'logsout'}      {1x1 Simulink.SimulationInput}    {1x1 Simulink.SimulationMetadata}    {20213x1 timetable}    {20213x1 timetable}

m4 contains the data for the extracted member.

Create a simulation ensemble datastore from a subset of an existing simulation ensemble datastore.

Create a simulationEnsembleDatastore object using data previously generated by running a Simulink® model at various fault values.

unzip simEnsData.zip
ens = simulationEnsembleDatastore(pwd,'logsout');
ens_nm = ens.NumMembers
ens_nm = 5

The ensemble contains five files. View the filenames.

ens.Files
ans = 5x1 string
    "/tmp/Bdoc24a_2528353_1076626/tp7a95b0de/predmaint-ex46856662/TransmissionCasingSimplified_log_1.mat"
    "/tmp/Bdoc24a_2528353_1076626/tp7a95b0de/predmaint-ex46856662/TransmissionCasingSimplified_log_2.mat"
    "/tmp/Bdoc24a_2528353_1076626/tp7a95b0de/predmaint-ex46856662/TransmissionCasingSimplified_log_3.mat"
    "/tmp/Bdoc24a_2528353_1076626/tp7a95b0de/predmaint-ex46856662/TransmissionCasingSimplified_log_4.mat"
    "/tmp/Bdoc24a_2528353_1076626/tp7a95b0de/predmaint-ex46856662/TransmissionCasingSimplified_log_5.mat"

Extract the first, third, and fifth files into a new ensemble.

idx = [1 3 5];
sens = subset(ens,idx);
sens_nm = sens.NumMembers
sens_nm = 3

The new ensemble contains three members. View the filenames.

sens.Files
ans = 3x1 string
    "/tmp/Bdoc24a_2528353_1076626/tp7a95b0de/predmaint-ex46856662/TransmissionCasingSimplified_log_1.mat"
    "/tmp/Bdoc24a_2528353_1076626/tp7a95b0de/predmaint-ex46856662/TransmissionCasingSimplified_log_3.mat"
    "/tmp/Bdoc24a_2528353_1076626/tp7a95b0de/predmaint-ex46856662/TransmissionCasingSimplified_log_5.mat"

The new ensemble contains the three files that you indexed.

Load the data pEnsemble, which is a workspace ensemble that contains pump diagnostic information.

load pumpWEnsemble pEnsemble

Examine the contents of pEnsemble.

pEnsemble
pEnsemble = 
  workspaceEnsemble with properties:

           DataVariables: [2x1 string]
    IndependentVariables: [0x0 string]
      ConditionVariables: "faultCode"
       SelectedVariables: [3x1 string]
                ReadSize: 1
              NumMembers: 240
          LastMemberRead: "Member 240"

pEnsemble contains 240 members.

Use summary to determine the number of fault classes.

pdist = summary(pEnsemble,ConditionVariable="faultCode");
sd = size(pdist)
sd = 1×2

     8     2

The ensemble has eight fault classes, for an average of 30 members per class. Reduce the ensemble to half its size by limiting each class to 15 members.

subEns_numclass = subset(pEnsemble,ConditionVariable="faultCode",NumMembers=15)
subEns_numclass = 
  workspaceEnsemble with properties:

           DataVariables: [2x1 string]
    IndependentVariables: [0x0 string]
      ConditionVariables: "faultCode"
       SelectedVariables: [3x1 string]
                ReadSize: 1
              NumMembers: 120
          LastMemberRead: [0x0 string]

The ensemble now has half the members (120), distributed over eight equally sized classes that each contain 15 members.

Load the data pEnsemble, which is a workspace ensemble that contains pump diagnostic information.

load pumpWEnsemble pEnsemble

Examine the contents of pEnsemble.

pEnsemble
pEnsemble = 
  workspaceEnsemble with properties:

           DataVariables: [2x1 string]
    IndependentVariables: [0x0 string]
      ConditionVariables: "faultCode"
       SelectedVariables: [3x1 string]
                ReadSize: 1
              NumMembers: 240
          LastMemberRead: "Member 240"

pEnsemble contains 240 members with two data variables and the condition variable faultCode.

To illustrate the distribution of condition values, or classes, use summary to create a histogram.

summary(pEnsemble, ConditionVariable="faultCode")

The histogram of the full ensemble shows the number of members associated with each faultCode value. A faultCode value of 0 indicates a fully healthy member. Other faultCode values correspond to fault combinations. The ensemble has eight classes.

Create a subset of pEnsemble that is balanced, that is, contains the same number of members for each class. Set the name-value argument ImbalancedClass to "smallest" to specify that all classes contain the same number of members as the smallest class, which the histogram shows contains 17 members. The subset should therefore contain 17 members for each class, for a total of 136 members.

subEns_smallest = subset(pEnsemble,ConditionVariable="faultCode",ImbalancedClass="smallest");
totalmembers = subEns_smallest.NumMembers
totalmembers = 136

Examine the histogram of the subset ensemble.

summary(subEns_smallest,ConditionVariable="faultCode")

The histogram of the subset ensemble contains equally-sized classes.

Load the data pEnsemble and plot a histogram of its class distribution.

load pumpWEnsemble pEnsemble
summary(pEnsemble,ConditionVariable="faultCode")

The largest class has the label 0 and contains 42 members. Reduce the size of this class to 30 members.

subEns_largest = subset(pEnsemble,ConditionVariable="faultCode",...
    ImbalancedClass="largest",SampleSize=30);
summary(subEns_largest,ConditionVariable="faultCode")

The largest class in the original ensemble now has 30 members.

Load the data pEnsemble.

load pumpWEnsemble pEnsemble

Determine the number of members in pEnsemble.

numM = pEnsemble.NumMembers
numM = 240

The ensemble has 240 members. Reduce the number of members to 100. Return both the subset ensemble and remainder ensemble and their indices.

[sens100,sensidx,rens100,rensidx] = subset(pEnsemble,NumMembers=100);

View the selection order.

sfirst10 = sensidx(1:10)'
sfirst10 = 1×10

     1     2     3     4     5     6     7     8     9    10

The members are selected from the beginning of the ensemble.

View the indices of the remainder.

rfirst10 = rensidx(1:10)'
rfirst10 = 1×10

   101   102   103   104   105   106   107   108   109   110

The remainder ensemble selections begin at member 101.

Use SelectionOrder to choose the members randomly.

[sens100r,sensidxr] = subset(pEnsemble,NumMembers=100,SelectionOrder="random");
first10r = sensidxr(1:10)'
first10r = 1×10

     3     6     7    11    16    22    28    30    32    33

The members are selected randomly but in monotonic order.

Use SelectionOrder to use the last members.

[sens100l,sensidxl,rens100,rensidxl] = subset(pEnsemble,NumMembers=100,SelectionOrder="last");
slast10l = sensidxl(1:10)'
slast10l = 1×10

   141   142   143   144   145   146   147   148   149   150

View the last index of the remainder ensemble.

rlast10l = rensidxl(end)
rlast10l = 140

The remainder ensemble contains the first 140 members.

Load the data pEnsemble.

load pumpWEnsemble pEnsemble

Partition pEnsemble into a training subset and a test subset, using a Holdout value of 0.1.

[trainsub,trainidx,testsub,testidx] = subset(pEnsemble, Holdout=0.1);

Compare ensemble sizes.

numfull = pEnsemble.NumMembers
numfull = 240
numtrain = trainsub.NumMembers
numtrain = 216
numtest = testsub.NumMembers
numtest = 24

The test ensemble contains 10% of the original ensemble.

Load the data pEnsemble.

load pumpWEnsemble pEnsemble

Plot a histogram of the members of the ensemble by the class "faultCode".

summary(pEnsemble,ConditionVariable="faultCode")

Partition pEnsemble into training and test data sets using a Holdout value of 0.1.

[trainsub,trainidx,testsub,testidx] = subset(pEnsemble,Holdout=0.1,...
    ConditionVariable="faultCode");

Plot the histograms of the training and test partitions.

summary(trainsub,ConditionVariable="faultCode")

The histogram is roughly proportional to the full ensemble.

summary(testsub,ConditionVariable="faultCode")

The histogram displays the same general pattern as the previous histograms, given the smaller class sizes.

Input Arguments

collapse all

Source ensemble datastore from which to extract members, specified as a fileEnsembleDatastore, simulationEnsembleDatastore, or workspaceEnsemble object. For an example of extracting a member from an ensemble datastore, see Extract Specific Member from Ensemble Datastore.

Indices of source ensemble members to extract, specified as a numeric vector, an integer vector, or a logical vector. The number of elements in the vector must not exceed the number of members in ens. For numeric or integer vectors, all indices must be positive. For logical vectors, the number of elements must be equal to the number of ensemble members in ens. For an example of creating and using an index vector, see Create Subset of Ensemble Datastore.

Condition variable that determines the division of the ensemble into classes, specified as a string that must be a value in the ConditionVariable property of ens. To specify cv, use the form ConditionVariable=cv.

Specify ConditionVariable whenever you extract a subset that is determined by class-related criteria.

Number of members to include in the subset ensemble or class, specified as a positive integer that does not exceed the total number of members in the ensemble or class. To specify numMembers, use the form NumMembers=numMembers.

When you extract a subset that depends on class, that is, you set cvName, numMembers represents the target number of members in the appropriate class or classes.

When you extract a subset that ignores class, numMembers represents the target number of members in the entire ensemble.

Target size of largest class when you extract a subset using ImbalancedClass="largest", specified as either a percentage represented by a scalar in the range (0–1) or as a positive integer that is less than the size of the largest class in ens.

When you extract a subset that depends on class, that is, you set cvName, numMembers represents the target number of members in each appropriate class or classes.

Criterion for selecting subset members, specified as one of the following strings:

  • "first" — Select members from the beginning of the ensemble.

  • "last" — Select members from the end of the ensemble.

  • "random" — Select members randomly.

Holdout partition size, specified as either a percentage represented by a scalar in the range (0–1) or as a positive integer. When holdout is an integer, cvpartition randomly selects holdout observations for the test set. When holdout is a value in the range (0,1), cvpartition randomly selects holdout*n observations, where n is the number of members in trainsub.

Stratification flag for cvpartition, specified as true or false. You can specify stratifyFlag only when you specify holdout and condition variable cvName.

Output Arguments

collapse all

Extracted ensemble datastore, returned as the same type of object as ens.

Member selection indices for sens, returned as a vector of positive integers.

Remainder ensemble datastore, returned as the same type of object as ens. remens contains the members of ens that are not selected for sens.

Member selection indices for remens, returned as a vector of positive integers. Together, sensid and remidx index all the members in ens.

Training set, returned as the same type of object as ens.

Member selection indices for trainset, returned as a vector of positive integers.

Test set, returned as the same type of object as ens. remens contains the members of ens that are not selected for sens.

Member selection indices for testsub, returned as a vector of positive integers.

Version History

Introduced in R2021a

expand all