Esta página aún no se ha traducido para esta versión. Puede ver la versión más reciente de esta página en inglés.

You can use various algorithms for ensemble learning. Some of the algorithms apply only to
classification ensembles, and others apply only to regression ensembles. You can use the
algorithms to create ensembles by specifying a value for the `'Method'`

name-value pair argument of `fitcensemble`

or `fitrensemble`

. This topic provides descriptions of the various algorithms. Note
that usage of some algorithms, such as `LPBoost`

,
`TotalBoost`

, and `RobustBoost`

, requires an Optimization
Toolbox™
license.

`AdaBoostM1`

is a very popular boosting algorithm for binary
classification. The algorithm trains learners sequentially. For every learner with index
*t*, `AdaBoostM1`

computes the weighted classification
error

$${\epsilon}_{t}={\displaystyle \sum}_{n=1}^{N}{d}_{n}^{\left(t\right)}\mathbb{I}\left({y}_{n}\ne {h}_{t}\left({x}_{n}\right)\right),$$

where

*x*is a vector of predictor values for observation_{n}*n*.*y*is the true class label._{n}*h*is the prediction of learner (hypothesis) with index_{t}*t*.$$\mathbb{I}$$ is the indicator function.

$${d}_{n}^{\left(t\right)}$$ is the weight of observation

*n*at step*t*.

`AdaBoostM1`

then increases weights for observations misclassified by
learner *t* and reduces weights for observations correctly classified by
learner *t*. The next learner *t* + 1 is then trained on the data with updated weights ${d}_{n}^{\left(t+1\right)}$.

After training finishes, `AdaBoostM1`

computes prediction for new data
using

$f\left(x\right)={\displaystyle \sum}_{t=1}^{T}{\alpha}_{t}{h}_{t}\left(x\right),$

where

${\alpha}_{t}=\frac{1}{2}\mathrm{log}\frac{1-{\epsilon}_{t}}{{\epsilon}_{t}}$

are weights of the weak hypotheses in the ensemble.

Training by `AdaBoostM1`

can be viewed as stagewise minimization of the
exponential loss

$\sum}_{n=1}^{N}{w}_{n}\mathrm{exp}\left(-{y}_{n}f\left({x}_{n}\right)\right),$

where

*y*∊ {–1,+1} is the true class label._{n}*w*are observation weights normalized to add up to 1._{n}*f*(*x*) ∊ (–∞,+∞) is the predicted classification score._{n}

The observation weights *w _{n}* are the original
observation weights you passed to

`fitcensemble`

.The second output from the `predict`

method of an
`AdaBoostM1`

classification ensemble is an *N*-by-2
matrix of classification scores for the two classes and *N* observations.
The second column in this matrix is always equal to minus the first column. The
`predict`

method returns two scores to be consistent with multiclass
models, though this is redundant because the second column is always the negative of the
first.

Most often `AdaBoostM1`

is used with decision stumps (default) or
shallow trees. If boosted stumps give poor performance, try setting the minimal parent node
size to one quarter of the training data.

By default, the learning rate for boosting algorithms is `1`

. If you
set the learning rate to a lower number, the ensemble learns at a slower rate, but can
converge to a better solution. `0.1`

is a popular choice for the learning
rate. Learning at a rate less than `1`

is often called
“shrinkage”.

For examples using `AdaBoostM1`

, see Tune RobustBoost.

For references related to `AdaBoostM1`

, see Freund and Schapire [20], Schapire et al. [40], Friedman, Hastie, and Tibshirani [22], and Friedman [21].

`AdaBoostM2`

is an extension of `AdaBoostM1`

for
multiple classes. Instead of weighted classification error, `AdaBoostM2`

uses weighted pseudo-loss for *N* observations and *K*
classes

${\epsilon}_{t}=\frac{1}{2}{\displaystyle \sum}_{n=1}^{N}{\displaystyle \sum}_{k\ne {y}_{n}}{d}_{n,k}^{\left(t\right)}\left(1-{h}_{t}\left({x}_{n},{y}_{n}\right)+{h}_{t}\left({x}_{n},k\right)\right),$

where

*h*(_{t}*x*,_{n}*k*) is the confidence of prediction by learner at step*t*into class*k*ranging from 0 (not at all confident) to 1 (highly confident).${d}_{n,k}^{\left(t\right)}$ are observation weights at step

*t*for class*k*.*y*is the true class label taking one of the_{n}*K*values.The second sum is over all classes other than the true class

*y*._{n}

Interpreting the pseudo-loss is harder than classification error, but the idea is the
same. Pseudo-loss can be used as a measure of the classification accuracy from any learner
in an ensemble. Pseudo-loss typically exhibits the same behavior as a weighted
classification error for `AdaBoostM1`

: the first few learners in a boosted
ensemble give low pseudo-loss values. After the first few training steps, the ensemble
begins to learn at a slower pace, and the pseudo-loss value approaches 0.5 from
below.

For examples using `AdaBoostM2`

, see Train Classification Ensemble.

For references related to `AdaBoostM2`

, see Freund and Schapire [20].

*Bagging*, which stands for “bootstrap aggregation,”
is a type of ensemble learning. To bag a weak learner such as a decision tree on a dataset,
generate many bootstrap replicas of this dataset and grow decision trees on these replicas.
Obtain each bootstrap replica by randomly selecting `N`

observations out of
`N`

with replacement, where `N`

is the dataset size. To
find the predicted response of a trained ensemble, take an average over predictions from
individual trees.

Bagged decision trees were introduced in MATLAB^{®}
R2009a as `TreeBagger`

. The `fitcensemble`

and
`fitrensemble`

functions let you bag in a manner consistent with boosting. An ensemble of bagged trees,
either `ClassificationBaggedEnsemble`

or `RegressionBaggedEnsemble`

, returned by `fitcensemble`

or
`fitrensemble`

, respectively, offers almost the same functionally as
`TreeBagger`

. Discrepancies between `TreeBagger`

and the new framework are described in detail in TreeBagger Features Not in fitcensemble or fitrensemble.

Bagging works by training learners on resampled versions of the data. This resampling is
usually done by bootstrapping observations, that is, selecting *N* out of
*N* observations with replacement for every new learner. In addition,
every tree in the ensemble can randomly select predictors for decision splits—a
technique known to improve the accuracy of bagged trees.

By default, the minimal leaf sizes for bagged trees are set to `1`

for
classification and `5`

for regression. Trees grown with the default leaf
size are usually very deep. These settings are close to optimal for the predictive power of
an ensemble. Often you can grow trees with larger leaves without losing predictive power.
Doing so reduces training and prediction time, as well as memory usage for the trained
ensemble.

Another important parameter is the number of predictors selected at random for every decision split. This random selection is made for every split, and every deep tree involves many splits. By default, this parameter is set to a square root of the number of predictors for classification, and one third of predictors for regression.

Several features of bagged decision trees make them a unique algorithm. Drawing
`N`

out of `N`

observations with replacement omits on
average 37% of observations for each decision tree. These are “out-of-bag”
observations. You can use them to estimate the predictive power and feature importance. For
each observation, you can estimate the out-of-bag prediction by averaging over predictions
from all trees in the ensemble for which this observation is out of bag. You can then
compare the computed prediction against the observed response for this observation. By
comparing the out-of-bag predicted responses against the observed responses for all
observations used for training, you can estimate the average out-of-bag error. This
out-of-bag average is an unbiased estimator of the true ensemble error. You can also obtain
out-of-bag estimates of feature importance by randomly permuting out-of-bag data across one
variable or column at a time and estimating the increase in the out-of-bag error due to this
permutation. The larger the increase, the more important the feature. Thus, you need not
supply test data for bagged ensembles because you obtain reliable estimates of the
predictive power and feature importance in the process of training, which is an attractive
feature of bagging.

Another attractive feature of bagged decision trees is the proximity matrix. Every time two observations land on the same leaf of a tree, their proximity increases by 1. For normalization, sum these proximities over all trees in the ensemble and divide by the number of trees. The resulting matrix is symmetric with diagonal elements equal to 1 and off-diagonal elements ranging from 0 to 1. You can use this matrix for finding outlier observations and discovering clusters in the data through multidimensional scaling.

For examples using bagging, see:

For references related to bagging, see Breiman [8], [9], and [10].

`fitcensemble`

and `fitrensemble`

produce bagged
ensembles that have most, but not all, of the functionality of `TreeBagger`

objects. Additionally, some functionalities have different names in the new bagged
ensembles.

**TreeBagger Features Not in fitcensemble or fitrensemble**

Feature | TreeBagger Property | TreeBagger Method |
---|---|---|

Computation of proximity matrix | `Proximity` | `fillprox` , `mdsprox` |

Computation of outliers | `OutlierMeasure` | N/A |

Out-of-bag estimates of predictor importance | `OOBPermutedPredictorDeltaMeanMargin` and
`OOBPermutedPredictorCountRaiseMargin`
| N/A |

Merging two ensembles trained separately | N/A | `append` |

Parallel computation for creating ensemble | Set the `UseParallel` name-value pair to
`true` | N/A |

When you estimate the proximity matrix and outliers of a
`TreeBagger`

model using `fillprox`

, MATLAB
must fit an *n*-by-*n* matrix in memory,
where *n* is the number of observations. Therefore, if
*n* is moderate to large, then you should avoid estimating the
proximity matrix and outliers.

**Differing Names Between TreeBagger and Bagged Ensembles**

Feature | TreeBagger | Bagged Ensembles |
---|---|---|

Split criterion contributions for each predictor | `DeltaCriterionDecisionSplit` property | First output of `predictorImportance` (classification) or
`predictorImportance`
(regression) |

Predictor associations | `SurrogateAssociation` property | Second output of `predictorImportance` (classification) or
`predictorImportance`
(regression) |

Out-of-bag estimates of predictor importance | `OOBPermutedPredictorDeltaError` property | Output of `oobPermutedPredictorImportance`
(classification) or `oobPermutedPredictorImportance`
(regression) |

Error (misclassification probability or mean-squared error) | `error` and `oobError` methods | `loss` and `oobLoss` methods (classification); `loss` and `oobLoss` methods (regression) |

Train additional trees and add to ensemble | `growTrees` method | `resume` method (classification); `resume` method (regression) |

Mean classification margin per tree | `meanMargin` and `oobMeanMargin` methods | `edge` and `oobEdge` methods (classification) |

In addition, two important changes were made to training and prediction for bagged classification ensembles:

If you pass a misclassification cost matrix to

`TreeBagger`

, it passes this matrix along to the trees. If you pass a misclassification cost matrix to`fitcensemble`

, it uses this matrix to adjust the class prior probabilities.`fitcensemble`

then passes the adjusted prior probabilities and the default cost matrix to the trees. The default cost matrix is`ones(K)-eye(K)`

for`K`

classes.Unlike the

`loss`

and`edge`

methods in the new framework, the`TreeBagger`

`error`

and`meanMargin`

methods do not normalize input observation weights of the prior probabilities in the respective class.

`GentleBoost`

(also known as Gentle AdaBoost) combines features of
`AdaBoostM1`

and `LogitBoost`

. Like
`AdaBoostM1`

, `GentleBoost`

minimizes the exponential
loss. But its numeric optimization is set up differently. Like
`LogitBoost`

, every weak learner fits a regression model to response values *y _{n}* ∊ {–1,+1}. This makes

`GentleBoost`

another good candidate for
binary classification of data with multilevel categorical predictors.`fitcensemble`

computes and stores the mean-squared error in the
`FitInfo`

property of the ensemble object. The mean-squared error
is

$\sum}_{n=1}^{N}{d}_{n}^{\left(t\right)}{\left({\tilde{y}}_{n}-{h}_{t}\left({x}_{n}\right)\right)}^{2},$

where

${d}_{n}^{\left(t\right)}$ are observation weights at step

*t*(the weights add up to 1).*h*(_{t}*x*) are predictions of the regression model_{n}*h*fitted to response values_{t}*y*._{n}

As the strength of individual learners weakens, the weighted mean-squared error approaches 1.

For examples using `GentleBoost`

, see Train Ensemble With Unequal Classification Costs and Classification with Many Categorical Levels.

For references related to `GentleBoost`

, see Friedman, Hastie, and
Tibshirani [22].

`LogitBoost`

is another popular algorithm for binary classification.
`LogitBoost`

works similarly to `AdaBoostM1`

, except it
minimizes binomial deviance

$\sum}_{n=1}^{N}{w}_{n}\mathrm{log}\left(1+\mathrm{exp}\left(-2{y}_{n}f\left({x}_{n}\right)\right)\right),$

where

*y*∊ {–1,+1} is the true class label._{n}*w*are observation weights normalized to add up to 1._{n}*f*(*x*) ∊ (–∞,+∞) is the predicted classification score._{n}

Binomial deviance assigns less weight to badly misclassified observations (observations
with large negative values of
*y _{n}f*(

`LogitBoost`

can give better average accuracy than
`AdaBoostM1`

for data with poorly separable classes.Learner *t* in a `LogitBoost`

ensemble fits a
regression model to response values

${\tilde{y}}_{n}=\frac{{y}_{n}^{*}-{p}_{t}\left({x}_{n}\right)}{{p}_{t}\left({x}_{n}\right)\left(1-{p}_{t}\left({x}_{n}\right)\right)},$

where

*y**∊ {0,+1} are relabeled classes (0 instead of –1)._{n}*p*(_{t}*x*) is the current ensemble estimate of the probability for observation_{n}*x*to be of class 1._{n}

Fitting a regression model at each boosting step turns into a great computational
advantage for data with multilevel categorical predictors. Take a categorical predictor with
*L* levels. To find the optimal decision split on such a predictor, a
classification tree needs to consider 2^{L–1} – 1 splits. A regression tree needs to consider only
*L* – 1 splits, so the processing time can be much shorter.
`LogitBoost`

is recommended for categorical predictors with many
levels.

`fitcensemble`

computes and stores the mean-squared error in the
`FitInfo`

property of the ensemble object. The mean-squared error
is

$\sum}_{n=1}^{N}{d}_{n}^{\left(t\right)}{\left({\tilde{y}}_{n}-{h}_{t}\left({x}_{n}\right)\right)}^{2},$

where

${d}_{n}^{\left(t\right)}$ are observation weights at step

*t*(the weights add up to 1).*h*(_{t}*x*) are predictions of the regression model_{n}*h*fitted to response values ${\tilde{y}}_{n}$._{t}

Values *y _{n}* can range from –∞ to
+∞, so the mean-squared error does not have well-defined bounds.

For examples using `LogitBoost`

, see Classification with Many Categorical Levels.

For references related to `LogitBoost`

, see Friedman, Hastie, and
Tibshirani [22].

`LPBoost`

(linear programming boost), like
`TotalBoost`

, performs multiclass classification by attempting to
maximize the minimal *margin* in the training set. This attempt uses
optimization algorithms, namely linear programming for `LPBoost`

. So you
need an Optimization
Toolbox
license to use `LPBoost`

or
`TotalBoost`

.

The margin of a classification is the difference between the predicted soft
classification *score* for the true class, and the largest score for
the false classes. For trees, the *score* of a classification of a leaf
node is the posterior probability of the classification at that node. The posterior
probability of the classification at a node is the number of training sequences that lead to
that node with the classification, divided by the number of training sequences that lead to
that node. For more information, see Definitions in `margin`

.

Why maximize the minimal margin? For one thing, the generalization error (the error on new data) is the probability of obtaining a negative margin. Schapire and Singer [41] establish this inequality on the probability of obtaining a negative margin:

$${P}_{\text{test}}\left(m\le 0\right)\le {P}_{\text{train}}\left(m\le \theta \right)+O\left(\frac{1}{\sqrt{N}}\sqrt{\frac{V{\mathrm{log}}^{2}(N/V)}{{\theta}^{2}}+\mathrm{log}(1/\delta )}\right).$$

Here *m* is the margin, *θ* is any positive number,
*V* is the Vapnik-Chervonenkis dimension of the classifier space,
*N* is the size of the training set, and *δ* is a small
positive number. The inequality holds with probability 1–*δ* over many
i.i.d. training and test sets. This inequality says: To obtain a low generalization error,
minimize the number of observations below margin *θ* in the training
set.

`LPBoost`

iteratively maximizes the minimal margin through a sequence
of linear programming problems. Equivalently, by duality, `LPBoost`

minimizes the maximal *edge*, where edge is the weighted mean margin
(see Definitions). At each iteration, there are more constraints in the problem. So,
for large problems, the optimization problem becomes increasingly constrained, and slow to
solve.

`LPBoost`

typically creates ensembles with many learners having weights
that are orders of magnitude smaller than those of other learners. Therefore, to better
enable you to remove the unimportant ensemble members, the `compact`

method reorders the members of an `LPBoost`

ensemble
from largest weight to smallest. Therefore, you can easily remove the least important
members of the ensemble using the `removeLearners`

method.

For examples using `LPBoost`

, see LPBoost and TotalBoost for Small Ensembles.

For references related to `LPBoost`

, see Warmuth, Liao, and Ratsch
[44].

`LSBoost`

(least squares boosting) fits regression ensembles. At every
step, the ensemble fits a new learner to the difference between the observed response and
the aggregated prediction of all learners grown previously. The ensemble fits to minimize
mean-squared error.

You can use `LSBoost`

with shrinkage by passing in the
`LearnRate`

parameter. By default this parameter is set to
`1`

, and the ensemble learns at the maximal speed. If you set
`LearnRate`

to a value from `0`

to `1`

,
the ensemble fits every new learner to *y _{n}* –

*y*is the observed response._{n}*f*(*x*) is the aggregated prediction from all weak learners grown so far for observation_{n}*x*._{n}*η*is the learning rate.

For examples using `LSBoost`

, see Train Regression Ensemble and Regularize a Regression Ensemble.

For references related to `LSBoost`

, see Hastie, Tibshirani, and
Friedman [24], Chapters 7 (Model Assessment and Selection) and 15 (Random Forests, see also [9]).

Boosting algorithms such as `AdaBoostM1`

and
`LogitBoost`

increase weights for misclassified observations at every
boosting step. These weights can become very large. If this happens, the boosting algorithm
sometimes concentrates on a few misclassified observations and neglects the majority of
training data. Consequently the average classification accuracy suffers. You need an
Optimization
Toolbox
license to use `RobustBoost`

.

In this situation, you can try using `RobustBoost`

. This algorithm does
not assign almost the entire data weight to badly misclassified observations. It can produce
better average classification accuracy.

Unlike `AdaBoostM1`

and `LogitBoost`

,
`RobustBoost`

does not minimize a specific loss function. Instead, it
maximizes the number of observations with the classification margin above a certain
threshold.

`RobustBoost`

trains based on time evolution. The algorithm starts at *t* = 0. At every step, `RobustBoost`

solves an optimization
problem to find a positive step in time Δ*t* and a corresponding positive
change in the average margin for training data Δ*m*.
`RobustBoost`

stops training and exits if at least one of these three
conditions is true:

Time

*t*reaches 1.`RobustBoost`

cannot find a solution to the optimization problem with positive updates Δ*t*and Δ*m*.`RobustBoost`

grows as many learners as you requested.

Results from `RobustBoost`

can be usable for any termination condition.
Estimate the classification accuracy by cross validation or by using an independent test
set.

To get better classification accuracy from `RobustBoost`

, you can
adjust three parameters in `fitcensemble`

:
`RobustErrorGoal`

, `RobustMaxMargin`

, and
`RobustMarginSigma`

. Start by varying values for
`RobustErrorGoal`

from 0 to 1. The maximal allowed value for
`RobustErrorGoal`

depends on the two other parameters. If you pass a
value that is too high, `fitcensemble`

produces an error message showing
the allowed range for `RobustErrorGoal`

.

For examples using `RobustBoost`

, see Tune RobustBoost.

For references related to `RobustBoost`

, see Freund [19].

`RUSBoost`

is especially effective at classifying imbalanced data,
meaning some class in the training data has many fewer members than another. RUS stands for
Random Under Sampling. The algorithm takes *N*, the number of members in
the class with the fewest members in the training data, as the basic unit for sampling.
Classes with more members are under sampled by taking only *N* observations
of every class. In other words, if there are *K* classes, then, for each
weak learner in the ensemble, `RUSBoost`

takes a subset of the data with
*N* observations from each of the *K* classes. The
boosting procedure follows the procedure in AdaBoostM2
for reweighting and constructing the ensemble.

When you construct a `RUSBoost`

ensemble, there is an optional
name-value pair called `RatioToSmallest`

. Give a vector of
*K* values, each value representing the multiple of *N*
to sample for the associated class. For example, if the smallest class has
*N* = 100 members, then `RatioToSmallest`

=
`[2,3,4]`

means each weak learner has 200 members in class 1, 300 in
class 2, and 400 in class 3. If `RatioToSmallest`

leads to a value that is
larger than the number of members in a particular class, then `RUSBoost`

samples the members with replacement. Otherwise, `RUSBoost`

samples the
members without replacement.

For examples using `RUSBoost`

, see Classification with Imbalanced Data.

For references related to `RUSBoost`

, see Seiffert et al. [43].

Use random subspace ensembles (`Subspace`

) to improve the accuracy of
discriminant analysis (`ClassificationDiscriminant`

) or
*k*-nearest neighbor (`ClassificationKNN`

) classifiers. `Subspace`

ensembles also have
the advantage of using less memory than ensembles with all predictors, and can handle
missing values (`NaN`

s).

The basic random subspace algorithm uses these parameters.

*m*is the number of dimensions (variables) to sample in each learner. Set*m*using the`NPredToSample`

name-value pair.*d*is the number of dimensions in the data, which is the number of columns (predictors) in the data matrix`X`

.*n*is the number of learners in the ensemble. Set*n*using the`NLearn`

input.

The basic random subspace algorithm performs the following steps:

Choose without replacement a random set of

*m*predictors from the*d*possible values.Train a weak learner using just the

*m*chosen predictors.Repeat steps 1 and 2 until there are

*n*weak learners.Predict by taking an average of the

`score`

prediction of the weak learners, and classify the category with the highest average`score`

.

You can choose to create a weak learner for every possible set of *m*
predictors from the *d* dimensions. To do so, set *n*, the
number of learners, to `'AllPredictorCombinations'`

. In this case, there
are `nchoosek(size(X,2),NPredToSample)`

weak learners in the
ensemble.

`fitcensemble`

downweights predictors after choosing them for a
learner, so subsequent learners have a lower chance of using a predictor that was previously
used. This weighting tends to make predictors more evenly distributed among learners than in
uniform weighting.

For examples using `Subspace`

, see Random Subspace Classification.

For references related to random subspace ensembles, see Ho [26].

`TotalBoost`

, like linear programming boost
(`LPBoost`

), performs multiclass classification by attempting to maximize
the minimal *margin* in the training set. This attempt uses
optimization algorithms, namely quadratic programming for `TotalBoost`

. So
you need an Optimization
Toolbox
license to use `LPBoost`

or
`TotalBoost`

.

The margin of a classification is the difference between the predicted soft
classification *score* for the true class, and the largest score for
the false classes. For trees, the *score* of a classification of a leaf
node is the posterior probability of the classification at that node. The posterior
probability of the classification at a node is the number of training sequences that lead to
that node with the classification, divided by the number of training sequences that lead to
that node. For more information, see Definitions in `margin`

.

Why maximize the minimal margin? For one thing, the generalization error (the error on new data) is the probability of obtaining a negative margin. Schapire and Singer [41] establish this inequality on the probability of obtaining a negative margin:

$${P}_{\text{test}}\left(m\le 0\right)\le {P}_{\text{train}}\left(m\le \theta \right)+O\left(\frac{1}{\sqrt{N}}\sqrt{\frac{V{\mathrm{log}}^{2}(N/V)}{{\theta}^{2}}+\mathrm{log}(1/\delta )}\right).$$

Here *m* is the margin, *θ* is any positive number,
*V* is the Vapnik-Chervonenkis dimension of the classifier space,
*N* is the size of the training set, and *δ* is a small
positive number. The inequality holds with probability 1–*δ* over many
i.i.d. training and test sets. This inequality says: To obtain a low generalization error,
minimize the number of observations below margin *θ* in the training
set.

`TotalBoost`

minimizes a proxy of the Kullback-Leibler divergence
between the current weight distribution and the initial weight distribution, subject to the
constraint that the *edge* (the weighted margin) is below a certain
value. The proxy is a quadratic expansion of the divergence:

$$D(W,{W}_{0})={\displaystyle \sum _{n=1}^{N}\mathrm{log}\frac{W(n)}{{W}_{0}(n)}}\approx {\displaystyle \sum _{n=1}^{N}\left(1+\frac{W(n)}{{W}_{0}(n)}\right)\Delta +\frac{1}{2W(n)}{\Delta}^{2}},$$

where Δ is the difference between *W*(*n*), the
weights at the current and next iteration, and *W*_{0},
the initial weight distribution, which is uniform. This optimization formulation keeps
weights from becoming zero. At each iteration, there are more constraints in the problem.
So, for large problems, the optimization problem becomes increasingly constrained, and slow
to solve.

`TotalBoost`

typically creates ensembles with many learners having
weights that are orders of magnitude smaller than those of other learners. Therefore, to
better enable you to remove the unimportant ensemble members, the `compact`

method reorders the members of a `TotalBoost`

ensemble
from largest weight to smallest. Therefore you can easily remove the least important members
of the ensemble using the `removeLearners`

method.

For examples using `TotalBoost`

, see LPBoost and TotalBoost for Small Ensembles.

For references related to `TotalBoost`

, see Warmuth, Liao, and Ratsch
[44].

`ClassificationBaggedEnsemble`

| `ClassificationDiscriminant`

| `ClassificationEnsemble`

| `ClassificationKNN`

| `ClassificationPartitionedEnsemble`

| `CompactClassificationEnsemble`

| `CompactRegressionEnsemble`

| `RegressionBaggedEnsemble`

| `RegressionEnsemble`

| `RegressionPartitionedEnsemble`

| `TreeBagger`

| `fitcensemble`

| `fitrensemble`