Main Content

templateNaiveBayes

Naive Bayes classifier template

Description

example

t = templateNaiveBayes() returns a naive Bayes template suitable for training error-correcting output code (ECOC) multiclass models.

If you specify a default template, then the software uses default values for all input arguments during training.

Specify t as a learner in fitcecoc.

example

t = templateNaiveBayes(Name,Value) returns a template with additional options specified by one or more name-value pair arguments. All properties of t are empty, except those you specify using Name,Value pair arguments.

For example, you can specify distributions for the predictors.

If you display t in the Command Window, then all options appear empty ([]), except those that you specify using name-value pair arguments. During training, the software uses default values for empty options.

Examples

collapse all

Use templateNaiveBayes to specify a default naive Bayes template.

t = templateNaiveBayes()
t = 
Fit template for classification NaiveBayes.

    DistributionNames: [1x0 double]
               Kernel: []
              Support: []
                Width: []
      StandardizeData: []
              Version: 1
               Method: 'NaiveBayes'
                 Type: 'classification'

All properties of the template object are empty except for Method and Type. When you pass t to the training function, the software fills in the empty properties with their respective default values. For example, the software fills the DistributionNames property with a 1-by- D cell array of character vectors with 'normal' in each cell, where D is the number of predictors. For details on other default values, see fitcnb.

t is a plan for a naive Bayes learner, and no computation occurs when you specify it. You can pass t to fitcecoc to specify naive Bayes binary learners for ECOC multiclass learning.

Create a nondefault naive Bayes template for use in fitcecoc.

Load Fisher's iris data set.

load fisheriris

Create a template for naive Bayes binary classifiers, and specify kernel distributions for all predictors.

t = templateNaiveBayes('DistributionNames','kernel')
t = 
Fit template for classification NaiveBayes.

    DistributionNames: 'kernel'
               Kernel: []
              Support: []
                Width: []
      StandardizeData: []
              Version: 1
               Method: 'NaiveBayes'
                 Type: 'classification'

All properties of the template object are empty except for DistributionNames, Method, and Type. When you pass t to the training function, the software fills in the empty properties with their respective default values.

Specify t as a binary learner for an ECOC multiclass model.

Mdl = fitcecoc(meas,species,'Learners',t);

By default, the software trains Mdl using the one-versus-one coding design.

Display the in-sample (resubstitution) misclassification error.

L = resubLoss(Mdl,'LossFun','classiferror')
L = 0.0333

Input Arguments

collapse all

Name-Value Arguments

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

Example: 'DistributionNames','mn' specifies to treat all predictors as token counts for a multinomial model.

Data distributions fitcnb uses to model the data, specified as the comma-separated pair consisting of 'DistributionNames' and a character vector or string scalar, a string array, or a cell array of character vectors with values from this table.

ValueDescription
'kernel'Kernel smoothing density estimate.
'mn'Multinomial distribution. If you specify mn, then all features are components of a multinomial distribution. Therefore, you cannot include 'mn' as an element of a string array or a cell array of character vectors. For details, see Algorithms.
'mvmn'Multivariate multinomial distribution. For details, see Algorithms.
'normal'Normal (Gaussian) distribution.

If you specify a character vector or string scalar, then the software models all the features using that distribution. If you specify a 1-by-P string array or cell array of character vectors, then the software models feature j using the distribution in element j of the array.

By default, the software sets all predictors specified as categorical predictors (using the CategoricalPredictors name-value pair argument) to 'mvmn'. Otherwise, the default distribution is 'normal'.

You must specify that at least one predictor has distribution 'kernel' to additionally specify Kernel, Standardize, Support, or Width.

Example: 'DistributionNames','mn'

Example: 'DistributionNames',{'kernel','normal','kernel'}

Kernel smoother type, specified as the comma-separated pair consisting of 'Kernel' and a character vector or string scalar, a string array, or a cell array of character vectors.

This table summarizes the available options for setting the kernel smoothing density region. Let I{u} denote the indicator function.

ValueKernelFormula
'box'Box (uniform)

f(x)=0.5I{|x|1}

'epanechnikov'Epanechnikov

f(x)=0.75(1x2)I{|x|1}

'normal'Gaussian

f(x)=12πexp(0.5x2)

'triangle'Triangular

f(x)=(1|x|)I{|x|1}

If you specify a 1-by-P string array or cell array, with each element of the array containing any value in the table, then the software trains the classifier using the kernel smoother type in element j for feature j in X. The software ignores elements of Kernel not corresponding to a predictor whose distribution is 'kernel'.

You must specify that at least one predictor has distribution 'kernel' to additionally specify Kernel, Standardize, Support, or Width.

Example: 'Kernel',{'epanechnikov','normal'}

Since R2023b

Flag to standardize the kernel-distributed predictors, specified as a numeric or logical 0 (false) or 1 (true). This argument is valid only when the DistributionNames value contains at least one kernel distribution ("kernel").

If you set Standardize to true, then the software centers and scales each kernel-distributed predictor variable by the corresponding column mean and standard deviation. The software does not standardize predictors with nonkernel distributions, such as categorical predictors.

Example: "Standardize",true

Data Types: single | double | logical

Kernel smoothing density support, specified as the comma-separated pair consisting of 'Support' and 'positive', 'unbounded', a string array, a cell array, or a numeric row vector. The software applies the kernel smoothing density to the specified region.

This table summarizes the available options for setting the kernel smoothing density region.

ValueDescription
1-by-2 numeric row vectorFor example, [L,U], where L and U are the finite lower and upper bounds, respectively, for the density support.
'positive'The density support is all positive real values.
'unbounded'The density support is all real values.

If you specify a 1-by-P string array or cell array, with each element in the string array containing any text value in the table and each element in the cell array containing any value in the table, then the software trains the classifier using the kernel support in element j for feature j in X. The software ignores elements of Kernel not corresponding to a predictor whose distribution is 'kernel'.

You must specify that at least one predictor has distribution 'kernel' to additionally specify Kernel, Standardize, Support, or Width.

Example: 'Support',{[-10,20],'unbounded'}

Data Types: char | string | cell | double

Kernel smoothing window width, specified as the comma-separated pair consisting of 'Width' and a matrix of numeric values, numeric column vector, numeric row vector, or scalar.

Suppose there are K class levels and P predictors. This table summarizes the available options for setting the kernel smoothing window width.

ValueDescription
K-by-P matrix of numeric valuesElement (k,j) specifies the width for predictor j in class k.
K-by-1 numeric column vectorElement k specifies the width for all predictors in class k.
1-by-P numeric row vectorElement j specifies the width in all class levels for predictor j.
scalarSpecifies the bandwidth for all features in all classes.

By default, the software selects a default width automatically for each combination of predictor and class by using a value that is optimal for a Gaussian distribution. If you specify Width and it contains NaNs, then the software selects widths for the elements containing NaNs.

You must specify that at least one predictor has distribution 'kernel' to additionally specify Kernel, Standardize, Support, or Width.

Example: 'Width',[NaN NaN]

Data Types: double | struct

Output Arguments

collapse all

Naive Bayes classification template suitable for training error-correcting output code (ECOC) multiclass models, returned as a template object. Pass t to fitcecoc to specify how to create the naive Bayes classifier for the ECOC model.

If you display t to the Command Window, then all, unspecified options appear empty ([]). However, the software replaces empty options with their corresponding default values during training.

More About

collapse all

Bag-of-Tokens Model

In the bag-of-tokens model, the value of predictor j is the nonnegative number of occurrences of token j in the observation. The number of categories (bins) in the multinomial model is the number of distinct tokens (number of predictors).

Naive Bayes

Naive Bayes is a classification algorithm that applies density estimation to the data.

The algorithm leverages Bayes theorem, and (naively) assumes that the predictors are conditionally independent, given the class. Although the assumption is usually violated in practice, naive Bayes classifiers tend to yield posterior distributions that are robust to biased class density estimates, particularly where the posterior is 0.5 (the decision boundary) [1].

Naive Bayes classifiers assign observations to the most probable class (in other words, the maximum a posteriori decision rule). Explicitly, the algorithm takes these steps:

  1. Estimate the densities of the predictors within each class.

  2. Model posterior probabilities according to Bayes rule. That is, for all k = 1,...,K,

    P^(Y=k|X1,..,XP)=π(Y=k)j=1PP(Xj|Y=k)k=1Kπ(Y=k)j=1PP(Xj|Y=k),

    where:

    • Y is the random variable corresponding to the class index of an observation.

    • X1,...,XP are the random predictors of an observation.

    • π(Y=k) is the prior probability that a class index is k.

  3. Classify an observation by estimating the posterior probability for each class, and then assign the observation to the class yielding the maximum posterior probability.

If the predictors compose a multinomial distribution, then the posterior probabilityP^(Y=k|X1,..,XP)π(Y=k)Pmn(X1,...,XP|Y=k), where Pmn(X1,...,XP|Y=k) is the probability mass function of a multinomial distribution.

Algorithms

  • If predictor variable j has a conditional normal distribution (see the DistributionNames name-value argument), the software fits the distribution to the data by computing the class-specific weighted mean and the unbiased estimate of the weighted standard deviation. For each class k:

    • The weighted mean of predictor j is

      x¯j|k={i:yi=k}wixij{i:yi=k}wi,

      where wi is the weight for observation i. The software normalizes weights within a class such that they sum to the prior probability for that class.

    • The unbiased estimator of the weighted standard deviation of predictor j is

      sj|k=[{i:yi=k}wi(xijx¯j|k)2z1|kz2|kz1|k]1/2,

      where z1|k is the sum of the weights within class k and z2|k is the sum of the squared weights within class k.

  • If all predictor variables compose a conditional multinomial distribution (you specify 'DistributionNames','mn'), the software fits the distribution using the bag-of-tokens model. The software stores the probability that token j appears in class k in the property DistributionParameters{k,j}. Using additive smoothing [2], the estimated probability is

    P(token j|class k)=1+cj|kP+ck,

    where:

    • cj|k=nk{i:yi=k}xijwi{i:yi=k}wi, which is the weighted number of occurrences of token j in class k.

    • nk is the number of observations in class k.

    • wi is the weight for observation i. The software normalizes weights within a class such that they sum to the prior probability for that class.

    • ck=j=1Pcj|k, which is the total weighted number of occurrences of all tokens in class k.

  • If predictor variable j has a conditional multivariate multinomial distribution:

    1. The software collects a list of the unique levels, stores the sorted list in CategoricalLevels, and considers each level a bin. Each predictor/class combination is a separate, independent multinomial random variable.

    2. For each class k, the software counts instances of each categorical level using the list stored in CategoricalLevels{j}.

    3. The software stores the probability that predictor j, in class k, has level L in the property DistributionParameters{k,j}, for all levels in CategoricalLevels{j}. Using additive smoothing [2], the estimated probability is

      P(predictor j=L|class k)=1+mj|k(L)mj+mk,

      where:

      • mj|k(L)=nk{i:yi=k}I{xij=L}wi{i:yi=k}wi, which is the weighted number of observations for which predictor j equals L in class k.

      • nk is the number of observations in class k.

      • I{xij=L}=1 if xij = L, 0 otherwise.

      • wi is the weight for observation i. The software normalizes weights within a class such that they sum to the prior probability for that class.

      • mj is the number of distinct levels in predictor j.

      • mk is the weighted number of observations in class k.

References

[1] Hastie, T., R. Tibshirani, and J. Friedman. The Elements of Statistical Learning, Second Edition. NY: Springer, 2008.

[2] Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schütze. Introduction to Information Retrieval, NY: Cambridge University Press, 2008.

Version History

Introduced in R2014b

expand all