incrementalLearner
Syntax
Description
returns a robust random cut forest (RRCF) model IncrementalForest
= incrementalLearner(forest
)IncrementalForest
for anomaly detection, initialized using the parameters provided in the RRCF model
forest
. Because its property values reflect the knowledge gained
from forest
, IncrementalForest
can detect
anomalies given new observations, and it is warm, meaning that
the incremental fit
function can return scores and detect
anomalies.
specifies additional options using one or more
name-value arguments. For example, IncrementalForest
= incrementalLearner(forest
,Name=Value
)ScoreWarmupPeriod=500
specifies
to process 500 observations before score computation and anomaly detection.
Examples
Perform Incremental RRCF Anomaly Detection with Categorical Predictor Data
Train an incremental robust random cut forest (RRCF) model and perform anomaly detection on a data set with categorical predictors.
Load Data
Load census1994.mat
. The data set consists of demographic data from the US Census Bureau.
load census1994.mat
incrementalRobustRandomCutForest
does not use observations with missing values. Remove missing values in the data to reduce memory consumption and speed up training. Keep only the first 1000 observations in the training data set and the first 2000 observations in the test data set.
adultdata = rmmissing(adultdata); adulttest = rmmissing(adulttest); Xtrain = adultdata(1:1000,:); Xstream = adulttest(1:2000,:);
Train RRCF Model
Fit an RRCF model to the training data. Specify an anomaly contamination fraction of 0.001.
rng(0,"twister"); % For reproducibility TTforest = rrcforest(Xtrain,ContaminationFraction=0.001); details(TTforest)
RobustRandomCutForest with properties: CollusiveDisplacement: 'maximal' NumLearners: 100 NumObservationsPerLearner: 256 Mu: [] Sigma: [] CategoricalPredictors: [2 4 6 7 8 9 10 14 15] ContaminationFraction: 1.0000e-03 ScoreThreshold: 55.5745 PredictorNames: {'age' 'workClass' 'fnlwgt' 'education' 'education_num' 'marital_status' 'occupation' 'relationship' 'race' 'sex' 'capital_gain' 'capital_loss' 'hours_per_week' 'native_country' 'salary'}
TTforest
is a RobustRandomCutForest
model object representing a traditionally trained RRCF model. The software identifies nine variables in the data as categorical predictors because they contain string arrays.
Convert Trained Model
Convert the traditionally trained RRCF model to an RRCF model for incremental learning.
Incrementalforest = incrementalLearner(TTforest);
Incrementalforest
is an incrementalRobustRandomCutForest
model object that is ready for incremental learning and anomaly detection.
Fit Incremental Model and Detect Anomalies
Perform incremental learning on the Xstream
data by using the fit
function. To simulate a data stream, fit the model in chunks of 100 observations at a time. At each iteration:
Process 100 observations.
Overwrite the previous incremental model with a new one fitted to the incoming observations.
Store
medianscore
, the median score value of the data chunk, to see how it evolves during incremental learning.Store
threshold
, the score threshold value for anomalies, to see how it evolves during incremental learning.Store
numAnom
, the number of detected anomalies in the chunk, to see how it evolves during incremental learning.
n = numel(Xstream(:,1)); numObsPerChunk = 100; nchunk = floor(n/numObsPerChunk); medianscore = zeros(nchunk,1); numAnom = zeros(nchunk,1); threshold = zeros(nchunk,1); % Incremental fitting for j = 1:nchunk ibegin = min(n,numObsPerChunk*(j-1) + 1); iend = min(n,numObsPerChunk*j); idx = ibegin:iend; [Incrementalforest,tf,scores] = fit(Incrementalforest,Xstream(idx,:)); medianscore(j) = median(scores); numAnom(j) = sum(tf); threshold(j) = Incrementalforest.ScoreThreshold; end
Analyze Incremental Model During Training
To see how the median score, score threshold, and number of detected anomalies per chunk evolve during training, plot them on separate tiles.
tiledlayout(3,1); nexttile plot(medianscore) ylabel("Median Score") xlabel("Iteration") xlim([0 nchunk]) nexttile plot(threshold) ylabel("Score Threshold") xlabel("Iteration") xlim([0 nchunk]) nexttile plot(numAnom,"+") ylabel("Anomalies") xlabel("Iteration") xlim([0 nchunk]) ylim([0 max(numAnom)+0.2])
totalanomalies=sum(numAnom)
totalanomalies = 1
anomfrac= totalanomalies/n
anomfrac = 5.0000e-04
fit
updates the model and returns the observation scores and the indices of observations with scores above the score threshold value as anomalies. A high score value indicates a normal observation, and a low value indicates an anomaly. The median score fluctuates between approximately 230 and 270. The score threshold rises from a value of 260 after the first iteration and steadily approaches 285 after 12 iterations. The software detected 4 anomalies in the Xstream
data, yielding a total contamination fraction of 0.002.
Incrementally Train RRCF Model on Shingled Data
Train a robust random cut forest (RRCF) model on a simulated, noisy, periodic shingled time series containing no anomalies by using rrcforest
. Convert the trained model to an incremental learner object, and then incrementally fit the time series and detect anomalies.
Create Simulated Data Stream
Create a simulated data stream of observations representing a noisy sinusoid signal.
rng(0,"twister"); % For reproducibility period = 100; n = 2001+period; sigma = 0.04; a = linspace(1,n,n)'; b = sin(2*pi*(a-1)/period)+sigma*randn(n,1);
Introduce an anomalous region into the data stream. Plot the data stream portion that contains the anomalous region, and circle the anomalous data points.
c = 2*(sin(2*pi*(a-35)/period)+sigma*randn(n,1)); b(1150:1170) = c(1150:1170); scatter(a,b,".") xlim([900,1200]) xlabel("Observation") hold on scatter(a(1150:1170),b(1150:1170),"r") hold off
Convert the single-featured data set b
into a multi-featured data set by shingling [1] with a shingle size equal to the period of the signal. The th shingled observation is a vector of features with values , , ..., , where is the shingle size.
X = []; shingleSize = period; for i = 1:n-shingleSize X = [X;b(i:i+shingleSize-1)']; end
Train Model and Perform Incremental Anomaly Detection
Fit a robust random cut forest model to the first 1000 shingled observations, specifying a contamination fraction of 0. Convert the model to an incrementalRobustRandomCutForest
model object. Specify to keep the 100 most recent observations relevant for anomaly detection.
Mdl = rrcforest(X(1:1000,:),ContaminationFraction=0); IncrementalMdl = incrementalLearner(Mdl,NumObservationsToKeep=100);
To simulate a data stream, process the full shingled data set in chunks of 100 observations at a time. At each iteration:
Process 100 observations.
Calculate scores and detect anomalies using the
isanomaly
function.Store
anomIdx
, the indices of shingled observations marked as anomalies.If the chunk contains fewer than three anomalies, fit and update the previous incremental model.
n = numel(X(:,1)); numObsPerChunk = 100; nchunk = floor(n/numObsPerChunk); anomIdx = []; allscores = []; % Incremental fitting rng("default"); % For reproducibility for j = 1:nchunk ibegin = min(n,numObsPerChunk*(j-1) + 1); iend = min(n,numObsPerChunk*j); idx = ibegin:iend; [isanom,scores] = isanomaly(IncrementalMdl,X(idx,:)); allscores = [allscores;scores]; anomIdx = [anomIdx;find(isanom)+ibegin-1]; if (sum(isanom) < 3) IncrementalMdl = fit(IncrementalMdl,X(idx,:)); end end
Analyze Incremental Model During Training
At each iteration, the software calculates a score value for each observation in the data chunk. A negative score value with large magnitude indicates a normal observation, and a large positive value indicates an anomaly. Plot the anomaly score for the observations in the vicinity of the anomaly. Circle the scores of shingles that the software returns as anomalous.
figure scatter(a(1:2000),allscores,".") hold on scatter(a(anomIdx),allscores(anomIdx),20,"or") xlim([900,1200]) xlabel("Shingle") ylabel("Score") hold off
Because the introduced anomalous region begins at observation 1150, and the shingle size is 100, shingle 1051 is the first to show a high anomaly score. Some shingles between 1050 and 1170 have scores lying just below the anomaly score threshold, due to the noise in the sinusoidal signal. The shingle size affects the performance of the model by defining how many subsequent consecutive data points in the original time series the software uses to calculate the anomaly score for each shingle.
Plot the unshingled data and highlight the introduced anomalous region. Circle the observation number of the first element in each shingle returned by that the software as anomalous.
figure xlim([900,1200]) ylim([-1.5 2]) rectangle(Position=[1150 -1.5 20 3.5],FaceColor=[0.9 0.9 0.9], ... EdgeColor=[0.9 0.9 0.9]) hold on scatter(a,b,".") scatter(a(anomIdx),b(anomIdx),20,"or") xlabel("Observation") hold off
Input Arguments
forest
— Traditionally trained RRCF model for anomaly detection
RobustRandomCutForest
model object
Traditionally trained RRCF model for anomaly detection, specified as a RobustRandomCutForest
model object returned by rrcforest
.
Name-Value Arguments
Specify optional pairs of arguments as
Name1=Value1,...,NameN=ValueN
, where Name
is
the argument name and Value
is the corresponding value.
Name-value arguments must appear after other arguments, but the order of the
pairs does not matter.
Example:
incrementalLearner(forest,ObservationRemoval="timedecaying",ScoreWarmupPeriod=500)
sets the observation removal method to "timedecaying"
and specifies
to process 500 observations before the incremental fit
function
returns scores and detects anomalies.
NumObservationsToKeep
— Number of most recent observations relevant for anomaly detection
forest.NumObservationsPerLearner
(default) | nonnegative integer
Number of the most recent observations relevant for anomaly detection, specified as a nonnegative integer.
Example:
NumObservationsToKeep=250
Data Types: single
| double
ObservationRemoval
— Observation removal method
"oldest"
(default) | "timedecaying"
| "random"
Observation removal method, specified as "oldest"
,
"timedecaying"
, or "random"
. When the robust
random cut trees reach their capacity, the software removes old observations to
accommodate the most recent data.
Value | Description |
---|---|
| Oldest observations are removed first. |
| Observations are removed randomly in a weighted fashion. Older observations have a higher probability of being removed first. |
| Observations are removed in random order. |
Data Types: string
| char
Options
— Options for computing in parallel and setting random streams
structure
Options for computing in parallel and setting random streams, specified as a
structure. Create the Options
structure using statset
. This table lists the option fields and their
values.
Field Name | Value | Default |
---|---|---|
UseParallel | Set this value to true to run computations in
parallel. | false |
UseSubstreams | Set this value to To compute
reproducibly, set | false |
Streams | Specify this value as a RandStream object or
cell array of such objects. Use a single object except when the
UseParallel value is true
and the UseSubstreams value is
false . In that case, use a cell array that
has the same size as the parallel pool. | If you do not specify Streams , then
incrementalLearner uses the default stream or
streams. |
Note
You need Parallel Computing Toolbox™ to run computations in parallel.
Example: Options=statset(UseParallel=true,UseSubstreams=true,Streams=RandStream("mlfg6331_64"))
Data Types: struct
ScoreWarmupPeriod
— Warm-up period before score computation and anomaly detection
0
(default) | nonnegative integer
Warm-up period before score computation and anomaly detection, specified as
a nonnegative integer. This option specifies the number of observations used by
the incremental fit
function to train the model and
estimate the score threshold.
Note
When processing observations during the score warm-up period, the software ignores observations that contain missing values for all predictors.
Example:
ScoreWarmupPeriod=200
Data Types: single
| double
ScoreWindowSize
— Running window size used to estimate score threshold
1000
(default) | positive integer
Running window size used to estimate the score threshold
(ScoreThreshold
), specified as a positive integer. The
default ScoreWindowSize
value is
1000
.
If ScoreWindowSize
is greater than the number of
observations in the training data, the software determines
ScoreThreshold
by subsampling from the training data.
Otherwise, ScoreThreshold
is set to
forest.ScoreThreshold
.
Example:
ScoreWindowSize=100
Data Types: single
| double
Output Arguments
IncrementalForest
— RRCF model for incremental anomaly detection
incrementalRobustRandomCutForest
model object
RRCF model for incremental anomaly detection, returned as an incrementalRobustRandomCutForest
model object.
To initialize IncrementalForest
for incremental anomaly
detection,
incrementalLearner
passes the values of the following properties of
forest
to the corresponding properties of
IncrementalForest
.
Property | Description |
---|---|
CategoricalPredictors | Categorical predictor indices, a vector of positive integers |
ContaminationFraction | Fraction of anomalies in the training data, a numeric scalar in
the range [0,1] |
Mu
| Predictor means of the training data, a numeric vector |
NumLearners | Number of robust random cut trees, a positive integer scalar |
NumObservationsPerLearner
| Number of observations for each robust random cut tree, a nonnegative integer |
PredictorNames
| Predictor variable names, a cell array of character vectors |
ScoreThreshold
| Threshold score for anomalies in the training data, a numeric
scalar in the range [0,Inf ). If
ScoreWindowSize is greater than the number
of observations used to train forest , then
incrementalLearner approximates
ScoreThreshold by subsampling from the
training data. Otherwise,
incrementalLearner passes
forest.ScoreThreshold to
IncrementalForest.ScoreThreshold . |
Sigma
| Predictor standard deviations of the training data, a numeric vector |
More About
Incremental Learning for Anomaly Detection
Incremental learning, or online learning, is a branch of machine learning concerned with processing incoming data from a data stream, possibly given little to no knowledge of the distribution of the predictor variables, aspects of the prediction or objective function (including tuning parameter values), or whether the observations contain anomalies. Incremental learning differs from traditional machine learning, where enough data is available to fit to a model, perform cross-validation to tune hyperparameters, and infer the predictor distribution.
Anomaly detection is used to identify unexpected events and departures from normal behavior. In situations where the full data set is not immediately available, or new data is arriving, you can use incremental learning for anomaly detection to incrementally train a model so it adjusts to the characteristics of the incoming data.
Given incoming observations, an incremental learning model for anomaly detection does the following:
Computes anomaly scores
Updates the anomaly score threshold
Detects data points above the score threshold as anomalies
Fits the model to the incoming observations
For more information, see Incremental Anomaly Detection with MATLAB.
References
[1] Guha, Sudipto, N. Mishra, G. Roy, and O. Schrijvers. "Robust Random Cut Forest Based Anomaly Detection on Streams," Proceedings of The 33rd International Conference on Machine Learning 48 (June 2016): 2712–21.
[2] Bartos, Matthew D., A. Mullapudi, and S. C. Troutman. "rrcf: Implementation of the Robust Random Cut Forest Algorithm for Anomaly Detection on Streams." Journal of Open Source Software 4, no. 35 (2019): 1336.
Extended Capabilities
Automatic Parallel Support
Accelerate code by automatically running computation in parallel using Parallel Computing Toolbox™.
To run in parallel, specify the Options
name-value argument in the call to
this function and set the UseParallel
field of the
options structure to true
using
statset
:
Options=statset(UseParallel=true)
For more information about parallel computing, see Run MATLAB Functions with Automatic Parallel Support (Parallel Computing Toolbox).
Version History
Introduced in R2023b
See Also
Functions
Objects
MATLAB Command
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: United States.
You can also select a web site from the following list
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)
Asia Pacific
- Australia (English)
- India (English)
- New Zealand (English)
- 中国
- 日本Japanese (日本語)
- 한국Korean (한국어)