Multiple Object Tracking

Tracking is the process of locating a moving object or multiple objects over time in a video stream. Unlike object detection, which is the process of locating an object of interest in a single frame, tracking associates detections of an object across multiple frames.

Tracking multiple objects requires detection, prediction, and data association.

Detection — Detect objects of interest in a video frame.
Prediction — Predict the object locations in the next frame.
Data association — Use the predicted locations to associate detections across frames to form tracks.

Detection

Selecting the right approach for detecting objects of interest depends on what you want to track and whether the camera is stationary.

Detect Objects Using Stationary Camera

To detect objects in motion with a stationary camera, you can perform background subtraction using the vision.ForegroundDetector System object™. The background subtraction approach works efficiently, but requires the camera to be stationary.

Detect Objects Using Moving Camera

To detect objects in motion with a moving camera, you can use a sliding-window detection approach. This approach typically works more slowly than the background subtraction approach. To detect and track a specific category of object, use the System objects or functions described in this table.

Select a Detection Algorithm

Type of Object to Track	Camera	Functionality
Anything that moves	Stationary	`vision.ForegroundDetector`
Faces, eyes, nose, mouth, upper body	Stationary, Moving	`vision.CascadeObjectDetector`
Pedestrians	Stationary, Moving	`vision.PeopleDetector` `yolov4ObjectDetector` `yolov3ObjectDetector` `yolov2ObjectDetector` You can filter the YOLO-based detector results to keep only the "`people`" class. For more information, see the Detect People Using YOLO v4 Object Detector example.
Custom object category	Stationary, Moving	See Choose an Object Detector for a list of object detectors and their respective benefits.

Prediction

To track an object over time, you must predict its location in the next frame. The simplest method of prediction assumes that the object remains near its last known location. In other words, the previous detection serves as the next prediction. This method is especially effective at high frame rates. However, using this prediction method can fail when objects do not move at constant speeds, or when the frame rate is low relative to the speed of the object in motion.

A more sophisticated method of prediction is to use the previously observed motion of the object. The Kalman filter (vision.KalmanFilter) predicts the next location of an object, by assuming that it moves according to a motion model, such as constant velocity or constant acceleration. The Kalman filter also takes into account process noise and measurement noise. Process noise is the deviation of the actual motion of the object from the motion model. Measurement noise is the detection error.

To more easily configure a Kalman filter, use the configureKalmanFilter function. This function sets up the filter for tracking a physical object moving with constant velocity or constant acceleration within a Cartesian coordinate system. The statistics are the same along all dimensions. To configure a Kalman filter with differing assumptions, you must construct the vision.KalmanFilter object directly.

The Kalman filter assumes that motion and measurement models are linear, and that the uncertainty in each model follows a Gaussian distribution. When these assumptions are incorrect, if the object maneuvers, or when the measurements are incomplete, you must use another tracking filter. The Sensor Fusion and Tracking Toolbox™ provides additional tracking filters. For more details, see Introduction to Estimation Filters (Sensor Fusion and Tracking Toolbox).

Data Association

Data association is the process of associating detections corresponding to the same physical object across frames. The temporal history of a particular object consists of multiple detections, called a track. A track representation can include the entire history of the previous locations of the object. Alternatively, it can consist of only the last known location and current velocity of the object.

Detection to Track Cost Functions

To match a detection to a track, you must establish criteria for evaluating the matches. You can establish these criteria by defining a cost function. The higher the cost of matching a detection to a track, the less likely that the detection belongs to the track. You can define a simple cost function can be defined as the degree of overlap between the bounding boxes of the predicted and detected objects. The Tracking Pedestrians from a Moving Car example implements this type of cost function by using the bboxOverlapRatio function. You can implement a more sophisticated cost function, such as one that accounts for the uncertainty of the prediction, by using the distance function of the vision.KalmanFilter object. You can also implement a custom cost function that can incorporate information about the size and appearance of the object.

Elimination of Unlikely Matches

Gating is a method of eliminating highly unlikely matches from consideration, such as by imposing a threshold on your cost function. An observation does not match to a track if the cost exceeds a certain threshold value. Using this threshold method effectively results in a circular gating region around each prediction, within which a detection must be found to be considered a match. An alternative gating technique is to make the gating region large enough to include the k-nearest neighbors of the prediction.

Assign Detections to Track

Data association reduces to a minimum a weight bipartite matching problem, (an area of graph theory). A bipartite graph represents tracks and detections as vertices. It also represents the cost of matching a detection and a track as a weighted edge between the corresponding vertices.

The assignDetectionsToTracks function implements the Munkres variant of the Hungarian bipartite matching algorithm. Its input is the cost matrix, where the rows correspond to tracks and the columns correspond to detections. Each entry contains the cost of assigning a particular detection to a particular track. You can implement gating by setting the cost of impossible matches to infinity.

Track Management

Data association must account for the fact that new objects appearing in the field of view, or a tracked object leaving the field of view. As such, for any given frame, you might need to create some new tracks or discard some existing tracks. The assignDetectionsToTracks function returns the indices of unassigned tracks and unassigned detections in addition to the matched pairs.

One way of handling unmatched detections is to create a new track from each of them. Alternatively, you can create new tracks from only those unmatched detections greater than a certain size, or from detections that have certain locations or appearances. For example, if the scene has a single entry point, such as a doorway, then you can specify that only unmatched detections located near the entry point can begin new tracks, and to discard all other unmatched detections as noise.

You can also handle unmatched tracks by deleting any track that remains unmatched for a certain number of frames. Alternatively, you can specify to delete an unmatched track when its last known location is near an exit point.