Implement Visual SLAM in MATLAB

Visual simultaneous localization and mapping (vSLAM) refers to the process of calculating the position and orientation of a camera, with respect to its surroundings, while simultaneously mapping the environment. The process uses only visual inputs from the camera. Applications for visual SLAM include augmented reality, robotics, and autonomous driving. For a general description on why SLAM matters and how it works for different applications, see What is SLAM?

Visual SLAM algorithms are broadly classified into two categories, depending on how they estimate the camera motion. The indirect, feature-based method uses feature points of images to minimize the reprojection error. The direct method uses the overall brightness of images to minimize the photometric error. The Computer Vision Toolbox™ algorithms provide functions for performing the steps for feature-based visual SLAM workflow and also provides the object monovslam, that includes the full workflow. The workflow and corresponding functions described in this overview consists of map initialization, tracking, local mapping, loop detection, and drift correction.

Note

The workflow described in this overview applies to images taken by a pinhole camera. To use the visual SLAM workflow with images taken by a fisheye camera, convert the fisheye camera into a virtual pinhole camera using the undistortFisheyeImage function.

Terms Used in Visual SLAM

Visual SLAM literature uses these common terms:

Key Frames — A subset of video frames that contain cues for localization and tracking. Two consecutive key frames usually indicate a large visual change caused by a camera movement.
Map Points — A list of 3-D world points that represent the map of the environment reconstructed from the key frames.
Covisibility Graph — A graph with key frames as nodes. Two key frames are connected by an edge if they share common map points. The weight of an edge is the number of shared map points.
Recognition Database — A database that stores the visual word-to-image mapping based on the input bag of features. Determine whether a place has been visited in the past by searching the database for an image that is visually similar to the query image.

Typical Feature-based Visual SLAM Workflow

To construct a feature-based visual SLAM pipeline on a sequence of images, follow these steps:

Initialize Map — Initialize the map of 3-D points from two image frames. Compute the 3-D points and relative camera pose by using triangulation based on 2-D feature correspondences.
Track Features — For each new frame, estimate the camera pose by matching features in the current frame to features in the last key frame.
Create Local Map — If you identify the current frame as a key frame, create a new 3-D map of points. Use bundle adjustment to refine the camera pose and 3-D points.
Detect Loops — Detect loops for each key frame by comparing the current frame to all previous key frames using the bag-of-features approach.
Correct Drift — Optimize the pose graph to correct the drift in the camera poses of all the key frames.

The figure illustrates a typical feature-based visual SLAM workflow. It also shows the points at which data is stored or retrieved from objects that manage the data.

Flow chart diagram showing map initialization, tracking, local mapping, loop detection, and drift correction.

Key Frame and Map Data Management

Use the view set, point set, and transformation objects to manage key frames and map data.

Use the imageviewset object to manage data associated with the odometry and mapping process. The object contains data as a set of views and pairwise connections between views. The object can also be used to build and update a pose graph.
- Each view consists of the absolute camera pose and the feature points extracted from the image. Each view, with its unique identifier (view ID), within the view set forms a node of the pose graph.
- Each connection stores information that links one view to another view. The connection includes the indices of matched features between the views, the relative transformation between the connected views, and the uncertainty in computing the measurement. Each connection forms an edge in the pose graph.
- Use a rigidtform3d object input with imageviewset to store the absolute camera poses and relative camera poses of odometry edges. Use a simtform3d object input with imageviewset to store the relative camera poses of loop-closure edges.
Use the worldpointset object to store correspondences between 3-D map points and 2-D image points across camera views.
- The WorldPoints property of worldpointset stores the 3-D locations of map points.
- The Correspondence property of worldpointset stores the view IDs of the key frames that observe the map points.

Map Initialization

To initialize mapping, you must match features between two images, estimate the relative camera pose, and triangulate initial 3-D world points. This workflow commonly uses the Speeded-Up Robust Features (SURF) and Oriented FAST and Rotated BRIEF (ORB) features point features. The map initialization workflow consists of a detecting, extracting, and matching features, and then finding a relative camera pose estimate, finding the 3-D locations of matched features, and refining the initial map. Finally, store the resulting key frames and mapped points in an image view set and a world point set, respectively.

Workflow	Function	Description
1. Detect	`detectSURFFeatures`	Detect SURF features and return a `SURFPoints` object.
	`detectORBFeatures`	Detect ORB features and return an `ORBPoints` object.
	`detectSIFTFeatures`	Detect SIFT features and return a `SIFTPoints` object.
2. Extract	`extractFeatures`	Extract feature vectors and their corresponding locations in a binary or intensity image.
3. Match	`matchFeatures`	Obtain the indices of the matching features between two feature sets.
4. Estimate relative camera pose from matched feature points	`estgeotform2d`	Compute a homography from matching point pairs.
	`estimateFundamentalMatrix`	Estimate the fundamental matrix from matching point pairs.
	`estrelpose`	Compute the relative camera poses, represented as a `rigidtform3d` object, based on a homography or a fundamental matrix. The location can only be computed up to scale, so the distance between two cameras is set to `1`.
5. Find 3-D locations of the matched feature points	`triangulate`	Find the 3-D locations of matching pairs of undistorted image points.
6. Refine initial map	`bundleAdjustment`	Refine 3-D map points and camera poses that minimize reprojection errors.
7. Manage data for initial map and key frames	`addView`	Add the two views formed by the feature points and their absolute poses to the `imageviewset` object.
	`addConnection`	Add the odometry edge defined by the connection between successive key views, formed by the relative pose transformation between the cameras, to the `imageviewset` object.
	`addWorldPoints`	Add the initial map points to the `worldpointset` object.
	`addCorrespondences`	Add the 3-D to 2-D projection correspondences between the key frames and the map points to the `worldpointset` object.

Tracking

The tracking workflow uses every frame to determine when to insert a new key frame. Use these steps and functions for the tracking workflow.

Workflow	Function	Description
Match extracted features	`matchFeatures`	Match extracted features from the current frame with features in the last key frame that have known 3-D locations.
Estimate camera pose	`estworldpose`	Estimate the current camera pose.
Project map points	`world2img`	Project the map points observed by the last key frame into the current frame.
Search for feature correspondences	`matchFeaturesInRadius`	Search for feature correspondences within spatial constraints.
Refine camera pose	`bundleAdjustmentMotion`	Refine the camera pose with 3-D to 2-D correspondence by performing a motion-only bundle adjustment.
Identify local map points	`findWorldPointsInView` `findWorldPointsInTracks`	Identify points in the view and points that correspond to point tracks.
Search for more feature correspondences	`matchFeaturesInRadius`	Search for more feature correspondences in the current frame, which contains projected local map points.
Refine camera pose	`bundleAdjustmentMotion`	Refine the camera pose with 3-D to 2-D correspondence by performing a motion-only bundle adjustment.
Store new key frame	`addView` `addConnection`	If you determine that the current frame is a new key frame, add it and its connections to covisible key frames to the `imageviewset`.

Feature matching is critical in the tracking workflow. Use the matchFeaturesInRadius function to return more putative matches when an estimation of the positions of matched feature points is available. The two match feature functions used in the workflow are:

matchFeatures — Returns the indices of the matching features in the two input feature sets.
matchFeaturesInRadius — Returns the indices of the matching features, which satisfy spatial constraints, in the two input feature sets.

To get a greater number of matched feature pairs, increase the values for the MatchThreshhold and MaxRatio name-value arguments of the matchFeatures and matchFeaturesInRadius functions. The outliers pairs can be discarded after performing bundle adjustment in the local mapping step.

Local Mapping

Perform local mapping for every key frame. Follow these steps to create new map points.

Workflow	Function	Description
Connect key frames	`connectedViews`	Find the covisible key frames of the current key frame.
Search for matches in connected key frames	`matchFeatures`	For each unmatched feature point in the current key frame, use the `matchFeatures` function to search for a match with other unmatched points in the covisible key frames.
Compute location for new matches	`triangulate`	Compute the 3-D locations of the matched feature points.
Store new map points	`addWorldPoints`	Add the new map points to the `worldpointset` object.
Store 3-D to 2-D correspondences	`addCorrespondences`	Add new 3-D to 2-D correspondences to the `worldpointset` object.
Update odometry connection	`updateConnection`	Update the connection between the current key frame and its covisible frames with more feature matches.
Store representative view of 3-D points	`updateRepresentativeView`	Update representative view ID and corresponding feature index.
Store distance limits and viewing direction of 3-D points	`updateLimitsAndDirection`	Update distance limits and mean viewing direction.
Refine pose	`bundleAdjustment`	Refine the pose of the current key frame, the poses of covisible key frames, and all the map points observed in these key frames. For improved performance, only include strongly connected, covisible key frames in the refinement process. Use the `minNumMatches` argument of the `connectedViews` function to select strongly-connected covisible key frames.
Remove outliers	`removeWorldPoints`	Remove outlier map points with large reprojection errors from the `worldpointset` object. The associated 3-D to 2-D correspondences are removed automatically.

This table compares the camera poses, map points, and number of cameras for each of the bundle adjustment functions used in 3-D reconstruction.

Function	Camera Poses	Map Points	Number of Cameras
`bundleAdjustment`	Optimized	Optimized	Multiple
`bundleAdjustmentMotion`	Optimized	Fixed	One
`bundleAdjustmentStructure`	Fixed	Optimized	Multiple

Loop Detection

Due to an accumulation of errors, using visual odometry alone can lead to drift. These errors can result in severe inaccuracies over long distances. Using graph-based SLAM helps to correct the drift. To do this, detect loop closures by finding a previously visited location. A common approach is to use this bag-of-features workflow:

Workflow	Function	Description
Construct bag of visual words	`bagOfFeatures`	Construct a bag of visual words for place recognition.
Create recognition database	`indexImages`	Create a recognition database, `invertedImageIndex`, to map visual words to images.
Identify loop closure candidates	`retrieveImages`	Search for images that are similar to the current key frame. Identify consecutive images as loop closure candidates if they are similar to the current frame. Otherwise, add the current key frame to the recognition database.
Compute relative camera pose for loop closure candidates	`estgeotform3d`	Compute the relative camera pose between the candidate key frame and the current key frame, for each loop closure candidate
Close loop	`addConnection`	Close the loop by adding a loop closure edge with the relative camera pose to the `imageviewset` object.

Drift Correction

The imageviewset object internally updates the pose graph as views and connections are added. To minimize drift, perform pose graph optimization by using the optimizePoses function, once sufficient loop closures are added. The optimizePoses function returns an imageviewset object with the optimized absolute pose transformations for each view.

You can use the createPoseGraph function to return the pose graph as a MATLAB^® digraph object. You can use graph algorithms in MATLAB to inspect, view, or modify the pose graph. Use the optimizePoseGraph (Navigation Toolbox) function from Navigation Toolbox™ to optimize the modified pose graph, and then use the updateView function to update the camera poses in the view set.

Visualization

To develop the visual SLAM system, you can use the following visualization functions.

Function	Description
`imshow`	Display an image
`showMatchedFeatures`	Display matched feature points in two images
`plot`	Plot image view set views and connections
`plotCamera`	Plot a camera in 3-D coordinates
`pcshow`	Plot 3-D point cloud
`pcplayer`	Visualize streaming 3-D point cloud data

References

[1] Hartley, Richard, and Andrew Zisserman. Multiple View Geometry in Computer Vision. 2nd ed. Cambridge: Cambridge University Press, 2003.

[2] Fraundorfer, Friedrich, and Davide Scaramuzza. “Visual Odometry: Part II: Matching, Robustness, Optimization, and Applications.” IEEE Robotics & Automation Magazine 19, no. 2 (June 2012): 78–90. https://doi.org/10.1109/MRA.2012.2182810.

[3] Mur-Artal, Raul, J. M. M. Montiel, and Juan D. Tardos. “ORB-SLAM: A Versatile and Accurate Monocular SLAM System.” IEEE Transactions on Robotics 31, no. 5 (October 2015): 1147–63. https://doi.org/10.1109/TRO.2015.2463671.

[4] Kümmerle, Rainer, Giorgio Grisetti, Hauke Strasdat, Kurt Konolige, and Wolfram Burgard. "G²o: A General Framework for Graph Optimization." In 2011 IEEE International Conference on Robotics and Automation (ICRA 2011), Shanghai, 9–13 May 2011, 3607–13. New York: Institute of Electrical and Electronics Engineers. https://doi.org//10.1109/ICRA.2011.5979949.