Main Content

Visual SLAM Overview

Visual simultaneous localization and mapping (vSLAM) refers to the process of calculating the position and orientation of a camera, with respect to its surroundings, while simultaneously mapping the environment. The process uses only visual inputs from the camera. Applications for visual SLAM include augmented reality, robotics, and autonomous driving.

Visual SLAM algorithms are broadly classified into two categories, depending on how they estimate the camera motion. The indirect, feature-based method uses feature points of images to minimize the reprojection error. The direct method uses the overall brightness of images to minimize the photometric error. The Computer Vision Toolbox™ algorithms provide functions for performing feature-based visual SLAM. The workflow consists of map initialization, tracking, local mapping, loop detection, and drift correction.

Note

The workflow described in this overview applies to images taken by a pinhole camera. To use the visual SLAM workflow with images taken by a fisheye camera, convert the fisheye camera into a virtual pinhole camera using the undistortFisheyeImage function.

Terms Used in Visual SLAM

Visual SLAM literature uses these common terms:

  • Key Frames — A subset of video frames that contain cues for localization and tracking. Two consecutive key frames usually indicate a large visual change caused by a camera movement.

  • Map Points — A list of 3-D world points that represent the map of the environment reconstructed from the key frames.

  • Covisibility Graph — A graph of key frames as nodes. Two key frames are connected by an edge if they share common map points. The weight of an edge is the number of shared map points.

  • Recognition Database — A database that stores the visual word-to-image mapping based on the input bag of features. Determine whether a place has been visited in the past by searching the database for an image that is visually similar to the query image.

Typical Feature-based Visual SLAM Workflow

To construct a feature-based visual SLAM pipeline on a sequence of images, follow these steps:

  1. Initialize Map — Initialize the map of 3-D points from two image frames. Compute the 3-D points and relative camera pose by using triangulation based on 2-D feature correspondences.

  2. Track Features — For each new frame, estimate the camera pose by matching features in the current frame to features in the last key frame.

  3. Create Local Map — If you identify the current frame as a key frame, create a new 3-D map of points. Use bundle adjustment to refine the camera pose and 3-D points.

  4. Detect Loops — Detect loops for each key frame by comparing the current frame to all previous key frames using the bag-of-features approach.

  5. Correct Drift — Optimize the pose graph to correct the drift in the camera poses of all the key frames.

The figure illustrates a typical feature-based visual SLAM workflow. It also shows the points at which data is stored or retrieved from objects that manage the data.

Flow chart diagram showing map initialization, tracking, local mapping, loop detection, and drift correction.

Key Frame and Map Data Management

Use the view set, point set, and transformation objects to manage key frames and map data.

  • Use the imageviewset object to manage data associated with the odometry and mapping process. The object contains data as a set of views and pairwise connections between views. The object can also be used to build and update a pose graph.

    • Each view consists of the absolute camera pose and the feature points extracted from the image. Each view, with its unique identifier (view ID), within the view set forms a node of the pose graph.

    • Each connection stores information that links one view to another view. The connection includes the indices of matched features between the views, the relative transformation between the connected views, and the uncertainty in computing the measurement. Each connection forms an edge in the pose graph.

    • Use a rigid3d object input with imageviewset to store the absolute camera poses and relative camera poses of odometry edges. Use an affine3d object input with imageviewset to store the relative camera poses of loop-closure edges.

  • Use the worldpointset object to store correspondences between 3-D map points and 2-D image points across camera views.

    • The WorldPoints property of worldpointset stores the 3-D locations of map points.

    • The Correspondence property of worldpointset stores the view IDs of the key frames that observe the map points.

Map Initialization

To initialize mapping, you must match features between two images, estimate the relative camera pose, and triangulate initial 3-D world points. This workflow commonly uses the Speeded-Up Robust Features (SURF) and Oriented FAST and Rotated BRIEF (ORB) features point features. The map initialization workflow consists of a detecting, extracting, and matching features, and then finding a relative camera pose estimate, finding the 3-D locations of matched features, and refining the initial map. Finally, store the resulting key frames and mapped points in an image view set and a world point set, respectively.

WorkflowFunctionDescription
DetectdetectSURFFeaturesDetect SURF features and return a SURFPoints object.
detectORBFeaturesDetect ORB features and return an ORBPoints object.
ExtractextractFeaturesExtract feature vectors and their corresponding locations in a binary or intensity image.
MatchmatchFeatures Obtain the indices of the matching features between two feature sets.
Estimate relative camera pose from matched feature pointsestimateGeometricTransform2D Compute a homography from matching point pairs.
estimateFundamentalMatrix Estimate the fundamental matrix from matching point pairs.
relativeCameraPose Compute the relative camera poses, represented as a rigid3d object, based on a homography or a fundamental matrix. The location can only be computed up to scale, so the distance between two cameras is set to 1.
Find 3-D locations of the matched feature pointstriangulate Find the 3-D locations of matching pairs of undistorted image points.
Refine initial mapbundleAdjustment Refine 3-D map points and camera poses that minimize reprojection errors.
Manage data for initial map and key framesaddView Add the two views formed by the feature points and their absolute poses to the imageviewset object.
addConnection Add the odometry edge defined by the connection between successive key views, formed by the relative pose transformation between the cameras, to the imageviewset object.
addWorldPoints Add the initial map points to the worldpointset object.
addCorrespondences Add the 3-D to 2-D projection correspondences between the key frames and the map points to the worldpointset object.

Tracking

The tracking workflow uses every frame to determine when to insert a new key frame. Use these steps and functions for the tracking workflow.

WorkflowFunctionDescription
Match extracted featuresmatchFeaturesMatch extracted features from the current frame with features in the last key frame that have known 3-D locations.
Estimate camera poseestimateWorldCameraPoseEstimate the current camera pose.
Project map pointsworldToImageProject the map points observed by the last key frame into the current frame.
Search for feature correspondencesmatchFeaturesInRadiusSearch for feature correspondences within spatial constraints.
Refine camera posebundleAdjustmentMotionRefine the camera pose with 3-D to 2-D correspondence by performing a motion-only bundle adjustment.
Identify local map points

findWorldPointsInView

findWorldPointsInTracks

Identify points in the view and points that correspond to point tracks.
Search for more feature correspondencesmatchFeaturesInRadiusSearch for more feature correspondences in the current frame, which contains projected local map points.
Refine camera pose bundleAdjustmentMotionRefine the camera pose with 3-D to 2-D correspondence by performing a motion-only bundle adjustment.
Store new key frame

addView

addConnection

If you determine that the current frame is a new key frame, add it and its connections to covisible key frames to the imageviewset.

Feature matching is critical in the tracking workflow. Use the matchFeaturesInRadius function to return more putative matches when an estimation of the positions of matched feature points is available. The two match feature functions used in the workflow are:

  • matchFeatures — Returns the indices of the matching features in the two input feature sets.

  • matchFeaturesInRadius — Returns the indices of the matching features, which satisfy spatial constraints, in the two input feature sets.

To get a greater number of matched feature pairs, increase the values for the MatchThreshhold and MaxRatio name-value arguments of the matchFeatures and matchFeaturesInRadius functions. The outliers pairs can be discarded after performing bundle adjustment in the local mapping step.

Local Mapping

Perform local mapping for every key frame. Follow these steps to create new map points.

WorkflowFunctionDescription
Connect key frames connectedViewsFind the covisible key frames of the current key frame.
Search for matches in connected key framesmatchFeaturesFor each unmatched feature point in the current key frame, use the matchFeatures function to search for a match with other unmatched points in the covisible key frames.
Compute location for new matchestriangulateCompute the 3-D locations of the matched feature points.
Store new map pointsaddWorldPointsAdd the new map points to the worldpointset object.
Store 3-D to 2-D correspondencesaddCorrespondencesAdd new 3-D to 2-D correspondences to the worldpointset object.
Update odometry connectionupdateConnectionUpdate the connection between the current key frame and its covisible frames with more feature matches.
Refine posebundleAdjustment

Refine the pose of the current key frame, the poses of covisible key frames, and all the map points observed in these key frames. For improved performance, only include strongly connected, covisible key frames in the refinement process.

Use the minNumMatches argument of the connectedViews function to select strongly-connected covisible key frames.

Remove outliersremoveWorldPointsRemove outlier map points with large reprojection errors from the worldpointset object. The associated 3-D to 2-D correspondences are removed automatically.

This table compares the camera poses, map points, and number of cameras for each of the bundle adjustment functions used in 3-D reconstruction.

FunctionCamera PosesMap PointsNumber of Cameras
bundleAdjustmentOptimizedOptimizedMultiple
bundleAdjustmentMotionOptimizedFixedOne
bundleAdjustmentStructureFixedOptimizedMultiple

Loop Detection

Due to an accumulation of errors, using visual odometry alone can lead to drift. These errors can result in severe inaccuracies over long distances. Using graph-based SLAM helps to correct the drift. To do this, detect loop closures by finding a previously visited location. A common approach is to use this bag-of-features workflow:

WorkflowFunctionDescription
Construct bag of visual wordsbagOfFeaturesConstruct a bag of visual words for place recognition.
Create recognition databaseindexImagesCreate a recognition database, invertedImageIndex, to map visual words to images.
Identify loop closure candidatesretrieveImagesSearch for images that are similar to the current key frame. Identify consecutive images as loop closure candidates if they are similar to the current frame. Otherwise, add the current key frame to the recognition database.
Compute relative camera pose for loop closure candidatesestimateGeometricTransform3DCompute the relative camera pose between the candidate key frame and the current key frame, for each loop closure candidate
Close loopaddConnectionClose the loop by adding a loop closure edge with the relative camera pose to the imageviewset object.

Drift Correction

The imageviewset object internally updates the pose graph as views and connections are added. To minimize drift, perform pose graph optimization by using the optimizePoses function, once sufficient loop closures are added. The optimizePoses function returns an imageviewset object with the optimized absolute pose transformations for each view.

You can use the createPoseGraph function to return the pose graph as a MATLAB® digraph object. You can use graph algorithms in MATLAB to inspect, view, or modify the pose graph. Use the optimizePoseGraph (Navigation Toolbox) function from Navigation Toolbox™ to optimize the modified pose graph, and then use the updateView function to update the camera poses in the view set.

Visualization

To develop the visual SLAM system, you can use the following visualization functions.

FunctionDescription
imshowDisplay an image
showMatchedFeaturesDisplay matched feature points in two images
plotPlot image view set views and connections
plotCameraPlot a camera in 3-D coordinates
pcshowPlot 3-D point cloud
pcplayerVisualize streaming 3-D point cloud data

References

[1] Hartley, Richard, and Andrew Zisserman. Multiple View Geometry in Computer Vision. 2nd ed. Cambridge: Cambridge University Press, 2003.

[2] Fraundorfer, Friedrich, and Davide Scaramuzza. “Visual Odometry: Part II: Matching, Robustness, Optimization, and Applications.” IEEE Robotics & Automation Magazine 19, no. 2 (June 2012): 78–90. https://doi.org/10.1109/MRA.2012.2182810.

[3] Mur-Artal, Raul, J. M. M. Montiel, and Juan D. Tardos. “ORB-SLAM: A Versatile and Accurate Monocular SLAM System.” IEEE Transactions on Robotics 31, no. 5 (October 2015): 1147–63. https://doi.org/10.1109/TRO.2015.2463671.

[4] Kümmerle, Rainer, Giorgio Grisetti, Hauke Strasdat, Kurt Konolige, and Wolfram Burgard. "G2o: A General Framework for Graph Optimization." In 2011 IEEE International Conference on Robotics and Automation (ICRA 2011), Shanghai, 9–13 May 2011, 3607–13. New York: Institute of Electrical and Electronics Engineers. https://doi.org//10.1109/ICRA.2011.5979949.

See Also

Functions

Objects

Related Topics