Mashaan blog

Structure-from-Motion (SfM): A Tutorial

Resources

For this tutorial, I collected information from multiple resources:

Github repositories
- colmap
- VGGSfM
Books
- Multiple View Geometry in Computer Vision
- Computer Vision: Algorithms and Applications
Courses

References

@inproceedings{schoenberger2016sfm,
 author     = {Sch\"{o}nberger, Johannes Lutz and Frahm, Jan-Michael},
 title      = {Structure-from-Motion Revisited},
 booktitle  = {Conference on Computer Vision and Pattern Recognition (CVPR)},
 year       = {2016},
}

@inproceedings{wang2024vggsfm,
 title      = {VGGSfM: Visual Geometry Grounded Deep Structure From Motion},
 author     = {Wang, Jianyuan and Karaev, Nikita and Rupprecht, Christian and Novotny, David},
 booktitle  = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
 pages      = {21686--21697},
 year       = {2024}
}

Why SfM has anything to do with NeRF?

drawings-02 001

NeRF needs cameras positions to cast rays into the scene and render 3D to compute the training loss. Without SfM, NeRF would be shooting rays into the wild. A nice thing to have if we can do end to end training from 2D images all the way to 3D reconstruction.

Two-View Geometry

Screenshot 2025-01-11 at 2 49 47 PM

source: (Hartley and Zisserman, 2004)

Epipolar Geometry

Screenshot 2025-01-11 at 3 02 25 PM

source: (Hartley and Zisserman, 2004)

In simple words: the ray that passes through $x$ is projected onto the second view as $l \prime$. This is very important as it limits the search of $X$ in the second view to $l \prime$.

Screenshot 2025-01-11 at 3 02 42 PM

source: (Hartley and Zisserman, 2004)

The Fundamental Matrix $F$

Screenshot 2025-01-11 at 3 05 53 PM

source: (Hartley and Zisserman, 2004)

Epipolar geometry describes the relation for a moving camera through the essential matrix $E$ (calibrated) or the fundamental matrix $F$ (uncalibrated). source: (Schonberger and Frahm, 2016)

The Essential Matrix $E$

Screenshot 2025-01-11 at 3 07 39 PM

source: (Hartley and Zisserman, 2004)

Remarks about Two-View Geometry

If we have enough correspondences (3D points projected onto two views) we can estimate the fundamental matrix $F$. Actually, we only need 8 corresponding image points, hence the name (Eight-point algorithm).
With the fundamental matrix $F$, we can guess the camera matrices $K$ and $K\prime$ and get the essential matrix $E$.
The process of computing a 3D point $X$ is called triangulation.

colmap

source: (Schonberger and Frahm, 2016)

Initialization

Choosing the initial pair of images

these notes were taken from: incremental_mapper_impl.h

Colmap implements an incremental SfM using three files:

incremental_pipeline.cc under colmap/src/colmap/controllers/incremental_pipeline.cc.
incremental_mapper.cc under colmap/src/colmap/sfm/incremental_mapper.cc.
incremental_mapper_impl.cc under colmap/src/colmap/sfm/incremental_mapper_impl.cc.

The function InitializeReconstruction inside colmap/src/colmap/controllers/incremental_pipeline.cc initializes the reconstruction by calling FindInitialImagePair in colmap/src/colmap/sfm/incremental_mapper.cc

drawings-02 003

Large number of correspondences would make it easy to pair it with a second image, and having camera calibration priors would allow colmap to use the essential matrix $E$ to estimate the camera poses.

Estimating two view geometry between the initial pair

In addition to incremental_pipeline.cc, incremental_mapper.cc, and incremental_mapper_impl.cc we have two more files to estimate two view geometry between the initial pair of images:

two_view_geometry.cc under colmap/src/colmap/estimators/two_view_geometry.cc.
essential_matrix.cc under colmap/src/colmap/geometry/essential_matrix.cc.
pose.cc under colmap/src/colmap/geometry/pose.cc.

drawings-02 004

EstimateCalibratedTwoViewGeometry: estimates two-view geometry from calibrated image pair.
- Extract corresponding points
- Estimate epipolar models
- Estimate planar or panoramic model
- Determine inlier ratios of different models

EstimateTwoViewGeometryPose: estimates relative pose for two-view geometry.

// Try to recover relative pose for calibrated and uncalibrated
// configurations. In the uncalibrated case, this most likely leads to a
// ill-defined reconstruction, but sometimes it succeeds anyways after e.g.
// subsequent bundle-adjustment etc.

PoseFromEssentialMatrix: recovers the most probable pose from the given essential matrix.

// Decompose an essential matrix into the possible rotations and translations.
//
// The first pose is assumed to be P = [I | 0] and the set of four other
// possible second poses are defined as: {[R1 | t], [R2 | t],
//                                        [R1 | -t], [R2 | -t]}
//
// @param E          3x3 essential matrix.
// @param R1         First possible 3x3 rotation matrix.
// @param R2         Second possible 3x3 rotation matrix.
// @param t          3x1 possible translation vector (also -t possible).

CheckCheirality

// Perform cheirality constraint test, i.e., determine which of the triangulated
// correspondences lie in front of both cameras.
//
// @param cam2_from_cam1  Relative camera transformation.
// @param points1         First set of corresponding points.
// @param points2         Second set of corresponding points.
// @param points3D        Points that lie in front of both cameras.

Screenshot 2025-01-14 at 10 09 06 AM

source: (Hartley and Zisserman, 2004)

Screenshot 2025-01-14 at 10 07 56 AM

source: (Hartley and Zisserman, 2004)

Image Registration

New images can be registered to the current model by solving the Perspective-n-Point (PnP) problem using feature correspondences to triangulated points in already registered images (2D-3D correspondences). The PnP problem involves estimating the pose $P_c$ and, for uncalibrated cameras, its intrinsic parameters. The set $𝒫$ is thus extended by the pose $P_c$ of the newly registered image (Schönberger and Frahm, 2016).

drawings-02 005

FindNextImages: sort images in a way that prioritize images with a sufficient number of visible points.
RegisterNextImage
- search for 2D-3D correspondences
- estimate camera parameters
- pose refinement
- extend tracks to the newly registered image

Triangulation

A newly registered image must observe existing scene points. In addition, it may also increase scene coverage by extending the set of points $𝒳$ through triangulation. A new scene point $X_k$ can be triangulated and added to $𝒳$ as soon as at least one more image, also covering the new scene part but from a different viewpoint, is registered (Schönberger and Frahm, 2016).

drawings-02 006

Bundle Adjustment

Without further refinement, SfM usually drifts quickly to a non-recoverable state. Bundle adjustment is the joint non-linear refinement of camera parameters $P_c$ and point parameters $X_k$ that minimizes the reprojection error:

$E = \sum_j \rho_j \Big( \left\lVert \pi (P_c, X_k) - x_j \right\rVert^{2}_{2} \Big)$

$\pi$: a function that projects scene points to image space

$\rho_j$: the Cauchy function as the robust loss function to potentially down-weight outliers

drawings-02 007

IterativeLocalRefinement: iteratively calls AdjustLocalBundle

AdjustLocalBundle

// Adjust locally connected images and points of a reference image. In
// addition, refine the provided 3D points. Only images connected to the
// reference image are optimized. If the provided 3D points are not locally
// connected to the reference image, their observing images are set as
// constant in the adjustment.

IterativeGlobalRefinement: iteratively calls AdjustGlobalBundle
AdjustGlobalBundle: Global bundle adjustment using Ceres Solver, which is usually used to solve Non-linear Least Squares problems.

VGGSfM

From (Wang et al., 2024) appendix A, the training process involves multiple stages:

drawings-02 002

Screenshot 2025-01-08 at 12 26 51 PM

source: (Wang et al., 2024)

Tracker

drawings-02 008

cost volume pyramid

cost volume

source: (Yang et al., 2020)

The tracking process:

runner.py calls track_predictor.py in predict_tracks or predict_tracks_in_chunks avoid memory issues.

runner.py line 1315

   fine_pred_track, _, pred_vis, pred_score = track_predictor(
       images_feed,
       split_points,
       fmaps=fmaps_feed,
       fine_tracking=fine_tracking,
   )

track_predictor.py calls base_track_predictor.py twice, one for coarse_predictor and another for fine_predictor.

track_predictor.py line 91

   # Coarse prediction
   coarse_pred_track_lists, pred_vis = self.coarse_predictor(
       query_points=query_points,
       fmaps=fmaps,
       iters=coarse_iters,
       down_ratio=self.coarse_down_ratio,
   )
   coarse_pred_track = coarse_pred_track_lists[-1]

base_track_predictor.py takes query points and their feature maps as inputs and returns 2D positions and visibility:

input

  """
  query_points: B x N x 2, the number of batches, tracks, and xy
  fmaps: B x S x C x HH x WW, the number of batches, frames, and feature dimension.
          note HH and WW is the size of feature maps instead of original images
  """

Inside an iterative refinement loop, it samples discriptors from all frames $N_I$ starting from the position of query points at the reference frame $I_i$. It does that by calling CorrBlock from blocks.py

# Compute the correlation (check the implementation of CorrBlock)
if self.efficient_corr:
    fcorrs = fcorr_fn.sample(coords, track_feats)
else:
    fcorr_fn.corr(track_feats)
    fcorrs = fcorr_fn.sample(coords)  # B, S, N, corrdim

It passes query_points $\{ \hat{y}_1^i, \cdots, \hat{y}_1^{N_T} \}$, correlations $V \in ℝ^{N_T \times N_I \times C}$ , and track_feats $\{ m_1^i, \cdots, m_1^{N_T} \}$ to a transformer named EfficientUpdateFormer in blocks.py.
```
 # Concatenate them as the input for the transformers
 transformer_input = torch.cat(
     [flows_emb, fcorrs_, track_feats_], dim=2
 )
```

output

  if return_feat:
      return coord_preds, vis_e, track_feats, query_track_feat
  else:
      return coord_preds, vis_e

Camera Initializer

In (Wang et al., 2024) paper they mentioned that the camera initializer was designed as follows:

Screenshot 2025-01-10 at 3 40 35 PM

source: (Wang et al., 2024)

But at the time of writing, the camera initializer does not use track features in the code to make it faster. This was mentioned in the issues.

So here’s how the camera initializer is implemented in the code:

runner.py calls camera_predictor.py using self.camera_predictor, which passes image features to a transformer and refine the camera poses iteratively:

camera_predictor.py line 187

     for iter_num in range(iters):
         pred_pose_enc = pred_pose_enc.detach()

         # Embed the camera parameters and add to rgb_feat
         pose_embed = self.embed_pose(pred_pose_enc)
         rgb_feat = rgb_feat + pose_embed

         # Run trunk transformers on rgb_feat
         rgb_feat = self.trunk(rgb_feat)

         # Predict the delta feat and pose encoding at each iteration
         delta = self.pose_branch(rgb_feat)
         delta_pred_pose_enc = delta[..., : self.target_dim]
         delta_feat = delta[..., self.target_dim :]

         rgb_feat = self.ffeat_updater(self.norm(delta_feat)) + rgb_feat

         pred_pose_enc = pred_pose_enc + delta_pred_pose_enc

         # Residual connection
         rgb_feat = (rgb_feat + rgb_feat_init) / 2

runner.py calls estimate_preliminary.py using estimate_preliminary_cameras_poselib or estimate_preliminary_cameras. The difference was mentioned in the comments:

runner.py line 474

     # Estimate preliminary_cameras by recovering fundamental/essential/homography matrix from 2D matches
     # By default, we use fundamental matrix estimation with 7p/8p+LORANSAC
     # All the operations are batched and differentiable (if necessary)
     # except when you enable use_poselib to save GPU memory
     _, preliminary_dict = estimate_preliminary_cameras_fn(
         pred_track,
         pred_vis,
         width,
         height,
         tracks_score=pred_score,
         max_error=self.cfg.fmat_thres,
         loopresidual=True,
     )

estimate_preliminary.py performs three main tasks

Estimate Fundamental Matrix by Batch fmat: (B*(S-1))x3x3, where S is the number of frames. fundamental.py estimates the fundamental matrix by 7pt/8pt algo + LORANSAC and returns the one with the highest inlier number.
Estimate kmat1, kmat2: (B*(S-1))x3x3, where focal length is set as max(width, height), and the principal point is set as (width//2, height//2).
Get Essential matrix from Fundamental and Camera matrices.

Triangulator

In (Wang et al., 2024) paper they mentioned that they used a transformer for the triangulator:

Screenshot 2025-01-11 at 2 21 57 PM

source: (Wang et al., 2024)

But at the time of writing, the learnable parameters were removed from the triangulator in the code to simplify inference. This was mentioned in the issues.

So here’s how the triangulator is implemented in the code:

runner.py calls triangulator.py using self.triangulator, which uses RANSAC Direct Linear Transforms (DLT) to triangulate and bundle adjust.

triangulator.py line 122

      # For initialization
      # we first triangulate a point cloud for each pair of query-reference images,
      # i.e., we have S-1 3D point clouds
      # points_3d_pair: S-1 x N x 3
      (points_3d_pair, cheirality_mask_pair, triangle_value_pair) = (
          triangulate_by_pair(extrinsics[None], tracks_normalized[None])
      )

      # Check which point cloud can provide sufficient inliers
      # that pass the triangulation angle and cheirality check
      # Pick the highest inlier_geo_vis one as the initial point cloud
      inlier_total, valid_tri_angle_thres = find_best_initial_pair(
          inlier_geo_vis,
          cheirality_mask_pair,
          triangle_value_pair,
          init_tri_angle_thres,
      )

triangulator.py calls utils/triangulator.py using init_BA function. It uses pycolmap for BA:

utils/triangulator.py line 122

      # Convert PyTorch tensors to the format of Pycolmap
      # Prepare for the Bundle Adjustment Optimization
      # NOTE although we use pycolmap for BA here, but any BA library should be able to achieve the same result
      reconstruction = batch_matrix_to_pycolmap(
          toBA_points3D,
          toBA_extrinsics,
          toBA_intrinsics,
          toBA_tracks,
          toBA_masks,
          image_size,
          extra_params=toBA_extra_params,
          shared_camera=shared_camera,
          camera_type=camera_type,
      )
     
      # Prepare BA options
      ba_options = prepare_ba_options()
     
      # Conduct BA
      pycolmap.bundle_adjustment(reconstruction, ba_options)