Mashaan blog

Structure-from-Motion (SfM): A Tutorial

Resources

For this tutorial, I collected information from multiple resources:

References

@inproceedings{schoenberger2016sfm,
 author     = {Sch\"{o}nberger, Johannes Lutz and Frahm, Jan-Michael},
 title      = {Structure-from-Motion Revisited},
 booktitle  = {Conference on Computer Vision and Pattern Recognition (CVPR)},
 year       = {2016},
}
@inproceedings{wang2024vggsfm,
 title      = {VGGSfM: Visual Geometry Grounded Deep Structure From Motion},
 author     = {Wang, Jianyuan and Karaev, Nikita and Rupprecht, Christian and Novotny, David},
 booktitle  = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
 pages      = {21686--21697},
 year       = {2024}
}

Why SfM has anything to do with NeRF?

mermaid-diagram-2025-02-06-175648


mermaid-diagram-2025-02-06-175629

NeRF needs cameras positions to cast rays into the scene and render 3D to compute the training loss. Without SfM, NeRF would be shooting rays into the wild. A nice thing to have if we can do end to end training from 2D images all the way to 3D reconstruction.

Two-View Geometry

Screenshot 2025-01-11 at 2 49 47 PM

source: (Hartley and Zisserman, 2004)


Epipolar Geometry

Screenshot 2025-01-11 at 3 02 25 PM

source: (Hartley and Zisserman, 2004)

In simple words: the ray that passes through $x$ is projected onto the second view as $l \prime$. This is very important as it limits the search of $X$ in the second view to $l \prime$.


Screenshot 2025-01-11 at 3 02 42 PM

source: (Hartley and Zisserman, 2004)


The Fundamental Matrix $F$

Screenshot 2025-01-11 at 3 05 53 PM

source: (Hartley and Zisserman, 2004)


Epipolar geometry describes the relation for a moving camera through the essential matrix $E$ (calibrated) or the fundamental matrix $F$ (uncalibrated). source: (Schonberger and Frahm, 2016)


The Essential Matrix $E$

Screenshot 2025-01-11 at 3 07 39 PM

source: (Hartley and Zisserman, 2004)


Remarks about Two-View Geometry

colmap

image

source: (Schonberger and Frahm, 2016)

Initialization

Choosing the initial pair of images

these notes were taken from: incremental_mapper_impl.h

Colmap implements an incremental SfM using three files:

The function InitializeReconstruction inside colmap/src/colmap/controllers/incremental_pipeline.cc initializes the reconstruction by calling FindInitialImagePair in colmap/src/colmap/sfm/incremental_mapper.cc

mermaid-diagram-2025-02-06-205729

Large number of correspondences would make it easy to pair it with a second image, and having camera calibration priors would allow colmap to use the essential matrix $E$ to estimate the camera poses.

Estimating two view geometry between the initial pair

In addition to incremental_pipeline.cc, incremental_mapper.cc, and incremental_mapper_impl.cc we have two more files to estimate two view geometry between the initial pair of images:

mermaid-diagram-2025-02-06-205907

Screenshot 2025-01-14 at 10 09 06 AM

source: (Hartley and Zisserman, 2004)

Screenshot 2025-01-14 at 10 07 56 AM

source: (Hartley and Zisserman, 2004)

Image Registration

New images can be registered to the current model by solving the Perspective-n-Point (PnP) problem using feature correspondences to triangulated points in already registered images (2D-3D correspondences). The PnP problem involves estimating the pose $P_c$ and, for uncalibrated cameras, its intrinsic parameters. The set $𝒫$ is thus extended by the pose $P_c$ of the newly registered image (Schönberger and Frahm, 2016).

mermaid-diagram-2025-02-06-210037

Triangulation

A newly registered image must observe existing scene points. In addition, it may also increase scene coverage by extending the set of points $𝒳$ through triangulation. A new scene point $X_k$ can be triangulated and added to $𝒳$ as soon as at least one more image, also covering the new scene part but from a different viewpoint, is registered (Schönberger and Frahm, 2016).

mermaid-diagram-2025-02-06-210159

Bundle Adjustment

Without further refinement, SfM usually drifts quickly to a non-recoverable state. Bundle adjustment is the joint non-linear refinement of camera parameters $P_c$ and point parameters $X_k$ that minimizes the reprojection error:

$E = \sum_j \rho_j \Big( \left\lVert \pi (P_c, X_k) - x_j \right\rVert^{2}_{2} \Big)$

mermaid-diagram-2025-02-06-210302

VGGSfM

From (Wang et al., 2024) appendix A, the training process involves multiple stages:

mermaid-diagram-2025-02-06-210441


mermaid-diagram-2025-02-06-210605


Screenshot 2025-01-08 at 12 26 51 PM

source: (Wang et al., 2024)


Tracker

IMG_2332


cost volume pyramid

cost volume

source: (Yang et al., 2020)


The tracking process:

  1. runner.py calls track_predictor.py in predict_tracks or predict_tracks_in_chunks avoid memory issues.

    runner.py line 1315

       fine_pred_track, _, pred_vis, pred_score = track_predictor(
           images_feed,
           split_points,
           fmaps=fmaps_feed,
           fine_tracking=fine_tracking,
       )
    
  2. track_predictor.py calls base_track_predictor.py twice, one for coarse_predictor and another for fine_predictor.

    track_predictor.py line 91

       # Coarse prediction
       coarse_pred_track_lists, pred_vis = self.coarse_predictor(
           query_points=query_points,
           fmaps=fmaps,
           iters=coarse_iters,
           down_ratio=self.coarse_down_ratio,
       )
       coarse_pred_track = coarse_pred_track_lists[-1]
    
  3. base_track_predictor.py takes query points and their feature maps as inputs and returns 2D positions and visibility:
    1. input
        """
        query_points: B x N x 2, the number of batches, tracks, and xy
        fmaps: B x S x C x HH x WW, the number of batches, frames, and feature dimension.
                note HH and WW is the size of feature maps instead of original images
        """
      
    2. Inside an iterative refinement loop, it samples discriptors from all frames $N_I$ starting from the position of query points at the reference frame $I_i$. It does that by calling CorrBlock from blocks.py
      # Compute the correlation (check the implementation of CorrBlock)
      if self.efficient_corr:
          fcorrs = fcorr_fn.sample(coords, track_feats)
      else:
          fcorr_fn.corr(track_feats)
          fcorrs = fcorr_fn.sample(coords)  # B, S, N, corrdim
      
    3. It passes query_points $\{ \hat{y}_1^i, \cdots, \hat{y}_1^{N_T} \}$, correlations $V \in ℝ^{N_T \times N_I \times C}$ , and track_feats $\{ m_1^i, \cdots, m_1^{N_T} \}$ to a transformer named EfficientUpdateFormer in blocks.py.
       # Concatenate them as the input for the transformers
       transformer_input = torch.cat(
           [flows_emb, fcorrs_, track_feats_], dim=2
       )
      
    4. output
        if return_feat:
            return coord_preds, vis_e, track_feats, query_track_feat
        else:
            return coord_preds, vis_e
      

Camera Initializer

In (Wang et al., 2024) paper they mentioned that the camera initializer was designed as follows:

Screenshot 2025-01-10 at 3 40 35 PM

source: (Wang et al., 2024)


But at the time of writing, the camera initializer does not use track features in the code to make it faster. This was mentioned in the issues.

So here’s how the camera initializer is implemented in the code:

  1. runner.py calls camera_predictor.py using self.camera_predictor, which passes image features to a transformer and refine the camera poses iteratively:

    camera_predictor.py line 187

         for iter_num in range(iters):
             pred_pose_enc = pred_pose_enc.detach()
    
             # Embed the camera parameters and add to rgb_feat
             pose_embed = self.embed_pose(pred_pose_enc)
             rgb_feat = rgb_feat + pose_embed
    
             # Run trunk transformers on rgb_feat
             rgb_feat = self.trunk(rgb_feat)
    
             # Predict the delta feat and pose encoding at each iteration
             delta = self.pose_branch(rgb_feat)
             delta_pred_pose_enc = delta[..., : self.target_dim]
             delta_feat = delta[..., self.target_dim :]
    
             rgb_feat = self.ffeat_updater(self.norm(delta_feat)) + rgb_feat
    
             pred_pose_enc = pred_pose_enc + delta_pred_pose_enc
    
             # Residual connection
             rgb_feat = (rgb_feat + rgb_feat_init) / 2
    
  2. runner.py calls estimate_preliminary.py using estimate_preliminary_cameras_poselib or estimate_preliminary_cameras. The difference was mentioned in the comments:

    runner.py line 474

         # Estimate preliminary_cameras by recovering fundamental/essential/homography matrix from 2D matches
         # By default, we use fundamental matrix estimation with 7p/8p+LORANSAC
         # All the operations are batched and differentiable (if necessary)
         # except when you enable use_poselib to save GPU memory
         _, preliminary_dict = estimate_preliminary_cameras_fn(
             pred_track,
             pred_vis,
             width,
             height,
             tracks_score=pred_score,
             max_error=self.cfg.fmat_thres,
             loopresidual=True,
         )
    

    estimate_preliminary.py performs three main tasks

    1. Estimate Fundamental Matrix by Batch fmat: (B*(S-1))x3x3, where S is the number of frames. fundamental.py estimates the fundamental matrix by 7pt/8pt algo + LORANSAC and returns the one with the highest inlier number.
    2. Estimate kmat1, kmat2: (B*(S-1))x3x3, where focal length is set as max(width, height), and the principal point is set as (width//2, height//2).
    3. Get Essential matrix from Fundamental and Camera matrices.

Triangulator

In (Wang et al., 2024) paper they mentioned that they used a transformer for the triangulator:

Screenshot 2025-01-11 at 2 21 57 PM

source: (Wang et al., 2024)


But at the time of writing, the learnable parameters were removed from the triangulator in the code to simplify inference. This was mentioned in the issues.

So here’s how the triangulator is implemented in the code:

  1. runner.py calls triangulator.py using self.triangulator, which uses RANSAC Direct Linear Transforms (DLT) to triangulate and bundle adjust.

    triangulator.py line 122

          # For initialization
          # we first triangulate a point cloud for each pair of query-reference images,
          # i.e., we have S-1 3D point clouds
          # points_3d_pair: S-1 x N x 3
          (points_3d_pair, cheirality_mask_pair, triangle_value_pair) = (
              triangulate_by_pair(extrinsics[None], tracks_normalized[None])
          )
    
          # Check which point cloud can provide sufficient inliers
          # that pass the triangulation angle and cheirality check
          # Pick the highest inlier_geo_vis one as the initial point cloud
          inlier_total, valid_tri_angle_thres = find_best_initial_pair(
              inlier_geo_vis,
              cheirality_mask_pair,
              triangle_value_pair,
              init_tri_angle_thres,
          )
    
  2. triangulator.py calls utils/triangulator.py using init_BA function. It uses pycolmap for BA:

    utils/triangulator.py line 122

          # Convert PyTorch tensors to the format of Pycolmap
          # Prepare for the Bundle Adjustment Optimization
          # NOTE although we use pycolmap for BA here, but any BA library should be able to achieve the same result
          reconstruction = batch_matrix_to_pycolmap(
              toBA_points3D,
              toBA_extrinsics,
              toBA_intrinsics,
              toBA_tracks,
              toBA_masks,
              image_size,
              extra_params=toBA_extra_params,
              shared_camera=shared_camera,
              camera_type=camera_type,
          )
         
          # Prepare BA options
          ba_options = prepare_ba_options()
         
          # Conduct BA
          pycolmap.bundle_adjustment(reconstruction, ba_options)