For this tutorial, I collected information from multiple resources:
@inproceedings{schoenberger2016sfm,
author = {Sch\"{o}nberger, Johannes Lutz and Frahm, Jan-Michael},
title = {Structure-from-Motion Revisited},
booktitle = {Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2016},
}
@inproceedings{wang2024vggsfm,
title = {VGGSfM: Visual Geometry Grounded Deep Structure From Motion},
author = {Wang, Jianyuan and Karaev, Nikita and Rupprecht, Christian and Novotny, David},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages = {21686--21697},
year = {2024}
}
NeRF needs cameras positions to cast rays into the scene and render 3D to compute the training loss. Without SfM, NeRF would be shooting rays into the wild. A nice thing to have if we can do end to end training from 2D images all the way to 3D reconstruction.
source: (Hartley and Zisserman, 2004)
source: (Hartley and Zisserman, 2004)
In simple words: the ray that passes through $x$ is projected onto the second view as $l \prime$. This is very important as it limits the search of $X$ in the second view to $l \prime$.
source: (Hartley and Zisserman, 2004)
source: (Hartley and Zisserman, 2004)
Epipolar geometry describes the relation for a moving camera through the essential matrix $E$ (calibrated) or the fundamental matrix $F$ (uncalibrated). source: (Schonberger and Frahm, 2016)
source: (Hartley and Zisserman, 2004)
source: (Schonberger and Frahm, 2016)
these notes were taken from: incremental_mapper_impl.h
Colmap implements an incremental SfM using three files:
colmap/src/colmap/controllers/incremental_pipeline.cc
.colmap/src/colmap/sfm/incremental_mapper.cc
.colmap/src/colmap/sfm/incremental_mapper_impl.cc
.The function InitializeReconstruction
inside colmap/src/colmap/controllers/incremental_pipeline.cc
initializes the reconstruction by calling FindInitialImagePair
in colmap/src/colmap/sfm/incremental_mapper.cc
FindFirstInitialImage
: It finds a first image by sorting all images in a way that priortizes images with a large number of correspondences and have camera calibration priors.FindSecondInitialImage
: It orders images in a list where it places at top of the list the images with large number of correspondences to the first image and have camera calibration priors.Large number of correspondences would make it easy to pair it with a second image, and having camera calibration priors would allow colmap to use the essential matrix $E$ to estimate the camera poses.
In addition to incremental_pipeline.cc, incremental_mapper.cc, and incremental_mapper_impl.cc we have two more files to estimate two view geometry between the initial pair of images:
colmap/src/colmap/estimators/two_view_geometry.cc
.colmap/src/colmap/geometry/essential_matrix.cc
.colmap/src/colmap/geometry/pose.cc
.EstimateCalibratedTwoViewGeometry
: estimates two-view geometry from calibrated image pair.
EstimateTwoViewGeometryPose
: estimates relative pose for two-view geometry.
// Try to recover relative pose for calibrated and uncalibrated
// configurations. In the uncalibrated case, this most likely leads to a
// ill-defined reconstruction, but sometimes it succeeds anyways after e.g.
// subsequent bundle-adjustment etc.
PoseFromEssentialMatrix
: recovers the most probable pose from the given essential matrix.
// Decompose an essential matrix into the possible rotations and translations.
//
// The first pose is assumed to be P = [I | 0] and the set of four other
// possible second poses are defined as: {[R1 | t], [R2 | t],
// [R1 | -t], [R2 | -t]}
//
// @param E 3x3 essential matrix.
// @param R1 First possible 3x3 rotation matrix.
// @param R2 Second possible 3x3 rotation matrix.
// @param t 3x1 possible translation vector (also -t possible).
CheckCheirality
// Perform cheirality constraint test, i.e., determine which of the triangulated
// correspondences lie in front of both cameras.
//
// @param cam2_from_cam1 Relative camera transformation.
// @param points1 First set of corresponding points.
// @param points2 Second set of corresponding points.
// @param points3D Points that lie in front of both cameras.
source: (Hartley and Zisserman, 2004)
source: (Hartley and Zisserman, 2004)
New images can be registered to the current model by solving the Perspective-n-Point (PnP) problem using feature correspondences to triangulated points in already registered images (2D-3D correspondences). The PnP problem involves estimating the pose $P_c$ and, for uncalibrated cameras, its intrinsic parameters. The set $𝒫$ is thus extended by the pose $P_c$ of the newly registered image (Schönberger and Frahm, 2016).
FindNextImages
: sort images in a way that prioritize images with a sufficient number of visible points.RegisterNextImage
A newly registered image must observe existing scene points. In addition, it may also increase scene coverage by extending the set of points $𝒳$ through triangulation. A new scene point $X_k$ can be triangulated and added to $𝒳$ as soon as at least one more image, also covering the new scene part but from a different viewpoint, is registered (Schönberger and Frahm, 2016).
Without further refinement, SfM usually drifts quickly to a non-recoverable state. Bundle adjustment is the joint non-linear refinement of camera parameters $P_c$ and point parameters $X_k$ that minimizes the reprojection error:
$E = \sum_j \rho_j \Big( \left\lVert \pi (P_c, X_k) - x_j \right\rVert^{2}_{2} \Big)$
- $\pi$: a function that projects scene points to image space
- $\rho_j$: the Cauchy function as the robust loss function to potentially down-weight outliers
IterativeLocalRefinement
: iteratively calls AdjustLocalBundle
AdjustLocalBundle
// Adjust locally connected images and points of a reference image. In
// addition, refine the provided 3D points. Only images connected to the
// reference image are optimized. If the provided 3D points are not locally
// connected to the reference image, their observing images are set as
// constant in the adjustment.
IterativeGlobalRefinement
: iteratively calls AdjustGlobalBundle
AdjustGlobalBundle
: Global bundle adjustment using Ceres Solver, which is usually used to solve Non-linear Least Squares problems.From (Wang et al., 2024) appendix A, the training process involves multiple stages:
source: (Wang et al., 2024)
source: (Yang et al., 2020)
runner.py
calls track_predictor.py
in predict_tracks
or predict_tracks_in_chunks
avoid memory issues.
runner.py line 1315
fine_pred_track, _, pred_vis, pred_score = track_predictor(
images_feed,
split_points,
fmaps=fmaps_feed,
fine_tracking=fine_tracking,
)
track_predictor.py
calls base_track_predictor.py
twice, one for coarse_predictor
and another for fine_predictor
.
track_predictor.py line 91
# Coarse prediction
coarse_pred_track_lists, pred_vis = self.coarse_predictor(
query_points=query_points,
fmaps=fmaps,
iters=coarse_iters,
down_ratio=self.coarse_down_ratio,
)
coarse_pred_track = coarse_pred_track_lists[-1]
base_track_predictor.py
takes query points and their feature maps as inputs and returns 2D positions and visibility:
"""
query_points: B x N x 2, the number of batches, tracks, and xy
fmaps: B x S x C x HH x WW, the number of batches, frames, and feature dimension.
note HH and WW is the size of feature maps instead of original images
"""
CorrBlock
from blocks.py
# Compute the correlation (check the implementation of CorrBlock)
if self.efficient_corr:
fcorrs = fcorr_fn.sample(coords, track_feats)
else:
fcorr_fn.corr(track_feats)
fcorrs = fcorr_fn.sample(coords) # B, S, N, corrdim
query_points
$\{ \hat{y}_1^i, \cdots, \hat{y}_1^{N_T} \}
$, correlations
$V \in ℝ^{N_T \times N_I \times C}
$ , and track_feats
$\{ m_1^i, \cdots, m_1^{N_T} \}
$ to a transformer named EfficientUpdateFormer
in blocks.py
.
# Concatenate them as the input for the transformers
transformer_input = torch.cat(
[flows_emb, fcorrs_, track_feats_], dim=2
)
if return_feat:
return coord_preds, vis_e, track_feats, query_track_feat
else:
return coord_preds, vis_e
In (Wang et al., 2024) paper they mentioned that the camera initializer was designed as follows:
source: (Wang et al., 2024)
But at the time of writing, the camera initializer does not use track features in the code to make it faster. This was mentioned in the issues.
So here’s how the camera initializer is implemented in the code:
runner.py
calls camera_predictor.py
using self.camera_predictor
, which passes image features to a transformer and refine the camera poses iteratively:
camera_predictor.py line 187
for iter_num in range(iters):
pred_pose_enc = pred_pose_enc.detach()
# Embed the camera parameters and add to rgb_feat
pose_embed = self.embed_pose(pred_pose_enc)
rgb_feat = rgb_feat + pose_embed
# Run trunk transformers on rgb_feat
rgb_feat = self.trunk(rgb_feat)
# Predict the delta feat and pose encoding at each iteration
delta = self.pose_branch(rgb_feat)
delta_pred_pose_enc = delta[..., : self.target_dim]
delta_feat = delta[..., self.target_dim :]
rgb_feat = self.ffeat_updater(self.norm(delta_feat)) + rgb_feat
pred_pose_enc = pred_pose_enc + delta_pred_pose_enc
# Residual connection
rgb_feat = (rgb_feat + rgb_feat_init) / 2
runner.py
calls estimate_preliminary.py
using estimate_preliminary_cameras_poselib
or estimate_preliminary_cameras
. The difference was mentioned in the comments:
runner.py line 474
# Estimate preliminary_cameras by recovering fundamental/essential/homography matrix from 2D matches
# By default, we use fundamental matrix estimation with 7p/8p+LORANSAC
# All the operations are batched and differentiable (if necessary)
# except when you enable use_poselib to save GPU memory
_, preliminary_dict = estimate_preliminary_cameras_fn(
pred_track,
pred_vis,
width,
height,
tracks_score=pred_score,
max_error=self.cfg.fmat_thres,
loopresidual=True,
)
estimate_preliminary.py
performs three main tasks
fmat: (B*(S-1))x3x3
, where S
is the number of frames. fundamental.py
estimates the fundamental matrix by 7pt/8pt algo + LORANSAC and returns the one with the highest inlier number.kmat1, kmat2: (B*(S-1))x3x3
, where focal length is set as max(width, height), and the principal point is set as (width//2, height//2).In (Wang et al., 2024) paper they mentioned that they used a transformer for the triangulator:
source: (Wang et al., 2024)
But at the time of writing, the learnable parameters were removed from the triangulator in the code to simplify inference. This was mentioned in the issues.
So here’s how the triangulator is implemented in the code:
runner.py
calls triangulator.py
using self.triangulator
, which uses RANSAC Direct Linear Transforms (DLT) to triangulate and bundle adjust.
triangulator.py line 122
# For initialization
# we first triangulate a point cloud for each pair of query-reference images,
# i.e., we have S-1 3D point clouds
# points_3d_pair: S-1 x N x 3
(points_3d_pair, cheirality_mask_pair, triangle_value_pair) = (
triangulate_by_pair(extrinsics[None], tracks_normalized[None])
)
# Check which point cloud can provide sufficient inliers
# that pass the triangulation angle and cheirality check
# Pick the highest inlier_geo_vis one as the initial point cloud
inlier_total, valid_tri_angle_thres = find_best_initial_pair(
inlier_geo_vis,
cheirality_mask_pair,
triangle_value_pair,
init_tri_angle_thres,
)
triangulator.py
calls utils/triangulator.py
using init_BA
function. It uses pycolmap for BA:
utils/triangulator.py line 122
# Convert PyTorch tensors to the format of Pycolmap
# Prepare for the Bundle Adjustment Optimization
# NOTE although we use pycolmap for BA here, but any BA library should be able to achieve the same result
reconstruction = batch_matrix_to_pycolmap(
toBA_points3D,
toBA_extrinsics,
toBA_intrinsics,
toBA_tracks,
toBA_masks,
image_size,
extra_params=toBA_extra_params,
shared_camera=shared_camera,
camera_type=camera_type,
)
# Prepare BA options
ba_options = prepare_ba_options()
# Conduct BA
pycolmap.bundle_adjustment(reconstruction, ba_options)