Structure from motion Input: Output: (Tomasi and Kanade) a set of point tracks Output: 3D location of each point (shape) camera parameters (motion)
Orthographic SFM: Setup 𝐼 1 , 𝐼 2 ,…, 𝐼 𝑓 : a collection of images (video frames) depicting a rigid scene Orthographic projection (no scale) 𝑝 point tracks in those 𝑓 frames Unknown 3D location: 𝑃 𝑗 =( 𝑋 𝑗 , 𝑌 𝑗 , 𝑍 𝑗 ) 𝑇 ∈ ℝ 3 , 𝑗=1,…,𝑝 Projected locations: denote by ( 𝑥 𝑖𝑗 , 𝑦 𝑖𝑗 ) 𝑇 the location of 𝑃 𝑗 at frame 𝑖, then 𝑥 𝑖𝑗 = 𝒓 𝑖 𝑇 𝑃 𝑗 + 𝑐 𝑖 𝑦 𝑖𝑗 = 𝒔 𝑖 𝑇 𝑃 𝑗 + 𝑑 𝑖 𝒓 𝑖 𝑇 , 𝒔 𝑖 𝑇 are the two top rows of a rotation matrix
Orthographic SFM: Objective Find 𝒓 𝑖 𝒔 𝑖 ∈ ℝ 3 and 𝑐 𝑖 , 𝑑 𝑖 ∈ℝ that minimize 𝑖=1 𝑓 𝑗=1 𝑝 ( 𝒓 𝑖 𝑇 𝑃 𝑗 + 𝑐 𝑖 )− 𝑥 𝑖𝑗 2 + ( 𝒔 𝑖 𝑇 𝑃 𝑗 + 𝑑 𝑖 )− 𝑦 𝑖𝑗 2 Subject to 𝒓 𝑖 = 𝒔 𝑖 =1 𝒓 𝑖 𝑇 𝒔 𝑖 =0
Eliminate translation We can eliminate translation by representing the location of each point relative to the centroids of all 𝑝 points: Assume without loss of generality that the centroid of 𝑃 1 ,…, 𝑃 𝑝 coincides with the origin 𝟎∈ ℝ 3 Translate each image point by setting 𝑥 𝑖𝑗 = 𝑥 𝑖𝑗 − 𝑥 𝑖 𝑦 𝑖𝑗 = 𝑦 𝑖𝑗 − 𝑦 𝑖 ( 𝑥 𝑖 , 𝑦 𝑖 ) denotes the centroid of ( 𝑥 𝑖𝑗 , 𝑦 𝑖𝑗 )
Objective (w/o translation) Find 𝒓 𝑖 𝒔 𝑖 ∈ ℝ 3 that minimize 𝑖=1 𝑓 𝑗=1 𝑝 𝒓 𝑖 𝑇 𝑃 𝑗 − 𝑥 𝑖𝑗 2 + 𝒔 𝑖 𝑇 𝑃 𝑗 − 𝑦 𝑖𝑗 2 Subject to 𝒓 𝑖 = 𝒔 𝑖 =1 𝒓 𝑖 𝑇 𝒔 𝑖 =0
Measurement matrix 𝑀= 𝑥 11 𝑥 12 . … 𝑥 𝑓1 𝑥 𝑓2 . . . 𝑥 1𝑝 … . . 𝑥 𝑓𝑝 𝑦 11 𝑦 12 . .. 𝑦 𝑓1 𝑦 𝑓2 . . . 𝑦 1𝑝 … . . 𝑦 𝑓𝑝 2𝑓×𝑝
Transformation and shape matrices 𝑇= 𝒓 1 𝑇 … 𝒓 𝑓 𝑇 𝒔 1 𝑇 … 𝒔 𝑓 𝑇 = 𝑟 11 𝑟 12 𝑟 13 … … 𝑟 𝑓1 𝑟 𝑓2 𝑟 𝑓3 𝑠 11 𝑠 12 𝑠 13 … … 𝑠 𝑓1 𝑠 𝑓2 𝑠 𝑓3 2𝑓×3 𝑆= 𝑋 1 𝑋 2 𝑌 1 𝑌 2 . 𝑍 1 𝑍 2 𝑋 𝑝 . . 𝑌 𝑝 𝑍 𝑝 3×𝑝
Objective: matrix notation Find 𝑇 and 𝑆 that minimize 𝑀−𝑇𝑆 𝐹 Subject to 𝒓 𝑖 = 𝒔 𝑖 =1 𝒓 𝑖 𝑇 𝒔 𝑖 =0 𝑀 is 2𝑓×𝑝, 𝑇 is 2𝑓×3, 𝑆 is 3×𝑝
𝑀=𝑇𝑆+Noise 𝑥 11 𝑥 12 . … 𝑥 𝑓1 𝑥 𝑓2 . . . 𝑥 1𝑝 … . . 𝑥 𝑓𝑝 𝑦 11 𝑦 12 . .. 𝑦 𝑓1 𝑦 𝑓2 . . . 𝑦 1𝑝 … . . 𝑦 𝑓𝑝 2𝑓×𝑝 = 𝑟 11 𝑟 12 𝑟 13 … … 𝑟 𝑓1 𝑟 𝑓2 𝑟 𝑓3 𝑠 11 𝑠 12 𝑠 13 … … 𝑠 𝑓1 𝑠 𝑓2 𝑠 𝑓3 2𝑓×3 𝑋 1 … 𝑋 𝑝 𝑌 1 𝑌 𝑝 𝑍 1 … 𝑍 𝑝 3×𝑝 +Noise
TK-Factorization 𝑀=𝑇𝑆+Noise Step 1: find rank 3 approximation to 𝑀 using SVD 𝑀=𝑈Σ 𝑉 𝑇 where 𝑈 is 2𝑓×2𝑓, 𝑈 𝑇 𝑈=𝐼, Σ=𝑑𝑖𝑎𝑔( 𝜎 1 , 𝜎 2 ,…), size 2𝑓×𝑝, and 𝜎 1 ≥ 𝜎 2 ≥…≥0 𝑉 is 𝑝×𝑝, 𝑉 𝑇 𝑉=𝐼
TK-Factorization 𝑀 =𝑈 Σ 3 𝑉 𝑇 𝑀 =𝑈 Σ 3 𝑉 𝑇 where Σ 3 =𝑑𝑖𝑎𝑔( 𝜎 1 , 𝜎 2 , 𝜎 3 ,0, 0,…) Note: this is a relaxation, only noise components outside the 3D space are annihilated Step 2: factorization 𝑇 =𝑈 Σ 3 𝑆 = Σ 3 𝑉 𝑇 Ambiguity: 𝑀 =( 𝑇 𝐴)( 𝐴 −1 𝑆 ) for any non-singular, 3×3 matrix 𝐴
TK-Factorization Step 3: resolve ambiguity 𝒓 𝑖 = 𝒔 𝑖 =1 𝒓 𝑖 𝑇 𝒔 𝑖 =0 Let 𝑅 𝑖 = 𝒓 𝑖 𝑇 𝒔 𝑖 𝑇 2×3 , note that 𝑅 𝑖 𝑅 𝑖 𝑇 =𝐼 Let 𝑇 𝑖 = 𝒓 𝑖 𝑇 𝒔 𝑖 𝑇 2×3 be the corresponding rows in 𝑇 , then 𝑅 𝑖 = 𝑇 𝑖 𝐴 Find a 3×3 symmetric matrix 𝐴 𝐴 𝑇 𝑇 𝑖 𝐴 𝐴 𝑇 𝑇 𝑖 𝑇 = 𝑅 𝑖 𝑅 𝑖 𝑇 =𝐼
TK-Factorization 𝑇 𝑖 𝐴 𝐴 𝑇 𝑇 𝑖 𝑇 = 𝑅 𝑖 𝑅 𝑖 𝑇 =𝐼 𝑇 𝑖 𝐴 𝐴 𝑇 𝑇 𝑖 𝑇 = 𝑅 𝑖 𝑅 𝑖 𝑇 =𝐼 Equation is linear in 𝐴 𝐴 𝑇 There are 3𝑓 equations in 6 unknowns Find 𝐴 by eigen-decomposition 𝐴 𝐴 𝑇 =𝑊∆ 𝑊 𝑇 so that 𝐴=𝑊 ∆ Solution is obtained up to a rotation ambiguity 𝑇 𝑖 (𝐴𝐵)( 𝐵 𝑇 𝐴 𝑇 ) 𝑇 𝑖 𝑇 such that 𝐵 𝐵 𝑇 =𝐼
TK-Factorization: Summary Eliminate translation, construct 𝑀 𝑆𝑉𝐷(𝑀) to get rank 3 𝑀 and factorize 𝑀 = 𝑇 𝑆 (3×3 ambiguity 𝐴 remains) Resolve ambiguity: estimate 𝐴 𝐴 𝑇 from orthonormality and factorize to obtain 𝐴 Solution up to rotation and reflection
Incomplete tracks Tracks are often incomplete – Factorization with missing data Rank is difficult to enforce Surrogate: minimize the nuclear norm – sum of singular values, 𝜎 1 + 𝜎 2 + 𝜎 3 +… Nuclear norm is convex, minimization often achieves low rank Accurate reconstruction usually requires accounting for perspective distortion
Perspective projection A point 𝑃=(𝑋,𝑌,𝑍) is projected to 𝑥= 𝑓𝑋 𝑍 𝑦= 𝑓𝑌 𝑍 A point rotated by 𝑅 and translated by 𝒕 projects to 𝑥= 𝑓( 𝒓 1 𝑇 𝑃+ 𝑡 𝑥 ) 𝒓 3 𝑇 𝑃+ 𝑡 𝑧 𝑦= 𝑓( 𝒓 2 𝑇 𝑃+ 𝑡 𝑦 ) 𝒓 3 𝑇 𝑃+ 𝑡 𝑧 𝒓 𝑖 𝑇 denotes the rows of 𝑅 We call 𝐶=𝐾[𝑅,𝒕] 3×4 a camera matrix 𝐾 calibration matrix, 𝑅 camera orientation, 𝒕 camera location
Bundle adjustment Given 𝑝 points in 𝑓 frames, (𝑥 𝑖𝑗 , 𝑦 𝑖𝑗 ), find camera matrices 𝐶 𝑖 and positions 𝑃 𝑗 (𝑗=1,…,𝑝) that minimize 𝑖=1 𝑓 𝑗=1 𝑝 𝑓 ( 𝒓 𝑖1 𝑇 𝑃 𝑗 + 𝑡 𝑥 ) 𝒓 𝑖3 𝑇 𝑃 𝑗 + 𝑡 𝑧 − 𝑥 𝑖𝑗 2 + 𝑓 (𝒓 𝑖2 𝑇 𝑃 𝑗 + 𝑡 𝑦 ) 𝒓 𝑖3 𝑇 𝑃 𝑗 + 𝑡 𝑧 − 𝑦 𝑖𝑗 2 Alternate optimization Given 𝑅 𝑖 and 𝒕 𝒊 , solve for 𝑃 𝑗 Given 𝑃 𝑗 solve for 𝑅 𝑖 and 𝒕 𝒊 Very good initial guess is required
Bundler (photo-tourism) (Snavely et al.)
Bundler (photo-tourism) Given images, identify feature points, describe them with SIFTs Match SIFTs, accept each match 𝑝 𝑖 ↔ 𝑝 𝑗 whose score is at least twice of any other match 𝑝 𝑖 ↔ 𝑝 𝑘 For every pair of images with sufficiently many matches use RANSAC to recover Essential matrices Starting with two images and adding one image at a time: use essential matrix to recover depth and apply bundle adjustment
Simultaneous solutions 𝐸 𝑖𝑗 : Essential matrix between 𝐼 𝑖 and 𝐼 𝑗 , 𝑖,𝑗=1,…,𝑓 𝐸 𝑖𝑗 = 𝒕 𝑖𝑗 × 𝑅 𝑖𝑗 (on a subset of image pairs) Objective: recover camera orientation 𝑅 𝑖 and location 𝒕 𝑖 relative to a global coordinate system min 𝑅 𝑖 𝑅 𝑖𝑗 − 𝑅 𝑖 𝑅 𝑗 𝑇 𝐹 This can be solved in various ways, for example min 𝑅 𝑖 𝑅 𝑖𝑗 𝑅 𝑗 − 𝑅 𝑖 𝐹 : least squares solution if we ignore the orthonormality constraints for 𝑅 𝑖
Essential in global coordinates Corresponding points, 𝑝 and 𝑞, satisfy the following relation 𝑝 𝑇 𝑅 𝑖 𝑇 𝒕 𝑖 × − 𝒕 𝑗 × 𝑅 𝑗 𝑞=0 This generalizes the formula for the essential matrix (plug in 𝑅 𝑖 =𝐼, 𝒕 𝑖 =𝟎) Once camera orientations 𝑅 𝑖 are known we can solve for camera locations Solution suffers from shrinkage problems
Reconstruction example