Global Alignment and Structure from Motion

Slides:

Advertisements

Similar presentations

Structure from Motion Introduction to Computer Vision CS223B, Winter 2005 Richard Szeliski.

Advertisements

The fundamental matrix F

Last 4 lectures Camera Structure HDR Image Filtering Image Transform.

More Mosaic Madness : Computational Photography Alexei Efros, CMU, Fall 2011 © Jeffrey Martin (jeffrey-martin.com) with a lot of slides stolen from.

MASKS © 2004 Invitation to 3D vision Lecture 7 Step-by-Step Model Buidling.

Computer vision: models, learning and inference

Chapter 6 Feature-based alignment Advanced Computer Vision.

Jan-Michael Frahm, Enrique Dunn Spring 2012

Two-view geometry.

Camera calibration and epipolar geometry

Structure from motion.

Structure from Motion Computer Vision CSE P 576, Spring 2011 Richard Szeliski Microsoft Research.

Lecture 23: Structure from motion and multi-view stereo

Lecture 11: Structure from motion, part 2 CS6670: Computer Vision Noah Snavely.

Direct Methods for Visual Scene Reconstruction Paper by Richard Szeliski & Sing Bing Kang Presented by Kristin Branson November 7, 2002.

Epipolar geometry. (i)Correspondence geometry: Given an image point x in the first view, how does this constrain the position of the corresponding point.

Structure from motion. Multiple-view geometry questions Scene geometry (structure): Given 2D point matches in two or more images, where are the corresponding.

Uncalibrated Geometry & Stratification Sastry and Yang

Camera Calibration and Pose Estimation

Uncalibrated Epipolar - Calibration

Lecture 11: Structure from motion CS6670: Computer Vision Noah Snavely.

Lecture 22: Structure from motion CS6670: Computer Vision Noah Snavely.

Single-view geometry Odilon Redon, Cyclops, 1914.

CSCE 641 Computer Graphics: Image-based Modeling (Cont.) Jinxiang Chai.

Global Alignment and Structure from Motion Computer Vision CSE576, Spring 2005 Richard Szeliski.

Camera parameters Extrinisic parameters define location and orientation of camera reference frame with respect to world frame Intrinsic parameters define.

Global Alignment and Structure from Motion Computer Vision CSE455, Winter 2008 Noah Snavely.

Non-Linear Least Squares and Sparse Matrix Techniques: Fundamentals Richard Szeliski Microsoft Research UW-MSR Course on Vision Algorithms CSE/EE 577,

Lecture 12: Structure from motion CS6670: Computer Vision Noah Snavely.

CS 558 C OMPUTER V ISION Lecture IX: Dimensionality Reduction.

Today: Calibration What are the camera parameters?

Mosaics CSE 455, Winter 2010 February 8, 2010 Neel Joshi, CSE 455, Winter Announcements  The Midterm went out Friday  See to the class.

Multi-view geometry. Multi-view geometry problems Structure: Given projections of the same 3D point in two or more images, compute the 3D coordinates.

776 Computer Vision Jan-Michael Frahm, Enrique Dunn Spring 2013.

Computer vision: models, learning and inference

Image Stitching Ali Farhadi CSE 455

Camera Calibration & Stereo Reconstruction Jinxiang Chai.

Multi-view geometry.

Last Lecture (optical center) origin principal point P (X,Y,Z) p (x,y) x y.

Camera calibration Digital Visual Effects Yung-Yu Chuang with slides by Richard Szeliski, Steve Seitz,, Fred Pighin and Marc Pollefyes.

Image stitching Digital Visual Effects Yung-Yu Chuang with slides by Richard Szeliski, Steve Seitz, Matthew Brown and Vaclav Hlavac.

CSCE 643 Computer Vision: Structure from Motion

Last lecture Passive Stereo Spacetime Stereo Multiple View Stereo.

Peripheral drift illusion. Multiple views Hartley and Zisserman Lowe stereo vision structure from motion optical flow.

Lecture 03 15/11/2011 Shai Avidan הבהרה : החומר המחייב הוא החומר הנלמד בכיתה ולא זה המופיע / לא מופיע במצגת.

Affine Structure from Motion

Single-view geometry Odilon Redon, Cyclops, 1914.

Raquel A. Romano 1 Scientific Computing Seminar May 12, 2004 Projective Geometry for Computer Vision Projective Geometry for Computer Vision Raquel A.

EECS 274 Computer Vision Geometric Camera Calibration.

Two-view geometry. Epipolar Plane – plane containing baseline (1D family) Epipoles = intersections of baseline with image planes = projections of the.

Last lecture Passive Stereo Spacetime Stereo.

EECS 274 Computer Vision Affine Structure from Motion.

Feature Matching. Feature Space Outlier Rejection.

Computer vision: models, learning and inference M Ahad Multiple Cameras

776 Computer Vision Jan-Michael Frahm & Enrique Dunn Spring 2013.

3D Reconstruction Using Image Sequence

Structure from motion Multi-view geometry Affine structure from motion Projective structure from motion Planches : –

Announcements No midterm Project 3 will be done in pairs same partners as for project 2.

EECS 274 Computer Vision Projective Structure from Motion.

Lecture 22: Structure from motion CS6670: Computer Vision Noah Snavely.

Multi-view geometry. Multi-view geometry problems Structure: Given projections of the same 3D point in two or more images, compute the 3D coordinates.

Digital Visual Effects, Spring 2007 Yung-Yu Chuang 2007/4/17

Epipolar geometry.

Structure from motion Input: Output: (Tomasi and Kanade)

The Brightness Constraint

Uncalibrated Geometry & Stratification

Multi-view geometry.

Structure from motion Input: Output: (Tomasi and Kanade)

Lecture 15: Structure from motion

Presentation transcript:

Global Alignment and Structure from Motion CSE576, Spring 2009 Sameer Agarwal

Overview Global refinement for Image stitching Camera calibration Pose estimation and Triangulation Structure from Motion

Readings Chapter 3, Noah Snavely’s thesis Supplementary readings: Hartley & Zisserman, Multiview Geometry, Appendices 5 and 6. Brown & Lowe, Recognizing Panoramas, ICCV 2003

Problem: Drift add another copy of first image at the end (x1,y1) copy of first image (xn,yn) add another copy of first image at the end this gives a constraint: yn = y1 there are a bunch of ways to solve this problem add displacement of (y1 – yn)/(n - 1) to each image after the first compute a global warp: y’ = y + ax run a big optimization problem, incorporating this constraint best solution, but more complicated known as “bundle adjustment”

Global optimization Minimize a global energy function: What are the variables? The translation tj = (xj, yj) for each image What is the objective function? We have a set of matched features pi,j = (ui,j, vi,j) For each point match (pi,j, pi,j+1): pi,j+1 – pi,j = tj+1 – tj

Global optimization … p1,2 – p1,1 = t2 – t1 p1,3 – p1,2 = t3 – t2 wij = 1 if feature i is visible in images j and j+1 0 otherwise p1,2 – p1,1 = t2 – t1 p1,3 – p1,2 = t3 – t2 p2,3 – p2,2 = t3 – t2 … v4,1 – v4,4 = y1 – y4 minimize

Global optimization A x b p1,2 p3,4 p4,4 p4,1 p1,1 p3,3 p1,3 p2,3 p2,2 2m x 2n 2n x 1 2m x 1

Global optimization A x b Defines a least squares problem: minimize 2m x 2n 2n x 1 2m x 1 Defines a least squares problem: minimize Solution: Problem: there is no unique solution for ! (det = 0) We can add a global offset to a solution and get the same error

Ambiguity in global location (0,0) (-100,-100) (200,-200) Each of these solutions has the same error Called the gauge ambiguity Solution: fix the position of one image (e.g., make the origin of the 1st image (0,0))

Solving for rotations R1 R2 (u11, v11) f I1 (u11, v11, f) = p11 R2p22 R1p11 I2

Solving for rotations minimize

Parameterizing rotations How do we parameterize R and ΔR? Euler angles: bad idea quaternions: 4-vectors on unit sphere Axis-angle representation (Rodriguez Formula)

Nonlinear Least Squares

Global alignment Least-squares solution of min |Rjpij - Rkpik|2 or Rjpij - Rkpik = 0 Use the linearized update (I+[ωj]×)Rjpij - (I+[ωk]×) Rkpik = 0 or [qij]×ωj- [qik]×ωk = qij-qik, qij= Rjpij Estimate least square solution over {ωi} Iterate a few times (updating the {Ri})

Camera Calibration

Camera calibration Determine camera parameters from known 3D points or calibration object(s) internal or intrinsic parameters such as focal length, optical center, aspect ratio: what kind of camera? external or extrinsic (pose) parameters: where is the camera? How can we do this?

Camera calibration – approaches Possible approaches: linear regression (least squares) non-linear optimization vanishing points multiple planar patterns panoramas (rotational motion)

Image formation equations (Xc,Yc,Zc) uc f

Calibration matrix Is this form of K good enough? non-square pixels (digital video) skew radial distortion

Camera matrix Fold intrinsic calibration matrix K and extrinsic pose parameters (R,t) together into a camera matrix M = K [R | t ] (put 1 in lower r.h. corner for 11 d.o.f.)

Camera matrix calibration Directly estimate 11 unknowns in the M matrix using known 3D points (Xi,Yi,Zi) and measured feature positions (ui,vi)

Camera matrix calibration Linear regression: Bring denominator over, solve set of (over- determined) linear equations. How? Least squares (pseudo-inverse) Is this good enough?

Levenberg-Marquardt Iterative non-linear least squares [Press’92] Linearize measurement equations Substitute into log-likelihood equation: quadratic cost function in Δm

Levenberg-Marquardt Iterative non-linear least squares [Press’92] Solve for minimum Hessian: error:

Levenberg-Marquardt What if it doesn’t converge? Multiply diagonal by (1 + λ), increase λ until it does Halve the step size Δm (my favorite) Use line search Other ideas? Uncertainty analysis: covariance Σ = A-1 Is maximum likelihood the best idea? How to start in vicinity of global minimum?

Camera matrix calibration Advantages: very simple to formulate and solve can recover K [R | t] from M using QR decomposition [Golub & VanLoan 96] Disadvantages: doesn't compute internal parameters can give garbage results more unknowns than true degrees of freedom need a separate camera matrix for each new view

Separate intrinsics / extrinsics New feature measurement equations Use non-linear minimization Standard technique in photogrammetry, computer vision, computer graphics [Tsai 87] – also estimates κ1 (freeware @ CMU) http://www.cs.cmu.edu/afs/cs/project/cil/ftp/html/v-source.html [Bogart 91] – View Correlation

Intrinsic/extrinsic calibration Advantages: can solve for more than one camera pose at a time potentially fewer degrees of freedom Disadvantages: more complex update rules need a good initialization (recover K [R | t] from M)

Multi-plane calibration Use several images of planar target held at unknown orientations [Zhang 99] Compute plane homographies Solve for K-TK-1 from Hk’s 1 plane if only f unknown 2 planes if (f,uc,vc) unknown 3+ planes for full K Code available from Zhang and OpenCV

Rotational motion Use pure rotation (large scene) to estimate f estimate f from pairwise homographies re-estimate f from 360º “gap” optimize over all {K,Rj} parameters [Stein 95; Hartley ’97; Shum & Szeliski ’00; Kang & Weiss ’99] Most accurate way to get f, short of surveying distant points f=510 f=468

Pose estimation and triangulation

Pose estimation Once the internal camera parameters are known, can compute camera pose [Tsai87] [Bogart91] Application: superimpose 3D graphics onto video How do we initialize (R,t)?

Pose estimation Previous initialization techniques: vanishing points [Caprile 90] planar pattern [Zhang 99] Other possibilities Through-the-Lens Camera Control [Gleicher92]: differential update 3+ point “linear methods”: [DeMenthon 95][Quan 99][Ameller 00]

Pose estimation Use inter-point distance constraints [Quan 99][Ameller 00] Solve set of polynomial equations in xi2p Recover R,t using procrustes analysis. u (Xc,Yc,Zc) uc f

Triangulation Problem: Given some points in correspondence across two or more images (taken from calibrated cameras), {(uj,vj)}, compute the 3D location X

Triangulation Method I: intersect viewing rays in 3D, minimize: X is the unknown 3D point Cj is the optical center of camera j Vj is the viewing ray for pixel (uj,vj) sj is unknown distance along Vj Advantage: geometrically intuitive X Vj Cj

Triangulation Method II: solve linear equations in X advantage: very simple Method III: non-linear minimization advantage: most accurate (image plane error)

Structure from Motion

Structure from motion Given many points in correspondence across several images, {(uij,vij)}, simultaneously compute the 3D location xi and camera (or motion) parameters (K, Rj, tj) Two main variants: calibrated, and uncalibrated (sometimes associated with Euclidean and projective reconstructions)

Orthographic SFM [Tomasi & Kanade, IJCV 92]

Results Look at paper figures…

Structure from motion How many points do we need to match? 2 frames: (R,t): 5 dof + 3n point locations ≤ 4n point measurements ⇒ n ≥ 5 k frames: 6(k–1)-1 + 3n ≤ 2kn always want to use many more

Extensions Paraperspective [Poelman & Kanade, PAMI 97] Sequential Factorization [Morita & Kanade, PAMI 97] Factorization under perspective [Christy & Horaud, PAMI 96] [Sturm & Triggs, ECCV 96] Factorization with Uncertainty [Anandan & Irani, IJCV 2002]

Bundle Adjustment What makes this non-linear minimization hard? many more parameters: potentially slow poorer conditioning (high correlation) potentially lots of outliers gauge (coordinate) freedom

Lots of parameters: sparsity Only a few entries in Jacobian are non-zero

Robust error models Outlier rejection use robust penalty applied to each set of joint measurements for extremely bad data, use random sampling [RANSAC, Fischler & Bolles, CACM’81]

Structure from motion Reconstruction (side) (top) Input: images with points in correspondence pi,j = (ui,j,vi,j) Output structure: 3D location xi for each point pi motion: camera parameters Rj , tj Objective function: minimize reprojection error

SfM objective function Given point x and rotation and translation R, t Minimize sum of squared reprojection errors: predicted image location observed image location

Scene reconstruction Feature detection Incremental structure from motion Pairwise feature matching Correspondence estimation Again, the goal of the reconstruction procedure is to automatically estimate the relative positions and orientations, as well as the focal length or zoom, of each of the photographs in a connected component of the scene. The process consists of four steps: detecting features in each of the input photos; matching features between each pair of photos; grouping the matches into correspondences across multiple photos; and using the correspondences to estimate the geometry of the scene and the cameras in a technique known as structure from motion. I’ll now describe each of these steps in more detail. [Structure from motion techniques for unordered image collections have also been developed by others, such as Brown and Lowe and Schaffalitzky and Zisserman, and the basic pipeline of our system closely follows that of Brown and Lowe. To our knowledge, our system is the first to have been demonstrated on large collections of photos from the Internet.]

Feature detection Detect features using SIFT [Lowe, IJCV 2004] The core of the reconstruction process is the ability to reliably match feature points between images. Fortunately, over the last few years feature detection and matching techniques have improved dramatically. Many different technique exist – we use SIFT, or scale-invariant feature transform, developed by David Lowe. SIFT is designed to be invariant to image scale and rotation, as well as affine changes in image intensity. Here is an image from the Trevi Fountain, and here are the features SIFT detects in the image, shown as square patches. The squares are scaled and rotated to reflect the scale and orientation of the features.

Feature detection Detect features using SIFT [Lowe, IJCV 2004] So, we begin by detecting SIFT features in each photo. [We detect SIFT features in each photo, resulting in a set of feature locations and 128-byte feature descriptors…]

Feature detection Detect features using SIFT [Lowe, IJCV 2004]

Feature matching Match features between each pair of images Next, we match features across each pair of photos using approximate nearest neighbor matching.

Feature matching Refine matching using RANSAC [Fischler & Bolles 1987] to estimate fundamental matrices between pairs The matches are refined by using RANSAC, which is a robust model-fitting technique, to estimate a fundamental matrix between each pair of matching images and keeping only matches consistent with that fundamental matrix. We then link connected components of pairwise feature matches together to form correspondences across multiple images. Once we have correspondences, we run structure from motion to recover the camera and scene geometry. Structure from motion solves the following problem:

Reconstruction Choose two/three views to seed the reconstruction. Add 3d points via triangulation. Add cameras using pose estimation. Bundle adjustment Goto step 2.

Two-view structure from motion Simpler case: can consider motion independent of structure Let’s first consider the case where K is known Each image point (ui,j, vi,j, 1) can be multiplied by K-1 to form a 3D ray We call this the calibrated case K

Notes on two-view geometry epipolar plane epipolar line epipolar line p p' How can we express the epipolar constraint? Answer: there is a 3x3 matrix E such that p'TEp = 0 E is called the essential matrix

Properties of the essential matrix epipolar plane Ep epipolar line epipolar line p p' e e'

Properties of the essential matrix epipolar plane Ep epipolar line epipolar line p p' e e' p'TEp = 0 Ep is the epipolar line associated with p e and e' are called epipoles: Ee = 0 and ETe' = 0 E can be solved for with 5 point matches see Nister, An efficient solution to the five-point relative pose problem. PAMI 2004.

The Fundamental matrix If K is not known, then we use a related matrix called the Fundamental matrix, F Called the uncalibrated case F can be solved for linearly with eight points, or non-linearly with six or seven points

Photo Tourism overview Photo Explorer Scene reconstruction Input photographs Relative camera positions and orientations Point cloud Sparse correspondence Our system takes as input an unordered set of photos, either from an Internet search or from a large personal collection. We assume the photos are largely from the same static scene. The first step of our system is to apply a computer vision techniques to reconstruct the geometry of the scene. The output of this procedure is the relative positions and orientation for the cameras used to take a connected set of the photographs, as well as a point cloud representing the geometry of the scene, and a sparse set of correspondences between the photos. This information is then loaded into our interactive photo explorer tool. [Note: change to Trevi for consistency]