Structure-from-Motion Determining the 3-D structure of the world, and/or the motion of a camera using a sequence of images taken by a moving camera. –Equivalently,

Slides:



Advertisements
Similar presentations
Today Composing transformations 3D Transformations
Advertisements

Announcements. Structure-from-Motion Determining the 3-D structure of the world, and/or the motion of a camera using a sequence of images taken by a moving.
Invariants (continued).
Computer Graphics Lecture 4 Geometry & Transformations.
Primitives Behaviour at infinity HZ 2.2 Projective DLT alg Invariants
Transformations We want to be able to make changes to the image larger/smaller rotate move This can be efficiently achieved through mathematical operations.
3D Reconstruction – Factorization Method Seong-Wook Joo KG-VISA 3/10/2004.
Mapping: Scaling Rotation Translation Warp
Geometry (Many slides adapted from Octavia Camps and Amitabh Varshney)
Camera calibration and epipolar geometry
Structure from motion.
HCI 530 : Seminar (HCI) Damian Schofield. HCI 530: Seminar (HCI) Transforms –Two Dimensional –Three Dimensional The Graphics Pipeline.
2D Geometric Transformations
Chapter 4.1 Mathematical Concepts
Linear Algebra and SVD (Some slides adapted from Octavia Camps)
Chapter 4.1 Mathematical Concepts. 2 Applied Trigonometry Trigonometric functions Defined using right triangle  x y h.
Invariants. Definitions and Invariants A definition of a class means that given a list of properties: –For all props, all objects have that prop. –No.
Epipolar geometry. (i)Correspondence geometry: Given an image point x in the first view, how does this constrain the position of the corresponding point.
Structure from motion. Multiple-view geometry questions Scene geometry (structure): Given 2D point matches in two or more images, where are the corresponding.
Uncalibrated Geometry & Stratification Sastry and Yang
Outline for Today More math… Finish linear algebra: Matrix composition
CSCE 590E Spring 2007 Basic Math By Jijun Tang. Applied Trigonometry Trigonometric functions  Defined using right triangle  x y h.
3-D Geometry.
COMP322/S2000/L221 Relationship between part, camera, and robot (cont’d) the inverse perspective transformation which is dependent on the focal length.
Calibration Dorit Moshe.
Recognition by Linear Combinations of Models By Shimon Ullman & Rosen Basri Presented by: Piotr Dollar Nov. 19, 2002.
© 2003 by Davi GeigerComputer Vision October 2003 L1.1 Structure-from-EgoMotion (based on notes from David Jacobs, CS-Maryland) Determining the 3-D structure.
Previously Two view geometry: epipolar geometry Stereo vision: 3D reconstruction epipolar lines Baseline O O’ epipolar plane.
Single-view geometry Odilon Redon, Cyclops, 1914.
Affine Structure-from-Motion: A lot of frames (1) ISP.
The Pinhole Camera Model
Stockman MSU/CSE Math models 3D to 2D Affine transformations in 3D; Projections 3D to 2D; Derivation of camera matrix form.
CS 450: Computer Graphics 2D TRANSFORMATIONS
CS 450: COMPUTER GRAPHICS 3D TRANSFORMATIONS SPRING 2015 DR. MICHAEL J. REALE.
776 Computer Vision Jan-Michael Frahm, Enrique Dunn Spring 2013.
Mathematical Fundamentals
Transformations Aaron Bloomfield CS 445: Introduction to Graphics
Epipolar geometry The fundamental matrix and the tensor
Homogeneous Coordinates (Projective Space) Let be a point in Euclidean space Change to homogeneous coordinates: Defined up to scale: Can go back to non-homogeneous.
Course 12 Calibration. 1.Introduction In theoretic discussions, we have assumed: Camera is located at the origin of coordinate system of scene.
Geometric Models & Camera Calibration
CSE 681 Review: Transformations. CSE 681 Transformations Modeling transformations build complex models by positioning (transforming) simple components.
视觉的三维运动理解 刘允才 上海交通大学 2002 年 11 月 16 日 Understanding 3D Motion from Images Yuncai Liu Shanghai Jiao Tong University November 16, 2002.
CSCE 643 Computer Vision: Structure from Motion
Elementary Linear Algebra Anton & Rorres, 9th Edition
Affine Structure from Motion
3D Imaging Motion.
EECS 274 Computer Vision Affine Structure from Motion.
1 Chapter 2: Geometric Camera Models Objective: Formulate the geometrical relationships between image and scene measurements Scene: a 3-D function, g(x,y,z)
Computer Graphics Matrices
CSCE 552 Fall 2012 Math By Jijun Tang. Applied Trigonometry Trigonometric functions  Defined using right triangle  x y h.
Determining 3D Structure and Motion of Man-made Objects from Corners.
Graphics Lecture 2: Slide 1 Lecture 2 Transformations for animation.
Instructor: Mircea Nicolescu Lecture 9
4. Affine transformations. Reading Required:  Watt, Section 1.1. Further reading:  Foley, et al, Chapter  David F. Rogers and J. Alan Adams,
Geometric Transformations Ceng 477 Introduction to Computer Graphics Computer Engineering METU.
Sect. 4.5: Cayley-Klein Parameters 3 independent quantities are needed to specify a rigid body orientation. Most often, we choose them to be the Euler.
55:148 Digital Image Processing Chapter 11 3D Vision, Geometry
Geometric Transformations
Review: Transformations
Review: Transformations
Homogeneous Coordinates (Projective Space)
Two-view geometry Computer Vision Spring 2018, Lecture 10
Epipolar geometry.
Structure from motion Input: Output: (Tomasi and Kanade)
Two-view geometry.
Uncalibrated Geometry & Stratification
Reconstruction.
The Pinhole Camera Model
Structure from motion Input: Output: (Tomasi and Kanade)
Presentation transcript:

Structure-from-Motion Determining the 3-D structure of the world, and/or the motion of a camera using a sequence of images taken by a moving camera. –Equivalently, we can think of the world as moving and the camera as fixed. Like stereo, but the position of the camera isn’t known (and it’s more natural to use many images with little motion between them, not just two with a lot of motion). –We may or may not assume we know the parameters of the camera, such as its focal length.

Structure-from-Motion As with stereo, we can divide problem: –Correspondence. –Reconstruction. Again, we’ll talk about reconstruction first. –So for the next few classes we assume that each image contains some points, and we know which points match which.

Structure-from-Motion …

Movie

Reconstruction A lot harder than with stereo. Start with simpler case: scaled orthographic projection (weak perspective). –Recall, in this we remove the z coordinate and scale all x and y coordinates the same amount.

Announcements Quiz on April 22 in class (a week from Thursday). –Practice problems handed out Thursday. –Reviews at office hours, possibly in seminar room across the hall from my office. Questions about PS6

First: Represent motion We’ll talk about a fixed camera, and moving object. Key point: Points Some matrix The image Then:

Remember what this means. We are representing moving a set of points, projecting them into the image, and scaling them. Matrix multiplication: take inner product between each row of S and each point. First row of S produces X coordinates, while second row produces Y. Projection occurs because S has no third row. Translation occurs with tx and ty. Scaling can be encoded with a scale factor in S. The rest of S must be allowing the object to rotate.

Examples: S = [s, 0, 0, 0; 0, s, 0, 0]; This is just projection, with scaling by s. S = [s, 0, 0, s*tx; 0, s, 0, s*ty]; This is translation by (tx,ty,something), projection, and scaling.

Structure-from-Motion S encodes: –Projection: only two lines –Scaling, since S can have a scale factor. –Translation, by tx/s and ty/s. –Rotation:

Rotation Represents a 3D rotation of the points in P.

First, look at 2D rotation (easier) Matrix R acts on points by rotating them. Also, RR T = Identity. R T is also a rotation matrix, in the opposite direction to R.

Why does multiplying points by R rotate them? Think of the rows of R as a new coordinate system. Taking inner products of each points with these expresses that point in that coordinate system. This means rows of R must be orthonormal vectors (orthogonal unit vectors). Think of what happens to the points (1,0) and (0,1). They go to (cos theta, -sin theta), and (sin theta, cos theta). They remain orthonormal, and rotate clockwise by theta. Any other point, (a,b) can be thought of as a(1,0) + b(0,1). R(a(1,0)+b(0,1) = Ra(1,0) + Ra(0,1) = aR(1,0) + bR(0,1). So it’s in the same position relative to the rotated coordinates that it was in before rotation relative to the x, y coordinates. That is, it’s rotated.

Simple 3D Rotation Rotation about z axis. Rotates x,y coordinates. Leaves z coordinates fixed.

Full 3D Rotation Any rotation can be expressed as combination of three rotations about three axes. Rows (and columns) of R are orthonormal vectors. R has determinant 1 (not -1).

Intuitively, it makes sense that 3D rotations can be expressed as 3 separate rotations about fixed axes. Rotations have 3 degrees of freedom; two describe an axis of rotation, and one the amount. Rotations preserve the length of a vector, and the angle between two vectors. Therefore, (1,0,0), (0,1,0), (0,0,1) must be orthonormal after rotation. After rotation, they are the three columns of R. So these columns must be orthonormal vectors for R to be a rotation. Similarly, if they are orthonormal vectors (with determinant 1) R will have the effect of rotating (1,0,0), (0,1,0), (0,0,1). Same reasoning as 2D tells us all other points rotate too. Note if R has determinant -1, then R is a rotation plus a reflection.

Questions?

Putting it Together Scale Projection 3D Translation 3D Rotation We can just write st x as t x and st y as t y.

Affine Structure from Motion

Affine Structure-from-Motion: Two Frames (1)

Affine Structure-from-Motion: Two Frames (2) To make things easy, suppose:

Affine Structure-from-Motion: Two Frames (3) Looking at the first four points, we get:

Affine Structure-from-Motion: Two Frames (4) We can solve for motion by inverting matrix of points. Or, explicitly, we see that first column on left (images of first point) give the translations. After solving for these, we can solve for the each column of the s components of the motion using the images of each point, in turn.

Affine Structure-from-Motion: Two Frames (5) Once we know the motion, we can use the images of another point to solve for the structure. We have four linear equations, with three unknowns.

Affine Structure-from-Motion: Two Frames (6) Suppose we just know where the k’th point is in image 1. Then, we can use the first two equations to write x k and y k as linear in z k. The final two equations lead to two linear equations in the missing values and z k. If we eliminate z k we get one linear equation in the missing values. This means the unknown point lies on a known line. That is, we recover the epipolar constraint. Furthermore, these lines are all parallel.

Affine Structure-from-Motion: Two Frames (7) But, what if the first four points aren’t so simple? Then we define A so that: This is always possible as long as the points aren’t coplanar.

Affine Structure-from-Motion: Two Frames (8) Then, given: We have: And:

Affine Structure-from-Motion: Two Frames (9) Given: Then we just pretend that: is our motion, and solve as before.

Affine Structure-from-Motion: Two Frames (10) This means that we can never determine the exact 3D structure of the scene. We can only determine it up to some transformation, A. Since if a structure and motion explains the points: So does another of the form:

Affine Structure-from-Motion: Two Frames (11) Note that A has the form: A corresponds to translation of the points, plus a linear transformation.

For example, there is clearly a translational ambiguity in recovering the points. We can’t tell the difference between two point sets that are identical up to a translation when we only see them after they undergo an unknown translation. Similarly, there’s clearly a rotational ambiguity. The rest of the ambiguity is a stretching in an unknown direction.

Let’s take an explicit example of this. Suppose we have a cube with vertices at: (0,0,0) (10,0,0), (0,0,10)….. Suppose we transform this by rotating it by 45 degrees about the y axis. Then the transformation matrix is: [sq2 0 –sq2 0; ]. Now suppose instead we had a rectanguloid with corners at (10, 0, 0) (11,0,0), (10, 0, 10) …. We can transform this rectanguloid into the first cube by transforming it as: [ ; ; ; ]. Then we can apply [sq2 0 –sq2 0; ] to the resulting cube, to generate the same image. Or, we could have combined these two transformations into one: [sq2 0 –sq2 0; ]* [ ; ; ; ] = [10sq2 0 –sq2 0; ]

Affine Structure-from-Motion: A lot of frames (1) ISP

First Step: Solve for Translation (1) This is trivial, because we can pick a simple origin. –World origin is arbitrary. –Example: We can assume first point is at origin. Rotation then doesn’t effect that point. All its motion is translation. –Better to pick center of mass as origin. Average of all points. This also averages all noise.

Specifically, we can never tell where the world points were to begin with. Adding one to every x coordinate in P and then subtracting 1 in every tx is undetectable. So, wlog we can assume that sum(P(k,:)) = 0 for k from 1 to 3, ie., sum(x1 … xn) = 0, sum(y1…yn) = 0, sum(z1 … zn) = 0. Rotation doesn’t move the origin, which is now the center of mass. Neither does scaled orthographic projection. So, this only moves from translation. Explicitly, we assume sum(p) = (0,0,0)^T. Then: sum(s*R(p)) = s*R(sum(p)) = s*R(0,0,0)^T = (0,0,0)^T. (^T means transpose).

More explicitly, suppose sum(p) = (0,0,0,n)^T. Then, sum(R*P) = R*(sum(P)) = R*(0,0,0,n)^T = (0,0,0,n)^T. Sum(T*R*P) = T*(0,0,0,n)^T = (ntx,nty,ntz,n)^T. (Or just look at the 2x4 projection matrix). If we subtract tx or ty from every row, then the residual is (s11,s12,s13;s21,s22,s23)*P. I = s part of matrix + t part of matrix.

First Step: Solve for Translation (2)

First Step: Solve for Translation (3) As if by magic, there’s no translation.

Rank Theorem has rank 3. This means there are 3 vectors such that every row of is a linear combination of these vectors. These vectors are the rows of P. S P

Solve for S SVD is made to do this. D is diagonal with non-increasing values. U and V have orthonormal rows. Ignoring values that get set to 0, we have U(:,1:3) for S, and D(1:3,1:3)*V(1:3,:) for P.

Linear Ambiguity (as before) = U(:,1:3) * D(1:3,1:3) * V(1:3,:) = (U(:,1:3) * A) * (inv(A) *D(1:3,1:3) * V(1:3,:))

Noise has full rank. Best solution is to estimate I that’s as near to as possible, with estimate of I having rank 3. Our current method does this.

Weak Perspective Motion S P Row 2k and 2k+1 of S should be orthogonal. All rows should be unit vectors. (Push all scale into P). =(U(:,1:3)*A)*(inv(A) *D(1:3,1:3)*V(1:3,:)) Choose A so (U(:,1:3) * A) satisfies these conditions.

Related problems we won’t cover Missing data. Points with different, known noise. Multiple moving objects.

Final Messages Structure-from-motion for points can be reduced to linear algebra. Epipolar constraint reemerges. SVD important. Rank Theorem says the images a scene produces aren’t complicated (also important for recognition).