SIFT Guest Lecture by Jiwon Kim http://www.cs.washington.edu/homes/jwkim/
SIFT Features and Its Applications
Autostitch Demo
Autostitch Fully automatic panorama generation Input: set of images Output: panorama(s) Uses SIFT (Scale-Invariant Feature Transform) to find/align images
1. Solve for homography
1. Solve for homography
1. Solve for homography
2. Find connected sets of images
2. Find connected sets of images
2. Find connected sets of images
3. Solve for camera parameters New images initialised with rotation, focal length of best matching image
3. Solve for camera parameters New images initialised with rotation, focal length of best matching image
4. Blending the panorama Burt & Adelson 1983 Blend frequency bands over range l
2-band Blending Low frequency (l > 2 pixels) High frequency (l < 2 pixels)
Linear Blending
2-band Blending
So, what is SIFT? Scale-Invariant Feature Transform David Lowe at UBC Scale/rotation invariant Currently best known feature descriptor Many real-world applications Object recognition Panorama stitching Robot localization Video indexing …
Example: object recognition
SIFT properties Locality: features are local, so robust to occlusion and clutter Distinctiveness: individual features can be matched to a large database of objects Quantity: many features can be generated for even small objects Efficiency: close to real-time performance
SIFT algorithm overview Feature detection Detect points that can be repeatably selected under location/scale change Feature description Assign orientation to detected feature points Construct a descriptor for image patch around each feature point Feature matching
1. Feature detection Detect points stable under location/scale change Build continuous space (x, y, scale) Approximated by multi-scale Difference-of-Gaussian pyramid Select maxima/minima in (x, y, scale)
1. Feature detection
1. Feature detection Localize extrema by fitting a quadratic Sub-pixel/sub-scale interpolation using Taylor expansion Take derivative and set to zero
1. Feature detection Discard low-contrast/edge points Low contrast: discard keypoints with < threshold Edge points: high contrast in one direction, low in the other compute principal curvatures from eigenvalues of 2x2 Hessian matrix, and limit ratio
1. Feature detection Example (a) 233x189 image (b) 832 DOG extrema (c) 729 left after peak value threshold (d) 536 left after testing ratio of principle curvatures
2. Feature description Assign orientation to keypoints Create histogram of local gradient directions computed at selected scale Assign canonical orientation at peak of smoothed histogram
2. Feature description Construct SIFT descriptor Create array of orientation histograms 8 orientations x 4x4 histogram array = 128 dimensions
2. Feature description Advantage over simple correlation Gradients less sensitive to illumination change Gradients may shift: robust to deformation, viewpoint change
Performance: stability to noise Match features after random change in image scale & orientation, with differing levels of image noise Find nearest neighbor in database of 30,000 features
Performance: stability to affine change Match features after random change in image scale & orientation, with 2% image noise, and affine distortion Find nearest neighbor in database of 30,000 features
Performance: distinctiveness Vary size of database of features, with 30 degree affine change, 2% image noise Measure % correct for single nearest neighbor match
3. Feature matching For each feature in A, find nearest neighbor in B
3. Feature matching Nearest neighbor search too slow for large database of 128-dimenional data Approximate nearest neighbor search: Best-bin-first [Beis et al. 97]: modification to k-d tree algorithm Use heap data structure to identify bins in order by their distance from query point Result: Can give speedup by factor of 1000 while finding nearest neighbor (of interest) 95% of the time
3. Feature matching Reject false matches Compare distance of nearest neighbor to second nearest neighbor Common features aren’t distinctive, therefore bad Threshold of 0.8 provides excellent separation
3. Feature matching Now, given feature matches… Find an object in the scene Solve for homography (panorama) …
3. Feature matching Example: 3D object recognition
3. Feature matching 3D object recognition Assume affine transform: clusters of size >=3 Looking for 3 matches out of 3000 that agree on same object and pose: too many outliers for RANSAC or LMS Use Hough Transform Each match votes for a hypothesis for object ID/pose Voting for multiple bins & large bin size allow for error due to similarity approximation
3. Feature matching 3D object recognition: solve for pose Affine transform of [x,y] to [u,v]: Rewrite to solve for transform parameters:
3. Feature matching 3D object recognition: verify model Discard outliers for pose solution in prev step Perform top-down check for additional features Evaluate probability that match is correct Use Bayesian model, with probability that features would arise by chance if object was not present Takes account of object size in image, textured regions, model feature count in database, accuracy of fit [Lowe 01]
Planar recognition Training images
Planar recognition Reliably recognized at a rotation of 60° away from the camera Affine fit approximates perspective projection Only 3 points are needed for recognition
3D object recognition Training images
3D object recognition Only 3 keys are needed for recognition, so extra keys provide robustness Affine model is no longer as accurate
Recognition under occlusion
Illumination invariance
Applications of SIFT Object recognition Panoramic image stitching Robot localization Video indexing … The Office of the Past Document tracking and recognition
Location recognition
Robot Localization
Map continuously built over time
Locations of map features in 3D
Sony Aibo SIFT usage: Recognize charging station Communicate with visual cards Teach object recognition
The Office of the Past Paper everywhere
Unify physical and electronic desktops Video camera Recognize video of paper on physical desktop Tracking Recognition Linking Desktop
Unify physical and electronic desktops Video camera Applications Find lost documents Browse remote desktop Find electronic version History-based queries Desktop
Example input video
Demo – Remote desktop
System overview Video camera Computer User Desk Here is an overview of our system. In the setup, a video camera is mounted above the desk looking straight down to record the desktop.
System overview Video of desk Given the video of the physical desktop,
System overview Video of desk Images from PDF ..and images of corresponding electronic documents extracted from PDF’s
System overview Video of desk Images from PDF Track & recognize …the system tracks and recognizes the paper documents by matching between the two, Track & recognize
System overview Video of desk Images from PDF Internal representation …and produces an internal graphical representation that encodes the evolution of the stack structure over time. Desk Track & recognize T T+1
System overview Video of desk Images from PDF Internal representation We call each of these graphs a “scene graph”. Desk Track & recognize T T+1 Scene Graph
System overview Where is my W-2? Video of desk Images from PDF Internal representation Then, when the user issues a query, such as, where is my W-2 form?, Desk Track & recognize T T+1
System overview Where is my W-2? Answer Video of desk Images from PDF Internal representation …the system answers the query by consulting the scene graphs. Track & recognize Desk Desk T T+1
Assumptions Document Corresponding electronic copy exists No duplicates of same document We make a number of assumptions to simplify the tracking & recognition problem. First, we assume that each paper document has a corresponding electronic copy on the computer, and also that there are no duplicate copies of the same document, in other words, each document is unique and distinct from each other.
Assumptions Document Motion Corresponding electronic copy exists No duplicates of same document Motion 3 event types: move/entry/exit One document at a time Only topmost document can move A number of other assumptions are made to constrain the motion of the documents. For instance, we assume that there are 3 types of events, move/entry/exit, and only one document on top of a stack can move at a time. Although these assumptions do limit the capability of our system to handle more realistic situations, they were carefully chosen to make the problem tractable while still allowing interesting applications, as we will demonstrate later in the talk.
Non-assumptions Desk need not be initially empty Also note that there are certain assumptions we don’t make. For instance, we don’t require the desk to be initially empty. The desk is allowed to start with unknown papers on it, and our system automatically discovers the documents as observations accumulate over time.
Non-assumptions Desk need not be initially empty Stacks may overlap (-10 min) Also, the paper stacks are allowed to overlap with each other, forming a complex graph structure, rather than cleanly separated stacks.
Algorithm overview Input Frames … … Here is a step-by-step overview of the tracking & recognition algorithm. Given the input sequence,
Algorithm overview Input Frames … … Event Detection before after Lastly, we update the scene graph according to the event. The above 4 steps are repeated for each event in the input sequence. (-11 min) Now, I’ll explain each step of the algorithm.
“A document moved from (x1,y1) to (x2,y2)” Algorithm overview Input Frames … … Event Detection before after Event Interpretation “A document moved from (x1,y1) to (x2,y2)” Lastly, we update the scene graph according to the event. The above 4 steps are repeated for each event in the input sequence. (-11 min) Now, I’ll explain each step of the algorithm.
“A document moved from (x1,y1) to (x2,y2)” Algorithm overview Input Frames … … Event Detection before after Event Interpretation “A document moved from (x1,y1) to (x2,y2)” File1.pdf Lastly, we update the scene graph according to the event. The above 4 steps are repeated for each event in the input sequence. (-11 min) Now, I’ll explain each step of the algorithm. Document Recognition File2.pdf File3.pdf
“A document moved from (x1,y1) to (x2,y2)” Algorithm overview Input Frames … … Event Detection before after Event Interpretation “A document moved from (x1,y1) to (x2,y2)” File1.pdf Lastly, we update the scene graph according to the event. The above 4 steps are repeated for each event in the input sequence. (-11 min) Now, I’ll explain each step of the algorithm. Document Recognition File2.pdf File3.pdf Scene Graph Update Desk Desk
“A document moved from (x1,y1) to (x2,y2)” Algorithm overview Input Frames … … Event Detection before after Event Interpretation “A document moved from (x1,y1) to (x2,y2)” SIFT File1.pdf Lastly, we update the scene graph according to the event. The above 4 steps are repeated for each event in the input sequence. (-11 min) Now, I’ll explain each step of the algorithm. Document Recognition File2.pdf File3.pdf Scene Graph Update Desk Desk
Document tracking example Here’s an example of a move event, before after
Document tracking example ..where this top-left document before after
Document tracking example ..moves to the right. before after
Document tracking example To classify the event, we first extract image features in both images. before after
Document tracking example ..we match them between the two images. before after
Document tracking example We identify features that have no match, shown in green before after
Document tracking example ..and discard them before after
Document tracking example Next we cluster matching pairs of features according to their relative transformation Red features moved under the same xform, while blue ones stayed where they are before after
Document tracking example We look at the red cluster, and if it contains sufficiently many features, the event is considered a move. Otherwise it’s a non-move and subjected to further classification. before after
Document tracking example Motion: (x,y,θ) If it’s a move, we obtain the motion from the transformation of red cluster before after
Document Recognition Match against PDF image database … … File1.pdf ..where we match features in the region identified as the document against a database of PDF images stored on the computer, also using SIFT features. File1.pdf File2.pdf File3.pdf File4.pdf File5.pdf File6.pdf
Document Recognition Performance analysis Tested 20 pages against database of 162 pages We tested the performance of our recognition method by testing 20 pages against a database of 162 pages of documents, both of which were mostly from computer science research papers, and the method was able to correctly differentiate and recognize all of them.
Document Recognition Performance analysis Tested 20 pages against database of 162 pages ~200x300 pixels per document for reliable match Recognition Rate We also tested the performance with varying document image resolutions. In this graph, the X axis shows the length of the longer side of the document in pixels, and the Y axis shows the success rate of recognition. Document Resolution
Document Recognition Performance analysis Tested 20 pages against database of 162 pages ~200x300 pixels per document for reliable match 0.9 Recognition Rate We found that to achieve a recognition rate of 90% the documents must be at least 200 by 300 pixels large. Note that this resolution is not high enough for recognizing text using techniques such as OCR, but is still good enough for reliable recognition of individual documents. 300 Document Resolution
Results Input video Running time ~40 minutes 1024x768 @ 15 fps 22 documents, 49 events Running time Video processed offline No optimization A few hours for entire video Before showing a demo of our system, let me provide some statistics on the input data and video processing. The input video was recorded over a period of 40 minutes, at 1024x768 resolution and 15 frames per second. It contained 22 documents on the desk, with 49 events. The input video was analyzed offline, that is, after the recording was over. We did not optimize the performance at all, and it took a few hours to process the entire input sequence.
Demo – Paper tracking (-18 min) Let me show a demo of the query interface to our system, using the same input sequence I demoed at the beginning of the talk. The right window is the visualization panel showing the current state of the desktop. The left window shows a list of thumbnails of the documents found by the system. The user can browse this list and click on the thumbnail of the document of interest to query its location in the stack. The visualization expands the stack that contains the selected document and highlights the document. The user can open the PDF file of the selected document as well. The interface also supports a couple of alternative ways to specify a document. The user can locate a document by doing a keyword search for the title or the author. Here I’m looking for the document that contains the string “digitaldesk” in its title. The system tells me he paper is in this tack. The user can also sort the thumbnails in various ways. For example, the documents can be sorted in decreasing order of the last time the user accessed each document. The oldest document at the end of this list lies at the bottom of this stack; the second oldest document no longer exists on the desk; and the next oldest document is at the bottom of this stack, and so forth. On the other hand, the most recent document at the beginning of this list is on top of this stack; the next most recent document is on top of this stack, and so forth.
Photo sorting example Here’s an example of using our system for sorting digital photographs. Sorting a large number of digital photographs using the computer interface is usually a fairly tedious task.
Photo sorting example In contrast, it is very easy to sort printed photographs into physical stacks. So we printed out digital photographs on sheets of paper, and recorded the user sorting them into physical stacks on the desk. Here we sort the photographs from two source stacks, one shown on the bottom right of the video, and the other outside the camera view in the user's hand, into three target stacks based on the content of the pictures.
Demo – Photo sorting (-20 min) After processing this video with our system, we can click on each of the three stacks in the query interface, and assign it to an appropriate folder on the computer. Then our system automatically organizes the corresponding digital photographs into the designated folder, and pops up the folder in thumbnail view. I should point out that one clear drawback is the overhead of first having to print out the photographs on paper. However, we think that this can be useful for people who are not familiar with computer interfaces.
Future work Enhance realism More applications Handle more realistic desktops Real-time performance More applications Support other document tasks E.g., attach reminder, cluster documents Beyond documents Other 3D desktop objects, books/CD’s
Summary SIFT is: Scale/rotation invariant local feature Highly distinctive Robust to occlusion, illumination change, 3D viewpoint change Efficient (real-time performance) Suitable for many useful applications
References Distinctive image features from scale-invariant keypoints David G. Lowe, International Journal of Computer Vision, 60, 2 (2004), pp. 91-110 Recognising panoramas Matthew Brown and David G. Lowe, International Conference on Computer Vision (ICCV 2003), Nice, France (October 2003), pp. 1218-25. Video-Based Document Tracking: Unifying Your Physical and Electronic Desktops Jiwon Kim, Steven M. Seitz and Maneesh Agrawala, ACM Symposium on User Interface Software and Technology (UIST 2004), pp. 99-107.