SIFT Guest Lecture by Jiwon Kim

Slides:

Advertisements

Similar presentations

Distinctive Image Features from Scale-Invariant Keypoints

Advertisements

Recognising Panoramas M. Brown and D. Lowe, University of British Columbia.

Feature Detection. Description Localization More Points Robust to occlusion Works with less texture More Repeatable Robust detection Precise localization.

Distinctive Image Features from Scale-Invariant Keypoints David Lowe.

Object Recognition from Local Scale-Invariant Features David G. Lowe Presented by Ashley L. Kapron.

Presented by Xinyu Chang

Summary of Friday A homography transforms one 3d plane to another 3d plane, under perspective projections. Those planes can be camera imaging planes or.

The SIFT (Scale Invariant Feature Transform) Detector and Descriptor

Object Recognition using Invariant Local Features Applications l Mobile robots, driver assistance l Cell phone location or object recognition l Panoramas,

Distinctive Image Features from Scale- Invariant Keypoints Mohammad-Amin Ahantab Technische Universität München, Germany.

Image alignment Image from

Instructor: Mircea Nicolescu Lecture 13 CS 485 / 685 Computer Vision.

IBBT – Ugent – Telin – IPI Dimitri Van Cauwelaert A study of the 2D - SIFT algorithm Dimitri Van Cauwelaert.

Object Recognition with Invariant Features n Definition: Identify objects or scenes and determine their pose and model parameters n Applications l Industrial.

Fast High-Dimensional Feature Matching for Object Recognition David Lowe Computer Science Department University of British Columbia.

Robust and large-scale alignment Image from

A Study of Approaches for Object Recognition

Object Recognition with Invariant Features n Definition: Identify objects or scenes and determine their pose and model parameters n Applications l Industrial.

Recognising Panoramas

Automatic Panoramic Image Stitching using Local Features Matthew Brown and David Lowe, University of British Columbia.

Motivation Where is my W-2 Form?. Video-based Tracking Camera view of the desk Camera Overhead video camera.

Distinctive Image Feature from Scale-Invariant KeyPoints

Feature extraction: Corners and blobs

Distinctive image features from scale-invariant keypoints. David G. Lowe, Int. Journal of Computer Vision, 60, 2 (2004), pp Presented by: Shalomi.

Object Recognition Using Distinctive Image Feature From Scale-Invariant Key point D. Lowe, IJCV 2004 Presenting – Anat Kaspi.

Automatic Image Stitching using Invariant Features Matthew Brown and David Lowe, University of British Columbia.

Scale Invariant Feature Transform (SIFT)

UIST 2004Kim, Seitz, Agrawala Video-Based Document Tracking: Unifying Your Physical and Electronic Desktops Jiwon KimSteven M. SeitzManeesh Agrawala University.

SIFT - The Scale Invariant Feature Transform Distinctive image features from scale-invariant keypoints. David G. Lowe, International Journal of Computer.

NIPS 2003 Tutorial Real-time Object Recognition using Invariant Local Image Features David Lowe Computer Science Department University of British Columbia.

Automatic Image Alignment (feature-based) : Computational Photography Alexei Efros, CMU, Fall 2006 with a lot of slides stolen from Steve Seitz and.

The SIFT (Scale Invariant Feature Transform) Detector and Descriptor

Distinctive Image Features from Scale-Invariant Keypoints David G. Lowe – IJCV 2004 Brien Flewelling CPSC 643 Presentation 1.

Sebastian Thrun CS223B Computer Vision, Winter Stanford CS223B Computer Vision, Winter 2005 Lecture 3 Advanced Features Sebastian Thrun, Stanford.

Scale-Invariant Feature Transform (SIFT) Jinxiang Chai.

Distinctive Image Features from Scale-Invariant Keypoints By David G. Lowe, University of British Columbia Presented by: Tim Havinga, Joël van Neerbos.

Computer vision.

Internet-scale Imagery for Graphics and Vision James Hays cs195g Computational Photography Brown University, Spring 2010.

Object Tracking/Recognition using Invariant Local Features Applications l Mobile robots, driver assistance l Cell phone location or object recognition.

Reporter: Fei-Fei Chen. Wide-baseline matching Object recognition Texture recognition Scene classification Robot wandering Motion tracking.

CVPR 2003 Tutorial Recognition and Matching Based on Local Invariant Features David Lowe Computer Science Department University of British Columbia.

CSCE 643 Computer Vision: Extractions of Image Features Jinxiang Chai.

Lecture 7: Features Part 2 CS4670/5670: Computer Vision Noah Snavely.

Distinctive Image Features from Scale-Invariant Keypoints Ronnie Bajwa Sameer Pawar * * Adapted from slides found online by Michael Kowalski, Lehigh University.

Features Digital Visual Effects, Spring 2006 Yung-Yu Chuang 2006/3/15 with slides by Trevor Darrell Cordelia Schmid, David Lowe, Darya Frolova, Denis Simakov,

Distinctive Image Features from Scale-Invariant Keypoints David Lowe Presented by Tony X. Han March 11, 2008.

CSE 185 Introduction to Computer Vision Feature Matching.

Distinctive Image Features from Scale-Invariant Keypoints

Presented by David Lee 3/20/2006

776 Computer Vision Jan-Michael Frahm Spring 2012.

Recognizing specific objects Matching with SIFT Original suggestion Lowe, 1999,2004.

CSCI 631 – Foundations of Computer Vision March 15, 2016 Ashwini Imran Image Stitching.

Distinctive Image Features from Scale-Invariant Keypoints Presenter :JIA-HONG,DONG Advisor : Yen- Ting, Chen 1 David G. Lowe International Journal of Computer.

Invariant Local Features Image content is transformed into local feature coordinates that are invariant to translation, rotation, scale, and other imaging.

776 Computer Vision Jan-Michael Frahm Spring 2012.

SIFT Scale-Invariant Feature Transform David Lowe

Presented by David Lee 3/20/2006

Lecture 07 13/12/2011 Shai Avidan הבהרה: החומר המחייב הוא החומר הנלמד בכיתה ולא זה המופיע / לא מופיע במצגת.

Distinctive Image Features from Scale-Invariant Keypoints

Scale Invariant Feature Transform (SIFT)

Nearest-neighbor matching to feature database

Features Readings All is Vanity, by C. Allan Gilbert,

Nearest-neighbor matching to feature database

CAP 5415 Computer Vision Fall 2012 Dr. Mubarak Shah Lecture-5

The SIFT (Scale Invariant Feature Transform) Detector and Descriptor

Computational Photography

Presented by Xu Miao April 20, 2005

Jiwon Kim Steve Seitz Maneesh Agrawala

Presentation transcript:

SIFT Guest Lecture by Jiwon Kim http://www.cs.washington.edu/homes/jwkim/

SIFT Features and Its Applications

Autostitch Demo

Autostitch Fully automatic panorama generation Input: set of images Output: panorama(s) Uses SIFT (Scale-Invariant Feature Transform) to find/align images

1. Solve for homography

1. Solve for homography

1. Solve for homography

2. Find connected sets of images

2. Find connected sets of images

2. Find connected sets of images

3. Solve for camera parameters New images initialised with rotation, focal length of best matching image

3. Solve for camera parameters New images initialised with rotation, focal length of best matching image

4. Blending the panorama Burt & Adelson 1983 Blend frequency bands over range  l

2-band Blending Low frequency (l > 2 pixels) High frequency (l < 2 pixels)

Linear Blending

2-band Blending

So, what is SIFT? Scale-Invariant Feature Transform David Lowe at UBC Scale/rotation invariant Currently best known feature descriptor Many real-world applications Object recognition Panorama stitching Robot localization Video indexing …

Example: object recognition

SIFT properties Locality: features are local, so robust to occlusion and clutter Distinctiveness: individual features can be matched to a large database of objects Quantity: many features can be generated for even small objects Efficiency: close to real-time performance

SIFT algorithm overview Feature detection Detect points that can be repeatably selected under location/scale change Feature description Assign orientation to detected feature points Construct a descriptor for image patch around each feature point Feature matching

1. Feature detection Detect points stable under location/scale change Build continuous space (x, y, scale) Approximated by multi-scale Difference-of-Gaussian pyramid Select maxima/minima in (x, y, scale)

1. Feature detection

1. Feature detection Localize extrema by fitting a quadratic Sub-pixel/sub-scale interpolation using Taylor expansion Take derivative and set to zero

1. Feature detection Discard low-contrast/edge points Low contrast: discard keypoints with < threshold Edge points: high contrast in one direction, low in the other  compute principal curvatures from eigenvalues of 2x2 Hessian matrix, and limit ratio

1. Feature detection Example (a) 233x189 image (b) 832 DOG extrema (c) 729 left after peak value threshold (d) 536 left after testing ratio of principle curvatures

2. Feature description Assign orientation to keypoints Create histogram of local gradient directions computed at selected scale Assign canonical orientation at peak of smoothed histogram

2. Feature description Construct SIFT descriptor Create array of orientation histograms 8 orientations x 4x4 histogram array = 128 dimensions

2. Feature description Advantage over simple correlation Gradients less sensitive to illumination change Gradients may shift: robust to deformation, viewpoint change

Performance: stability to noise Match features after random change in image scale & orientation, with differing levels of image noise Find nearest neighbor in database of 30,000 features

Performance: stability to affine change Match features after random change in image scale & orientation, with 2% image noise, and affine distortion Find nearest neighbor in database of 30,000 features

Performance: distinctiveness Vary size of database of features, with 30 degree affine change, 2% image noise Measure % correct for single nearest neighbor match

3. Feature matching For each feature in A, find nearest neighbor in B

3. Feature matching Nearest neighbor search too slow for large database of 128-dimenional data Approximate nearest neighbor search: Best-bin-first [Beis et al. 97]: modification to k-d tree algorithm Use heap data structure to identify bins in order by their distance from query point Result: Can give speedup by factor of 1000 while finding nearest neighbor (of interest) 95% of the time

3. Feature matching Reject false matches Compare distance of nearest neighbor to second nearest neighbor Common features aren’t distinctive, therefore bad Threshold of 0.8 provides excellent separation

3. Feature matching Now, given feature matches… Find an object in the scene Solve for homography (panorama) …

3. Feature matching Example: 3D object recognition

3. Feature matching 3D object recognition Assume affine transform: clusters of size >=3 Looking for 3 matches out of 3000 that agree on same object and pose: too many outliers for RANSAC or LMS Use Hough Transform Each match votes for a hypothesis for object ID/pose Voting for multiple bins & large bin size allow for error due to similarity approximation

3. Feature matching 3D object recognition: solve for pose Affine transform of [x,y] to [u,v]: Rewrite to solve for transform parameters:

3. Feature matching 3D object recognition: verify model Discard outliers for pose solution in prev step Perform top-down check for additional features Evaluate probability that match is correct Use Bayesian model, with probability that features would arise by chance if object was not present Takes account of object size in image, textured regions, model feature count in database, accuracy of fit [Lowe 01]

Planar recognition Training images

Planar recognition Reliably recognized at a rotation of 60° away from the camera Affine fit approximates perspective projection Only 3 points are needed for recognition

3D object recognition Training images

3D object recognition Only 3 keys are needed for recognition, so extra keys provide robustness Affine model is no longer as accurate

Recognition under occlusion

Illumination invariance

Applications of SIFT Object recognition Panoramic image stitching Robot localization Video indexing … The Office of the Past Document tracking and recognition

Location recognition

Robot Localization

Map continuously built over time

Locations of map features in 3D

Sony Aibo SIFT usage: Recognize charging station Communicate with visual cards Teach object recognition

The Office of the Past Paper everywhere

Unify physical and electronic desktops Video camera Recognize video of paper on physical desktop Tracking Recognition Linking Desktop

Unify physical and electronic desktops Video camera Applications Find lost documents Browse remote desktop Find electronic version History-based queries Desktop

Example input video

Demo – Remote desktop

System overview Video camera Computer User Desk Here is an overview of our system. In the setup, a video camera is mounted above the desk looking straight down to record the desktop.

System overview Video of desk Given the video of the physical desktop,

System overview Video of desk Images from PDF ..and images of corresponding electronic documents extracted from PDF’s

System overview Video of desk Images from PDF Track & recognize …the system tracks and recognizes the paper documents by matching between the two, Track & recognize

System overview Video of desk Images from PDF Internal representation …and produces an internal graphical representation that encodes the evolution of the stack structure over time. Desk Track & recognize T T+1

System overview Video of desk Images from PDF Internal representation We call each of these graphs a “scene graph”. Desk Track & recognize T T+1 Scene Graph

System overview Where is my W-2? Video of desk Images from PDF Internal representation Then, when the user issues a query, such as, where is my W-2 form?, Desk Track & recognize T T+1

System overview Where is my W-2? Answer Video of desk Images from PDF Internal representation …the system answers the query by consulting the scene graphs. Track & recognize Desk Desk T T+1

Assumptions Document Corresponding electronic copy exists No duplicates of same document We make a number of assumptions to simplify the tracking & recognition problem. First, we assume that each paper document has a corresponding electronic copy on the computer, and also that there are no duplicate copies of the same document, in other words, each document is unique and distinct from each other.

Assumptions Document Motion Corresponding electronic copy exists No duplicates of same document Motion 3 event types: move/entry/exit One document at a time Only topmost document can move A number of other assumptions are made to constrain the motion of the documents. For instance, we assume that there are 3 types of events, move/entry/exit, and only one document on top of a stack can move at a time. Although these assumptions do limit the capability of our system to handle more realistic situations, they were carefully chosen to make the problem tractable while still allowing interesting applications, as we will demonstrate later in the talk.

Non-assumptions Desk need not be initially empty Also note that there are certain assumptions we don’t make. For instance, we don’t require the desk to be initially empty. The desk is allowed to start with unknown papers on it, and our system automatically discovers the documents as observations accumulate over time.

Non-assumptions Desk need not be initially empty Stacks may overlap (-10 min) Also, the paper stacks are allowed to overlap with each other, forming a complex graph structure, rather than cleanly separated stacks.

Algorithm overview Input Frames … … Here is a step-by-step overview of the tracking & recognition algorithm. Given the input sequence,

Algorithm overview Input Frames … … Event Detection before after Lastly, we update the scene graph according to the event. The above 4 steps are repeated for each event in the input sequence. (-11 min) Now, I’ll explain each step of the algorithm.

“A document moved from (x1,y1) to (x2,y2)” Algorithm overview Input Frames … … Event Detection before after Event Interpretation “A document moved from (x1,y1) to (x2,y2)” Lastly, we update the scene graph according to the event. The above 4 steps are repeated for each event in the input sequence. (-11 min) Now, I’ll explain each step of the algorithm.

“A document moved from (x1,y1) to (x2,y2)” Algorithm overview Input Frames … … Event Detection before after Event Interpretation “A document moved from (x1,y1) to (x2,y2)” File1.pdf Lastly, we update the scene graph according to the event. The above 4 steps are repeated for each event in the input sequence. (-11 min) Now, I’ll explain each step of the algorithm. Document Recognition File2.pdf File3.pdf

“A document moved from (x1,y1) to (x2,y2)” Algorithm overview Input Frames … … Event Detection before after Event Interpretation “A document moved from (x1,y1) to (x2,y2)” File1.pdf Lastly, we update the scene graph according to the event. The above 4 steps are repeated for each event in the input sequence. (-11 min) Now, I’ll explain each step of the algorithm. Document Recognition File2.pdf File3.pdf Scene Graph Update Desk Desk

“A document moved from (x1,y1) to (x2,y2)” Algorithm overview Input Frames … … Event Detection before after Event Interpretation “A document moved from (x1,y1) to (x2,y2)” SIFT File1.pdf Lastly, we update the scene graph according to the event. The above 4 steps are repeated for each event in the input sequence. (-11 min) Now, I’ll explain each step of the algorithm. Document Recognition File2.pdf File3.pdf Scene Graph Update Desk Desk

Document tracking example Here’s an example of a move event, before after

Document tracking example ..where this top-left document before after

Document tracking example ..moves to the right. before after

Document tracking example To classify the event, we first extract image features in both images. before after

Document tracking example ..we match them between the two images. before after

Document tracking example We identify features that have no match, shown in green before after

Document tracking example ..and discard them before after

Document tracking example Next we cluster matching pairs of features according to their relative transformation Red features moved under the same xform, while blue ones stayed where they are before after

Document tracking example We look at the red cluster, and if it contains sufficiently many features, the event is considered a move. Otherwise it’s a non-move and subjected to further classification. before after

Document tracking example Motion: (x,y,θ) If it’s a move, we obtain the motion from the transformation of red cluster before after

Document Recognition Match against PDF image database … … File1.pdf ..where we match features in the region identified as the document against a database of PDF images stored on the computer, also using SIFT features. File1.pdf File2.pdf File3.pdf File4.pdf File5.pdf File6.pdf

Document Recognition Performance analysis Tested 20 pages against database of 162 pages We tested the performance of our recognition method by testing 20 pages against a database of 162 pages of documents, both of which were mostly from computer science research papers, and the method was able to correctly differentiate and recognize all of them.

Document Recognition Performance analysis Tested 20 pages against database of 162 pages ~200x300 pixels per document for reliable match Recognition Rate We also tested the performance with varying document image resolutions. In this graph, the X axis shows the length of the longer side of the document in pixels, and the Y axis shows the success rate of recognition. Document Resolution

Document Recognition Performance analysis Tested 20 pages against database of 162 pages ~200x300 pixels per document for reliable match 0.9 Recognition Rate We found that to achieve a recognition rate of 90% the documents must be at least 200 by 300 pixels large. Note that this resolution is not high enough for recognizing text using techniques such as OCR, but is still good enough for reliable recognition of individual documents. 300 Document Resolution

Results Input video Running time ~40 minutes 1024x768 @ 15 fps 22 documents, 49 events Running time Video processed offline No optimization A few hours for entire video Before showing a demo of our system, let me provide some statistics on the input data and video processing. The input video was recorded over a period of 40 minutes, at 1024x768 resolution and 15 frames per second. It contained 22 documents on the desk, with 49 events. The input video was analyzed offline, that is, after the recording was over. We did not optimize the performance at all, and it took a few hours to process the entire input sequence.

Demo – Paper tracking (-18 min) Let me show a demo of the query interface to our system, using the same input sequence I demoed at the beginning of the talk. The right window is the visualization panel showing the current state of the desktop. The left window shows a list of thumbnails of the documents found by the system. The user can browse this list and click on the thumbnail of the document of interest to query its location in the stack. The visualization expands the stack that contains the selected document and highlights the document. The user can open the PDF file of the selected document as well. The interface also supports a couple of alternative ways to specify a document. The user can locate a document by doing a keyword search for the title or the author. Here I’m looking for the document that contains the string “digitaldesk” in its title. The system tells me he paper is in this tack. The user can also sort the thumbnails in various ways. For example, the documents can be sorted in decreasing order of the last time the user accessed each document. The oldest document at the end of this list lies at the bottom of this stack; the second oldest document no longer exists on the desk; and the next oldest document is at the bottom of this stack, and so forth. On the other hand, the most recent document at the beginning of this list is on top of this stack; the next most recent document is on top of this stack, and so forth.

Photo sorting example Here’s an example of using our system for sorting digital photographs. Sorting a large number of digital photographs using the computer interface is usually a fairly tedious task.

Photo sorting example In contrast, it is very easy to sort printed photographs into physical stacks. So we printed out digital photographs on sheets of paper, and recorded the user sorting them into physical stacks on the desk. Here we sort the photographs from two source stacks, one shown on the bottom right of the video, and the other outside the camera view in the user's hand, into three target stacks based on the content of the pictures.

Demo – Photo sorting (-20 min) After processing this video with our system, we can click on each of the three stacks in the query interface, and assign it to an appropriate folder on the computer. Then our system automatically organizes the corresponding digital photographs into the designated folder, and pops up the folder in thumbnail view. I should point out that one clear drawback is the overhead of first having to print out the photographs on paper. However, we think that this can be useful for people who are not familiar with computer interfaces.

Future work Enhance realism More applications Handle more realistic desktops Real-time performance More applications Support other document tasks E.g., attach reminder, cluster documents Beyond documents Other 3D desktop objects, books/CD’s

Summary SIFT is: Scale/rotation invariant local feature Highly distinctive Robust to occlusion, illumination change, 3D viewpoint change Efficient (real-time performance) Suitable for many useful applications

References Distinctive image features from scale-invariant keypoints David G. Lowe, International Journal of Computer Vision, 60, 2 (2004), pp. 91-110 Recognising panoramas Matthew Brown and David G. Lowe, International Conference on Computer Vision (ICCV 2003), Nice, France (October 2003), pp. 1218-25. Video-Based Document Tracking: Unifying Your Physical and Electronic Desktops Jiwon Kim, Steven M. Seitz and Maneesh Agrawala, ACM Symposium on User Interface Software and Technology (UIST 2004), pp. 99-107.