Keypoint-based Recognition and Object Search

Slides:



Advertisements
Similar presentations
Kapitel 14 Recognition – p. 1 Recognition Scene understanding / visual object categorization Pose clustering Object recognition by local features Image.
Advertisements

Kapitel 14 Recognition Scene understanding / visual object categorization Pose clustering Object recognition by local features Image categorization Bag-of-features.
Instance-level recognition II.
Feature Matching and Robust Fitting Computer Vision CS 143, Brown James Hays Acknowledgment: Many slides from Derek Hoiem and Grauman&Leibe 2008 AAAI Tutorial.
TP14 - Indexing local features
Object Recognition using Invariant Local Features Applications l Mobile robots, driver assistance l Cell phone location or object recognition l Panoramas,
CS4670 / 5670: Computer Vision Bag-of-words models Noah Snavely Object
Photo by Carl Warner.
1 Image Retrieval Hao Jiang Computer Science Department 2009.
Stephan Gammeter, Lukas Bossard, Till Quack, Luc Van Gool.
Image alignment Image from
776 Computer Vision Jared Heinly Spring 2014 (slides borrowed from Jan-Michael Frahm, Svetlana Lazebnik, and others)
IBBT – Ugent – Telin – IPI Dimitri Van Cauwelaert A study of the 2D - SIFT algorithm Dimitri Van Cauwelaert.
Object Recognition with Invariant Features n Definition: Identify objects or scenes and determine their pose and model parameters n Applications l Industrial.
CVPR 2008 James Philbin Ondˇrej Chum Michael Isard Josef Sivic
Robust and large-scale alignment Image from
Object retrieval with large vocabularies and fast spatial matching
1 Image Recognition - I. Global appearance patterns Slides by K. Grauman, B. Leibe.
Lecture 28: Bag-of-words models
Image alignment.
Object Recognition with Invariant Features n Definition: Identify objects or scenes and determine their pose and model parameters n Applications l Industrial.
Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman ICCV 2003 Presented by: Indriyati Atmosukarto.
Distinctive Image Feature from Scale-Invariant KeyPoints
Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman University of Oxford ICCV 2003.
Fitting a Model to Data Reading: 15.1,
Automatic Image Alignment (feature-based) : Computational Photography Alexei Efros, CMU, Fall 2006 with a lot of slides stolen from Steve Seitz and.
Sebastian Thrun CS223B Computer Vision, Winter Stanford CS223B Computer Vision, Winter 2005 Lecture 3 Advanced Features Sebastian Thrun, Stanford.
Lecture 6: Feature matching and alignment CS4670: Computer Vision Noah Snavely.
Alignment and Object Instance Recognition Computer Vision CS 543 / ECE 549 University of Illinois Derek Hoiem 02/16/12.
Machine learning & category recognition Cordelia Schmid Jakob Verbeek.
10/31/13 Object Recognition and Augmented Reality Computational Photography Derek Hoiem, University of Illinois Dali, Swans Reflecting Elephants.
Object Recognition and Augmented Reality
CS 766: Computer Vision Computer Sciences Department, University of Wisconsin-Madison Indexing and Retrieval James Hill, Ozcan Ilikhan, Mark Lenz {jshill4,
Indexing Techniques Mei-Chen Yeh.
Clustering with Application to Fast Object Search
Image alignment.
CSE 185 Introduction to Computer Vision
Keypoint-based Recognition Computer Vision CS 543 / ECE 549 University of Illinois Derek Hoiem 03/04/10.
Final Exam Review CS485/685 Computer Vision Prof. Bebis.
CSE 473/573 Computer Vision and Image Processing (CVIP)
Fitting and Registration Computer Vision CS 543 / ECE 549 University of Illinois Derek Hoiem 02/14/12.
Object Tracking/Recognition using Invariant Local Features Applications l Mobile robots, driver assistance l Cell phone location or object recognition.
04/30/13 Last class: summary, goggles, ices Discrete Structures (CS 173) Derek Hoiem, University of Illinois 1 Image: wordpress.com/2011/11/22/lig.
Lecture 4: Feature matching CS4670 / 5670: Computer Vision Noah Snavely.
Bag-of-features models. Origin 1: Texture recognition Texture is characterized by the repetition of basic elements or textons For stochastic textures,
1 Total Recall: Automatic Query Expansion with a Generative Feature Model for Object Retrieval Ondrej Chum, James Philbin, Josef Sivic, Michael Isard and.
Perceptual and Sensory Augmented Computing Visual Object Recognition Tutorial Visual Object Recognition Bastian Leibe & Computer Vision Laboratory ETH.
Example: line fitting. n=2 Model fitting Measure distances.
Video Google: A Text Retrieval Approach to Object Matching in Videos Josef Sivic and Andrew Zisserman.
10/31/13 Object Recognition and Augmented Reality Computational Photography Derek Hoiem, University of Illinois Dali, Swans Reflecting Elephants.
Lecture 8: Feature matching CS6670: Computer Vision Noah Snavely.
Lecture 08 27/12/2011 Shai Avidan הבהרה: החומר המחייב הוא החומר הנלמד בכיתה ולא זה המופיע / לא מופיע במצגת.
CSE 185 Introduction to Computer Vision Feature Matching.
Local features: detection and description
776 Computer Vision Jan-Michael Frahm Spring 2012.
CSCI 631 – Foundations of Computer Vision March 15, 2016 Ashwini Imran Image Stitching.
Invariant Local Features Image content is transformed into local feature coordinates that are invariant to translation, rotation, scale, and other imaging.
776 Computer Vision Jan-Michael Frahm Spring 2012.
Prof. Alex Berg (Credits to many other folks on individual slides)
SIFT Scale-Invariant Feature Transform David Lowe
Lecture 07 13/12/2011 Shai Avidan הבהרה: החומר המחייב הוא החומר הנלמד בכיתה ולא זה המופיע / לא מופיע במצגת.
CS 2770: Computer Vision Feature Matching and Indexing
Nearest-neighbor matching to feature database
Object Recognition and Augmented Reality
By Suren Manvelyan, Crocodile (nile crocodile?) By Suren Manvelyan,
Features Readings All is Vanity, by C. Allan Gilbert,
Nearest-neighbor matching to feature database
Brief Review of Recognition + Context
CSE 185 Introduction to Computer Vision
Presented by Xu Miao April 20, 2005
Presentation transcript:

Keypoint-based Recognition and Object Search 03/08/11 Keypoint-based Recognition and Object Search Computer Vision CS 543 / ECE 549 University of Illinois Derek Hoiem

Notices I’m having trouble connecting to the web server, so can’t post lecture slides right now HW 2 due Thurs Guest lecture Thurs by Ali Farhadi

General Process of Object Recognition Specify Object Model Generate Hypotheses Score Hypotheses Resolution

General Process of Object Recognition Example: Keypoint-based Instance Recognition Specify Object Model A1 B1 B2 B3 A3 A2 Generate Hypotheses Last Class Score Hypotheses Resolution

Overview of Keypoint Matching 1. Find a set of distinctive key- points A1 A2 A3 B1 B2 B3 2. Define a region around each keypoint 3. Extract and normalize the region content e.g. color e.g. color N pixels 4. Compute a local descriptor from the normalized region 5. Match local descriptors K. Grauman, B. Leibe

General Process of Object Recognition Example: Keypoint-based Instance Recognition Specify Object Model A1 B1 B2 B3 A3 A2 Affine-variant point locations Generate Hypotheses Affine Parameters Score Hypotheses This Class # Inliers Resolution Choose hypothesis with max score above threshold

Keypoint-based instance recognition Given two images and their keypoints (position, scale, orientation, descriptor) Goal: Verify if they belong to a consistent configuration Local Features, e.g. SIFT Slide credit: David Lowe

Finding the objects (overview) Input Image Stored Image Match interest points from input image to database image Matched points vote for rough position/orientation/scale of object Find triplets of position/orientation/scale that have at least three votes Compute affine registration and matches using iterative least squares with outlier check Report object if there are at least T matched points

Matching Keypoints Want to match keypoints between: Query image Stored image containing the object Given descriptor x0, find two nearest neighbors x1, x2 with distances d1, d2 x1 matches x0 if d1/d2 < 0.8 This gets rid of 90% false matches, 5% of true matches in Lowe’s study

Affine Object Model Accounts for 3D rotation of a surface under orthographic projection

Affine Object Model Accounts for 3D rotation of a surface under orthographic projection Translation Scaling/skew What is the minimum number of matched points that we need?

Finding the objects (in detail) Match interest points from input image to database image Get location/scale/orientation using Hough voting In training, each point has known position/scale/orientation wrt whole object Matched points vote for the position, scale, and orientation of the entire object Bins for x, y, scale, orientation Wide bins (0.25 object length in position, 2x scale, 30 degrees orientation) Vote for two closest bin centers in each direction (16 votes total) Geometric verification For each bin with at least 3 keypoints Iterate between least squares fit and checking for inliers and outliers Report object if > T inliers (T is typically 3, can be computed to match some probabilistic threshold)

Examples of recognized objects

View interpolation Training Recognition Given images of different viewpoints Cluster similar viewpoints using feature matches Link features in adjacent views Recognition Feature matches may be spread over several training viewpoints  Use the known links to “transfer votes” to other viewpoints [Lowe01] Slide credit: David Lowe

Applications Sony Aibo (Evolution Robotics) SIFT usage Other uses Recognize docking station Communicate with visual cards Other uses Place recognition Loop closure in SLAM K. Grauman, B. Leibe Slide credit: David Lowe

Location Recognition Training [Lowe04] Slide credit: David Lowe

How to quickly find images in a large database that match a given image region?

Simple idea See how many keypoints are close to keypoints in each other image Lots of Matches Few or No Matches But this will be really, really slow!

Fast visual search “Video Google”, Sivic and Zisserman, ICCV 2003 “Scalable Recognition with a Vocabulary Tree”, Nister and Stewenius, CVPR 2006.

110,000,000 Images in 5.8 Seconds Slide Slide Credit: Nister Very recently, we have scaled the system even further. The system we just demo’d searches a 50 thousand image index in the RAM of this laptop at real-time rates. We have built and programmed a desktop system at home in which we currently have 12 hard drives. We bypass the operating system and read and write the disk surfaces directly. Last week this system clocked in on 110 Million images in 5.8 seconds, so roughly 20Million images per second and machine. The images consist of around 10 days of four TV channels, so more than a month of TV. Soon we don’t have to watch TV, because we’ll have machines doing it for us. Slide Slide Credit: Nister

Slide Slide Credit: Nister To continue the analogy, if we printed all these images on paper, and stacked them, Slide Slide Credit: Nister

The pile would stack, as high as Slide Slide Credit: Nister

Slide Slide Credit: Nister Mount Everest. Another way to put perspective on this is, Google image search not too long ago claimed to index 2 billion images, although based on meta-data, while we do it based on image content. So, with about 20 desktop systems like the one I just showed, it seems that it may be possible to build a web-scale content-based image search engine, and we are sort of hoping that this paper will fuel the race for the first such search engine. So, that is some motivation. Let me now move to the contribution of the paper. As you can guess by now, it is about scalability of recognition and retrieval. Slide Slide Credit: Nister

Key Ideas Visual Words Inverse document file Cluster descriptors (e.g., K-means) Inverse document file Quick lookup of files given keypoints tf-idf: Term Frequency – Inverse Document Frequency # documents # times word appears in document # documents that contain the word # words in document

Recognition with K-tree Following slides by David Nister (CVPR 2006)

I will now try to describe the approach with an animation. The first, offline phase, is to train the vocabulary tree. We use Maximally Stable Extremal Regions by Matas et al. and then we extract SIFT descriptors from the MSER regions. Regarding the feature extraction, we are not really claiming anything other than down-in-the-trenches nose-to-the-grindstone hard work to get good implementations of the state-of-the-art, and since we are all obsessed with novelty I will not spend any time talking about the feature extraction. We believe that the vocabulary tree approach will work well on any of your favorite descriptors.

All the descriptor vectors are thrown into a common space. This is done for many images, the goal being to acquire enough statistics on how natural images distribute points in the descriptor space.

We then run k-means on the descriptor space We then run k-means on the descriptor space. In this setting, k defines what we call the branch-factor of the tree, which indicates how fast the tree branches. In this illustration, k is three. We then run k-means again, recursively on each of the resulting quantization cells. This defines the vocabulary tree, which is essentially a hierarchical set of cluster centers and their corresponding Voronoi regions. We typically use a branch-factor of 10 and six levels, resulting in a million leaf nodes. We lovingly call this the Mega-Voc.

We now have the vocabulary tree, and the online phase can begin. In order to add an image to the database, we perform feature extraction. Each descriptor vector is now dropped down from the root of the tree and quantized very efficiently into a path down the tree, encoded by a single integer. Each node in the vocabulary tree has an associated inverted file index. Indecies back to the new image are then added to the relevant inverted files. This is a very efficient operation that is carried out whenever we want to add an image.

When we then wish to query on an input image, we quantize the descriptor vectors of the input image in a similar way, and accumulate scores for the images in the database with so called term frequency inverse document frequency (tf-idf). This is effectively an entropy weighting of the information. The winner is the image in the database with the most common information with the input image.

Performance One of the reasons we managed to take the system this far is that we have worked with rigorous performance testing against a database with ground truth. We now have a benchmark database with over 10000 images grouped in sets of four images of the same object. We query on one of the images in the group and see how many of the other images in the group make it to the top of the query. In the paper, we have tested this up to a million images, by embedding the ground truth database into a database encompassing all the frames of 10 feature length movies.

More words is better Improves Retrieval Improves Speed Branch factor We have used the performance testing to explore various settings for the algorithm. The most important finding is that the number of leaf nodes in the vocabulary tree is very important. In principle, there is a trade-off, because we wish to have discriminative quantizations, which implies many leaf-nodes, while we also wish to have repeatable quantizations, which implies few leaf nodes. However, it seems that in a large practical range, the bigger the better when it comes to the vocabulary. This probably depends on the descriptors used, but is not entirely surprising, since we are trying to divide up a 128-dimensional space in discriminative bins. This is a very useful finding, because it means that we are in a win-win situation: a larger vocabulary improves retrieval performance. It turns out that it also improves the retrieval speed, because our inverted file index becomes much sparser, so we exploit the true power of inverted files better with a larger vocabulary. We also use a hierarchical scoring scheme, which has the nice property that the performance will only level out, not actually go down.

Higher branch factor works better (but slower) We also find that performance improves with the branch factor. This improvement is not dramatic, but it is interesting to note that very low branch factors are somewhat weak, and a branch factor of two results in partitioning the space with planes.

Sampling strategies Sparse, at interest points Dense, uniformly Randomly To find specific, textured objects, sparse sampling from interest points often more reliable. Multiple complementary interest operators offer more image coverage. For object categorization, dense sampling offers better coverage. [See Nowak, Jurie & Triggs, ECCV 2006] Multiple interest operators K. Grauman, B. Leibe Image credits: F-F. Li, E. Nowak, J. Sivic

Can we be more accurate? So far, we treat each image as containing a “bag of words”, with no spatial information Which matches better? h a f e a f z e h a f e e

Can we be more accurate? So far, we treat each image as containing a “bag of words”, with no spatial information Real objects have consistent geometry

Final key idea: geometric verification Goal: Given a set of possible keypoint matches, figure out which ones are geometrically consistent How can we do this?

Final key idea: geometric verification RANSAC for affine transform Repeat N times: a f z e a f e z Randomly choose 3 matching pairs Estimate transformation Affine Transform a f z e a f e Predict remaining points and count “inliers” z

Application: Large-Scale Retrieval Query Results on 5K (demo available for 100K) K. Grauman, B. Leibe [Philbin CVPR’07]

Example Applications Mobile tourist guide Self-localization Aachen Cathedral Mobile tourist guide Self-localization Object/building recognition Photo/video augmentation B. Leibe [Quack, Leibe, Van Gool, CIVR’08]

Application: Image Auto-Annotation Moulin Rouge Old Town Square (Prague) Tour Montparnasse Colosseum Viktualienmarkt Maypole Left: Wikipedia image Right: closest match from Flickr K. Grauman, B. Leibe [Quack CIVR’08]

Video Google System Collect all words within query region Inverted file index to find relevant frames Compare word counts Spatial verification Sivic & Zisserman, ICCV 2003 Demo online at : http://www.robots.ox.ac.uk/~vgg/research/vgoogle/index.html Retrieved frames K. Grauman, B. Leibe

Things to remember Object instance recognition Keys to efficiency Find keypoints, compute descriptors Match descriptors Vote for / fit affine parameters Return object if # inliers > T Keys to efficiency Visual words Used for many applications Inverse document file Used for web-scale search