CS 766: Computer Vision Computer Sciences Department, University of Wisconsin-Madison Indexing and Retrieval James Hill, Ozcan Ilikhan, Mark Lenz {jshill4,

Slides:

Advertisements

Similar presentations

Distinctive Image Features from Scale-Invariant Keypoints

Advertisements

Distinctive Image Features from Scale-Invariant Keypoints David Lowe.

Object Recognition from Local Scale-Invariant Features David G. Lowe Presented by Ashley L. Kapron.

Three things everyone should know to improve object retrieval

Presented by Xinyu Chang

Content-Based Image Retrieval

Computer vision: models, learning and inference Chapter 13 Image preprocessing and feature extraction.

A NOVEL LOCAL FEATURE DESCRIPTOR FOR IMAGE MATCHING Heng Yang, Qing Wang ICME 2008.

Object Recognition using Invariant Local Features Applications l Mobile robots, driver assistance l Cell phone location or object recognition l Panoramas,

Distinctive Image Features from Scale- Invariant Keypoints Mohammad-Amin Ahantab Technische Universität München, Germany.

CS4670 / 5670: Computer Vision Bag-of-words models Noah Snavely Object

Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.

Image alignment Image from

IBBT – Ugent – Telin – IPI Dimitri Van Cauwelaert A study of the 2D - SIFT algorithm Dimitri Van Cauwelaert.

Object Recognition with Invariant Features n Definition: Identify objects or scenes and determine their pose and model parameters n Applications l Industrial.

Fast High-Dimensional Feature Matching for Object Recognition David Lowe Computer Science Department University of British Columbia.

CVPR 2008 James Philbin Ondˇrej Chum Michael Isard Josef Sivic

Packing bag-of-features ICCV 2009 Herv´e J´egou Matthijs Douze Cordelia Schmid INRIA.

Effective Image Database Search via Dimensionality Reduction Anders Bjorholm Dahl and Henrik Aanæs IEEE Computer Society Conference on Computer Vision.

Robust and large-scale alignment Image from

Object retrieval with large vocabularies and fast spatial matching

1 Image Recognition - I. Global appearance patterns Slides by K. Grauman, B. Leibe.

A Study of Approaches for Object Recognition

Object Recognition with Invariant Features n Definition: Identify objects or scenes and determine their pose and model parameters n Applications l Industrial.

Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman ICCV 2003 Presented by: Indriyati Atmosukarto.

JOSEF SIVIC AND ANDREW ZISSERMAN PRESENTERS: ILGE AKKAYA & JEANNETTE CHANG MARCH 1, 2011 Efficient Visual Search for Objects in Videos.

Distinctive Image Feature from Scale-Invariant KeyPoints

Distinctive image features from scale-invariant keypoints. David G. Lowe, Int. Journal of Computer Vision, 60, 2 (2004), pp Presented by: Shalomi.

Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman University of Oxford ICCV 2003.

Scale Invariant Feature Transform (SIFT)

Automatic Face Recognition for Film Character Retrieval in Feature-Length Films Ognjen Arandjelović Andrew Zisserman.

1 Invariant Local Feature for Object Recognition Presented by Wyman 2/05/2006.

Automatic Image Alignment (feature-based) : Computational Photography Alexei Efros, CMU, Fall 2006 with a lot of slides stolen from Steve Seitz and.

Multiple Object Class Detection with a Generative Model K. Mikolajczyk, B. Leibe and B. Schiele Carolina Galleguillos.

Distinctive Image Features from Scale-Invariant Keypoints David G. Lowe – IJCV 2004 Brien Flewelling CPSC 643 Presentation 1.

Scale-Invariant Feature Transform (SIFT) Jinxiang Chai.

Keypoint-based Recognition and Object Search

10/31/13 Object Recognition and Augmented Reality Computational Photography Derek Hoiem, University of Illinois Dali, Swans Reflecting Elephants.

Object Recognition and Augmented Reality

Indexing Techniques Mei-Chen Yeh.

Distinctive Image Features from Scale-Invariant Keypoints By David G. Lowe, University of British Columbia Presented by: Tim Havinga, Joël van Neerbos.

Computer vision.

Keypoint-based Recognition Computer Vision CS 543 / ECE 549 University of Illinois Derek Hoiem 03/04/10.

Final Exam Review CS485/685 Computer Vision Prof. Bebis.

Internet-scale Imagery for Graphics and Vision James Hays cs195g Computational Photography Brown University, Spring 2010.

04/30/13 Last class: summary, goggles, ices Discrete Structures (CS 173) Derek Hoiem, University of Illinois 1 Image: wordpress.com/2011/11/22/lig.

A Statistical Approach to Speed Up Ranking/Re-Ranking Hong-Ming Chen Advisor: Professor Shih-Fu Chang.

1 Total Recall: Automatic Query Expansion with a Generative Feature Model for Object Retrieval Ondrej Chum, James Philbin, Josef Sivic, Michael Isard and.

Video Google: A Text Retrieval Approach to Object Matching in Videos Josef Sivic and Andrew Zisserman.

10/31/13 Object Recognition and Augmented Reality Computational Photography Derek Hoiem, University of Illinois Dali, Swans Reflecting Elephants.

CSCE 643 Computer Vision: Extractions of Image Features Jinxiang Chai.

Puzzle Solver Sravan Bhagavatula EE 638 Project Stanford ECE.

CVPR2013 Poster Detecting and Naming Actors in Movies using Generative Appearance Models.

Lecture 08 27/12/2011 Shai Avidan הבהרה: החומר המחייב הוא החומר הנלמד בכיתה ולא זה המופיע / לא מופיע במצגת.

A Tutorial on using SIFT Presented by Jimmy Huff (Slightly modified by Josiah Yoder for Winter )

Bundling Features for Large Scale Partial-Duplicate Web Image Search Zhong Wu ∗, Qifa Ke, Michael Isard, and Jian Sun Microsoft Research.

776 Computer Vision Jan-Michael Frahm Spring 2012.

Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman University of Oxford ICCV 2003.

Distinctive Image Features from Scale-Invariant Keypoints Presenter :JIA-HONG,DONG Advisor : Yen- Ting, Chen 1 David G. Lowe International Journal of Computer.

Blob detection.

776 Computer Vision Jan-Michael Frahm Spring 2012.

Visual homing using PCA-SIFT

SIFT Scale-Invariant Feature Transform David Lowe

CS262: Computer Vision Lect 09: SIFT Descriptors

Lecture 07 13/12/2011 Shai Avidan הבהרה: החומר המחייב הוא החומר הנלמד בכיתה ולא זה המופיע / לא מופיע במצגת.

Video Google: Text Retrieval Approach to Object Matching in Videos

CAP 5415 Computer Vision Fall 2012 Dr. Mubarak Shah Lecture-5

The SIFT (Scale Invariant Feature Transform) Detector and Descriptor

Video Google: Text Retrieval Approach to Object Matching in Videos

Presented by Xu Miao April 20, 2005

Presentation transcript:

CS 766: Computer Vision Computer Sciences Department, University of Wisconsin-Madison Indexing and Retrieval James Hill, Ozcan Ilikhan, Mark Lenz {jshill4, ilikhan, 1

Presentation Outline 1- Introduction 2- Common methods used in the papers * SIFT descriptor * k-means clustering * TF-IDF weight 3- Video Google 4- Scalable Recognition with a Vocabulary Tree 5- City-Scale Location Recognition 2

Introduction Find identical objects in multiple images Difficulties with changes in – Scale – Orientation – Viewpoint – Lighting Search time and storage space 3

Indexing and Retrieval Common Solutions Invariant features (e.g. SIFT) kd-trees Best Bin First 4

SIFT - Scale-Invariant Feature Transform Key Steps 5 1)Difference of Gaussians in scale space 2)Maxima and minima are feature points 3)Remove low-contrast and non-robust edge points 4)Assign each point an orientation 5)Create a descriptor from windowed region

SIFT - Scale-Invariant Feature Transform Key Benefits 6  Feature points invariant to scale and translation  Orientations provide invariance to rotation  Distinctive descriptors are partially invariant to changes in illumination and viewpoint  Robust to background clutter and occlusion

k-means clustering Motivation (what are we trying to do) We want to develop a method for finding the centers of different clusters in a set of data. 7

k-means clustering 8

9

10

k-means clustering 11

k-means clustering How do we find these means? We need to perform a minimization on: 12

k-means clustering How do we extend this? With Hierarchical k-means Clustering! 13

k-means clustering 14

k-means clustering 15

k-means clustering 16

k-means clustering Now that we can cluster our data, how can we use this information to quickly find the closest vector in our data given some test vector? 17

k-means clustering We will build a vocabulary tree using this clustering method. Each vector in our data (including the means) will be considered a “word” in our vocabulary. We will build a tree using the means of our data. 18

k-means clustering 19

k-means clustering 20

k-means clustering 21

TF-IDF Term frequency–inverse document frequency (tf–idf): is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. A standard weight often used in information retrieval and text mining. 22

TF-IDF 23 n id : the number of occurrences of word i in document d. n d : the total number of words in document d. N i : the number of documents containing term i. N : the total number of documents in the whole database.

TF-IDF 24 word frequencyinverse document frequency X Each document is represented by a vector Then vectors are organized as an inverted file.

TF-IDF 25 Image credit: litt/hand/hand html

Video Google 26 A Text Retrieval Approach to Object Matching in Videos Josef Sivic and Andrew Zisserman Visual Geometry Group, Department of Engineering Science University of Oxford, United Kingdom Proceedings of the International Conference on Computer Vision (2003)

Video Google 27 Efficient Visual Search of Videos Cast as Text Retrieval IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 31, Number 4, page , 2009 Fundamental idea of paper: Retrieve key frames and shots of a video containing a particular object with ease, speed, and accuracy with which Google retrieves text documents (web pages) containing particular words.

Video Google 28 Recall Text Retrieval (preprocessing) 1.Parse documents into words 2.Stemming: “walk" = { “walk,” “walking,” “walks”,…} 3.Stop list to reject very common words, such as “the” and “an”. 4.Each document is represented by a vector with components given by the frequency of occurrence of the words the document contains 5.Store vector in an inverted file.

Video Google 29 Can we treat video the same way? What and where are the words of a video?

Video Google 30 1.Detect affine covariant regions in each key frame of video 2.Reject unstable regions. 3.Build visual vocabulary 4.Remove stop listed words 5.Compute weighted document frequency 6.Build the index (inverted file). The Video Google algorithm: a) Pre-processing (off-line):

Video Google 31 Building a Visual Vocabulary Step 1. Calculate viewpoint invariant regions: Shape Adapted (SA) region: centered on corner-like features Maximally Stable (MS) region: correspond to blobs of high contrast with respect to their surroundings such as a dark window on a gray wall. Each region is represented by a 128-dimentional vector using SIFT descriptor 720 x 576 pixel video frame ≈ 1200 regions

Video Google 32

Video Google 33 Step 2. Reject unstable regions: Any region that does not survive for more than 3 frames is rejected. This “stability check” significantly reduces the number of regions to about 600 per frame.

Video Google 34 Step 3. Build Visual Vocabulary: Use K-Means clustering to vector quantize descriptors into clusters Mahalanobis distance:

Video Google 35 Step 4. Remove stop-listed visual words: The most frequent visual words that occur in almost all images, such as highlights which occur in many frames, are rejected.

Video Google 36 Step 5. Compute tf-idf weighted document frequency vector: Variations of tf-idf may be used. Step 6. Build inverted-file indexing structure:

Video Google 37 1.Determine the set of visual words within the query region 2.Retrieve keyframes based on visual word frequencies 3.Re-rank the top keyframes using spatial consistency The Video Google algorithm: b) Run-time (on-line):

Video Google 38 Matched covariant regions in the retrieved frames should have a similar spatial arrangement to those of the outlined region in the query image. Spatial consistency:

Video Google 39 How it works: Query region and its close-up.

Video Google 40 How it works: Original matches based on visual words

Video Google 41 How it works: Original matches based on visual words

Video Google 42 How it works: Matches after using the stop-list

Video Google 43 How it works: Final set of matches after filtering on spatial consistency

Video Google 44

Video Google 45

Video Google 46 Real-time demo

CS 766: Computer Vision Computer Sciences Department, University of Wisconsin-Madison Scalable Recognition With a Vocabulary Tree James Hill, Ozcan Ilikhan, Mark Lenz {jshill4, ilikhan, 47

The Paper Scalable Recognition with a vocabulary tree David Nister and Henrik Stewenius Center for Visualization and Virtual Environments Department of Computer Science, University of Kentucky Published in 2006 Appeared in: 2006 IEEE Computer Science Conference on Computer Vision and Pattern Recognition 48

What are we trying to do. Provide an indexing scheme that: Scales to large image databases (1 million). Retrieves images in an acceptable amount of time. 49

Inspiration Sivic and Zisserman (what you just saw) Used k-means to partition the descriptors in several pictures. Used TF-IDF to score an image and find a close match. 50

What’s new? The idea of a vocabulary tree. Using a larger vocabulary tree speeds things up and improve match quality Can use many more training images (35000 vs 400) Can insert new images into the Database quickly (0.2s vs 10s) 51

How do we do it? Follow these three steps: 1.Build the vocabulary tree using the image descriptors. 2.Generate a score for a given query image. 3.Find the images in the database that best match that score. 52

Recap the Vocabulary Tree 1.For each image in our database, we calculate a set of feature point descriptors. 2.Each of these descriptors is a vector of numbers which exists in some space (128). 3.Consider each of these vectors to be a “word” in the vocabulary of our database. 53

Recap the Vocabulary Tree Build the vocabulary tree using hierarchical k- means clustering. 54

Recap the Vocabulary Tree 55

Recap the Vocabulary Tree 56

Recap the Vocabulary Tree 57

What’s it good for? Now that we have a vocabulary tree, we can generate a path down the vocabulary tree which is stored in an integer for scoring. At each level of the tree, the descriptor is compared to each of the k children using a dot product. The closest is the path that is followed. 58

Scoring We have a bunch of paths through the tree, how do we compare the query image to a database image? At each node, we define a weight w i. The paper suggests two methods Use a constant weighting scheme. Use an entropy weighting scheme such as 59

Scoring (continued) Where N is the number of images in the database N i is the number of images in the database with at least one descriptor vector path through node i. 60

Scoring (continued) This scoring mechanism results in a TF-IDF scheme. So we should see a higher score if more nodes are shared by more descriptors. 61

Scoring (continued) To compare two scores, we use the normalized difference between the query score and the database score. 62

Scoring (continued) Researchers found that the most important factors to quality where. A large vocabulary tree. Stronger weights towards the leaves of the tree. Using the L1 norm in the previous equation. 63

Scoring Implementation Scoring is implemented using inverted files At each node create an inverted file Each file contains a list of images in which the current node appears. The inverted file of inner nodes is simply the concatenation of it’s children’s inverted files. Database image scores are pre-computed and pre normalized. 64

Testing This method was tested using a a database of CD album covers. Pictures of cd album covers where then used as query images and run against the database. Also tested using 6376 images in groups of 4. Each image was queried in the hopes that the other 3 images would produce the top scores. Have tested on databases with image counts as high as 1 million (highest at time of writing) 65

Testing 66

Results 67

Conclusions The main conclusions of the paper are: Using a larger vocabulary tree makes things better. Using an L1 norm in the normalized difference of the scores produces better results than the L2 norm This method can scale up to 1 million images and still run in near real time. 68

CS 766: Computer Vision Computer Sciences Department, University of Wisconsin-Madison City-Scale Location Recognition James Hill, Ozcan Ilikhan, Mark Lenz {jshill4, ilikhan, 69

City-Scale Location Recognition 70 Estimate location by matching features from a large set of images

City-Scale Location Recognition 71 City-wide database of photos labeled with location

Image Features 72 SIFT features invariant to – Translation – Scale – Orientation – Illumination (partially)

Difficulties Matching Features 73 Storage space – 30,000 images ≈ 100,000,000 SIFT features ≈ 12 GB Search time kd-trees and Best Bin First require descriptors

Method 74 Cluster features into visual words Build vocabulary tree from clusters Search tree to score matches Location of image with top score

Method 75 Build trees with informative features Create trees of varying branching factor Vary number of comparisons during search

Vocabulary Tree 76 Visual word = region of an object Just need the distance between a query feature and each node Only leaf nodes are words

Informative Features 77 Cluster small subsets into visual words Compute information gain of features Select most informative features to build tree

Information Gain 78 Informative Feature – Found in all images of a location – Not in any image of another location Information gain: measure of how much new information reduces uncertainty

Information Gain 79

Building the Tree 80 Hierarchical k-means to cluster features Nodes are the centroids Leaves are the visual words

Branching Factor 81 Vary number of nodes compared to increase search accuracy Fixed vocabulary size M Branching factor k, depth L k L ≈ M

Greedy N-Best Paths 82 Approximate nearest neighbor Similar to Best Bin First Generalization of vocab tree search Search multiple branches at each level

Greedy N-Best Paths 83 k + kN(L-1) comparisons

Matching 84 Votes for image d = C d Computed in linear time in # of features

Results 85 30,000 images covering 20 km 278 GPS-labelled query images Performance = % query images within 10m of ground truth

Results 86 Informative Features vs. Uniform

Results 87 Greedy N-Best Paths vs. Best Bin First

Results 88 Top n matches

Conclusion 89 Vocabulary tree structure affects performance in recognition tasks Structure becomes more critical as database size increases Number of comparisons drives performance, not branching factor

CS 766: Computer Vision Computer Sciences Department, University of Wisconsin-Madison Q & A, Discussion Monday, November 29, 2010

CS 766: Computer Vision Computer Sciences Department, University of Wisconsin-Madison Acknowledgements Many thanks to Prof. Andrew Zisserman and Dr. Josef Sivic for providing us with extra materials for presentation. 91