Video Google: Text Retrieval Approach to Object Matching in Videos

Slides:



Advertisements
Similar presentations
Distinctive Image Features from Scale-Invariant Keypoints
Advertisements

Three things everyone should know to improve object retrieval
Presented by Xinyu Chang
Foreground Focus: Finding Meaningful Features in Unlabeled Images Yong Jae Lee and Kristen Grauman University of Texas at Austin.
Outline Feature Extraction and Matching (for Larger Motion)
TP14 - Local features: detection and description Computer Vision, FCUP, 2014 Miguel Coimbra Slides by Prof. Kristen Grauman.
Matching with Invariant Features
Bag-of-features models Many slides adapted from Fei-Fei Li, Rob Fergus, and Antonio Torralba.
CVPR 2008 James Philbin Ondˇrej Chum Michael Isard Josef Sivic
Packing bag-of-features ICCV 2009 Herv´e J´egou Matthijs Douze Cordelia Schmid INRIA.
Effective Image Database Search via Dimensionality Reduction Anders Bjorholm Dahl and Henrik Aanæs IEEE Computer Society Conference on Computer Vision.
Object retrieval with large vocabularies and fast spatial matching
Lecture 28: Bag-of-words models
Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman ICCV 2003 Presented by: Indriyati Atmosukarto.
JOSEF SIVIC AND ANDREW ZISSERMAN PRESENTERS: ILGE AKKAYA & JEANNETTE CHANG MARCH 1, 2011 Efficient Visual Search for Objects in Videos.
Approximate Nearest Neighbor - Applications to Vision & Matching
Distinctive Image Feature from Scale-Invariant KeyPoints
Feature extraction: Corners and blobs
Bag-of-features models
Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman University of Oxford ICCV 2003.
Stepan Obdrzalek Jirı Matas
Scale Invariant Feature Transform (SIFT)
Automatic Matching of Multi-View Images
Blob detection.
1 Invariant Local Feature for Object Recognition Presented by Wyman 2/05/2006.
Multiple Object Class Detection with a Generative Model K. Mikolajczyk, B. Leibe and B. Schiele Carolina Galleguillos.
Scale-Invariant Feature Transform (SIFT) Jinxiang Chai.
Object Recognition and Augmented Reality
Overview Introduction to local features
CS 766: Computer Vision Computer Sciences Department, University of Wisconsin-Madison Indexing and Retrieval James Hill, Ozcan Ilikhan, Mark Lenz {jshill4,
Indexing Techniques Mei-Chen Yeh.
Computer vision.
Overview Harris interest points Comparing interest points (SSD, ZNCC, SIFT) Scale & affine invariant interest points Evaluation and comparison of different.
Local invariant features Cordelia Schmid INRIA, Grenoble.
Reporter: Fei-Fei Chen. Wide-baseline matching Object recognition Texture recognition Scene classification Robot wandering Motion tracking.
Bag-of-features models. Origin 1: Texture recognition Texture is characterized by the repetition of basic elements or textons For stochastic textures,
1 Total Recall: Automatic Query Expansion with a Generative Feature Model for Object Retrieval Ondrej Chum, James Philbin, Josef Sivic, Michael Isard and.
Video Google: A Text Retrieval Approach to Object Matching in Videos Josef Sivic and Andrew Zisserman.
776 Computer Vision Jan-Michael Frahm Fall SIFT-detector Problem: want to detect features at different scales (sizes) and with different orientations!
Lecture 7: Features Part 2 CS4670/5670: Computer Vision Noah Snavely.
Local invariant features Cordelia Schmid INRIA, Grenoble.
Visual Categorization With Bags of Keypoints Original Authors: G. Csurka, C.R. Dance, L. Fan, J. Willamowski, C. Bray ECCV Workshop on Statistical Learning.
CVPR2013 Poster Detecting and Naming Actors in Movies using Generative Appearance Models.
Harris Corner Detector & Scale Invariant Feature Transform (SIFT)
Overview Introduction to local features Harris interest points + SSD, ZNCC, SIFT Scale & affine invariant interest point detectors Evaluation and comparison.
Feature extraction: Corners and blobs. Why extract features? Motivation: panorama stitching We have two images – how do we combine them?
Lecture 08 27/12/2011 Shai Avidan הבהרה: החומר המחייב הוא החומר הנלמד בכיתה ולא זה המופיע / לא מופיע במצגת.
Features, Feature descriptors, Matching Jana Kosecka George Mason University.
Discovering Objects and their Location in Images Josef Sivic 1, Bryan C. Russell 2, Alexei A. Efros 3, Andrew Zisserman 1 and William T. Freeman 2 Goal:
Project 3 questions? Interest Points and Instance Recognition Computer Vision CS 143, Brown James Hays 10/21/11 Many slides from Kristen Grauman and.
Introduction to Scale Space and Deep Structure. Importance of Scale Painting by Dali Objects exist at certain ranges of scale. It is not known a priory.
Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman University of Oxford ICCV 2003.
Distinctive Image Features from Scale-Invariant Keypoints
The topic discovery models
Local features: main components
CS 2770: Computer Vision Feature Matching and Indexing
TP12 - Local features: detection and description
Video Google: Text Retrieval Approach to Object Matching in Videos
Scale and interest point descriptors
Feature description and matching
The topic discovery models
CAP 5415 Computer Vision Fall 2012 Dr. Mubarak Shah Lecture-5
Video Google – A google approach to Video Retrieval
From a presentation by Jimmy Huff Modified by Josiah Yoder
The SIFT (Scale Invariant Feature Transform) Detector and Descriptor
The topic discovery models
SIFT keypoint detection
Knowledge-based event recognition from salient regions of activity
Lecture VI: Corner and Blob Detection
Presented by Xu Miao April 20, 2005
Presentation transcript:

Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman University of Oxford ICCV 2003

Motivation Retrieve key frames and shots of video containing particular object with ease, speed and accuracy with which Google retrieves web pages containing particular words Investigate whether text retrieval approach is applicable to object recognition Visual analogy of word: vector quantizing descriptor vectors

Benefits Matches are pre-computed so at run time frames and shots containing particular object can be retrieved with no delay Any object (or conjunction of objects) occurring in a video can be retrieved even though there was no explicit interest in the object when the descriptors were built

Text Retrieval Approach Documents are parsed into words Words represented by stems Stop list to reject common words Remaining words assigned unique identifier Document represented by vector of weighted frequency of words Vectors organized in inverted files Retrieval returns documents with closest (angle) vector to query

Viewpoint invariant description Two types of viewpoint covariant regions computed for each frame Shape Adapted (SA) Mikolajczyk & Schmid Maximally Stable (MS) Matas et al. Detect different image areas Provide complimentary representations of frame Computed at twice originally detected region size to be more discriminating

Shape Adapted Regions: the Harris-Affine Operator Elliptical shape adaptation about interest point Iteratively determine ellipse center, scale and shape Scale determined by local extremum (across scale) of Laplacian Shape determined by maximizing intensity gradient isotropy over elliptical region Centered on corner-like features

Examples of Harris-Affine Operator

Maximally Stable Regions Use intensity watershed image segmentation Select areas that are approximately stationary as intensity threshold is varied Correspond to blobs of high contrast with respect to surroundings

Examples of Maximally Stable Regions

Feature Descriptor Each elliptical affine invariant region represented by 128 dimensional vector using SIFT descriptor

Noise Removal Information aggregated over sequence of frames Regions detected in each frame tracked using simple constant velocity dynamical model and correlation Region not surviving more than 3 frames are rejected Estimate descriptor for region computed by averaging descriptors throughout track

Noise Removal Tracking region over 70 frames

Visual Vocabulary Goal: vector quantize descriptors into clusters (visual words) When a new frame is observed, the descriptor of the new frame is assigned to the nearest cluster, generating matches for all frames

Visual Vocabulary Implementation: K-Means clustering Regions tracked through contiguous frames and average description computed 10% of tracks with highest variance eliminated, leaving about 1000 regions per frame Subset of 48 shots (~10%) selected for clustering Distance function: Mahalanobis 6000 SA clusters and 10000 MS clusters

Visual Vocabulary

Experiments - Setup Goal: match scene locations within closed world of shots Data:164 frames from 48 shots taken at 19 different 3D locations; 4-9 frames from each location

Experiments - Retrieval Entire frame is query Each of 164 frames as query region in turn Correct retrieval: other frames which show same location Retrieval performance: average normalized rank of relevant images Nrel = # of relevant images for query image N = size of image set Ri = rank of ith relevant image Rank lies between 0 and 1. Intuitively, it will be 0 if all relevant images are returned ahead of any others. It will be .5 for random retrievals.

Experiment - Results Zero is good!

Experiments - Results Precision = # relevant images/total # of frames retrieved Recall = # correctly retrieved frames/ # relevant frames

Stop List Top 5% and bottom 10% of frequent words are stopped

Spatial Consistency Matched region in retrieved frames have similar spatial arrangement to outlined region in query Retrieve frames using weighted frequency vector and re-rank based on spatial consistency

More Results

Demo http://www.robots.ox.ac.uk/~vgg/research/vgoogle/how/method/method_a.html http://www.robots.ox.ac.uk/~vgg/research/vgoogle/index.html