Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman University of Oxford ICCV 2003.

Slides:

Advertisements

Similar presentations

Distinctive Image Features from Scale-Invariant Keypoints

Advertisements

Ter Haar Romeny, ICPR 2010 Introduction to Scale-Space and Deep Structure.

Three things everyone should know to improve object retrieval

Presented by Xinyu Chang

Foreground Focus: Finding Meaningful Features in Unlabeled Images Yong Jae Lee and Kristen Grauman University of Texas at Austin.

Change Detection C. Stauffer and W.E.L. Grimson, “Learning patterns of activity using real time tracking,” IEEE Trans. On PAMI, 22(8): , Aug 2000.

MIT CSAIL Vision interfaces Approximate Correspondences in High Dimensions Kristen Grauman* Trevor Darrell MIT CSAIL (*) UT Austin…

Distinctive Image Features from Scale- Invariant Keypoints Mohammad-Amin Ahantab Technische Universität München, Germany.

CS4670 / 5670: Computer Vision Bag-of-words models Noah Snavely Object

Bag-of-features models Many slides adapted from Fei-Fei Li, Rob Fergus, and Antonio Torralba.

CVPR 2008 James Philbin Ondˇrej Chum Michael Isard Josef Sivic

Packing bag-of-features ICCV 2009 Herv´e J´egou Matthijs Douze Cordelia Schmid INRIA.

Effective Image Database Search via Dimensionality Reduction Anders Bjorholm Dahl and Henrik Aanæs IEEE Computer Society Conference on Computer Vision.

Robust and large-scale alignment Image from

Object retrieval with large vocabularies and fast spatial matching

Lecture 28: Bag-of-words models

Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman ICCV 2003 Presented by: Indriyati Atmosukarto.

JOSEF SIVIC AND ANDREW ZISSERMAN PRESENTERS: ILGE AKKAYA & JEANNETTE CHANG MARCH 1, 2011 Efficient Visual Search for Objects in Videos.

Approximate Nearest Neighbor - Applications to Vision & Matching

Feature extraction: Corners and blobs

Bag-of-features models

Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman University of Oxford ICCV 2003.

Stepan Obdrzalek Jirı Matas

Scale Invariant Feature Transform (SIFT)

Automatic Face Recognition for Film Character Retrieval in Feature-Length Films Ognjen Arandjelović Andrew Zisserman.

Automatic Matching of Multi-View Images

Blob detection.

1 Invariant Local Feature for Object Recognition Presented by Wyman 2/05/2006.

Multiple Object Class Detection with a Generative Model K. Mikolajczyk, B. Leibe and B. Schiele Carolina Galleguillos.

Scale-Invariant Feature Transform (SIFT) Jinxiang Chai.

Object Recognition and Augmented Reality

Overview Introduction to local features

CS 766: Computer Vision Computer Sciences Department, University of Wisconsin-Madison Indexing and Retrieval James Hill, Ozcan Ilikhan, Mark Lenz {jshill4,

Indexing Techniques Mei-Chen Yeh.

Distinctive Image Features from Scale-Invariant Keypoints By David G. Lowe, University of British Columbia Presented by: Tim Havinga, Joël van Neerbos.

Unsupervised Learning of Categories from Sets of Partially Matching Image Features Kristen Grauman and Trevor Darrel CVPR 2006 Presented By Sovan Biswas.

Computer vision.

Overview Harris interest points Comparing interest points (SSD, ZNCC, SIFT) Scale & affine invariant interest points Evaluation and comparison of different.

Reporter: Fei-Fei Chen. Wide-baseline matching Object recognition Texture recognition Scene classification Robot wandering Motion tracking.

Bag-of-features models. Origin 1: Texture recognition Texture is characterized by the repetition of basic elements or textons For stochastic textures,

1 Total Recall: Automatic Query Expansion with a Generative Feature Model for Object Retrieval Ondrej Chum, James Philbin, Josef Sivic, Michael Isard and.

Video Google: A Text Retrieval Approach to Object Matching in Videos Josef Sivic and Andrew Zisserman.

776 Computer Vision Jan-Michael Frahm Fall SIFT-detector Problem: want to detect features at different scales (sizes) and with different orientations!

Evaluation of interest points and descriptors. Introduction Quantitative evaluation of interest point detectors –points / regions at the same relative.

Lecture 7: Features Part 2 CS4670/5670: Computer Vision Noah Snavely.

Local invariant features Cordelia Schmid INRIA, Grenoble.

Visual Categorization With Bags of Keypoints Original Authors: G. Csurka, C.R. Dance, L. Fan, J. Willamowski, C. Bray ECCV Workshop on Statistical Learning.

CVPR2013 Poster Detecting and Naming Actors in Movies using Generative Appearance Models.

Harris Corner Detector & Scale Invariant Feature Transform (SIFT)

Overview Introduction to local features Harris interest points + SSD, ZNCC, SIFT Scale & affine invariant interest point detectors Evaluation and comparison.

Feature extraction: Corners and blobs. Why extract features? Motivation: panorama stitching We have two images – how do we combine them?

Lecture 08 27/12/2011 Shai Avidan הבהרה: החומר המחייב הוא החומר הנלמד בכיתה ולא זה המופיע / לא מופיע במצגת.

Discovering Objects and their Location in Images Josef Sivic 1, Bryan C. Russell 2, Alexei A. Efros 3, Andrew Zisserman 1 and William T. Freeman 2 Goal:

Project 3 questions? Interest Points and Instance Recognition Computer Vision CS 143, Brown James Hays 10/21/11 Many slides from Kristen Grauman and.

Local features: detection and description

Bundling Features for Large Scale Partial-Duplicate Web Image Search Zhong Wu ∗, Qifa Ke, Michael Isard, and Jian Sun Microsoft Research.

CS654: Digital Image Analysis

Introduction to Scale Space and Deep Structure. Importance of Scale Painting by Dali Objects exist at certain ranges of scale. It is not known a priory.

1 Shape Descriptors for Maximally Stable Extremal Regions Per-Erik Forss´en and David G. Lowe Department of Computer Science University of British Columbia.

The topic discovery models

CS 2770: Computer Vision Feature Matching and Indexing

Video Google: Text Retrieval Approach to Object Matching in Videos

The topic discovery models

Video Google – A google approach to Video Retrieval

Maximally Stable Extremal Regions

Maximally Stable Extremal Regions

The topic discovery models

Maximally Stable Extremal Regions

SIFT keypoint detection

Video Google: Text Retrieval Approach to Object Matching in Videos

Presentation transcript:

Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman University of Oxford ICCV 2003

Motivation  Retrieve key frames and shots of video containing particular object with ease, speed and accuracy with which Google retrieves web pages containing particular words  Investigate whether text retrieval approach is applicable to object recognition  Visual analogy of word: vector quantizing descriptor vectors

Benefits  Matches are pre-computed so at run time frames and shots containing particular object can be retrieved with no delay  Any object (or conjunction of objects) occurring in a video can be retrieved even though there was no explicit interest in the object when the descriptors were built

Text Retrieval Approach  Documents are parsed into words  Words represented by stems  Stop list to reject common words  Remaining words assigned unique identifier  Document represented by vector of weighted frequency of words  Vectors organized in inverted files  Retrieval returns documents with closest (angle) vector to query

Viewpoint invariant description  Two types of viewpoint covariant regions computed for each frame Shape Adapted (SA) Mikolajczyk & Schmid Maximally Stable (MSER) Matas et al.  Detect different image areas  Provide complimentary representations of frame  Computed at twice originally detected region size to be more discriminating

Shape Adapted Regions: the Harris-Affine Operator  Elliptical shape adaptation about interest point  Iteratively determine ellipse center, scale and shape  Scale determined by local extremum (across scale) of Laplacian  Shape determined by maximizing intensity gradient isotropy over elliptical region  Centered on corner-like features

Examples of Harris-Affine Operator

Maximally Stable Regions  Use intensity watershed image segmentation  Select areas that are approximately stationary as intensity threshold is varied  Correspond to blobs of high contrast with respect to surroundings

Examples of Maximally Stable Regions

Feature Descriptor  Each elliptical affine invariant region represented by 128 dimensional vector using SIFT descriptor

Noise Removal  Information aggregated over sequence of frames  Regions detected in each frame tracked using simple constant velocity dynamical model and correlation  Region not surviving more than 3 frames are rejected  Estimate descriptor for region computed by averaging descriptors throughout track

Noise Removal Tracking region over 70 frames

Visual Vocabulary  Goal: vector quantize descriptors into clusters (visual words)  When a new frame is observed, the descriptor of the new frame is assigned to the nearest cluster, generating matches for all frames

Visual Vocabulary  Implementation: K-Means clustering  Regions tracked through contiguous frames and average description computed  10% of tracks with highest variance eliminated, leaving about 1000 regions per frame  Subset of 48 shots (~10%) selected for clustering  Distance function: Mahalanobis  6000 SA clusters and MS clusters

Visual Vocabulary

Experiments - Setup  Goal: match scene locations within closed world of shots  Data:164 frames from 48 shots taken at 19 different 3D locations; 4-9 frames from each location

Experiments - Retrieval  Entire frame is query  Each of 164 frames as query region in turn  Correct retrieval: other frames which show same location  Retrieval performance: average normalized rank of relevant images N rel = # of relevant images for query image N = size of image set R i = rank of ith relevant image Rank lies between 0 and 1. Intuitively, it will be 0 if all relevant images are returned ahead of any others. It will be.5 for random retrievals.

Experiment - Results Zero is good!

Experiments - Results Precision = # relevant images/total # of frames retrieved Recall = # correctly retrieved frames/ # relevant frames

Stop List  Top 5% and bottom 10% of frequent words are stopped

Spatial Consistency  Matched region in retrieved frames have similar spatial arrangement to outlined region in query  Retrieve frames using weighted frequency vector and re-rank based on spatial consistency

More Results

Related Web Pages  h/vgoogle/how/method/method_a.html h/vgoogle/how/method/method_a.html  h/vgoogle/index.html h/vgoogle/index.html