Special Topic on Image Retrieval

Slides:

Advertisements

Similar presentations

Aggregating local image descriptors into compact codes

Advertisements

Nearest Neighbor Search in High Dimensions Seminar in Algorithms and Geometry Mica Arie-Nachimson and Daniel Glasner April 2009.

Three things everyone should know to improve object retrieval

Presented by Xinyu Chang

Distributed Approximate Spectral Clustering for Large- Scale Datasets FEI GAO, WAEL ABD-ALMAGEED, MOHAMED HEFEEDA PRESENTED BY : BITA KAZEMI ZAHRANI 1.

Searching on Multi-Dimensional Data

FATIH CAKIR MELIHCAN TURK F. SUKRU TORUN AHMET CAGRI SIMSEK Content-Based Image Retrieval using the Bag-of-Words Concept.

Outline SIFT Background SIFT Extraction Application in Content Based Image Search Conclusion.

Marco Cristani Teorie e Tecniche del Riconoscimento1 Teoria e Tecniche del Riconoscimento Estrazione delle feature: Bag of words Facoltà di Scienze MM.

TP14 - Indexing local features

Large-Scale Image Retrieval From Your Sketches Daniel Brooks 1,Loren Lin 2,Yijuan Lu 1 1 Department of Computer Science, Texas State University, TX, USA.

A NOVEL LOCAL FEATURE DESCRIPTOR FOR IMAGE MATCHING Heng Yang, Qing Wang ICME 2008.

Query Specific Fusion for Image Retrieval

Herv´ eJ´ egouMatthijsDouzeCordeliaSchmid INRIA INRIA INRIA

CS4670 / 5670: Computer Vision Bag-of-words models Noah Snavely Object

Bag-of-features models. Origin 1: Texture recognition Texture is characterized by the repetition of basic elements or textons For stochastic textures,

1 Image Retrieval Hao Jiang Computer Science Department 2009.

Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.

Special Topic on Image Retrieval Local Feature Matching Verification.

Bag-of-features models Many slides adapted from Fei-Fei Li, Rob Fergus, and Antonio Torralba.

Fast High-Dimensional Feature Matching for Object Recognition David Lowe Computer Science Department University of British Columbia.

Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.

CVPR 2008 James Philbin Ondˇrej Chum Michael Isard Josef Sivic

Packing bag-of-features ICCV 2009 Herv´e J´egou Matthijs Douze Cordelia Schmid INRIA.

Bundling Features for Large Scale Partial-Duplicate Web Image Search Zhong Wu ∗, Qifa Ke, Michael Isard, and Jian Sun CVPR 2009.

Large-scale matching CSE P 576 Larry Zitnick

Bag of Features Approach: recent work, using geometric information.

Effective Image Database Search via Dimensionality Reduction Anders Bjorholm Dahl and Henrik Aanæs IEEE Computer Society Conference on Computer Vision.

WISE: Large Scale Content-Based Web Image Search Michael Isard Joint with: Qifa Ke, Jian Sun, Zhong Wu Microsoft Research Silicon Valley 1.

Object retrieval with large vocabularies and fast spatial matching

Lecture 28: Bag-of-words models

Agenda Introduction Bag-of-words model Visual words with spatial location Part-based models Discriminative methods Segmentation and recognition Recognition-based.

1 Abstract This paper presents a novel modification to the classical Competitive Learning (CL) by adding a dynamic branching mechanism to neural networks.

1 Jun Wang, 2 Sanjiv Kumar, and 1 Shih-Fu Chang 1 Columbia University, New York, USA 2 Google Research, New York, USA Sequential Projection Learning for.

Spatial and Temporal Data Mining

Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman ICCV 2003 Presented by: Indriyati Atmosukarto.

Object Recognition. So what does object recognition involve?

Bag-of-features models

Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman University of Oxford ICCV 2003.

Lecture 4 Unsupervised Learning Clustering & Dimensionality Reduction

“Bag of Words”: recognition using texture : Advanced Machine Perception A. Efros, CMU, Spring 2006 Adopted from Fei-Fei Li, with some slides from.

Multiple Object Class Detection with a Generative Model K. Mikolajczyk, B. Leibe and B. Schiele Carolina Galleguillos.

FLANN Fast Library for Approximate Nearest Neighbors

Review: Intro to recognition Recognition tasks Machine learning approach: training, testing, generalization Example classifiers Nearest neighbor Linear.

Bag-of-features models. Origin 1: Texture recognition Texture is characterized by the repetition of basic elements or textons For stochastic textures,

Indexing Techniques Mei-Chen Yeh.

Exercise Session 10 – Image Categorization

Final Exam Review CS485/685 Computer Vision Prof. Bebis.

CSE 473/573 Computer Vision and Image Processing (CVIP)

Presented by Tienwei Tsai July, 2005

Bag-of-features models. Origin 1: Texture recognition Texture is characterized by the repetition of basic elements or textons For stochastic textures,

A Statistical Approach to Speed Up Ranking/Re-Ranking Hong-Ming Chen Advisor: Professor Shih-Fu Chang.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Comparison of SOM Based Document Categorization Systems.

1 Total Recall: Automatic Query Expansion with a Generative Feature Model for Object Retrieval Ondrej Chum, James Philbin, Josef Sivic, Michael Isard and.

Video Google: A Text Retrieval Approach to Object Matching in Videos Josef Sivic and Andrew Zisserman.

Collective Vision: Using Extremely Large Photograph Collections Mark Lenz CameraNet Seminar University of Wisconsin – Madison February 2, 2010 Acknowledgments:

Visual Categorization With Bags of Keypoints Original Authors: G. Csurka, C.R. Dance, L. Fan, J. Willamowski, C. Bray ECCV Workshop on Statistical Learning.

Event retrieval in large video collections with circulant temporal encoding CVPR 2013 Oral.

An Approximate Nearest Neighbor Retrieval Scheme for Computationally Intensive Distance Measures Pratyush Bhatt MS by Research(CVIT)

Total Recall: Automatic Query Expansion with a Generative Feature Model for Object Retrieval O. Chum, et al. Presented by Brandon Smith Computer Vision.

Lecture 08 27/12/2011 Shai Avidan הבהרה: החומר המחייב הוא החומר הנלמד בכיתה ולא זה המופיע / לא מופיע במצגת.

Bundling Features for Large Scale Partial-Duplicate Web Image Search Zhong Wu ∗, Qifa Ke, Michael Isard, and Jian Sun Microsoft Research.

CS654: Digital Image Analysis

Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman University of Oxford ICCV 2003.

CS 2770: Computer Vision Feature Matching and Indexing

Large-scale Instance Retrieval

Video Google: Text Retrieval Approach to Object Matching in Videos

Large-scale Instance Retrieval

Video Google: Text Retrieval Approach to Object Matching in Videos

Part 1: Bag-of-words models

Presentation transcript:

Special Topic on Image Retrieval

Measure Image Similarity by Local Feature Matching Matching criteria (matlab demo) Distance criterion: Given a test feature from one image, the normalized L2-distance from the nearest neighbor of a comparison image is less than ϵ. Distance ratio criterion: Given a test feature from one image, the distance ratio between the nearest and second nearest neighbors of a comparison image is less than 0.80.

SIFT Matching by Threshold The distribution of identified true matches and false matches based on L2-distance thresholds.

Coefficient distributions of the top 20 dimensions in SIFT after PCA

Direct matching: the complexity issue Assume an image described by m=1000 descriptors (dimension d=128) N*m=1 billion descriptors to index Database representation in RAM 128 GB with 1 byte per dimension Search: m2* N * d elementary operations i.e., > 1014 " computationally not tractable The quadratic term m2: severely impacts the efficiency

retinal, cerebral cortex, Bag of Visual Words 4/22/2017 Text Words in Information Retrieval (IR) Compactness Descriptiveness Bag-of-Word model Retrieve Of all the sensory impressions proceeding to the brain, the visual experiences are the dominant ones. Our perception of the world around us is based essentially on the messages that reach the brain from our eyes. For a long time it was thought that the retinal image was transmitted point by point to visual centers in the brain; the cerebral cortex was a movie screen, so to speak, upon which the image in the eye was projected. sensory, brain, visual, perception, retinal, cerebral cortex, eye, cell, optical nerve, image Hubel, Wiesel China is forecasting a trade surplus of $90bn (£51bn) to $100bn this year, a threefold increase on 2004's $32bn. The Commerce Ministry said the surplus would be created by a predicted 30% jump in exports to $750bn, compared with a 18% rise in imports to $660bn. The figures are likely to further annoy the US, which has long argued that China's exports are unfairly helped by a deliberately undervalued yuan. China, trade, surplus, commerce, exports, imports, US, yuan, bank, domestic, foreign, increase, trade, value

Bag of Visual Words 4/22/2017 Could images be represented as Bag-of-Visual Words? Image Bag of ‘visual words’ ?

CBIR based on BoVW Retrieval Results …… Query Feature Extraction Vector Quantization Index Lookup On-line Database Feature Extraction Vector Quantization Image Index Codebook Training Off-line

Bag-of-visual-words The BOV representation First introduced for texture classification [Malik’99] “Video-Google paper” – Sivic and Zisserman, ICCV’2003 Mimic a text retrieval system for image/video retrieval High retrieval efficiency and excellent recognition performance Key idea: n local descriptor describing the image -> 1 vector sparse vectors " efficient comparison inherits invariance of the local descriptors Problem: How to generate the visual word?

Bag-of-visual words The goal: “put the images into words”, namely visual words Input local descriptors are continuous Need to define what a “visual word is” Done by a quantizer q q is typically a k-means is called a “visual dictionary”, of size k A local descriptor is assigned to its nearest neighbor Quantization is lossy: we can not get back to the original descriptor But much more compact: typically 2-4 bytes/descriptor

Popular Quantization Schemes K-means K-d tree LSH Product quantization Scalar quantization CSH

K-means Clustering Given a dataset Goal: Partition the dataset into K clusters denoted by Formulation: Solution: Fix , and solve : Iterate above two steps until convergence : assignment of a data to a cluster

General Steps of K-means Algorithm 1. Decide on a value for k. 2. Initialize the k cluster centers (randomly, if necessary). 3. Decide the class memberships of the N objects by assigning them to the nearest cluster center. 4. Re-estimate the k cluster centers, by assuming the memberships found above are correct. 5. If none of the N objects changed membership in the last iteration, exit. Otherwise goto 3.

Illustration of the K-means algorithm using the re-scaled Old Faithful data set. Plot of the cost function. The algorithm has converged after the third iterations

Large Vocabularies with Learned Quantizer Feature matching by quantization: Hierarchical k-means [Nister 06] Approximate k-means [Philbin 07] Based on approximate nearest neighbor search With parallel k-tree

Bag of Visual Words with Vocabulary Tree 4/22/2017 Interest point detection and local feature extraction Hierarchical K-means clustering Visual Word Get the final visual word Tree Feature Space

Bag of Visual Words ….. frequency Visual words codebook 4/22/2017 Summarize entire image based on its distribution (histogram) of word occurrences. Visual Word Histogram frequency ….. Visual words codebook

Bag of Visual Words with Vocabulary Tree 4/22/2017 IDF: inverse document frequency TF: term frequency nid : number of occurrences of word i in image d nd : total number of words in the image d ni : the number of occurrences of term i in the whole database N : the number of images in the whole database Visual Word Histogram

Bag of Visual Words 4/22/2017 Interest point detection and local feature extraction

Bag of Visual Words 4/22/2017 Features are clustered to quantize the space into a discrete number of visual words.

Bag of Visual Words 4/22/2017 Given a new image, the neareast visual wod is identified for each of its features.

Bag of Visual Words 4/22/2017 A bag-of-visual-words histogram can be used to summarized the entire image.

Image Indexing and Comparison based on Visual Word Histogram Representation Image Indexing with visual word Representation of a sparse image-visual word matrix Store only those non-empty items Image distance measurement

Popular Quantization Schemes K-means K-d tree LSH Product quantization Scalar quantization CSH

KD-Tree (K-dimensional tree) Divide the feature space with binary tree Each non-leaf node corresponds to a cutting plane Feature points in each rectilinear space are similar to each other

KD-Tree Binary tree with each node denoting a rectilinear region Each non-leaf node corresponds to a cutting plane The directions of the cutting planes alternate with depth An Example

Popular Quantization Schemes K-means K-d tree LSH Product quantization Scalar quantization CSH

Locality-Sensitive Hashing [Indyk, and Motwani 1998] [Datar et al. 2004] 1 101 Index by compact code 1 1 hash function random Collision prob of e-app neighbors hash code collision probability proportional to original similarity l: # hash tables, K: hash bits per table Courtesy: Shih-Fu Chang, 2012

Popular Quantization Schemes K-means K-d tree LSH Product quantization Scalar quantization CSH

Product Quantization (TPAMI’12) Basic idea Partition feature vector into m sub-vectors Quantize sub-vectors with m distinct quantizers Advantage Low complexity in quantization Extremely fine quantization of the feature space Vector distance approximation Distance table d11 d12 … d1K dK1 dKK

Product Quantization

Popular Quantization Schemes K-means K-d tree LSH Product quantization Scalar quantization CSH

Scalar Quantization SIFT distance distribution of true matches and false matches from experimental study From Fig. 1, we can observe that, when the distance threshold increases, the amount of identified trues matches first steadily increases and then becomes stable. On the other hand, as the threshold value grows, the amount of identified false matches first keeps relatively stable and increase more and more sharply after the threshold becomes larger than 0.5. One conclusion drawn from Fig. 1 is that, we can select a general threshold to distinguish the true matches and the false matches by making a trade off between including more true matches and excluding less false matches.

Novel Approach- Scalar Quantization Basic Idea Scalar vs. Vector Quantization simple, fast, data independent Map a SIFT feature to a binary signature (bits) Map function is independent of image collection The binary signature keeps the discriminative power of SIFT feature We propose a novel quantization approach, scalar quantization, on SIFT feature. Compared with traditional vector quantization, our scalar quantization is simple and fast. Remarkably, it is data-independent. The basic idea is to map a SIFT feature to a binary signature. The obtained binary signature keeps the discriminative power of SIFT feature. We have made a statistical study on 0.4 trillion pairs of SIFT features. The average Euclidean distance of each Hamming distance is illustrated as this figure. Hamming distance L2 distance

Binary SIFT Signature General idea (0, 25, 8, 2, . . ., 14, 5, 2)T Distance preserving (0, 25, 8, 2, . . ., 14, 5, 2)T SIFT descriptor (0, 1, 0, 0, . . ., 1, 0, 0) T Binary Signature Transformation f(x) Given a SIFT descriptor vector, we transform it to a binary signature. The transformation is expected to be distance-preserving. Binary signature enjoys two advantages. First, it is compact to store in memory. Second, it is efficient for comparison computing. The transformation shall be characterized with three properties. First, it should be simple and efficient in implementation. Second, it is unsupervised to avoid over-fitting to any training dataset. Third, feature distance should be well preserved in the Hamming space. Compact for storing in memory Efficient for comparison computing How to select f(x) Preferred Properties Simple and efficient Unsupervised to avoid overfiting to training data Well preserved feature distance

Scalar Quantization (MM’12) Each dimension is encoded with one bit Given a feature vector Quantize it to a bit vector The basic formulation of our scalar quantization is as follows. Each dimension of the SIFT descriptor is encoded with one bit. The binarization is achieved by thresholding, and the threshold is selected as the median value. : the median value of vector

Experimental Observation Statistical study on 0.4 trillion feature pairs (a) Descriptor pair frequency vs. Hamming distance; (b) The average standard deviation vs. Hamming distance. A typical SIFT descriptor with bins sorted by magnitude in each dimension; Statistical study on 408 billion SIFT feature pairs. (b)Descriptor pair frequency vs. Hamming distance; (c) The average L2-distance vs. Hamming distance; (d)The average standard deviation vs. Hamming distance. (c) Descriptor pair frequency vs. Hamming distance;

An Example of Matched Features Observation Share some common patterns in magnitudes on the 128 bins, e.g., the pair-wise differences between most of bins are similar and stable. Implication: The differences between bin magnitudes and a predefined threshold are stable for most bins. Here is a real example of local descriptor match across two images with scalar quantization. From right figure, it can be observed that these two SIFT descriptors have similar magnitude in the corresponding bins with some small variations before quantization. After scalar quantization, they differ from each other in six bins. With a proper threshold, it can be easily determined whether the local match is true or false just by the exclusive-OR (XOR) operation between the quantized bit-vectors.

Distribution of SIFT Median Value Distribution on 100 million SIFT features As shown in Fig. 4, the median value of most SIFT descriptors is relatively small, around 9, but the maximum magnitude in some bins still can reach more than 140. This may incur potential quantization loss since those bins with magnitude above the median are not well distinguished. To address this issue, the same scalar quantization strategy could be conducted again on those bins with magnitude above the median. Intuitively, such operation can be performed recursively. However, it will cause additional storage cost. In our implementation, we only perform the scalar quantization twice, i.e., first on the whole 128 elements, and second on those elements with magnitude above the median value.

Scalar Quantization Generalize Scalar Quantization (1, 1) (1, 0) Encode each dimension with multi-bits, e.g., 2 bits Trade-off between memory and accuracy (1, 1) (1, 0) (0, 0) We can generalize scalar quantization to encode each dimension of SIFT descriptor with multiple bits. As shown in Fig. 3, the median value of most SIFT descriptors is relatively small, around 10, but the maximum magnitude in some bins still can reach more than 140. This may incur potential quantization loss since those bins with magnitude above the median are not well distinguished. To address this issue, the same scalar quantization strategy could be conducted again on those bins with magnitude above the median. Intuitively, such operation can be performed recursively. However, it will cause additional storage cost. In our implementation, we only perform the scalar quantization twice, i.e., first on the whole 128 elements, and second on those elements with magnitude above the median value. f2 f1 A typical SIFT descriptor with bin magnitude sorted in descending order

Scalar Quantization Each dimension is encoded with one bit In practice, we quantize each dimension with 2 bits Considering memory and accuracy where is descendingly sorted from

Visual Matching by Scalar Quantization Given SIFT f (1) from Image Iq and f (2) from image Id Perform scalar quantization: f (1)  b(1) ; f (2)  b(2) f (1) matches f (2), if Hamming distance d(b(1), b(2)) < Threshold Real example: 256-bit SIFT binary signature Threshold = 24 bits With scalar quantization by Eq. (2), the comparison of SIFT descriptors in L2-distance is captured by the Hamming distance of the corresponding 256-bit vectors.

Binary SIFT Signature Given a SIFT descriptor Transform it to a bit vector Each dimension is encoded with k bits, k ≤ log2d Example： d = 8 (一维到多个bit，很多种方案，不知道最优的。但是可以找到一个最简单的，但是实验证明也非常有效 Backup：解释为啥median是最好的) We propose to generate the binary signature as follows. Given a SIFT descriptor, we transform it to a bit vector. Each dimension of the SIFT descriptor is encoded with k-bits. For instance, let d = 8. We first select the median as the threshold, and binarize each bin to a binary bit, obtaining an 8-bit vecotr. Continually, we select the second threshold as the median of those elements with magnitude above the first threshold, and obtain another 8-bit vector. Similarly, we can also obtain the third set of 8-bit. We concatenate those 24 bits as the SIFT signature. 1 1 1

Outline Our Approach Index structure Motivation Experiments Scalar Quantization Index structure Code Word Expansion Experiments Conclusions Demo Since our target is large-scale image search, how to adapt our scalar quantization result to the classic inverted file structure for scalable image search needs to be explored.

Indexing with Scalar Quantization Use inverted file indexing Take the top t bits of the SIFT binary signature as “code word”. Store the remaining bits in the entry of indexed features A toy example : Figure 4. A toy example of image feature indexed with inverted file structure. The scalar quantization result of the indexed feature is an 8-bit vector (1001 0101). The first three bits denote its code word ID (100), and the remaining 5 bits (10101) are stored in the inverted file list. 1 0 0 1 0 1 0 1

Unique “Code Word” Number Code word by 32 bit -> 232 in total ideally

Stop Frequent Code Word Fig. 8(a) shows the distribution of code word occurrence on one million image database. It can be observed that, of the 46.5 million code words, only the top few thousand code words have very high frequency. Those code words are prevalent in many images, and their distinctive power is weak. As suggested by [2], we apply a stop-list to ignore those code words that frequently occur in the database images. Experiments reveal that a proper stop-list may not affect the search accuracy, but does avoid checking many code word lists and achieves gain in efficiency. (a) (b) Figure . Frequency of code words among one million images (a) before, and (b) after, application of a stop list.

Code Word Expansion Scalar Quantization Error 1 0 0 1 0 1 0 1 Flipping bits exist in code word If ignore those flipped bits, many candidate features will be missed Degrade recall !! Solution: Expand code word to include flipped code words Enumerate all possible nearest neighbors within a predefined Hamming distance 1 0 0 1 0 1 0 1 1 0 1 1 0 1 0 1

Code Word Expansion: Quantization Error Reduction 2-bit flipping As shown in the toy example in Fig., the code word of a new query feature is a bit-vector 100, i.e., CW4 in pink color. To identify all of candidate features, its possible nearest neighbors (e.g., Hamming distance d =1) will be obtained by flipping one bit in turn, which generates three additional code words (in green color): CW0 (000), CW5 (101) and CW6 (110). These code words are nearest neighbors of CW4 in the Hamming space. Then, besides CW4, the indexed feature lists of these three expanded code words will be also considered as candidate true matches, and all features in these expanded lists will be further compared on their rest bit-codes.

Analysis on Recall of Valid Features 01001…001010…1010111…110101…110001…101110 224 bits for in index list 32 bits for code word Retrieved features as candidates: All candidate features :

Popular Quantization Schemes K-means K-d tree LSH Product quantization Scalar quantization Cascaded Scalable Hashing

Cascaded Scalable Hashing (TMM’13) SIFT: (12, 0, 1, 3, 45, 76, ……, 9, 21, 3, 1, 1, 0) Keep high precision PCA PC (c+1 ~c+e) Keep high recall PC 1 Binary Signature Generation PC c PC 2 … Hashing Hashing Hashing Binary Signature Verification >80% false positive >80% false positive >80% false positive

SCH: Problem Formulation Nearest Neighbor (NN) by distance criterion Relax NN to approximate nearest neighbor: How to determine the threshold ti in each dimension? pi(x): probability density function in dimension i ri: relative recall rate in dimension i, as

SCH: Problem Formulation (II) Relative recall rate in dimension i: Total recall rate by cascading c dimensions: To ensure high total recall, impose the constraint on the recall rate of each dimensions: Then total recall:

Hashing in Single Dimension Feature matching criteria: Distance criterion: Given a test feature from one image, the normalized L2-distance from the nearest neighbor of a comparison image is less than ϵ. Distance ratio criterion: Given a test feature from one image, the distance ratio between the nearest and second nearest neighbors of a comparison image is less than 0.80.

Hashing in Single Dimension Scalar quantization/hashing in each dimension: Cascaded hashing across c dimensions: The SCH result can be further represented as a scalar key Each indexed feature is hashed to only one key Online query hashing: Each query feature is hashed to at most 2c keys

Binary Signature Generation For those bins in PSIFT after the top c dimensions, Feature matching verification Checking the hamming distance between binary signatures