Special Topic on Image Retrieval
Measure Image Similarity by Local Feature Matching Matching criteria (matlab demo) Distance criterion: Given a test feature from one image, the normalized L2-distance from the nearest neighbor of a comparison image is less than ϵ. Distance ratio criterion: Given a test feature from one image, the distance ratio between the nearest and second nearest neighbors of a comparison image is less than 0.80.
SIFT Matching by Threshold The distribution of identified true matches and false matches based on L2-distance thresholds.
Coefficient distributions of the top 20 dimensions in SIFT after PCA
Direct matching: the complexity issue Assume an image described by m=1000 descriptors (dimension d=128) N*m=1 billion descriptors to index Database representation in RAM 128 GB with 1 byte per dimension Search: m2* N * d elementary operations i.e., > 1014 " computationally not tractable The quadratic term m2: severely impacts the efficiency
retinal, cerebral cortex, Bag of Visual Words 4/22/2017 Text Words in Information Retrieval (IR) Compactness Descriptiveness Bag-of-Word model Retrieve Of all the sensory impressions proceeding to the brain, the visual experiences are the dominant ones. Our perception of the world around us is based essentially on the messages that reach the brain from our eyes. For a long time it was thought that the retinal image was transmitted point by point to visual centers in the brain; the cerebral cortex was a movie screen, so to speak, upon which the image in the eye was projected. sensory, brain, visual, perception, retinal, cerebral cortex, eye, cell, optical nerve, image Hubel, Wiesel China is forecasting a trade surplus of $90bn (£51bn) to $100bn this year, a threefold increase on 2004's $32bn. The Commerce Ministry said the surplus would be created by a predicted 30% jump in exports to $750bn, compared with a 18% rise in imports to $660bn. The figures are likely to further annoy the US, which has long argued that China's exports are unfairly helped by a deliberately undervalued yuan. China, trade, surplus, commerce, exports, imports, US, yuan, bank, domestic, foreign, increase, trade, value
Bag of Visual Words 4/22/2017 Could images be represented as Bag-of-Visual Words? Image Bag of ‘visual words’ ?
CBIR based on BoVW Retrieval Results …… Query Feature Extraction Vector Quantization Index Lookup On-line Database Feature Extraction Vector Quantization Image Index Codebook Training Off-line
Bag-of-visual-words The BOV representation First introduced for texture classification [Malik’99] “Video-Google paper” – Sivic and Zisserman, ICCV’2003 Mimic a text retrieval system for image/video retrieval High retrieval efficiency and excellent recognition performance Key idea: n local descriptor describing the image -> 1 vector sparse vectors " efficient comparison inherits invariance of the local descriptors Problem: How to generate the visual word?
Bag-of-visual words The goal: “put the images into words”, namely visual words Input local descriptors are continuous Need to define what a “visual word is” Done by a quantizer q q is typically a k-means is called a “visual dictionary”, of size k A local descriptor is assigned to its nearest neighbor Quantization is lossy: we can not get back to the original descriptor But much more compact: typically 2-4 bytes/descriptor
Popular Quantization Schemes K-means K-d tree LSH Product quantization Scalar quantization CSH
K-means Clustering Given a dataset Goal: Partition the dataset into K clusters denoted by Formulation: Solution: Fix , and solve : Iterate above two steps until convergence : assignment of a data to a cluster
General Steps of K-means Algorithm 1. Decide on a value for k. 2. Initialize the k cluster centers (randomly, if necessary). 3. Decide the class memberships of the N objects by assigning them to the nearest cluster center. 4. Re-estimate the k cluster centers, by assuming the memberships found above are correct. 5. If none of the N objects changed membership in the last iteration, exit. Otherwise goto 3.
Illustration of the K-means algorithm using the re-scaled Old Faithful data set. Plot of the cost function. The algorithm has converged after the third iterations
Large Vocabularies with Learned Quantizer Feature matching by quantization: Hierarchical k-means [Nister 06] Approximate k-means [Philbin 07] Based on approximate nearest neighbor search With parallel k-tree
Bag of Visual Words with Vocabulary Tree 4/22/2017 Interest point detection and local feature extraction Hierarchical K-means clustering Visual Word Get the final visual word Tree Feature Space
Bag of Visual Words ….. frequency Visual words codebook 4/22/2017 Summarize entire image based on its distribution (histogram) of word occurrences. Visual Word Histogram frequency ….. Visual words codebook
Bag of Visual Words with Vocabulary Tree 4/22/2017 IDF: inverse document frequency TF: term frequency nid : number of occurrences of word i in image d nd : total number of words in the image d ni : the number of occurrences of term i in the whole database N : the number of images in the whole database Visual Word Histogram
Bag of Visual Words 4/22/2017 Interest point detection and local feature extraction
Bag of Visual Words 4/22/2017 Features are clustered to quantize the space into a discrete number of visual words.
Bag of Visual Words 4/22/2017 Given a new image, the neareast visual wod is identified for each of its features.
Bag of Visual Words 4/22/2017 A bag-of-visual-words histogram can be used to summarized the entire image.
Image Indexing and Comparison based on Visual Word Histogram Representation Image Indexing with visual word Representation of a sparse image-visual word matrix Store only those non-empty items Image distance measurement
Popular Quantization Schemes K-means K-d tree LSH Product quantization Scalar quantization CSH
KD-Tree (K-dimensional tree) Divide the feature space with binary tree Each non-leaf node corresponds to a cutting plane Feature points in each rectilinear space are similar to each other
KD-Tree Binary tree with each node denoting a rectilinear region Each non-leaf node corresponds to a cutting plane The directions of the cutting planes alternate with depth An Example
Popular Quantization Schemes K-means K-d tree LSH Product quantization Scalar quantization CSH
Locality-Sensitive Hashing [Indyk, and Motwani 1998] [Datar et al. 2004] 1 101 Index by compact code 1 1 hash function random Collision prob of e-app neighbors hash code collision probability proportional to original similarity l: # hash tables, K: hash bits per table Courtesy: Shih-Fu Chang, 2012
Popular Quantization Schemes K-means K-d tree LSH Product quantization Scalar quantization CSH
Product Quantization (TPAMI’12) Basic idea Partition feature vector into m sub-vectors Quantize sub-vectors with m distinct quantizers Advantage Low complexity in quantization Extremely fine quantization of the feature space Vector distance approximation Distance table d11 d12 … d1K dK1 dKK
Product Quantization
Popular Quantization Schemes K-means K-d tree LSH Product quantization Scalar quantization CSH
Scalar Quantization SIFT distance distribution of true matches and false matches from experimental study From Fig. 1, we can observe that, when the distance threshold increases, the amount of identified trues matches first steadily increases and then becomes stable. On the other hand, as the threshold value grows, the amount of identified false matches first keeps relatively stable and increase more and more sharply after the threshold becomes larger than 0.5. One conclusion drawn from Fig. 1 is that, we can select a general threshold to distinguish the true matches and the false matches by making a trade off between including more true matches and excluding less false matches.
Novel Approach- Scalar Quantization Basic Idea Scalar vs. Vector Quantization simple, fast, data independent Map a SIFT feature to a binary signature (bits) Map function is independent of image collection The binary signature keeps the discriminative power of SIFT feature We propose a novel quantization approach, scalar quantization, on SIFT feature. Compared with traditional vector quantization, our scalar quantization is simple and fast. Remarkably, it is data-independent. The basic idea is to map a SIFT feature to a binary signature. The obtained binary signature keeps the discriminative power of SIFT feature. We have made a statistical study on 0.4 trillion pairs of SIFT features. The average Euclidean distance of each Hamming distance is illustrated as this figure. Hamming distance L2 distance
Binary SIFT Signature General idea (0, 25, 8, 2, . . ., 14, 5, 2)T Distance preserving (0, 25, 8, 2, . . ., 14, 5, 2)T SIFT descriptor (0, 1, 0, 0, . . ., 1, 0, 0) T Binary Signature Transformation f(x) Given a SIFT descriptor vector, we transform it to a binary signature. The transformation is expected to be distance-preserving. Binary signature enjoys two advantages. First, it is compact to store in memory. Second, it is efficient for comparison computing. The transformation shall be characterized with three properties. First, it should be simple and efficient in implementation. Second, it is unsupervised to avoid over-fitting to any training dataset. Third, feature distance should be well preserved in the Hamming space. Compact for storing in memory Efficient for comparison computing How to select f(x) Preferred Properties Simple and efficient Unsupervised to avoid overfiting to training data Well preserved feature distance
Scalar Quantization (MM’12) Each dimension is encoded with one bit Given a feature vector Quantize it to a bit vector The basic formulation of our scalar quantization is as follows. Each dimension of the SIFT descriptor is encoded with one bit. The binarization is achieved by thresholding, and the threshold is selected as the median value. : the median value of vector
Experimental Observation Statistical study on 0.4 trillion feature pairs (a) Descriptor pair frequency vs. Hamming distance; (b) The average standard deviation vs. Hamming distance. A typical SIFT descriptor with bins sorted by magnitude in each dimension; Statistical study on 408 billion SIFT feature pairs. (b)Descriptor pair frequency vs. Hamming distance; (c) The average L2-distance vs. Hamming distance; (d)The average standard deviation vs. Hamming distance. (c) Descriptor pair frequency vs. Hamming distance;
An Example of Matched Features Observation Share some common patterns in magnitudes on the 128 bins, e.g., the pair-wise differences between most of bins are similar and stable. Implication: The differences between bin magnitudes and a predefined threshold are stable for most bins. Here is a real example of local descriptor match across two images with scalar quantization. From right figure, it can be observed that these two SIFT descriptors have similar magnitude in the corresponding bins with some small variations before quantization. After scalar quantization, they differ from each other in six bins. With a proper threshold, it can be easily determined whether the local match is true or false just by the exclusive-OR (XOR) operation between the quantized bit-vectors.
Distribution of SIFT Median Value Distribution on 100 million SIFT features As shown in Fig. 4, the median value of most SIFT descriptors is relatively small, around 9, but the maximum magnitude in some bins still can reach more than 140. This may incur potential quantization loss since those bins with magnitude above the median are not well distinguished. To address this issue, the same scalar quantization strategy could be conducted again on those bins with magnitude above the median. Intuitively, such operation can be performed recursively. However, it will cause additional storage cost. In our implementation, we only perform the scalar quantization twice, i.e., first on the whole 128 elements, and second on those elements with magnitude above the median value.
Scalar Quantization Generalize Scalar Quantization (1, 1) (1, 0) Encode each dimension with multi-bits, e.g., 2 bits Trade-off between memory and accuracy (1, 1) (1, 0) (0, 0) We can generalize scalar quantization to encode each dimension of SIFT descriptor with multiple bits. As shown in Fig. 3, the median value of most SIFT descriptors is relatively small, around 10, but the maximum magnitude in some bins still can reach more than 140. This may incur potential quantization loss since those bins with magnitude above the median are not well distinguished. To address this issue, the same scalar quantization strategy could be conducted again on those bins with magnitude above the median. Intuitively, such operation can be performed recursively. However, it will cause additional storage cost. In our implementation, we only perform the scalar quantization twice, i.e., first on the whole 128 elements, and second on those elements with magnitude above the median value. f2 f1 A typical SIFT descriptor with bin magnitude sorted in descending order
Scalar Quantization Each dimension is encoded with one bit In practice, we quantize each dimension with 2 bits Considering memory and accuracy where is descendingly sorted from
Visual Matching by Scalar Quantization Given SIFT f (1) from Image Iq and f (2) from image Id Perform scalar quantization: f (1) b(1) ; f (2) b(2) f (1) matches f (2), if Hamming distance d(b(1), b(2)) < Threshold Real example: 256-bit SIFT binary signature Threshold = 24 bits With scalar quantization by Eq. (2), the comparison of SIFT descriptors in L2-distance is captured by the Hamming distance of the corresponding 256-bit vectors.
Binary SIFT Signature Given a SIFT descriptor Transform it to a bit vector Each dimension is encoded with k bits, k ≤ log2d Example: d = 8 (一维到多个bit, 很多种方案,不知道最优的。但是可以找到一个最简单的,但是实验证明也非常有效 Backup:解释为啥median是最好的) We propose to generate the binary signature as follows. Given a SIFT descriptor, we transform it to a bit vector. Each dimension of the SIFT descriptor is encoded with k-bits. For instance, let d = 8. We first select the median as the threshold, and binarize each bin to a binary bit, obtaining an 8-bit vecotr. Continually, we select the second threshold as the median of those elements with magnitude above the first threshold, and obtain another 8-bit vector. Similarly, we can also obtain the third set of 8-bit. We concatenate those 24 bits as the SIFT signature. 1 1 1
Outline Our Approach Index structure Motivation Experiments Scalar Quantization Index structure Code Word Expansion Experiments Conclusions Demo Since our target is large-scale image search, how to adapt our scalar quantization result to the classic inverted file structure for scalable image search needs to be explored.
Indexing with Scalar Quantization Use inverted file indexing Take the top t bits of the SIFT binary signature as “code word”. Store the remaining bits in the entry of indexed features A toy example : Figure 4. A toy example of image feature indexed with inverted file structure. The scalar quantization result of the indexed feature is an 8-bit vector (1001 0101). The first three bits denote its code word ID (100), and the remaining 5 bits (10101) are stored in the inverted file list. 1 0 0 1 0 1 0 1
Unique “Code Word” Number Code word by 32 bit -> 232 in total ideally
Stop Frequent Code Word Fig. 8(a) shows the distribution of code word occurrence on one million image database. It can be observed that, of the 46.5 million code words, only the top few thousand code words have very high frequency. Those code words are prevalent in many images, and their distinctive power is weak. As suggested by [2], we apply a stop-list to ignore those code words that frequently occur in the database images. Experiments reveal that a proper stop-list may not affect the search accuracy, but does avoid checking many code word lists and achieves gain in efficiency. (a) (b) Figure . Frequency of code words among one million images (a) before, and (b) after, application of a stop list.
Code Word Expansion Scalar Quantization Error 1 0 0 1 0 1 0 1 Flipping bits exist in code word If ignore those flipped bits, many candidate features will be missed Degrade recall !! Solution: Expand code word to include flipped code words Enumerate all possible nearest neighbors within a predefined Hamming distance 1 0 0 1 0 1 0 1 1 0 1 1 0 1 0 1
Code Word Expansion: Quantization Error Reduction 2-bit flipping As shown in the toy example in Fig., the code word of a new query feature is a bit-vector 100, i.e., CW4 in pink color. To identify all of candidate features, its possible nearest neighbors (e.g., Hamming distance d =1) will be obtained by flipping one bit in turn, which generates three additional code words (in green color): CW0 (000), CW5 (101) and CW6 (110). These code words are nearest neighbors of CW4 in the Hamming space. Then, besides CW4, the indexed feature lists of these three expanded code words will be also considered as candidate true matches, and all features in these expanded lists will be further compared on their rest bit-codes.
Analysis on Recall of Valid Features 01001…001010…1010111…110101…110001…101110 224 bits for in index list 32 bits for code word Retrieved features as candidates: All candidate features :
Popular Quantization Schemes K-means K-d tree LSH Product quantization Scalar quantization Cascaded Scalable Hashing
Cascaded Scalable Hashing (TMM’13) SIFT: (12, 0, 1, 3, 45, 76, ……, 9, 21, 3, 1, 1, 0) Keep high precision PCA PC (c+1 ~c+e) Keep high recall PC 1 Binary Signature Generation PC c PC 2 … Hashing Hashing Hashing Binary Signature Verification >80% false positive >80% false positive >80% false positive
SCH: Problem Formulation Nearest Neighbor (NN) by distance criterion Relax NN to approximate nearest neighbor: How to determine the threshold ti in each dimension? pi(x): probability density function in dimension i ri: relative recall rate in dimension i, as
SCH: Problem Formulation (II) Relative recall rate in dimension i: Total recall rate by cascading c dimensions: To ensure high total recall, impose the constraint on the recall rate of each dimensions: Then total recall:
Hashing in Single Dimension Feature matching criteria: Distance criterion: Given a test feature from one image, the normalized L2-distance from the nearest neighbor of a comparison image is less than ϵ. Distance ratio criterion: Given a test feature from one image, the distance ratio between the nearest and second nearest neighbors of a comparison image is less than 0.80.
Hashing in Single Dimension Scalar quantization/hashing in each dimension: Cascaded hashing across c dimensions: The SCH result can be further represented as a scalar key Each indexed feature is hashed to only one key Online query hashing: Each query feature is hashed to at most 2c keys
Binary Signature Generation For those bins in PSIFT after the top c dimensions, Feature matching verification Checking the hamming distance between binary signatures