1 Large Scale Similarity Learning and Indexing Part II: Learning to Hash for Large Scale Search Fei Wang and Jun Wang IBM TJ Watson Research Center.

1 Large Scale Similarity Learning and Indexing Part II: Learning to Hash for Large Scale Search Fei Wang and Jun Wang IBM TJ Watson Research Center

2 Outline  Background  Approximate nearest neighbor search  Tree and hashing for data Indexing  Locality sensitive hashing  Learning to Hashing:  Unsupervised hashing  Supervised hashing  Semi-supervised hashing (pointwise/pairwise/listwise)  Large Scale Active Learning with Hashing  Hyperplane hashing  Fast query selection with hashing  Summary and Discussion

3 Motivation  Similarity based search has been popular in many applications –Image/video search and retrial: finding most similar images/videos –Audio search: find similar songs –Product search: find shoes with similar style but different color –Patient search: find patients with similar diagnostic status  Two key components: –Similarity/distance measure –Indexing scheme Whittlesearch (Kovashka et al. 2013)

4 Nearest Neighbor Search (NNS)  Given a set of points in a metric space and a query point, find the closest point in to  -Nearest neighbor  k-nearest neighbor search  Nearest neighbor search is a fundamental problem in many topics, including computational geometry, information retrieval, machine learning, data mining, computer vision, and so on  Time complexity: linear to the size of data  Also need to load the entire dataset in the memory Example: 1 billion images with 10K-dim BOW features Linear scan takes ~15 hrs; Storage for such a dataset is ~40 TB

5 Approximate Nearest Neighbor  Instead finding the exact nearest neighbors, return approximate nearest neighbors (Indyk, 03)  ANNs are reasonably good for many applications  Retrieve ANNs could be much faster (with a sublienar complexity)  Tree and Hashing are two popular indexing schemes for fast ANN search

6 Tree-Based ANN Search  Recursively partition the data: Divide and Conquer  Search complexity is O(log n) (worst case could be O(n))  Inefficient for high dimensional data  Requires significant memory cost tree KD-tree

7 Various Tree-Based Methods  Different ways to construct the tree structure KD-treeBall-tree PCA-treeRandom Projection-tree 2nd PC 1st PC

8 Hashing-Based ANN Search  Repeatedly partition the data  Each item in database represented as a hash code  Significantly reduced storage cost for 1 billion images: 40 TB features -> 8GB hash codes  Search complexity: constant time or sublinear linear scan 1 billions images: 15 hrs->13 sec x1x1 Xx1x1 x2x2 x3x3 x4x4 x5x5 h1h1 01101 h2h2 10101 h1h1 h2h2 ……………… hkhk …………… 010…100…111…001…110… x2x2 x3x3 x4x4 x5x5

9 Hashing: Training Step  Design models (hash functions) for computing hashing codes a general linear projection based hash function family  Estimate the model parameters

10 Hashing: Indexing Step  Compute the hash codes for each database item  Organize all the codes in hash tables (Inverse look-up) database items hash codes database items Inverse look-up hash bucket

11 Hashing: Query Step (Hash Lookup)  Compute the hash codes for the query point  Return the points within a small radius to the query in the Hamming space  The number of hashing codes within Hamming radius is

12 Hashing: Query Step (Hamming Ranking)  Hamming distance: the number of different bits between two hash codes  Rank the database items using their Hamming distance to the query’s hash code generate rank list

13 A Conceptual Diagram for Hashing Based Image Search System Indexing and Search Image Database Similarity Search & Retrieval Hash Function Design Visual Search Applications Reranking Refinement Designing compact yet accurate hashing codes is a critical component to make the search effective

14 Locality Sensitive Hashing (LSH) 0 1 0 1 0 1 Database Items hash function random 101 Query

15  Single hash bit  hash table with the bit length Collision Probability High dot product: unlikely to split Lower dot product: likely to split =

17 Overview: Learning-Based Hashing Techniques  Unsupervised: only use the property of unlabeled data (data- dependent)  Spectral hashing (SH, Weiss, et al. NIPS 2008)  Kernerlized methods (Kulis et al. ICCV 2009)  Graph hashing (Liu et al. ICML 2011)  Isotropic hashing (Kong et al. NIPS 2012)  Angular quantization hashing (Gong et al. NIPS 2012)  Supervised: use labeled data (task dependent)  Deep learning based (Torralba, CVPR 2008)  Binary reconstructive embedding (Kulis et al. NIPS 2009)  Supervised kernel method (Liu et al. CVPR 2012)  Minimal Loss Hashing (Norouzi & Fleet ICML 2011)  Semi-Supervised: use both labeled and unlabeled data  Metric learning based (Jian et al. CVPR 2008)  Semi-supervised hashing (Wang et al. CVPR 2010, PAMI 2012)  Sequential hashing (Wang et al. ICML 2011)

18 Overview: Advanced (Other) Hashing Techniques  Triplet and Listwise Hashing  Hamming metric learning based hashing (Norouzi et al. NIPS 2012)  Order preserving hashing (Wang et al. ACM MM 2013)  Column generation hashing (Li et al. ICML 2013)  Ranking supervised hashing (Wang et. al. ICCV 2013)  Hyperplane Hashing and Active Learning  Angle & embedding hyperplane hashing (Jain et al. NIPS 2010)  Bilinear hashing (Liu et al. ICML 2012)  Fast pairwise query selection (Qian et al. ICDM 2013)  Hashing for complex data sources  Heterogeneous hashing (Ou et al. KDD 2013)  Structured hashing (Ye et al. ICCV 2013)  Multiple feature hashing (Song et al. ACM MM 2011)  Composite hashing (Zhang et al. ACM SIGIR 2011)  Submodular hashing (Cao et al. ACM MM 2012)

19 Unsupervised: PCA Hashing  Partition the data along the PCA directions  Projections with high variance are more reliable  Low-variance projections are very noisy

20 Unsupervised: Spectral Hashing (Weiss et al. 2008)  Partition the data along the PCA directions  Essentially a balanced minimum cut problem (NP hard for a single bit partition)  Approximation through spectral relaxation (uniform distribution assumption) balancing orthogonality

21 Unsupervised: Spectral Hashing  Illustration of spectral hashing  Main steps:  1) extraction projections through performing PCA on the data,  2) projection selection (prefer projections with large spread and small spatial frequency)  3) generating hashing codes through thresholding a sinusoidal function

22 Unsupervised: Graph Hashing (Liu et al. 2012)  Graph is capable of capturing complex nonlinear structure  The same objective as spectral hashing +1 +1

23 Unsupervised: Graph Hashing  Different Solution: Full graph construction and eigen-decomposition is not scalable  The same objective as spectral hashing

24 Unsupervised: Angular Quantization (Gong et al. 2012)  Data-independent angular quantization The binary codes of a data point is the nearest binary vertex in the hypercube  Data-dependent angular quantization

25 From Unsupervised to Supervised Hashing  Existing hashing methods mostly rely on random or principal projections  Not compact  Insufficient accuracy  Simple metrics and features are usually not enough to express semantic similarity – semantic gap  Goal: to learn effective binary hash functions through incorporating supervised information Five categories of objects from Caltech 101, 40 images for each category

26 Binary Reconstructive Embedding (Kulis & Darrell 2009)  Kernelized hashing function  Euclidian distance and the binary distance  Objective: minimize the difference of these two distance measures  Can be supervised method if using semantic distance/similarity

27 RBMs Based Binary Coding (Torralba et al. 2008)  Restricted Boltzmann Machine (RBM)  Stacking RBMs into multiple layers – deep network (512-512-256-N)  The training process has two stages: unsupervised pre-training and supervised fine tuning weight offset energy Objective expected log probability Hinton & Salakhutdinov, 2006

28 Supervised Hashing with Kernels (Liu et al. 2012)  Pairwise similarity  Code inner product approximates pairwise similarity

29 Metric Learning Based Hashing ( Jain et al. 2008)  Given the distance metric, the generalized distance  Generalized similarity measure  Parameterized hash function  Collision probability

30 Semi-Supervised Hashing ( Wang et al. 2010)  Different ways to preserve pairwise relationships  Besides minimizing empirical loss on labeled data 1 1 Neighbor pair Non-neighbor pair Maximum entropy principle

31  Supervised information as triplets  Triplet ranking loss  Objective: minimize regularized ranking loss  Optimization: stochastic gradient descent Hamming Metric Learning ( Norouzi et al. 2012)

32 Column Generation Hashing (Li et al. 2013)  Learn hashing functions and the weights of hash bits  Large-margin formulation  Column generation to iteratively lean hash bits and update the bit weights weighted Hamming distance

33 Ranking Supervised Hashing (Wang et al. 2013)  Preserve ranking list in the Hamming space  Triplet matrix representing ranking order  Ranking consistency

35 Point-to-Point NN vs. Point-to-Hyperplane NN  Hyperplane hashing: aims in finding nearest points to a hyperplane  An efficient way for query selection in the active learning paradigm  Points nearest to the hyperplanes are the most uncertain ones

36 Hyperplane Hashing  Objective: find the datapoints with the shortest point-to-hyperplane distance

37 Angle and Embedding Based Hyperplane Hashing (Jain et al. 2010)  Angle based hyperplane hashing  Embedding based hyperplane hashing Figures from http://vision.cs.utexas.edu/projects/activehash/ collision probability embedding Distance in the embedded space proportional to the distance in the original space

38 Bilinear Hyperplane Hashing (Liu et al. 2012)  Bilinear hash functions u v x1x1 x2x2

39 Analysis and Comparison  Collision probability of bilinear hyperplane hashing  Compare with a ngle-based and embedding-based

40 Active Learning for Big Data  Active learning aims in reducing the annotation cost through connecting human and prediction models  The key idea for active learning is to iteratively identify those ambiguous points to the current prediction model  requires exhaustive testing over all the data samples  at least linear complexity and not feasible for big data applications Example and figure from “Active Learning Literature Survey” by Burr Settles

41 Active Learning with Hashing  The conceptual diagram  Two key components  Index unlabeled data into hash tables;  Compute the hash code of the current classifier and treat it as a query Figures from http://vision.cs.utexas.edu/projects/activehash/

42 Empirical Study  20News group data (18,846 documents, 20 classes)  Starting with 5 randomly labeled documents per class  Perform 300 iterations of active learning with different query selection strategy

43 Active Learning with Pairwise Queries  Typical applications can be found in pairwise comparison based ranking (Jamieson & Nowak et al. 2011, Wauthier et al. 2013)  In an active learning setting, the system sends the annotator a pair of points and receives the relevance comparison result as supervision  Exhaustively select optimal sample pairs will be with quadratic complexity Qian et al. 2013 Jamieson & Nowak et al. 2011

44 Fast Pairwise Query Selection with Hashing  Key motivations:  The selected query pairs should be with high relevance;  The order between the pair points should be uncertain  Two-step selection strategy  Relevance selection (point-to-point hashing);  Uncertain selection (point-to-hyperplane hashing):

46 Summary and Trend in Metric Learning

47 Summary and Trend in Learning to Hash  From data-independent to data-dependent  From task-independent to task-dependent  From simple supervising to complex supervision (pointwise- >pairwise->triplet/listwise)  From linear methods to kernel based methods  From homogeneous data to heterogeneous data  From simple data to structured data  From point-to-point methods to point-to-hyperplane methods  From model driven to application driven  From single table to multiple table

48 References

1 Large Scale Similarity Learning and Indexing Part II: Learning to Hash for Large Scale Search Fei Wang and Jun Wang IBM TJ Watson Research Center.

Similar presentations

Presentation on theme: "1 Large Scale Similarity Learning and Indexing Part II: Learning to Hash for Large Scale Search Fei Wang and Jun Wang IBM TJ Watson Research Center."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Large Scale Similarity Learning and Indexing Part II: Learning to Hash for Large Scale Search Fei Wang and Jun Wang IBM TJ Watson Research Center.

Similar presentations

Presentation on theme: "1 Large Scale Similarity Learning and Indexing Part II: Learning to Hash for Large Scale Search Fei Wang and Jun Wang IBM TJ Watson Research Center."— Presentation transcript:

Similar presentations

About project

Feedback