Mohammad Norouzi David Fleet

Slides:



Advertisements
Similar presentations
Aggregating local image descriptors into compact codes
Advertisements

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Three things everyone should know to improve object retrieval
PARTITIONAL CLUSTERING
An Introduction of Support Vector Machine
Computer vision: models, learning and inference Chapter 13 Image preprocessing and feature extraction.
Multi-layer Orthogonal Codebook for Image Classification Presented by Xia Li.
« هو اللطیف » By : Atefe Malek. khatabi Spring 90.
A NOVEL LOCAL FEATURE DESCRIPTOR FOR IMAGE MATCHING Heng Yang, Qing Wang ICME 2008.
Computer Vision – Image Representation (Histograms)
Bag-of-features models. Origin 1: Texture recognition Texture is characterized by the repetition of basic elements or textons For stochastic textures,
Principal Component Analysis CMPUT 466/551 Nilanjan Ray.
Presented by Relja Arandjelović Iterative Quantization: A Procrustean Approach to Learning Binary Codes University of Oxford 21 st September 2011 Yunchao.
Dimensional reduction, PCA
Segmentation Divide the image into segments. Each segment:
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Lecture 4 Unsupervised Learning Clustering & Dimensionality Reduction
Unsupervised Learning
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
FLANN Fast Library for Approximate Nearest Neighbors
CS 485/685 Computer Vision Face Recognition Using Principal Components Analysis (PCA) M. Turk, A. Pentland, "Eigenfaces for Recognition", Journal of Cognitive.
Review: Intro to recognition Recognition tasks Machine learning approach: training, testing, generalization Example classifiers Nearest neighbor Linear.
Bag-of-features models. Origin 1: Texture recognition Texture is characterized by the repetition of basic elements or textons For stochastic textures,
Linear Least Squares Approximation. 2 Definition (point set case) Given a point set x 1, x 2, …, x n  R d, linear least squares fitting amounts to find.
Dimensionality Reduction: Principal Components Analysis Optional Reading: Smith, A Tutorial on Principal Components Analysis (linked to class webpage)
Presented By Wanchen Lu 2/25/2013
BACKGROUND LEARNING AND LETTER DETECTION USING TEXTURE WITH PRINCIPAL COMPONENT ANALYSIS (PCA) CIS 601 PROJECT SUMIT BASU FALL 2004.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Bag-of-features models. Origin 1: Texture recognition Texture is characterized by the repetition of basic elements or textons For stochastic textures,
CHAPTER 7: Clustering Eick: K-Means and EM (modified Alpaydin transparencies and new transparencies added) Last updated: February 25, 2014.
Minimal Loss Hashing for Compact Binary Codes
Beyond Sliding Windows: Object Localization by Efficient Subwindow Search The best paper prize at CVPR 2008.
Efficient Subwindow Search: A Branch and Bound Framework for Object Localization ‘PAMI09 Beyond Sliding Windows: Object Localization by Efficient Subwindow.
In Defense of Nearest-Neighbor Based Image Classification Oren Boiman The Weizmann Institute of Science Rehovot, ISRAEL Eli Shechtman Adobe Systems Inc.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Project 11: Determining the Intrinsic Dimensionality of a Distribution Okke Formsma, Nicolas Roussis and Per Løwenborg.
Optimal Component Analysis Optimal Linear Representations of Images for Object Recognition X. Liu, A. Srivastava, and Kyle Gallivan, “Optimal linear representations.
Chapter 13 (Prototype Methods and Nearest-Neighbors )
Intro. ANN & Fuzzy Systems Lecture 16. Classification (II): Practical Considerations.
776 Computer Vision Jan-Michael Frahm Spring 2012.
Non-separable SVM's, and non-linear classification using kernels Jakob Verbeek December 16, 2011 Course website:
Machine Learning Supervised Learning Classification and Regression K-Nearest Neighbor Classification Fisher’s Criteria & Linear Discriminant Analysis Perceptron:
Data Transformation: Normalization
Data Mining Soongsil University
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Lecture: Face Recognition and Feature Reduction
Machine Learning Basics
Mean Shift Segmentation
Recognition using Nearest Neighbor (or kNN)
Paper Presentation: Shape and Matching
Clustering (3) Center-based algorithms Fuzzy k-means
Epipolar geometry.
Outline Multilinear Analysis
Fitting Curve Models to Edges
Outline Peter N. Belhumeur, Joao P. Hespanha, and David J. Kriegman, “Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection,”
Singular Value Decomposition
Goodfellow: Chapter 14 Autoencoders
Principal Component Analysis
COSC 4335: Other Classification Techniques
Foundation of Video Coding Part II: Scalar and Vector Quantization
Feature space tansformation methods
Generally Discriminant Analysis
Non-Negative Matrix Factorization
Scalable light field coding using weighted binary images
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Lecture 16. Classification (II): Practical Considerations
CS249: Neural Language Model
Goodfellow: Chapter 14 Autoencoders
Presentation transcript:

Mohammad Norouzi David Fleet Cartesian k-means Mohammad Norouzi David Fleet Vector quant is a key component of many vision and ML alg - in Codebook learning for recognition - Approximate nearest neighbor search - Feature compression to deal with very large datasets

The basic idea is to break the feature space into a bunch of regions and represent all the data points in each region by a cluster center.

The k-means alg select cluster centers that minimize quantization error, that is the Euclidean distance between the data point and their nearest center points.

The k-means alg select cluster centers that minimize quantization error, that is the Euclidean distance between the data point and their nearest center points.

We need many clusters Problem: Search time, storage cost For many vis task we need an accurate rep of the data point, so we need to keep the quantization error small. This is done by inc the num of quant regions and the num of cluster centers. But this is difficult with the k-means bec the search time and storage cost increases linearly with the number of centers. hierarchical k-means Approximate k-means for faster search time, but storage cost remains a fundamental barrier esp in the high-dim case. This paper presents a family of compositional quant models for which the search time and storage cost is sub-linear in the num of centers. This allows us to manage millions or billions of cluster centers. Increasing number of clusters Problem: Search time, storage cost

The basic idea is to use a compositional parametrization of clusters like JPEG, part-based models and product quantization. Suppose we break each image into

two parts. This is equivalent to projecting the data points onto two subspaces.

(subspace 1) (subspace 2) One for the left half and one for the right half of images. Then suppose we run k-means algorithm to learn a set of centers for each subspace separately.

(subspace 1) (subspace 2) The good news is that if we have 5 centers on the left and 5 centers on the right, we have 5-squared ways to represent a full image. In other words, the quant regions in the original space are represented by the cartesian product of the subspace regions. this is exactly what PQ does, a state of the art quant technique

(subspace 1) (subspace 2)

(subspace 1) (subspace 2)

(subspace 1) (subspace 2)

(subspace 1) (subspace 2)

Compositional representation 𝑚 subspaces ℎ regions per subspace More generally, if we project the inputs onto m subspaces, and if we break each subspace into h regions, this provides h^m centers for the original space, but the number of parameters is just order m*h. Therefore, the num of parameters is sublinear in the number of centers that we get. A key question is

Compositional representation 𝑚 subspaces ℎ regions per subspace 𝑘= ℎ 𝑚 centers More generally, if we project the inputs onto m subspaces, and if we break each subspace into h regions, this provides h^m centers for the original space, but the number of parameters is just order m*h. Therefore, the num of parameters is sublinear in the number of centers that we get. A key question is

Compositional representation 𝑚 subspaces ℎ regions per subspace 𝑘= ℎ 𝑚 centers 𝑂(𝑚ℎ) parameters More generally, if we project the inputs onto m subspaces, and if we break each subspace into h regions, this provides h^m centers for the original space, but the number of parameters is just order m*h. Therefore, the num of parameters is sublinear in the number of centers that we get. A key question is

Which subspaces? “How do we choose the subspaces to minimize quantiation error?” As you can imagine there are many ways to select subspaces because each subspace is quantized independently, we want almost no statistical dependence between the subspaces. > In this work we propose a learning alg to find the optimal subspaces

Which subspaces? Learning “How do we choose the subspaces to minimize quantiation error?” As you can imagine there are many ways to select subspaces because each subspace is quantized independently, we want almost no statistical dependence between the subspaces. > In this work we propose a learning alg to find the optimal subspaces

k-means 𝑘 cluster centers: 𝐶= 𝒄 1 , …, 𝒄 𝑘 Turning to formulation, we start from the k-means algorithm. let’s assume we have k cluster centers that are the columns of a center matrix C.

k-means ℓ 𝑘−𝑚𝑒𝑎𝑛𝑠 𝐶 = 𝒙∈𝒟 min 𝒃∈ ℋ 1 𝑘 𝒙 −𝐶𝒃 2 𝑘 cluster centers: 𝐶= 𝒄 1 , …, 𝒄 𝑘 𝒃 is a one-of-𝑘 encoding [ ℋ 1 𝑘 ≡ 𝒃∈ 0,1 𝑘 𝒃 =1}] ℓ 𝑘−𝑚𝑒𝑎𝑛𝑠 𝐶 = 𝒙∈𝒟 min 𝒃∈ ℋ 1 𝑘 𝒙 −𝐶𝒃 2 And b is an indicator vector with a one-of-k encoding, that is it has only one nonzero element. The nonzero element of b selects the nearest center to an input x from the matrix C Here the indicator vector b is a one-of-k encoding, which is a k-dim vector with k-1 0’s and only one 1. This one of k encoding ensures that only one of the centers in matrix C is selected. During training we iteratively optimize for matrix C, and indicator variable b, until we converge to local minima of the kmeans obj.

k-means ℓ 𝑘−𝑚𝑒𝑎𝑛𝑠 𝐶 = 𝒙∈𝒟 min 𝒃∈ ℋ 1 𝑘 𝒙 −𝐶𝒃 2 𝑘 cluster centers: 𝐶= 𝒄 1 , …, 𝒄 𝑘 𝒃 is a one-of-𝑘 encoding [ ℋ 1 𝑘 ≡ 𝒃∈ 0,1 𝑘 𝒃 =1}] ℓ 𝑘−𝑚𝑒𝑎𝑛𝑠 𝐶 = 𝒙∈𝒟 min 𝒃∈ ℋ 1 𝑘 𝒙 −𝐶𝒃 2 And b is an indicator vector with a one-of-k encoding, that is it has only one nonzero element. The nonzero element of b selects the nearest center to an input x from the matrix C Here the indicator vector b is a one-of-k encoding, which is a k-dim vector with k-1 0’s and only one 1. This one of k encoding ensures that only one of the centers in matrix C is selected. During training we iteratively optimize for matrix C, and indicator variable b, until we converge to local minima of the kmeans obj.

Orthogonal k-means ℓ 𝑜𝑘−𝑚𝑒𝑎𝑛𝑠 𝐶 = 𝒙∈𝒟 min 𝒃∈ ℬ 𝑚 𝒙 −𝐶𝒃 2 𝑚 center basis vecotrs: 𝐶= 𝒄 1 , …, 𝒄 𝑚 𝒃 is an arbitrary 𝑚-bit encoding [ ℬ 𝑚 ≡ −1,1 𝑚 ] ℓ 𝑜𝑘−𝑚𝑒𝑎𝑛𝑠 𝐶 = 𝒙∈𝒟 min 𝒃∈ ℬ 𝑚 𝒙 −𝐶𝒃 2 Our first comp model lets the indicator vector b to be an arbitrary m-bit binary code. Previously b could take on only k possible values, bec it only had a single nonz element, but now b can take on 2^m possible values. >> This creates 2^m possible centers. However, Finding the nearest center for a datapoint x involves a hard binary least squares problem to estimate b

Orthogonal k-means ℓ 𝑜𝑘−𝑚𝑒𝑎𝑛𝑠 𝐶 = 𝒙∈𝒟 min 𝒃∈ ℬ 𝑚 𝒙 −𝐶𝒃 2 𝑚 center basis vecotrs: 𝐶= 𝒄 1 , …, 𝒄 𝑚 𝒃 is an arbitrary 𝑚-bit encoding ℓ 𝑜𝑘−𝑚𝑒𝑎𝑛𝑠 𝐶 = 𝒙∈𝒟 min 𝒃∈ ℬ 𝑚 𝒙 −𝐶𝒃 2 Our first comp model lets the indicator vector b to be an arbitrary m-bit binary code. Previously b could take on only k possible values, bec it only had a single nonz element, but now b can take on 2^m possible values. >> This creates 2^m possible centers. However, Finding the nearest center for a datapoint x involves a hard binary least squares problem to estimate b #centers: 𝑘= 2 𝑚

Orthogonal k-means ℓ 𝑜𝑘−𝑚𝑒𝑎𝑛𝑠 𝐶 = 𝒙∈𝒟 min 𝒃∈ ℬ 𝑚 𝒙 −𝐶𝒃 2 𝑚 center basis vecotrs: 𝐶= 𝒄 1 , …, 𝒄 𝑚 Additional constraints: ∀ 𝑖≠𝑗, 𝑐 𝑖 ⊥ 𝑐 𝑗 LS estimate of 𝒃 given 𝒙 is 𝒃 =𝑠𝑔𝑛( 𝐶 𝑇 𝒙) ℓ 𝑜𝑘−𝑚𝑒𝑎𝑛𝑠 𝐶 = 𝒙∈𝒟 min 𝒃∈ ℬ 𝑚 𝒙 −𝐶𝒃 2 We make this center assignment problem tractable by making the columns of C orthogonal. Then, the optimal b is just the sign of C transpose x. This results in a model called Orthogonal k-means, which has an iterative learning alg to find C, and an efficient center assignment opt. Note that you can think of binary codes b as the vertices of an m-dim hypercube which are transformed by a matrix C to fit the datapoints x.

Iterative Quantization [Gong & Lazebnik, CVPR’11] min 𝒃∈ −1, +1 𝑚 𝒙 −𝝁 − 𝐶𝒃 2 To get a better sense of the orthogonal kmeans let’s look at a 2-dimensional example. The red crosses show the data points and the black circles show cluster centers. When C is the identity matrix then there is no transformation of binary codes and the quant regions are the four quadrants. By learning an appropriate transformation of binary codes, which involves a rotation, translation, and non-uniform scaling, we can obtain a much smaller quant error by using ok-means. Iterative Quantization [Gong & Lazebnik, CVPR’11] 𝐶= identity Learned 𝐶

We ran some NNS expts to figure out how important the extra scaling is. For the expts we quantized a dataset of 1M SIFT feature into 64-bit binary codes, and we compared real valued SIFT queries against vector quantized dataset points. On the x axis we plot different numbers of retrieved items (on a log-scale) and on the y axis we plot recall rates for the Euclidean NN. We get about 10% recall improvement on top ITQ for a wide range of K with no extra cost One take home message is that if we are using ITQ for hashing, we could replace it with ok-means to get some potential improvements.

We aslo compared with PQ and surprisingly we found that PQ performs much better than ok-means and itq. We think the reasons are two-fold: First, in both ok-means and itq subspaces are 1 dim, whereas in PQ multi-dim subspaces are used. For examples in this PQ uses 8, 16-dim subspaces. Mor

Product Quantization [Jegou, Douze, Schmid, PAMI’11] We aslo compared with PQ and surprisingly we found that PQ performs much better than ok-means and itq. We think the reasons are two-fold: First, in both ok-means and itq subspaces are 1 dim, whereas in PQ multi-dim subspaces are used. For examples in this PQ uses 8, 16-dim subspaces. Mor Product Quantization [Jegou, Douze, Schmid, PAMI’11]

Cartesian k-means 𝑥 ≈ 𝐶 1 𝐶 2 𝒃 (1) 𝒃 (2) 𝐶 (1) ⊥ 𝐶 (2) 𝑥 ≈ 𝐶 1 𝐶 2 𝒃 (1) 𝒃 (2) We can build an even more powerful model by taking ideas from both ok-means and PQ. For the purpose of the talk let’s only consider two subspaces. In this case, the center matrix is broken into two submatrices C^1 C^2. And the indicator variable b is broken into b^1 and b^2. In orthogonal k-means where the subspaces are 1D, C^1 and C^2 are just single columns, and b^1 and b^2 are just 1 bit each. But this case C^1 and C^2 have h columns and b^1 b^2 are one of h encoding. So in total the centers are the cartesian product of the subspace qunatizations, that is the total number of centers k is h^2. Like ok-means to make the model computationally efficient we assume that the subspaces are mutually orthogonal, and this gives us ck-means, in the case of 2 subspaces Inspired by PQ, we propose our second compositional model called Cartesian k-means. In its simplest case this model divides the center matrix into two subcenter sets denoted C^1 C^2, and lets assume each one of them has h columns. Accordingly we devides the indicator variables b into two parts as well, and each part is a one-of-h encoding. Effectively, we have two sets of subcenters, and we should pick one subcenter from each to generate a center in the original space… so the number of centers becomes h^2 𝐶 (1) ⊥ 𝐶 (2)

} } } } Cartesian k-means 𝑥 ≈ 𝐶 1 𝐶 2 𝒃 (1) 𝒃 (2) #centers: 𝑘= ℎ 2 𝑥 ≈ 𝐶 1 𝐶 2 } one-of-ℎ encoding } ℎ } ℎ 𝒃 (1) } one-of-ℎ encoding 𝒃 (2) #centers: 𝑘= ℎ 2 We can build an even more powerful model by taking ideas from both ok-means and PQ. For the purpose of the talk let’s only consider two subspaces. In this case, the center matrix is broken into two submatrices C^1 C^2. And the indicator variable b is broken into b^1 and b^2. In orthogonal k-means where the subspaces are 1D, C^1 and C^2 are just single columns, and b^1 and b^2 are just 1 bit each. But this case C^1 and C^2 have h columns and b^1 b^2 are one of h encoding. So in total the centers are the cartesian product of the subspace qunatizations, that is the total number of centers k is h^2. Like ok-means to make the model computationally efficient we assume that the subspaces are mutually orthogonal, and this gives us ck-means, in the case of 2 subspaces 𝒃 (1) , 𝒃 (2) ∈ ℋ 1 ℎ 𝐶 (1) ⊥ 𝐶 (2)

} } } } Cartesian k-means 𝑥 ≈ 𝐶 1 𝐶 2 𝒃 (1) 𝒃 (2) #centers: 𝑘= ℎ 2 𝑥 ≈ 𝐶 1 𝐶 2 } one-of-ℎ encoding } ℎ } ℎ 𝒃 (1) } one-of-ℎ encoding 𝒃 (2) #centers: 𝑘= ℎ 2 Storage cost: O 𝑘 Search time:O 𝑘 We can build an even more powerful model by taking ideas from both ok-means and PQ. For the purpose of the talk let’s only consider two subspaces. In this case, the center matrix is broken into two submatrices C^1 C^2. And the indicator variable b is broken into b^1 and b^2. In orthogonal k-means where the subspaces are 1D, C^1 and C^2 are just single columns, and b^1 and b^2 are just 1 bit each. But this case C^1 and C^2 have h columns and b^1 b^2 are one of h encoding. So in total the centers are the cartesian product of the subspace qunatizations, that is the total number of centers k is h^2. Like ok-means to make the model computationally efficient we assume that the subspaces are mutually orthogonal, and this gives us ck-means, in the case of 2 subspaces 𝒃 (1) , 𝒃 (2) ∈ ℋ 1 ℎ 𝐶 (1) ⊥ 𝐶 (2)

Learning Cartesian k-means 𝒙∈𝒟 min 𝒃 (1) , 𝒃 (2) 𝒙− 𝐶 1 𝐶 2 2 𝑝×ℎ 𝒃 (1) 𝒃 (2) 𝐶 (1) ⊥ 𝐶 (2) 𝒃 (1) , 𝒃 (2) ∈ ℋ 1 ℎ

Learning Cartesian k-means 𝒙∈𝒟 min 𝒃 (1) , 𝒃 (2) 𝒙− 𝑅 1 𝑅 2 𝐷 1 0 0 𝐷 2 2 𝑝 2 ×ℎ 𝑝× 𝑝 2 𝒃 (1) 𝒃 (2) 𝑅 (1) ⊥ 𝑅 (2) 𝒃 (1) , 𝒃 (2) ∈ ℋ 1 ℎ

Learning Cartesian k-means 𝒙∈𝒟 min 𝒃 (1) , 𝒃 (2) 𝒙− 𝑅 1 𝑅 2 𝐷 1 0 0 𝐷 2 2 𝒃 (1) 𝒃 (2) 𝑅 (1) ⊥ 𝑅 (2) 𝒃 (1) , 𝒃 (2) ∈ ℋ 1 ℎ

Learning Cartesian k-means 𝒙∈𝒟 min 𝒃 (1) , 𝒃 (2) 𝑅 1 𝑇 𝑅 2 𝑇 𝒙− 𝐷 1 0 0 𝐷 2 2 𝒃 (1) 𝒃 (2) 𝑅 (1) ⊥ 𝑅 (2) 𝒃 (1) , 𝒃 (2) ∈ ℋ 1 ℎ

Learning Cartesian k-means 𝒙∈𝒟 min 𝒃 (1) , 𝒃 (2) 𝑅 1 𝑇 𝒙 𝑅 2 𝑇 𝒙 − 𝐷 1 0 0 𝐷 2 2 𝒃 (1) 𝒃 (2) 𝑅 (1) ⊥ 𝑅 (2) 𝒃 (1) , 𝒃 (2) ∈ ℋ 1 ℎ

Learning Cartesian k-means 𝒙∈𝒟 min 𝒃 (1) , 𝒃 (2) 𝑅 1 𝑇 𝒙 𝑅 2 𝑇 𝒙 − 𝐷 1 𝒃 1 𝐷 (2) 𝒃 2 2 𝑅 (1) ⊥ 𝑅 (2) Update 𝐷 (1) and 𝒃 (1) by one step of k-means in 𝑅 1 𝑇 𝒙 𝒃 (1) , 𝒃 (2) ∈ ℋ 1 ℎ

Learning Cartesian k-means 𝒙∈𝒟 min 𝒃 (1) , 𝒃 (2) 𝑅 1 𝑇 𝒙 𝑅 2 𝑇 𝒙 − 𝐷 1 𝒃 1 𝐷 (2) 𝒃 2 2 𝑅 (1) ⊥ 𝑅 (2) Update 𝐷 (2) and 𝑏 (2) by one step of k-means in 𝑅 2 𝑇 𝒙 𝒃 (1) , 𝒃 (2) ∈ ℋ 1 ℎ

Learning Cartesian k-means 𝒙∈𝒟 min 𝒃 (1) , 𝒃 (2) 𝒙− 𝑅 1 𝑅 2 𝐷 1 𝒃 1 𝐷 2 𝒃 2 2 𝑅 (1) ⊥ 𝑅 (2) Update 𝑅 by SVD to solve Orthogonal procrustes 𝒃 (1) , 𝒃 (2) ∈ ℋ 1 ℎ

} } Cartesian k-means 𝑥 ≈ 𝐶 1 … 𝐶 𝑚 𝒃 (1) ⋮ 𝒃 (𝑚) ∀ 𝑖≠𝑗 𝐶 (𝑖) ⊥ 𝐶 (𝑗) one-of-ℎ 𝑥 ≈ 𝐶 1 … 𝐶 𝑚 𝒃 (1) ⋮ } one-of-ℎ 𝒃 (𝑚) We can build an even more powerful model by taking ideas from both ok-means and PQ. For the purpose of the talk let’s only consider two subspaces. In this case, the center matrix is broken into two submatrices C^1 C^2. And the indicator variable b is broken into b^1 and b^2. In orthogonal k-means where the subspaces are 1D, C^1 and C^2 are just single columns, and b^1 and b^2 are just 1 bit each. But this case C^1 and C^2 have h columns and b^1 b^2 are one of h encoding. So in total the centers are the cartesian product of the subspace qunatizations, that is the total number of centers k is h^2. Like ok-means to make the model computationally efficient we assume that the subspaces are mutually orthogonal, and this gives us ck-means, in the case of 2 subspaces Inspired by PQ, we propose our second compositional model called Cartesian k-means. In its simplest case this model divides the center matrix into two subcenter sets denoted C^1 C^2, and lets assume each one of them has h columns. Accordingly we devides the indicator variables b into two parts as well, and each part is a one-of-h encoding. Effectively, we have two sets of subcenters, and we should pick one subcenter from each to generate a center in the original space… so the number of centers becomes h^2 ∀ 𝑖≠𝑗 𝐶 (𝑖) ⊥ 𝐶 (𝑗) #centers: 𝑘= ℎ 𝑚 Storage cost: O 𝑚 𝑘 Search time:O( 𝑚 𝑘 ) 𝒃 (1) , …, 𝒃 (𝑚) ∈ ℋ 1 ℎ

Cartesian k-means 𝑚 subspaces, ℎ regions per subspace ok-means ℎ=2 𝑚=1 𝑘= 2 𝑚 𝑚=1 k-means So we have a knob that controls the degree of compositionality and model compactness compositionality

Now back to our expts on NN with a quantization into 2^64 regions, we asses the importance of learning subspaces on SIFT vectors. observe that ck-means

Slightly improves upon PQ, about 4% improvement in recall@10, Slightly improves upon PQ, about 4% improvement in recall@10,... but this is only using 1M data points

We went a step further and did another expt with 1B data points We went a step further and did another expt with 1B data points. we got a more significant improvement on top of PQ in this case, as quantization becomes more important when we have more data-points. Again ok-means improves upon itq.

~𝟏𝟎% We went a step further and did another expt with 1B data points. we got a more significant improvement on top of PQ in this case, as quantization becomes more important when we have more data-points. Again ok-means improves upon itq.

Another expt on GIST descriptor gave us much more improvement presumably because the higher dimensionality of gist vectors, which makes the selection of subspaces trickier. Notably, we used 1000-dim GIST descriptors vs. 128-dim SIFT. It is also clear that ok-mean doesn’t improve itq, presumably because of the way GIST descriptors are normalized. Note that for all of the NNS exps …. So the search time is exactly at the same cost

~𝟐𝟓% Another expt on GIST descriptor gave us much more improvement presumably because the higher dimensionality of gist vectors, which makes the selection of subspaces trickier. Notably, we used 1000-dim GIST descriptors vs. 128-dim SIFT. It is also clear that ok-mean doesn’t improve itq, presumably because of the way GIST descriptors are normalized. Note that for all of the NNS exps …. So the search time is exactly at the same cost

Codebook learning (CIFAR-10) Accuracy k-means (𝑘=1600) 77.9% k-means (𝑘=4000) 79.6% So far we showed exps on quantization for NNS wich requres very many quant regions, and so kmeans is not applicable. We also do a completely different set of experiments on codebook leanring… where the number of clusters in ju The CIFAR-10 dataset is a classification benchmark of tiny images with 10 classes. And as a baseline we used the method of Coates and Ng to learn a codebook of image patches with 1600 and 4000 codewords. We ran a linear SVM on bag of words histograms aggeregated over a spatial pyramid with four quadrants.

Codebook learning (CIFAR-10) Accuracy k-means (𝑘=1600) ck-means (𝑘= 40 2 ) 77.9% 78.2% k-means (𝑘=4000) ck-means (𝑘= 64 2 ) 79.6% 79.7% Now we change only codebook learning part, and we use ck-means and PQ to learn a vocabulary. In these experiments we only used two subspaces for both pq and ckmeans. Interestingly, we find that ck-means perform on par, or slightly better than k-means, even though it has a much faster cluster assignment optm.

Codebook learning (CIFAR-10) Accuracy k-means (𝑘=1600) ck-means (𝑘= 40 2 ) PQ (𝑘= 40 2 ) 77.9% 78.2% 75.9% k-means (𝑘=4000) ck-means (𝑘= 64 2 ) PQ (𝑘= 64 2 ) 79.6% 79.7% Now we change only codebook learning part, and we use ck-means and PQ to learn a vocabulary. In these experiments we only used two subspaces for both pq and ckmeans. Interestingly, we find that ck-means perform on par, or slightly better than k-means, even though it has a much faster cluster assignment optm.

Examples of Cartesian codewords learned for 10x10 patches

Quantized 𝟑𝟐×𝟑𝟐 images (𝟏𝟎𝟐𝟒 bits)

𝟑𝟐×𝟑𝟐 images

Run-time complexity Inference to quantize a point Learning A big rotation of size 𝑝×𝑝 can be expensive PCA to reduce dimensionality to 𝑠 as pre-processing and optimize a 𝑝×𝑠 projection within the model Learning The most expensive part in each training iteration is to solve SVD to estimate 𝑅 which is of 𝑂( 𝑝 3 ) Can be done faster if we have a 𝑝×𝑠 rotation

Summary ITQ PQ ok-means ck-means In this talk we present a compositional extension of k-means algorithm which allows for efficient clustering of data points into millions or billions of clusters. We also showed that midium size vocabularies can be learned for recognition in using cartesian kmeans without loss of accuracy. ok-means ck-means

Thank you for your attention! 𝐷 (1) 𝐷 (𝑚) 𝑥 ≈ 𝑅 (1) 𝑅 (𝑚) … 𝑏 (1) 𝑏 (𝑚) I’d be happy to take questions

𝑂(𝑛 𝑝) When we have millions of high-dim points

The good news is that the distance to the center

bit 1 ( 𝑑 1 + ) 2 ( 𝑑 1 − ) 2

bit 1 bit 2 ( 𝑑 1 + ) 2 ( 𝑑 2 + ) 2 ( 𝑑 1 − ) 2 (𝑑 2 − ) 2

Query-specific table ( ) bit 1 bit 2 … bit 𝑚 ( 𝑑 1 + ) 2 ( 𝑑 2 + ) 2 𝑑 𝑚 + 2 ( 𝑑 1 − ) 2 (𝑑 2 − ) 2 𝑑 𝑚 − 2 𝑂(𝑛 𝑚+𝑚 𝑝) ≤𝑂(𝑛𝑝)

We ran some NNS expts to figure out how important the extra scaling is. For the expts we quantized a dataset of 1M SIFT feature into 64-bit binary codes, and we compared real valued SIFT queries against vector quantized dataset points. On the x axis we plot different numbers of retrieved items (on a log-scale) and on the y axis we plot recall rates for the Euclidean NN. We get about 10% recall improvement on top ITQ for a wide range of K with no extra cost One take home message is that if we are using ITQ for hashing, we could replace it with ok-means to get some potential improvements.