Mohammad Norouzi David Fleet

Mohammad Norouzi David Fleet
Cartesian k-means Mohammad Norouzi David Fleet Vector quant is a key component of many vision and ML alg - in Codebook learning for recognition - Approximate nearest neighbor search - Feature compression to deal with very large datasets

The basic idea is to break the feature space into a bunch of regions and represent all the data points in each region by a cluster center.

The k-means alg select cluster centers that minimize quantization error, that is the Euclidean distance between the data point and their nearest center points.

We need many clusters Problem: Search time, storage cost
For many vis task we need an accurate rep of the data point, so we need to keep the quantization error small. This is done by inc the num of quant regions and the num of cluster centers. But this is difficult with the k-means bec the search time and storage cost increases linearly with the number of centers. hierarchical k-means Approximate k-means for faster search time, but storage cost remains a fundamental barrier esp in the high-dim case. This paper presents a family of compositional quant models for which the search time and storage cost is sub-linear in the num of centers. This allows us to manage millions or billions of cluster centers. Increasing number of clusters Problem: Search time, storage cost

The basic idea is to use a compositional parametrization of clusters like JPEG, part-based models and product quantization. Suppose we break each image into

two parts. This is equivalent to projecting the data points onto two subspaces.

(subspace 1) (subspace 2)
One for the left half and one for the right half of images. Then suppose we run k-means algorithm to learn a set of centers for each subspace separately.

(subspace 1) (subspace 2)
The good news is that if we have 5 centers on the left and 5 centers on the right, we have 5-squared ways to represent a full image. In other words, the quant regions in the original space are represented by the cartesian product of the subspace regions. this is exactly what PQ does, a state of the art quant technique

(subspace 1) (subspace 2)

Compositional representation
𝑚 subspaces ℎ regions per subspace More generally, if we project the inputs onto m subspaces, and if we break each subspace into h regions, this provides h^m centers for the original space, but the number of parameters is just order m*h. Therefore, the num of parameters is sublinear in the number of centers that we get. A key question is

𝑚 subspaces ℎ regions per subspace 𝑘= ℎ 𝑚 centers More generally, if we project the inputs onto m subspaces, and if we break each subspace into h regions, this provides h^m centers for the original space, but the number of parameters is just order m*h. Therefore, the num of parameters is sublinear in the number of centers that we get. A key question is

𝑚 subspaces ℎ regions per subspace 𝑘= ℎ 𝑚 centers 𝑂(𝑚ℎ) parameters More generally, if we project the inputs onto m subspaces, and if we break each subspace into h regions, this provides h^m centers for the original space, but the number of parameters is just order m*h. Therefore, the num of parameters is sublinear in the number of centers that we get. A key question is

Which subspaces? “How do we choose the subspaces to minimize quantiation error?” As you can imagine there are many ways to select subspaces because each subspace is quantized independently, we want almost no statistical dependence between the subspaces. > In this work we propose a learning alg to find the optimal subspaces

Which subspaces? Learning
“How do we choose the subspaces to minimize quantiation error?” As you can imagine there are many ways to select subspaces because each subspace is quantized independently, we want almost no statistical dependence between the subspaces. > In this work we propose a learning alg to find the optimal subspaces

k-means 𝑘 cluster centers: 𝐶= 𝒄 1 , …, 𝒄 𝑘
Turning to formulation, we start from the k-means algorithm. let’s assume we have k cluster centers that are the columns of a center matrix C.

k-means ℓ 𝑘−𝑚𝑒𝑎𝑛𝑠 𝐶 = 𝒙∈𝒟 min 𝒃∈ ℋ 1 𝑘 𝒙 −𝐶𝒃 2
𝑘 cluster centers: 𝐶= 𝒄 1 , …, 𝒄 𝑘 𝒃 is a one-of-𝑘 encoding [ ℋ 1 𝑘 ≡ 𝒃∈ 0,1 𝑘 𝒃 =1}] ℓ 𝑘−𝑚𝑒𝑎𝑛𝑠 𝐶 = 𝒙∈𝒟 min 𝒃∈ ℋ 1 𝑘 𝒙 −𝐶𝒃 2 And b is an indicator vector with a one-of-k encoding, that is it has only one nonzero element. The nonzero element of b selects the nearest center to an input x from the matrix C Here the indicator vector b is a one-of-k encoding, which is a k-dim vector with k-1 0’s and only one 1. This one of k encoding ensures that only one of the centers in matrix C is selected. During training we iteratively optimize for matrix C, and indicator variable b, until we converge to local minima of the kmeans obj.

Orthogonal k-means ℓ 𝑜𝑘−𝑚𝑒𝑎𝑛𝑠 𝐶 = 𝒙∈𝒟 min 𝒃∈ ℬ 𝑚 𝒙 −𝐶𝒃 2
𝑚 center basis vecotrs: 𝐶= 𝒄 1 , …, 𝒄 𝑚 𝒃 is an arbitrary 𝑚-bit encoding [ ℬ 𝑚 ≡ −1,1 𝑚 ] ℓ 𝑜𝑘−𝑚𝑒𝑎𝑛𝑠 𝐶 = 𝒙∈𝒟 min 𝒃∈ ℬ 𝑚 𝒙 −𝐶𝒃 2 Our first comp model lets the indicator vector b to be an arbitrary m-bit binary code. Previously b could take on only k possible values, bec it only had a single nonz element, but now b can take on 2^m possible values. >> This creates 2^m possible centers. However, Finding the nearest center for a datapoint x involves a hard binary least squares problem to estimate b

𝑚 center basis vecotrs: 𝐶= 𝒄 1 , …, 𝒄 𝑚 𝒃 is an arbitrary 𝑚-bit encoding ℓ 𝑜𝑘−𝑚𝑒𝑎𝑛𝑠 𝐶 = 𝒙∈𝒟 min 𝒃∈ ℬ 𝑚 𝒙 −𝐶𝒃 2 Our first comp model lets the indicator vector b to be an arbitrary m-bit binary code. Previously b could take on only k possible values, bec it only had a single nonz element, but now b can take on 2^m possible values. >> This creates 2^m possible centers. However, Finding the nearest center for a datapoint x involves a hard binary least squares problem to estimate b #centers: 𝑘= 2 𝑚

𝑚 center basis vecotrs: 𝐶= 𝒄 1 , …, 𝒄 𝑚 Additional constraints: ∀ 𝑖≠𝑗, 𝑐 𝑖 ⊥ 𝑐 𝑗 LS estimate of 𝒃 given 𝒙 is 𝒃 =𝑠𝑔𝑛( 𝐶 𝑇 𝒙) ℓ 𝑜𝑘−𝑚𝑒𝑎𝑛𝑠 𝐶 = 𝒙∈𝒟 min 𝒃∈ ℬ 𝑚 𝒙 −𝐶𝒃 2 We make this center assignment problem tractable by making the columns of C orthogonal. Then, the optimal b is just the sign of C transpose x. This results in a model called Orthogonal k-means, which has an iterative learning alg to find C, and an efficient center assignment opt. Note that you can think of binary codes b as the vertices of an m-dim hypercube which are transformed by a matrix C to fit the datapoints x.

Iterative Quantization [Gong & Lazebnik, CVPR’11]
min 𝒃∈ −1, +1 𝑚 𝒙 −𝝁 − 𝐶𝒃 2 To get a better sense of the orthogonal kmeans let’s look at a 2-dimensional example. The red crosses show the data points and the black circles show cluster centers. When C is the identity matrix then there is no transformation of binary codes and the quant regions are the four quadrants. By learning an appropriate transformation of binary codes, which involves a rotation, translation, and non-uniform scaling, we can obtain a much smaller quant error by using ok-means. Iterative Quantization [Gong & Lazebnik, CVPR’11] 𝐶= identity Learned 𝐶

We ran some NNS expts to figure out how important the extra scaling is.
For the expts we quantized a dataset of 1M SIFT feature into 64-bit binary codes, and we compared real valued SIFT queries against vector quantized dataset points. On the x axis we plot different numbers of retrieved items (on a log-scale) and on the y axis we plot recall rates for the Euclidean NN. We get about 10% recall improvement on top ITQ for a wide range of K with no extra cost One take home message is that if we are using ITQ for hashing, we could replace it with ok-means to get some potential improvements.

We aslo compared with PQ and surprisingly we found that PQ performs much better than ok-means and itq. We think the reasons are two-fold: First, in both ok-means and itq subspaces are 1 dim, whereas in PQ multi-dim subspaces are used. For examples in this PQ uses 8, 16-dim subspaces. Mor

Product Quantization [Jegou, Douze, Schmid, PAMI’11]
We aslo compared with PQ and surprisingly we found that PQ performs much better than ok-means and itq. We think the reasons are two-fold: First, in both ok-means and itq subspaces are 1 dim, whereas in PQ multi-dim subspaces are used. For examples in this PQ uses 8, 16-dim subspaces. Mor Product Quantization [Jegou, Douze, Schmid, PAMI’11]

Cartesian k-means 𝑥 ≈ 𝐶 1 𝐶 2 𝒃 (1) 𝒃 (2) 𝐶 (1) ⊥ 𝐶 (2)
𝑥 ≈ 𝐶 𝐶 2 𝒃 (1) 𝒃 (2) We can build an even more powerful model by taking ideas from both ok-means and PQ. For the purpose of the talk let’s only consider two subspaces. In this case, the center matrix is broken into two submatrices C^1 C^2. And the indicator variable b is broken into b^1 and b^2. In orthogonal k-means where the subspaces are 1D, C^1 and C^2 are just single columns, and b^1 and b^2 are just 1 bit each. But this case C^1 and C^2 have h columns and b^1 b^2 are one of h encoding. So in total the centers are the cartesian product of the subspace qunatizations, that is the total number of centers k is h^2. Like ok-means to make the model computationally efficient we assume that the subspaces are mutually orthogonal, and this gives us ck-means, in the case of 2 subspaces Inspired by PQ, we propose our second compositional model called Cartesian k-means. In its simplest case this model divides the center matrix into two subcenter sets denoted C^1 C^2, and lets assume each one of them has h columns. Accordingly we devides the indicator variables b into two parts as well, and each part is a one-of-h encoding. Effectively, we have two sets of subcenters, and we should pick one subcenter from each to generate a center in the original space… so the number of centers becomes h^2 𝐶 (1) ⊥ 𝐶 (2)

} } } } Cartesian k-means 𝑥 ≈ 𝐶 1 𝐶 2 𝒃 (1) 𝒃 (2) #centers: 𝑘= ℎ 2
𝑥 ≈ 𝐶 𝐶 2 } one-of-ℎ encoding } ℎ } ℎ 𝒃 (1) } one-of-ℎ encoding 𝒃 (2) #centers: 𝑘= ℎ 2 We can build an even more powerful model by taking ideas from both ok-means and PQ. For the purpose of the talk let’s only consider two subspaces. In this case, the center matrix is broken into two submatrices C^1 C^2. And the indicator variable b is broken into b^1 and b^2. In orthogonal k-means where the subspaces are 1D, C^1 and C^2 are just single columns, and b^1 and b^2 are just 1 bit each. But this case C^1 and C^2 have h columns and b^1 b^2 are one of h encoding. So in total the centers are the cartesian product of the subspace qunatizations, that is the total number of centers k is h^2. Like ok-means to make the model computationally efficient we assume that the subspaces are mutually orthogonal, and this gives us ck-means, in the case of 2 subspaces 𝒃 (1) , 𝒃 (2) ∈ ℋ 1 ℎ 𝐶 (1) ⊥ 𝐶 (2)

} } } } Cartesian k-means 𝑥 ≈ 𝐶 1 𝐶 2 𝒃 (1) 𝒃 (2) #centers: 𝑘= ℎ 2
𝑥 ≈ 𝐶 𝐶 2 } one-of-ℎ encoding } ℎ } ℎ 𝒃 (1) } one-of-ℎ encoding 𝒃 (2) #centers: 𝑘= ℎ 2 Storage cost: O 𝑘 Search time:O 𝑘 We can build an even more powerful model by taking ideas from both ok-means and PQ. For the purpose of the talk let’s only consider two subspaces. In this case, the center matrix is broken into two submatrices C^1 C^2. And the indicator variable b is broken into b^1 and b^2. In orthogonal k-means where the subspaces are 1D, C^1 and C^2 are just single columns, and b^1 and b^2 are just 1 bit each. But this case C^1 and C^2 have h columns and b^1 b^2 are one of h encoding. So in total the centers are the cartesian product of the subspace qunatizations, that is the total number of centers k is h^2. Like ok-means to make the model computationally efficient we assume that the subspaces are mutually orthogonal, and this gives us ck-means, in the case of 2 subspaces 𝒃 (1) , 𝒃 (2) ∈ ℋ 1 ℎ 𝐶 (1) ⊥ 𝐶 (2)

Learning Cartesian k-means
𝒙∈𝒟 min 𝒃 (1) , 𝒃 (2) 𝒙− 𝐶 𝐶 𝑝×ℎ 𝒃 (1) 𝒃 (2) 𝐶 (1) ⊥ 𝐶 (2) 𝒃 (1) , 𝒃 (2) ∈ ℋ 1 ℎ

𝒙∈𝒟 min 𝒃 (1) , 𝒃 (2) 𝒙− 𝑅 𝑅 𝐷 𝐷 𝑝 2 ×ℎ 𝑝× 𝑝 2 𝒃 (1) 𝒃 (2) 𝑅 (1) ⊥ 𝑅 (2) 𝒃 (1) , 𝒃 (2) ∈ ℋ 1 ℎ

𝒙∈𝒟 min 𝒃 (1) , 𝒃 (2) 𝒙− 𝑅 𝑅 𝐷 𝐷 𝒃 (1) 𝒃 (2) 𝑅 (1) ⊥ 𝑅 (2) 𝒃 (1) , 𝒃 (2) ∈ ℋ 1 ℎ

𝒙∈𝒟 min 𝒃 (1) , 𝒃 (2) 𝑅 1 𝑇 𝑅 2 𝑇 𝒙− 𝐷 𝐷 𝒃 (1) 𝒃 (2) 𝑅 (1) ⊥ 𝑅 (2) 𝒃 (1) , 𝒃 (2) ∈ ℋ 1 ℎ

𝒙∈𝒟 min 𝒃 (1) , 𝒃 (2) 𝑅 1 𝑇 𝒙 𝑅 2 𝑇 𝒙 − 𝐷 𝐷 𝒃 (1) 𝒃 (2) 𝑅 (1) ⊥ 𝑅 (2) 𝒃 (1) , 𝒃 (2) ∈ ℋ 1 ℎ

𝒙∈𝒟 min 𝒃 (1) , 𝒃 (2) 𝑅 1 𝑇 𝒙 𝑅 2 𝑇 𝒙 − 𝐷 1 𝒃 𝐷 (2) 𝒃 𝑅 (1) ⊥ 𝑅 (2) Update 𝐷 (1) and 𝒃 (1) by one step of k-means in 𝑅 1 𝑇 𝒙 𝒃 (1) , 𝒃 (2) ∈ ℋ 1 ℎ

𝒙∈𝒟 min 𝒃 (1) , 𝒃 (2) 𝑅 1 𝑇 𝒙 𝑅 2 𝑇 𝒙 − 𝐷 1 𝒃 𝐷 (2) 𝒃 𝑅 (1) ⊥ 𝑅 (2) Update 𝐷 (2) and 𝑏 (2) by one step of k-means in 𝑅 2 𝑇 𝒙 𝒃 (1) , 𝒃 (2) ∈ ℋ 1 ℎ

𝒙∈𝒟 min 𝒃 (1) , 𝒃 (2) 𝒙− 𝑅 𝑅 𝐷 1 𝒃 𝐷 2 𝒃 𝑅 (1) ⊥ 𝑅 (2) Update 𝑅 by SVD to solve Orthogonal procrustes 𝒃 (1) , 𝒃 (2) ∈ ℋ 1 ℎ

} } Cartesian k-means 𝑥 ≈ 𝐶 1 … 𝐶 𝑚 𝒃 (1) ⋮ 𝒃 (𝑚) ∀ 𝑖≠𝑗 𝐶 (𝑖) ⊥ 𝐶 (𝑗)
one-of-ℎ 𝑥 ≈ 𝐶 1 … 𝐶 𝑚 𝒃 (1) ⋮ } one-of-ℎ 𝒃 (𝑚) We can build an even more powerful model by taking ideas from both ok-means and PQ. For the purpose of the talk let’s only consider two subspaces. In this case, the center matrix is broken into two submatrices C^1 C^2. And the indicator variable b is broken into b^1 and b^2. In orthogonal k-means where the subspaces are 1D, C^1 and C^2 are just single columns, and b^1 and b^2 are just 1 bit each. But this case C^1 and C^2 have h columns and b^1 b^2 are one of h encoding. So in total the centers are the cartesian product of the subspace qunatizations, that is the total number of centers k is h^2. Like ok-means to make the model computationally efficient we assume that the subspaces are mutually orthogonal, and this gives us ck-means, in the case of 2 subspaces Inspired by PQ, we propose our second compositional model called Cartesian k-means. In its simplest case this model divides the center matrix into two subcenter sets denoted C^1 C^2, and lets assume each one of them has h columns. Accordingly we devides the indicator variables b into two parts as well, and each part is a one-of-h encoding. Effectively, we have two sets of subcenters, and we should pick one subcenter from each to generate a center in the original space… so the number of centers becomes h^2 ∀ 𝑖≠𝑗 𝐶 (𝑖) ⊥ 𝐶 (𝑗) #centers: 𝑘= ℎ 𝑚 Storage cost: O 𝑚 𝑘 Search time:O( 𝑚 𝑘 ) 𝒃 (1) , …, 𝒃 (𝑚) ∈ ℋ 1 ℎ

Cartesian k-means 𝑚 subspaces, ℎ regions per subspace ok-means ℎ=2 𝑚=1
𝑘= 2 𝑚 𝑚=1 k-means So we have a knob that controls the degree of compositionality and model compactness compositionality

Now back to our expts on NN with a quantization into 2^64 regions, we asses the importance of learning subspaces on SIFT vectors. observe that ck-means

Slightly improves upon PQ, about 4% improvement in recall@10,
Slightly improves upon PQ, about 4% improvement in but this is only using 1M data points

We went a step further and did another expt with 1B data points
We went a step further and did another expt with 1B data points. we got a more significant improvement on top of PQ in this case, as quantization becomes more important when we have more data-points. Again ok-means improves upon itq.

~𝟏𝟎% We went a step further and did another expt with 1B data points. we got a more significant improvement on top of PQ in this case, as quantization becomes more important when we have more data-points. Again ok-means improves upon itq.

Another expt on GIST descriptor gave us much more improvement presumably because the higher dimensionality of gist vectors, which makes the selection of subspaces trickier. Notably, we used 1000-dim GIST descriptors vs. 128-dim SIFT. It is also clear that ok-mean doesn’t improve itq, presumably because of the way GIST descriptors are normalized. Note that for all of the NNS exps …. So the search time is exactly at the same cost

~𝟐𝟓% Another expt on GIST descriptor gave us much more improvement presumably because the higher dimensionality of gist vectors, which makes the selection of subspaces trickier. Notably, we used 1000-dim GIST descriptors vs. 128-dim SIFT. It is also clear that ok-mean doesn’t improve itq, presumably because of the way GIST descriptors are normalized. Note that for all of the NNS exps …. So the search time is exactly at the same cost

Codebook learning (CIFAR-10)
Accuracy k-means (𝑘=1600) 77.9% k-means (𝑘=4000) 79.6% So far we showed exps on quantization for NNS wich requres very many quant regions, and so kmeans is not applicable. We also do a completely different set of experiments on codebook leanring… where the number of clusters in ju The CIFAR-10 dataset is a classification benchmark of tiny images with 10 classes. And as a baseline we used the method of Coates and Ng to learn a codebook of image patches with 1600 and 4000 codewords. We ran a linear SVM on bag of words histograms aggeregated over a spatial pyramid with four quadrants.

Accuracy k-means (𝑘=1600) ck-means (𝑘= 40 2 ) 77.9% 78.2% k-means (𝑘=4000) ck-means (𝑘= 64 2 ) 79.6% 79.7% Now we change only codebook learning part, and we use ck-means and PQ to learn a vocabulary. In these experiments we only used two subspaces for both pq and ckmeans. Interestingly, we find that ck-means perform on par, or slightly better than k-means, even though it has a much faster cluster assignment optm.

Accuracy k-means (𝑘=1600) ck-means (𝑘= 40 2 ) PQ (𝑘= 40 2 ) 77.9% 78.2% 75.9% k-means (𝑘=4000) ck-means (𝑘= 64 2 ) PQ (𝑘= 64 2 ) 79.6% 79.7% Now we change only codebook learning part, and we use ck-means and PQ to learn a vocabulary. In these experiments we only used two subspaces for both pq and ckmeans. Interestingly, we find that ck-means perform on par, or slightly better than k-means, even though it has a much faster cluster assignment optm.

Examples of Cartesian codewords learned for 10x10 patches

Quantized 𝟑𝟐×𝟑𝟐 images (𝟏𝟎𝟐𝟒 bits)

𝟑𝟐×𝟑𝟐 images

Run-time complexity Inference to quantize a point Learning
A big rotation of size 𝑝×𝑝 can be expensive PCA to reduce dimensionality to 𝑠 as pre-processing and optimize a 𝑝×𝑠 projection within the model Learning The most expensive part in each training iteration is to solve SVD to estimate 𝑅 which is of 𝑂( 𝑝 3 ) Can be done faster if we have a 𝑝×𝑠 rotation

Summary ITQ PQ ok-means ck-means
In this talk we present a compositional extension of k-means algorithm which allows for efficient clustering of data points into millions or billions of clusters. We also showed that midium size vocabularies can be learned for recognition in using cartesian kmeans without loss of accuracy. ok-means ck-means

Thank you for your attention!
𝐷 (1) 𝐷 (𝑚) 𝑥 ≈ 𝑅 (1) 𝑅 (𝑚) … 𝑏 (1) 𝑏 (𝑚) I’d be happy to take questions

𝑂(𝑛 𝑝) When we have millions of high-dim points

The good news is that the distance to the center

bit 1 ( 𝑑 1 + ) 2 ( 𝑑 1 − ) 2

bit 1 bit 2 ( 𝑑 1 + ) 2 ( 𝑑 2 + ) 2 ( 𝑑 1 − ) 2 (𝑑 2 − ) 2

Query-specific table ( )
bit 1 bit 2 … bit 𝑚 ( 𝑑 1 + ) 2 ( 𝑑 2 + ) 2 𝑑 𝑚 + 2 ( 𝑑 1 − ) 2 (𝑑 2 − ) 2 𝑑 𝑚 − 2 𝑂(𝑛 𝑚+𝑚 𝑝) ≤𝑂(𝑛𝑝)

We ran some NNS expts to figure out how important the extra scaling is.
For the expts we quantized a dataset of 1M SIFT feature into 64-bit binary codes, and we compared real valued SIFT queries against vector quantized dataset points. On the x axis we plot different numbers of retrieved items (on a log-scale) and on the y axis we plot recall rates for the Euclidean NN. We get about 10% recall improvement on top ITQ for a wide range of K with no extra cost One take home message is that if we are using ITQ for hashing, we could replace it with ok-means to get some potential improvements.

Mohammad Norouzi David Fleet

Similar presentations

Presentation on theme: "Mohammad Norouzi David Fleet"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Mohammad Norouzi David Fleet

Similar presentations

Presentation on theme: "Mohammad Norouzi David Fleet"— Presentation transcript:

Similar presentations

About project

Feedback