Approximate Near Neighbors for General Symmetric Norms Ilya Razenshteyn (MIT CSAIL) joint with Alexandr Andoni (Columbia University) Aleksandar Nikolov (University of Toronto) Erik Waingarten (Columbia University) arXiv:1611.06222
Nearest Neighbor Search Motivation Data Feature vector space + distance function Data analysis Geometry / Linear Algebra / Optimization Similarity search Nearest Neighbor Search
An example Word embeddings Yale Harvard graduate faculty undergraduate Juilliard university undergraduates Cornell MIT Word embeddings High-dimensional vectors that capture semantic similarity between words (and more) GloVe [Pennigton, Socher, Manning 2014], 400K words, 300 dimensions Ten nearest neighbors for “NYU”?
Approximate Near Neighbors (ANN) Dataset: 𝑛 points in a metric space 𝑋 (denote by 𝑃) Approximation 𝑐>1, distance threshold 𝑟>0 Query: 𝑞∈𝑋 such that there is 𝑝 ∗ ∈𝑃 with 𝑑 𝑋 (𝑞, 𝑝 ∗ )≤𝑟 Want: 𝑝 ∈𝑃 such that 𝑑 𝑋 (𝑞, 𝑝 )≤𝑐𝑟 Parameters: space, query time 𝑐𝑟 𝑟 𝑞
This talk: a metric on 𝑅 𝑑 , where 𝜔 log 𝑛 ≤𝑑≤ 𝑛 𝑜(1) FAQ Focus of this talk Q: why approximation? A: the exact case is hard for the high-dimensional problem. Q: what does “high-dimensional” mean? A: when 𝑑=𝜔( log 𝑛 ), where 𝑑 is the dimension of a metric. Q: how is the dimension defined? A: a metric is typically defined on 𝑅 𝑑 ; alternatively, doubling dimension, etc. Must depend on 𝑑 as 2 𝑜(𝑑) , ideally as 𝑑 𝑂(1) This talk: a metric on 𝑅 𝑑 , where 𝜔 log 𝑛 ≤𝑑≤ 𝑛 𝑜(1)
Which distance function to use? 𝛼 A distance function Must capture semantic similarity well Must be algorithmically tractable Word embeddings, etc.: cosine similarity The goal: classify metrics according to the complexity of high-dimensional ANN For theory: a poorly-understood property of a metric For practice: universal algorithm for ANN
High-dimensional norms An important case: 𝑋 is a normed space 𝑑 𝑋 𝑥 1 , 𝑥 2 = 𝑥 1 − 𝑥 2 , where ⋅ : 𝑅 𝑑 → 𝑅 + is such that 𝑥 =0 iff 𝑥=0 𝛼𝑥 =|𝛼| 𝑥 𝑥 1 + 𝑥 2 ≤ 𝑥 1 + 𝑥 2 Lots of tools (linear functional analysis) [Andoni, Krauthgamer, R 2015] characterizes norms that allow efficient sketching (succinct summarization), which implies efficient ANN Approximation O 𝑑 is easy (John’s theorem)
Unit balls A norm can be given by its unit ball 𝐵 𝑋 = 𝑥∈ 𝑅 𝑑 𝑥 ≤1 𝑥 ≈2 Claim: 𝐵 𝑋 is a symmetric convex body Claim: any such body can be a unit ball 𝑥 𝐾 = inf 𝑡>0 𝑥 𝑡 ∈𝐾 What property of a convex body makes ANN wrt it tractable? John’s theorem: any symmetric convex body is close to an ellipsoid (gives approximation 𝑑 ) 𝐵 𝑋 𝑥 ≈2
Our result Invariant under permutation of coordinates and changing signs 𝑑= 2 𝑜 log 𝑛 log log 𝑛 If 𝑋 is a symmetric normed space, and 𝑑= 𝑛 𝑜(1) , can solve ANN with: Approximation 𝑂(1) Space 𝑛 1+𝑜(1) Query time 𝑛 𝑜(1) log log 𝑛 𝑂(1)
Examples Usual 𝑙 𝑝 norms 𝑥 𝑝 = 𝑖 𝑥 𝑖 𝑝 1 𝑝 Top-𝑘 norm: sum of 𝑘 largest absolute values of coordinates Interpolates between 𝑙 1 and 𝑙 ∞ Orlicz norms: a unit ball is 𝑥∈ 𝑅 𝑑 𝑖 𝐺 𝑥 𝑖 ≤1 , Where 𝐺(⋅) is convex and non-negative, and 𝐺 0 =0. Gives 𝑙 𝑝 norms for 𝐺 𝑡 = 𝑡 𝑝 𝑘-support norm, box-Θ norm, 𝐾-functional (arise in probability and machine learning) 𝑡 𝐺(𝑡)
Prior work: symmetric norms [Blasiok, Braverman, Chestnut, Krauthgamer, Yang 2015]: classification of symmetric norms according to their streaming complexity Depends on how well the norm concentrates on the Euclidean ball Unlike streaming, ANN is always tractable
Prior work: ANN Mostly, focus on 𝑙 1 (Hamming/Manhattan) and 𝑙 2 (Euclidean) norms Work for many applications Allow efficient algorithms based on hashing Locality-Sensitive Hashing [Indyk, Motwani 1998] [Andoni, Indyk 2006] Data-dependent LSH [Andoni, Indyk, Nguyen, R 2014] [Andoni, R 2015] [Andoni, Laarhoven, R, Waingarten 2017]: tight trade-off between space and query time for every 𝑐>1 Few results for other norms ( 𝑙 ∞ , general 𝑙 𝑝 , will see later)
ANN for 𝑙 ∞ [Indyk 1998] ANN for 𝑑-dimensional 𝑙 ∞ : Space 𝑑⋅𝑛 1+𝜀 Query time 𝑂(𝑑 log 𝑛 ) Approximation 𝑂( 𝜀 −1 ⋅ log log 𝑑 ) Main idea: recursive partitioning “Small” ball with Ω(𝑛) points ― easy No such balls ― there is a “good” cut wrt some coordinate [Andoni, Croitoru, Patrascu 2008] [Kapralov, Panigrahy 2012]: Approximation 𝑂 log log 𝑑 is tight for decision trees!
𝑑 𝑌 𝑓 𝑎 , 𝑓 𝑏 /𝐶≤ 𝑑 𝑋 𝑎,𝑏 ≤ 𝑑 𝑌 (𝑓 𝑎 , 𝑓 𝑏 ) Metric embeddings A map 𝑓:𝑋→𝑌 is an embedding with distortion 𝑪, if for 𝑎,𝑏∈𝑋: 𝑑 𝑌 𝑓 𝑎 , 𝑓 𝑏 /𝐶≤ 𝑑 𝑋 𝑎,𝑏 ≤ 𝑑 𝑌 (𝑓 𝑎 , 𝑓 𝑏 ) Reductions for geometric problems 𝑓(𝑎) 𝑓(𝑏) 𝑓 𝑋 𝑌 a 𝑏 ANN with approximation 𝐷 for 𝑌 ANN with approximation 𝐶𝐷 for 𝑋
Embedding norms into 𝑙 ∞ For a normed space 𝑋 and 𝜀>0 there exists 𝑓:𝑋→ 𝑙 ∞ 𝑑 ′ with 𝑓 𝑥 ∞ ∈ 1±𝜀 ⋅ 𝑥 𝑋 Proof idea: 𝑥 𝑋 ≈ max 𝑦∈𝑁 𝑥, 𝑦 Take all directions and discretize (more details later) Can we combine it with ANN for 𝑙 ∞ and obtain ANN for any norm? No! Discretization requires 𝑑 ′ = 1 𝜀 𝑂(𝑑) . Tight even for 𝑙 2 . Approximation 𝑂 log log 𝑑 ′ =𝑂 log 𝑑 ≪ 𝑑 .
The strategy What Where Dimension Any norm 𝑙 ∞ 𝑑 ′ 𝑑 ′ = 2 𝑂(𝑑) 𝑙 ∞ 𝑑 ′ 𝑑 ′ = 2 𝑂(𝑑) Symmetric norm ⨁ 𝑙 ∞ ⨁ 𝑙 1 top− 𝑘 𝑖𝑗 norm 𝑑 ′ = 𝑑 𝑂( log log 𝑑 ) Bypass non-embeddability into low-dimensional 𝑙 ∞ allowing a more complicated host space, which is still tractable
𝑙 𝑝 -direct sums of metric spaces For metrics 𝑀 1 , 𝑀 2 , …, 𝑀 𝑡 , define ⨁ 𝑙 𝑝 𝑀 𝑖 as follows: The ground set is 𝑀 1 × 𝑀 2 ×…× 𝑀 𝑡 The distance is: 𝑑 𝑥 1 , 𝑥 2 ,…, 𝑥 𝑡 , ( 𝑦 1 , 𝑦 2 ,…, 𝑦 𝑡 ) = (𝑑 𝑥 1 , 𝑦 1 , 𝑑 𝑥 2 , 𝑦 2 ,…,𝑑( 𝑥 𝑡 , 𝑦 𝑡 ) 𝑝 Example: ⨁ 𝑙 𝑝 𝑙 𝑞 ― cascaded norms Our host space: ⨁ 𝑙 ∞ ⨁ 𝑙 1 𝑋 𝑖𝑗 , where 𝑋 𝑖𝑗 is 𝑅 𝑑 equipped with the top- 𝑘 𝑖𝑗 norm Outer sum is of size 𝑑 𝑂( log log 𝑑 ) Inner sum is of size 𝑑
Two necessary steps Embed a symmetric norm into ⨁ 𝑙 ∞ ⨁ 𝑙 1 𝑋 𝑖𝑗 Solve ANN for ⨁ 𝑙 ∞ ⨁ 𝑙 1 𝑋 𝑖𝑗 Prior work on ANN via product spaces: for Frechet distance [Indyk 2002], edit distance [Indyk 2004], and Ulam distance [Andoni, Indyk, Krauthgamer 2009]
ANN for ⨁ 𝑙 ∞ ⨁ 𝑙 1 𝑋 𝑖𝑗 [Indyk 2002], [Andoni 2009]: if for 𝑀 1 , 𝑀 2 , …, 𝑀 𝑡 there are data structures for 𝑐-ANN, then for ⨁ 𝑙 𝑝 𝑀 𝑖 one can get 𝑂(𝑐 log log 𝑛 )-ANN with almost the same time and space A powerful generalization of ANN for 𝑙 ∞ [Indyk 1998] Trivially implies ANN for general 𝑙 𝑝 Thus, enough to handle ANN for 𝑋 𝑖𝑗 (top-𝑘 norms)!
ANN for top-𝑘 norms Include 𝑙 1 and 𝑙 ∞ , thus, need a unified approach Idea: embed a top-𝑘 norm into 𝑙 ∞ 𝑑 ′ and use [Indyk 1998] Approximation: distortion × O(log log 𝑑′ ) Problem: 𝑙 1 requires 2 Ω(𝑑) -dimensional 𝑙 ∞ Solution: use randomized embeddings
Embedding top-𝑘 norm into 𝑙 ∞ The case 𝑘=𝑑 (that is, 𝑙 1 ) Embedding (uses min-stability of exponential distribution): Sample i.i.d. 𝑢 1 , 𝑢 2 ,…, 𝑢 𝑑 ~Exp(1) Embed 𝑓: 𝑥 1 , 𝑥 2 ,…, 𝑥 𝑑 ↦ 𝑥 1 𝑢 1 , 𝑥 2 𝑢 2 ,…, 𝑥 𝑑 𝑢 𝑑 Pr 𝑓 𝑥 ∞ ≤𝑡 = 𝑖 Pr 𝑥 𝑖 𝑢 𝑖 ≤𝑡 = 𝑖 𝑒 −| 𝑥 𝑖 |/𝑡 = 𝑒 − 𝑥 𝟏 /𝑡 Constant distortion w.h.p. In reality: slightly different parameters General 𝒌: sample 𝑢 𝑖 ~max 1 𝑘 ,Exp(1)
Detour: ANN for Orlicz norms Reminder: for convex 𝐺:𝑅→ 𝑅 + with 𝐺 0 =0, define a norm whose unit ball is 𝑥 𝑖 𝐺 𝑥 𝑖 ≤1 (e.g., 𝐺 𝑡 = 𝑡 𝑝 gives 𝑙 𝑝 norms). Embedding into 𝑙 ∞ (as before, 𝑂(1) distortion w.h.p.): Sample i.i.d. 𝑢 1 , 𝑢 2 ,…, 𝑢 𝑑 ~𝓓 Embed 𝑓: 𝑥 1 , 𝑥 2 ,…, 𝑥 𝑑 ↦ 𝑥 1 𝑢 1 , 𝑥 2 𝑢 2 ,…, 𝑥 𝑑 𝑢 𝑑 Pr 𝑋∼𝒟 [𝑋≤𝑡] =1− 𝑒 −𝐺(𝑡) A special case for 𝑙 𝑝 norms appeared in [Andoni 2009]
Where are we? Can solve ANN for ⨁ 𝑙 ∞ ⨁ 𝑙 1 𝑋 𝑖𝑗 , where 𝑋 𝑖𝑗 is 𝑅 𝑑 equipped with a top- 𝑘 𝑖𝑗 norm What remains to be done? Embed a 𝑑-dimensional symmetric norm into ( 𝑑 𝑂( log log 𝑑 ) -dimensional) ⨁ 𝑙 ∞ ⨁ 𝑙 1 𝑋 𝑖𝑗
Starting point: embedding any norm into 𝑙 ∞ For a normed space 𝑋 and 𝜀>0 there is linear 𝑓:𝑋→ 𝑙 ∞ 𝑑 ′ s.t. 𝑓 𝑥 ∞ ∈ 1±𝜀 ⋅ 𝑥 𝑋 A normed space 𝑋 ∗ dual to 𝑋: 𝑦 𝑋 ∗ = sup 𝑥 𝑋 ≤1 〈𝑥,𝑦〉 . Dual to 𝑙 𝑝 is 𝑙 𝑞 where 1 𝑝 + 1 𝑞 =1 ( 𝑙 1 vs. 𝑙 ∞ , 𝑙 2 vs. 𝑙 2 , etc.). Claim: for every 𝑥∈𝑋, have: 𝑥 𝑋 ∈ 1±𝜀 ⋅ max 𝑦∈N | 𝑥, 𝑦 | , where 𝑁 is an 𝜺-net of 𝐁 𝑿 ∗ (wrt 𝑋 ∗ ) Immediately gives an embedding
Proof For every 𝑦∈𝑁, have 𝑥,𝑦 ≤ 𝑥 𝑋 ⋅ 𝑦 𝑋 ∗ ≤ 𝑥 𝑋 , 𝑥,𝑦 ≤ 𝑥 𝑋 ⋅ 𝑦 𝑋 ∗ ≤ 𝑥 𝑋 , thus, max 𝑦∈𝑁 |〈𝑥, 𝑦〉| ≤ 𝑥 𝑋 . There exists 𝑦 such that 𝑦 𝑋 ∗ ≤1 and 𝑥,𝑦 = 𝑥 𝑋 Non-trivial, requires Hahn–Banach theorem Move 𝑦 to the closest 𝑦 ′ ∈𝑁 Get 𝑥,𝑦′ ≥ 1−𝜀 ⋅ 𝑥 𝑋 Thus, max 𝑦∈𝑁 |〈𝑥, 𝑦〉| ≥ 1−𝜀 ⋅ 𝑥 𝑋 Can take 𝑁 = 1 𝜀 𝑂(𝑑) by the volume argument
Better embeddings for symmetric norms Recap: can’t embed even 𝑙 2 𝑑 into 𝑙 ∞ 𝑑′ unless 𝑑 ′ = 2 Ω(𝑑) Instead, aim at embedding a symmetric norm into ⨁ 𝑙 ∞ ⨁ 𝑙 1 𝑋 𝑖𝑗 High level idea: a new space is more forgiving and allows to consider an 𝜀-net of 𝐵 𝑋 ∗ up to a symmetry Show that there is an 𝜀-net that is a result of applying symmetries to merely 𝑑 𝑂( log log 𝑑 ) vectors!
Exploiting symmetry For a vector 𝑥, 𝜋∈ 𝑆 𝑑 , and 𝜎∈{−1, 1 } 𝑑 , denote 𝑥 𝜋,𝜎 be 𝑥 with coordinates permuted according to 𝜋 and signs flipped according to 𝜎 Recap: 𝑥 𝜋, 𝜎 𝑋 = 𝑥 𝑋 Suppose that 𝑁 ′ is an 𝜀-net for 𝐵 𝑋 ∗ intersect ℒ= 𝑦 𝑦 1 ≥ 𝑦 2 ≥…≥ 𝑦 𝑑 ≥0 Then, 𝑥 𝑋 ∈ 1±𝜀 ⋅ max 𝑦∈ 𝑁 ′ ,𝜋,𝜎 𝑥, 𝑦 𝜋,𝜎 = 1±𝜀 ⋅ max 𝑦∈𝑁′ max 𝜋,𝜎 〈𝑥, 𝑦 𝜋,𝜎 〉 = 1±𝜀 ⋅ max 𝑦∈𝑁′ 𝑥 𝑦 = 1±𝜀 ⋅ max 𝑦∈𝑁′ 𝑘 𝑦 𝑘 − 𝑦 𝑘+1 ⋅ (top−𝑘 norm of 𝑥) max 𝑦∈𝑁′ 𝑘 𝑦 𝑘 − 𝑦 𝑘+1 ⋅ (top−𝑘 norm of 𝑥)
Small nets What remains to be done: an 𝜀-net for 𝐵 𝑋 ∗ ∩ 𝑦 𝑦 1 ≥ 𝑦 2 ≥…≥ 𝑦 𝑑 ≥0 of size 𝑑 𝑂 𝜀 ( log log 𝑑 ) Will see a weaker bound of 𝑑 𝑂 𝜀 ( log 𝑑 ) , still non-trivial Volume bound fails Instead, a simple explicit construction
Small nets: continued Want to approximate a vector 𝑦∈ 𝐵 𝑋 ∗ with 𝑦 1 ≥ 𝑦 2 ≥…≥ 𝑦 𝑑 ≥0 Zero all 𝑦 𝑖 ’s that are smaller than 𝑦 1 /poly 𝑑 Round all coordinates to a nearest power of 1+𝜀 O 𝜀 (log 𝑑 ) scales Only cardinality of each scale matters 𝑑 𝑂 𝜀 ( log 𝑑 ) vectors total Can be improved to 𝑑 𝑂 𝜀 ( log log 𝑑 ) by one more trick
Quick summary Embed a symmetric norm into a 𝑑 log log 𝑑 -dimensional product space of top- 𝑘 norms Use known techniques to reduce the ANN problem on the product space to ANN for the top-𝑘 norm Uses truncated exponential random variables to embed the top-𝑘 norm into 𝑙 ∞ and use a known ANN data structure there
Two immediate open questions Improve the dependence on 𝑑 from 𝑑 log log 𝑑 to 𝑑 𝑂(1) Better 𝜀-net for 𝐵 𝑋 ∗ ∩ 𝑦 𝑦 1 ≥ 𝑦 2 ≥…≥ 𝑦 𝑑 ≥0 Looks doable Improve approximation from log log 𝑛 𝑂 1 to O(log log 𝑑) Beyond log log 𝑑 is hard due to 𝑙 ∞ Need to bypass ANN for product spaces Maybe randomized embedding into low-dimensional 𝑙 ∞ for any symmetric norm?
General norms Exists an embedding into 𝑙 2 with distortion 𝑂 𝑑 Universal 𝑑 log log 𝑑 -dimensional space that can host all 𝑑-dimensional symmetric norms Impossible for general norms even for randomized embeddings: even distortion 𝑑 0.49 requires dimension 2 𝑑 Ω(1) Stronger hardness results? Implied by: there is a family of spectral expanders that embed with distortion 𝑂 1 into some log 𝑂(1) 𝑛 -dimensional norm, where 𝑛 is the number of nodes [Naor 2016]
The main open question Thanks! Is there an efficient ANN algorithm for general high-dimensional norms with approximation 𝑑 𝑜(1) ? There is hope… Thanks!