Dimension Reduction in the Hamming Cube (and its Applications) Rafail Ostrovsky UCLA (joint works with Rabani; and Kushilevitz and Rabani)

Slides:

Advertisements

Similar presentations

Estimating Distinct Elements, Optimally

Advertisements

Numerical Linear Algebra in the Streaming Model Ken Clarkson - IBM David Woodruff - IBM.

Efficient Private Approximation Protocols Piotr Indyk David Woodruff Work in progress.

Shortest Vector In A Lattice is NP-Hard to approximate

Approximate List- Decoding and Hardness Amplification Valentine Kabanets (SFU) joint work with Russell Impagliazzo and Ragesh Jaiswal (UCSD)

Nearest Neighbor Search in High Dimensions Seminar in Algorithms and Geometry Mica Arie-Nachimson and Daniel Glasner April 2009.

Algorithmic High-Dimensional Geometry 1 Alex Andoni (Microsoft Research SVC)

Overcoming the L 1 Non- Embeddability Barrier Robert Krauthgamer (Weizmann Institute) Joint work with Alexandr Andoni and Piotr Indyk (MIT)

Fast Algorithms For Hierarchical Range Histogram Constructions

1 Greedy Forwarding in Dynamic Scale-Free Networks Embedded in Hyperbolic Metric Spaces Dmitri Krioukov CAIDA/UCSD Joint work with F. Papadopoulos, M.

Big Data Lecture 6: Locality Sensitive Hashing (LSH)

Searching on Multi-Dimensional Data

Metric Embeddings As Computational Primitives Robert Krauthgamer Weizmann Institute of Science [Based on joint work with Alex Andoni]

Tools from Computational Geometry Bernard Chazelle Princeton University Bernard Chazelle Princeton University Tutorial FOCS 2005.

Multimedia DBs. Multimedia dbs A multimedia database stores text, strings and images Similarity queries (content based retrieval) Given an image find.

Data Structures and Functional Programming Algorithms for Big Data Ramin Zabih Cornell University Fall 2012.

Navigating Nets: Simple algorithms for proximity search Robert Krauthgamer (IBM Almaden) Joint work with James R. Lee (UC Berkeley)

Sketching and Embedding are Equivalent for Norms Alexandr Andoni (Simons Institute) Robert Krauthgamer (Weizmann Institute) Ilya Razenshteyn (CSAIL MIT)

SASH Spatial Approximation Sample Hierarchy

Nearest Neighbor. Predicting Bankruptcy Nearest Neighbor Remember all your data When someone asks a question –Find the nearest old data point –Return.

DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December

CSE 589 Applied Algorithms Spring 1999 Image Compression Vector Quantization Nearest Neighbor Search.

Recent Development on Elimination Ordering Group 1.

CS Lecture 9 Storeing and Querying Large Web Graphs.

Dimensionality Reduction

CS728 Lecture 16 Web indexes II. Last Time Indexes for answering text queries –given term produce all URLs containing –Compact representations for postings.

Approximate Nearest Neighbors and the Fast Johnson-Lindenstrauss Transform Nir Ailon, Bernard Chazelle (Princeton University)

1 Lecture 18 Syntactic Web Clustering CS

Oded Regev Tel-Aviv University On Lattices, Learning with Errors, Learning with Errors, Random Linear Codes, Random Linear Codes, and Cryptography and.

Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute (joint.

KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.

Nearest Neighbor Retrieval Using Distance-Based Hashing Michalis Potamias and Panagiotis Papapetrou supervised by Prof George Kollios A method is proposed.

Foundations of Privacy Lecture 11 Lecturer: Moni Naor.

Sketching and Embedding are Equivalent for Norms Alexandr Andoni (Simons Inst. / Columbia) Robert Krauthgamer (Weizmann Inst.) Ilya Razenshteyn (MIT, now.

Dimensionality Reduction

Dimensionality Reduction. Multimedia DBs Many multimedia applications require efficient indexing in high-dimensions (time-series, images and videos, etc)

Efficient Consistency Proofs for Generalized Queries on a Committed Database R. Ostrovsky C. Rackoff A. Smith UCLA Toronto.

Volume distortion for subsets of R n James R. Lee Institute for Advanced Study & University of Washington Symposium on Computational Geometry, 2006; Sedona,

Evaluating Performance for Data Mining Techniques

Algorithms on negatively curved spaces James R. Lee University of Washington Robert Krauthgamer IBM Research (Almaden) TexPoint fonts used in EMF. Read.

Shape Matching for Model Alignment 3D Scan Matching and Registration, Part I ICCV 2005 Short Course Michael Kazhdan Johns Hopkins University.

Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.

Length Reduction in Binary Transforms Oren Kapah Ely Porat Amir Rothschild Amihood Amir Bar Ilan University and Johns Hopkins University.

Image segmentation Prof. Noah Snavely CS1114

13 th Nov Geometry of Graphs and It’s Applications Suijt P Gujar. Topics in Approximation Algorithms Instructor : T Kavitha.

Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.

Geometric Problems in High Dimensions: Sketching Piotr Indyk.

Hank Childs, University of Oregon Isosurfacing (Part 3)

KNN Classifier.  Handed an instance you wish to classify  Look around the nearby region to see what other classes are around  Whichever is most common—make.

The Effectiveness of Lloyd-type Methods for the k-means Problem Chaitanya Swamy University of Waterloo Joint work with Rafi Ostrovsky, Yuval Rabani, Leonard.

The Power-Method: A Comprehensive Estimation Technique for Multi- Dimensional Queries Yufei Tao U. Hong Kong Christos Faloutsos CMU Dimitris Papadias Hong.

IMinMax B.C. Ooi, K.-L Tan, C. Yu, S. Stephen. Indexing the Edges -- A Simple and Yet Efficient Approach to High dimensional Indexing. ACM SIGMOD-SIGACT-

Sketching complexity of graph cuts Alexandr Andoni joint work with: Robi Krauthgamer, David Woodruff.

Honors Track: Competitive Programming & Problem Solving Seminar Topics Kevin Verbeek.

S IMILARITY E STIMATION T ECHNIQUES FROM R OUNDING A LGORITHMS Paper Review Jieun Lee Moses S. Charikar Princeton University Advanced Database.

Data Transformation: Normalization

Unsupervised Learning

Haim Kaplan and Uri Zwick

K Nearest Neighbor Classification

Lecture 10: Sketching S3: Nearest Neighbor Search

Lecture 7: Dynamic sampling Dimension Reduction

Near(est) Neighbor in High Dimensions

Linear sketching with parities

Near-Optimal (Euclidean) Metric Compression

Locality Sensitive Hashing

Linear sketching over

Instance Based Learning

Lecture 15: Least Square Regression Metric Embeddings

Minwise Hashing and Efficient Search

Presentation transcript:

Dimension Reduction in the Hamming Cube (and its Applications) Rafail Ostrovsky UCLA (joint works with Rabani; and Kushilevitz and Rabani)

2 PLAN  Problem Formulations  Communication complexity game  What really happened? (dimension reduction)  Solutions to 2 problems –ANN –k-clustering  What’s next?

3 Problem statements  Johnson-lindenstrauss lemma: n points in high dim. Hilbert Space can be embedded into O(logn) dim subspace with small distortion  Q: how do we do it for the Hamming Cube?  (we show how to avoid impossibility of [Charicar-Sahai])

4 Many different formulations of ANN  ANN – “approximate nearest neighbor search”  (many applications in computational geometry, biology/stringology, IR, other areas)  Here are different formulations:

5 Approximate Searching  Motivation: given a DB of “names”, user with a “target” name, find if any of DB names are “close” to the current name, without doing liner scan. Jon Alice Bob Eve Panconesi Kate Fred A.Panconesi ? 

6 Geometric formulation  Nearest Neighbor Search (NNS): given N blue points (and a distance function, say Euclidian distance in R d ), store all these points somehow

7 Data structure question  given a new red point, find closest blue point. Naive solution 1: store blue points “as is” and when given a red point, measure distances to all blue points. Q: can we do better?

8 Can we do better?  Easy in small dimensions (Voronoi diagrams)  “Curse of dimensionality” in High Dimensions…  [KOR]: Can get a good “approximate” solution efficiently!

9 Hamming Cube Formulation for ANN  Given a DB of N blue n-bit strings, process them somehow. Given an n-bit red string find ANN in the Hyper-Cube {0,1} n  Naïve solution 2: pre-compute all (exponential #) of answers (want small data-structures!) 

10 Clustering problem that I’ll discuss in detail  K-clustering

11 An example of Clustering – find “centers”  Given N points in R d

12 A clustering formulation  Find cluster “centers”

13 Clustering formulation  The “cost” is the sum of distances

14 Main technique  First, as a communication game  Second, interpreted as a dimension reduction

15 COMMUNICATION COMPLEXITY GAME  Given two players Alice and Bob,  Alice is secretly given string x  Bob is secretly given string y  they want to estimate hamming distance between x and y with small communication (with small error), provided that they have common randomness  How can they do it? (say length of |x|=|y|= N)  Much easier: how do we check that x=y ?

16 Main lemma : an abstract game  How can Alice and Bob estimate hamming distance between X and Y with small CC?  We assume Alice and Bob share randomness ALICE X 1 X 2 X 3 X 4 …X n BOB Y 1 Y 2 Y 3 Y 4 …Y n 

17 A simpler question  To estimate hamming distance between X and Y (within (1+  )) with small CC, sufficient for Alice and Bob for any L to be able to distinguish X and Y for: – H(X,Y) <= L OR – H(X,Y) > (1+  ) L  Q: why sampling does not work? ALICE X 1 X 2 X 3 X 4 …X n BOB Y 1 Y 2 Y 3 Y 4 …Y n 

18 Alice and Bob pick the SAME n-bit blue R each bit of R=1 independently with probability 1/2L XOR XOR 0/ X Y

19 What is the difference in probabilities? H(X,Y) (1+  ) L XOR 0/ XOR 0/ X Y

20 How do we amplify? XOR 0/ XOR 0/ X Y

21 How do we amplify? - Repeat, with many independent R ’ s but same distribution! XOR 0/ XOR 0/ X Y

22 a refined game with a small communication  How can Alice and Bob distinguish X and Y: – H(X,Y) <= L OR – H(X,Y) > (1+  ) L ALICE X 1 X 2 X 3 X 4 …X n For each R XOR (subset) of X i Compare the outputs. BOB Y 1 Y 2 Y 3 Y 4 …Y n For each R XOR (the same subset) of Y i Compare the outputs.  Pick 1/   logN R’s with correct distribution Compare this linear transformation.

23 Dimension Reduction in the Hamming Cube [OR] For each L, we can pick O(log N) R’s and boost the Probabilities! Key Property: we get an embedding from large to small cube that preserve ranges around L very well.

24 Dimension Reduction in the Hamming Cube [OR] For each L, we can pick O(log N) R’s and boost the Probabilities! Key Property: we get an embedding from large to small cube that preserve ranges around L. Key idea in applications: can build inverse lookup table for the small cube!

25 Applications  Applications of the dimension reduction in the Hamming CUBE  For ANN in the Hamming cube and R d  For K-Clustering

26 Application to ANN in the Hamming Cube  For each possible L build a “small cube” and project original DB to a small cube  Pre-compute inverse table for each entry of the small cube.  Why is this efficient?  How do we answer any query?  How do we navigate between different L?

27 Putting it All together: User’s private approx search from DB  Each projection is O(log N) R’s. User picks many such projections for each L-range. That defines all the embeddings.  Now, DB builds inverse lookup tables for each projection as new DB’s for each L.  User can now “project” its query into small cube and use binary search on L

28 MAIN THM [KOR]  Can build poly-size data-structure to do ANN for high-dimensional data in time polynomial in d and poly-log in N –For the hamming cube –L_1 –L_2 –Square of the Euclidian dist.  [IM] had a similar results, slightly weaker guarantee.

29 Dealing with R d  Project to random lines, choose “cut” points…  Well, not exactly… we need “navigation”

30 Clustering  Huge number of applications (IR, mining, analysis of stat data, biology, automatic taxonomy formation, web, topic-specific data-collections, etc.)  Two independent issues: –Representation of data –Forming “clusters” (many incomparable methods)

31 Representation of data examples  Latent semantic indexing yields points in R d with l 2 distance (distance indicating similarity)  Min-wise permutation (Broder at. al.) approach yields points in the hamming metric  Many other representations from IR literature lead to other metrics, including edit-distance metric on strings  Recent news: [OR-95] showed that we can embed edit-distance metric into l 1 with small distortion distortion= exp(sqrt(\log n \log log n))

32 Geometric Clustering: examples  Min-sum clustering in R d : form clusters s.t. the sum of intra-cluster distances in minimized  K-clustering: pick k “centers” in the ambient space. The cost is the sum of distances from each data-point to the closest center  Agglomerative clustering (form clusters below some distance-threshold)  Q: which is better?

33 Methods are (in general) incomparable

34 Min-SUM

Clustering

36 A k-clustering problem: notation  N – number of points  d – dimension  k – number of centers

37 About k-clustering  When k if fixed, this is easy for small d  [Kleinberg, Papadimitriou, Raghavan]: NP-complete for k=2 for the cube  [Drineas, Frieze, Kannan, Vempala, Vinay]” NP complete for R d for square of the Euclidian distance  When k is not fixed, this is facility location (Euclidian k- median)  For fixed d but growing k a PTAS was given by [Arora, Raghavan, Rao] (using dynamic prog.)  (this talk): [OR]: PTAS for fixed k, arbitrary d

38 Common tools in geometric PTAS  Dynamic programming  Sampling [Schulman, AS, DLVK]  [DFKVV] use SVD  Embeddings/dimension reduction seem useless because –Too many candidate centers –May introduce new centers

39 [OR] k-clustering result  A PTAS for fixed k –Hamming cube {0,1} d –l1d–l1d –l 2 d (Euclidian distance) –Square of the Euclidian distance

40 Main ideas  For 2-clustering find a good partition is as good as solving the problem  Switch to cube  Try partitions in the embedded low- dimensional data set  Given a partition, compute centers and cost in the original data send  Embedding/dim. reduction used to reduce the number of partitions

41 Stronger property of [OR] dimension reduction  Our random linear transformation preserve ranges!

42 THE ALGORITHM

43 The algorithm yet again  Guess 2-center distance  Map to small cube  Partition in the small cube  Measure the partition in the big cube  THM: gets within (1+  of optimal.  Disclaimer: PTAS is (almost never) practical, this shows “feasibility only”, more ideas are needed for a practical solution.

44 Dealing with k>2  Apex of a tournament is a node of max out- degree  Fact: apex has a path of length 2 to every node  Every point is assigned an apex of center “tournaments”: –Guess all (k choose 2) center distances –Embed into (k choose 2) small cubes –Guess center-projection in small cubes –For every point, for every pair of centers, define a “tournament” which center is closer in the projection

45 Conclusions  Dimension reduction in the cube allows to deal with huge number of “incomparable” attributes.  Embeddings of other metrics into the cube allows fast ANN for other metrics  Real applications still require considerable additional ideas  Fun area to work in