Property Testing of Data Dimensionality Robert Krauthgamer ICSI and UC Berkeley Joint work with Ori Sasson (Hebrew U.)

Slides:

Advertisements

Similar presentations

The Future (and Past) of Quantum Lower Bounds by Polynomials Scott Aaronson UC Berkeley.

Advertisements

Numerical Linear Algebra in the Streaming Model Ken Clarkson - IBM David Woodruff - IBM.

Numerical Linear Algebra in the Streaming Model

Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.

Hardness of testing 3- colorability in bounded degree graphs Andrej Bogdanov Kenji Obata Luca Trevisan.

Shortest Vector In A Lattice is NP-Hard to approximate

A Nonlinear Approach to Dimension Reduction Robert Krauthgamer Weizmann Institute of Science Joint work with Lee-Ad Gottlieb TexPoint fonts used in EMF.

Aggregating local image descriptors into compact codes

Overcoming the L 1 Non- Embeddability Barrier Robert Krauthgamer (Weizmann Institute) Joint work with Alexandr Andoni and Piotr Indyk (MIT)

The Stability of a Good Clustering Marina Meila University of Washington

A UNIFIED FRAMEWORK FOR TESTING LINEAR-INVARIANT PROPERTIES ARNAB BHATTACHARYYA CSAIL, MIT (Joint work with ELENA GRIGORESCU and ASAF SHAPIRA)

Dimensionality reduction. Outline From distances to points : – MultiDimensional Scaling (MDS) Dimensionality Reductions or data projections Random projections.

Dimensionality Reduction PCA -- SVD

Multimedia DBs. Multimedia dbs A multimedia database stores text, strings and images Similarity queries (content based retrieval) Given an image find.

Interchanging distance and capacity in probabilistic mappings Uriel Feige Weizmann Institute.

Christian Sohler | Every Property of Hyperfinite Graphs is Testable Ilan Newman and Christian Sohler.

Sketching and Embedding are Equivalent for Norms Alexandr Andoni (Simons Institute) Robert Krauthgamer (Weizmann Institute) Ilya Razenshteyn (CSAIL MIT)

Property Testing: A Learning Theory Perspective Dana Ron Tel Aviv University.

Dimensionality reduction. Outline From distances to points : – MultiDimensional Scaling (MDS) – FastMap Dimensionality Reductions or data projections.

DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December

Dimensionality reduction. Outline From distances to points : – MultiDimensional Scaling (MDS) – FastMap Dimensionality Reductions or data projections.

Proclaiming Dictators and Juntas or Testing Boolean Formulae Michal Parnas Dana Ron Alex Samorodnitsky.

Principal Component Analysis

Testing the Diameter of Graphs Michal Parnas Dana Ron.

Dimensionality Reduction and Embeddings

Testing of Clustering Noga Alon, Seannie Dar Michal Parnas, Dana Ron.

Approximate Nearest Neighbors and the Fast Johnson-Lindenstrauss Transform Nir Ailon, Bernard Chazelle (Princeton University)

Michael Bender - SUNY Stony Brook Dana Ron - Tel Aviv University Testing Acyclicity of Directed Graphs in Sublinear Time.

Testing Metric Properties Michal Parnas and Dana Ron.

1/ 30. Problems for classical IR models Introduction & Background(LSI,SVD,..etc) Example Standard query method Analysis standard query method Seeking.

1 Sampling Lower Bounds via Information Theory Ziv Bar-Yossef IBM Almaden.

The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.

On Testing Convexity and Submodularity Michal Parnas Dana Ron Ronitt Rubinfeld.

Clustering In Large Graphs And Matrices Petros Drineas, Alan Frieze, Ravi Kannan, Santosh Vempala, V. Vinay Presented by Eric Anderson.

Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute (joint.

Sketching and Embedding are Equivalent for Norms Alexandr Andoni (Simons Inst. / Columbia) Robert Krauthgamer (Weizmann Inst.) Ilya Razenshteyn (MIT, now.

Dimensionality Reduction

Dimensionality Reduction. Multimedia DBs Many multimedia applications require efficient indexing in high-dimensions (time-series, images and videos, etc)

Some 3CNF Properties are Hard to Test Eli Ben-Sasson Harvard & MIT Prahladh Harsha MIT Sofya Raskhodnikova MIT.

Embedding and Sketching Alexandr Andoni (MSR). Definition by example  Problem: Compute the diameter of a set S, of size n, living in d-dimensional ℓ.

Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18.

NUS CS5247 A dimensionality reduction approach to modeling protein flexibility By, By Miguel L. Teodoro, George N. Phillips J* and Lydia E. Kavraki Rice.

Presented By Wanchen Lu 2/25/2013

Correlation testing for affine invariant properties on Shachar Lovett Institute for Advanced Study Joint with Hamed Hatami (McGill)

Approximating the MST Weight in Sublinear Time Bernard Chazelle (Princeton) Ronitt Rubinfeld (NEC) Luca Trevisan (U.C. Berkeley)

Three different ways There are three different ways to show that ρ(A) is a simple eigenvalue of an irreducible nonnegative matrix A:

Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala.

13 th Nov Geometry of Graphs and It’s Applications Suijt P Gujar. Topics in Approximation Algorithms Instructor : T Kavitha.

1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 6. Dimensionality Reduction.

Manifold learning: MDS and Isomap

Lower Bounds for Embedding Edit Distance into Normed Spaces A. Andoni, M. Deza, A. Gupta, P. Indyk, S. Raskhodnikova.

Complexity and Efficient Algorithms Group / Department of Computer Science Testing the Cluster Structure of Graphs Christian Sohler joint work with Artur.

Sketching and Embedding are Equivalent for Norms Alexandr Andoni (Columbia) Robert Krauthgamer (Weizmann Inst) Ilya Razenshteyn (MIT) 1.

Out of sample extension of PCA, Kernel PCA, and MDS WILSON A. FLORERO-SALINAS DAN LI MATH 285, FALL

Unsupervised Learning II Feature Extraction

Property Testing (a.k.a. Sublinear Algorithms )

Dimension reduction for finite trees in L1

Approximating the MST Weight in Sublinear Time

LSI, SVD and Data Management

Warren Center for Network and Data Sciences

Lecture 10: Sketching S3: Nearest Neighbor Search

Sketching and Embedding are Equivalent for Norms

Nuclear Norm Heuristic for Rank Minimization

CIS 700: “algorithms for Big Data”

CIS 700: “algorithms for Big Data”

Yair Bartal Lee-Ad Gottlieb Hebrew U. Ariel University

CSCI B609: “Foundations of Data Science”

Dimension versus Distortion a.k.a. Euclidean Dimension Reduction

The Subgraph Testing Model

Lecture 15: Least Square Regression Metric Embeddings

Presentation transcript:

Property Testing of Data Dimensionality Robert Krauthgamer ICSI and UC Berkeley Joint work with Ori Sasson (Hebrew U.)

Testing Data Dimensionality2 Data dimensionality The analysis of large volumes of complex data is required in many disciplines. Such data is frequently represented by vectors in a high- dimensional vector space. – E.g., sequential biological data (genome, proteins) – A common method of representing data is feature extraction (vector representation in feature space). Images databases Text corpora (via latent semantic indexing)

Testing Data Dimensionality3 The issue of dimension High-dimensional data is difficult to work with. – Complexity of many operations is heavily dependent (e.g. exponentially) on the dimension. Real-life data often adheres to a low-dimensional structure – Which allows to effectively reduce the dimension. – E.g. in R 2 : Dimensionality Reduction: Mapping into low-dimensional space (while preserving most of the data “structure”) – Trade-off accuracy for computational efficiency

Testing Data Dimensionality4 Dimensionality reduction methods Singular Value Decomposition (SVD) – I.e., low-rank matrix approximation. – Practical variants: Multidimensional Scaling (MDS), Principal Component Analysis (PCA) Low-distortion embedding in low-dimensional l p – Of any Euclidean metric [Johnson-Lindenstrauss’86] – Of any metric [Bourgain’86, Linial-London-Rabinovich’93]. Other methods, e.g. combinatorial feature selection [Charikar-Guruswami-Kumar-Rajagopalan-Sahai’00] Linear Structure Metric Structure

Testing Data Dimensionality5 Property testing framework Relaxed decision problems: Determine whether The input has a property P, or The input is far from having the property P, i.e. it needs to be modified significantly in order to have the property. Goal: Obtain Randomized algorithms (correct with probability  2/3), Whose complexity is low (does not depend on input size). Trivial example: Testing if an input list contains only 0’s or  -fraction of the entries are not 0 – with  queries.

Testing Data Dimensionality6 Testing data dimensionality Given a data set S, determine whether S has at most a (fixed) dimension d, or S is  -far from having this property, – i.e. at least an  -fraction of the entries of (a representation of S) needs to be modified for S to have the property. Technicalities: Interpretation of dimension (i.e. type of structure) Representation of S – Assume it affects both query mechanism and farness measure

Testing Data Dimensionality7 Our results – Testing for linear structure Algorithm for testing whether vectors v 1,…,v n lie in linear (or affine) subspace of dimension  d. – Algorithm queries O(d/  ) vectors. – Holds for every vector space V. Algorithm for testing whether a matrix A m  n has rank  d. – Algorithm queries the entries of an O(d/  )  O(d/  ) submatrix. – Holds for matrices over any field F. (Both algorithms have one-sided error.)

Testing Data Dimensionality8 Our results – Testing for metric structure Testing whether v 1,…,v n  l 2 m can be embedded into l 2 d – Isometrically - achieved by querying O(d/  ) vectors (corollary). – With distortion  - requires querying  ((n/  ) 1/2 ) vectors. – With perturbation  - requires  (min{n 1/2, m/log m}) queries. Testing whether vectors v 1,…,v n  l 1 m can be embedded isometrically into l 1 d requires querying  (n 1/4 ) vectors. (Lower bounds are for algorithms with two-sided error.)

Testing Data Dimensionality9 Our results – Testing metrics and norm Algorithm for testing whether a matrix M n  n is the distances matrix of a d-dimensional Euclidean metric. – Algorithm queries the entries of an O(d/  )  O(d/  ) submatrix. – Slight improvement over O((dlog d)/  )  O((dlog d)/  ) of [Parnas-Ron’01]. Algorithm for testing whether a vector has l p -norm  . – Algorithm queries O(   log 1/  ) entries (with two-sided error). – Holds for any p and . – Allows to test the Frobenius norm of a matrix (such as the difference between a matrix and its low-rank approximation).

Testing Data Dimensionality10 Property testing origins Introduced by [Rubinfeld-Sudan’96] – Testing algebraic properties of functions Many PCPs involve testing of encodings – E.g. low-degree polynomials, Hadamard code, long code Testing of combinatorial properties initiated by [Goldreich-Goldwasser-Ron’98] – They focused on graph properties (e.g. coloring). – Later works considered testing monotonicity of functions, satisfiability of formulas, regularity of languages, equality of distributions, clustering of Euclidean vectors, metric spaces etc.

Testing Data Dimensionality11 Related work Property testing – Testing whether a distances matrix represents a tree metric, ultra- metric, or a low-dimensional Euclidean metric [Parnas-Ron’01]. – Testing properties of Euclidean vectors, e.g. clustering [Alon- Dar-Parnas-Ron’00] and convexity [Czumaj-Sohler-Ziegler’00]. – Testing various matrix properties, e.g. monotonicity [Newman- Fischer’01]. Fast low-rank approximation (by sampling) – [Frieze-Kannan-Vempala’98, Achlioptas-McSherry’01] – Farness measure considers the magnitude of the changes. – Sampling depends on input size (unless input is “uniform”).

Testing Data Dimensionality12 Other related work Finite point criterion for l p d – embeddability. – Namely, the minimum f p (d) such that (any) metric space embeds in l p d iff every f p (d) of its points do. – For p = 2, [Menger’28] showed f p (d) = d+3. – For p = 1 and any d > 2, [Bandelt-Chepoi-Laurent’98] showed f 1 (d)  d 2 -1, but it is not known whether f 1 (d) is finite. Our results for l 1 and l 2 spaces establish somewhat similar bounds for a relaxed version of this question.

Testing Data Dimensionality13 Algorithm for testing linear structure Thm 1. Testing whether a set of vectors S lies in a subspace of dimension  d can be achieved with O(d/  ) queries. The algorithm. 1. Query O(d/  ) vectors of S uniformly at random. 2. Accept if (and only if) the queried vectors lie in a linear (or affine) subspace of dimension  d.

Testing Data Dimensionality14 Proof of testing linear structure Proof (correctness). Algorithm always accepts a data set S of dimension  d. Let S be  -far from having dimension  d. Consider sampling the O(d/  ) vectors one by one. Let X t be the dimension of the subspace spanned by the first t sampled vectors. Lemma 1. Pr[X t+1 = X t + 1 | X t  d]  . Proof. Since S is  -far from having dimension  d, the subspace spanned by the first t sampled vectors contains less than  -fraction of the vectors of S.

Testing Data Dimensionality15 A technical lemma Lemma 2. Let 0  X 0  X 1  X 2 .  be random variables. If Pr[X t+1 = X t + 1 | X t  d]   for all t  0, then for t* = 8d/  we have Pr[X t*  d] < 1/3. Proof sketch. X t has binomial distribution as long as X t  d. Then E[X t* ]  8d and using Chernoff Pr[X t*  d] < 1/3. So with probability  2/3 we have X t*  d and the algorithm rejects (for S that is  -far from dimension  d). This completes the proof of Thm 1. – Similar approach allows to test if a matrix is low-rank and for distances matrix (slight improvement over [Parnas-Ron’01]).

Testing Data Dimensionality16 Lower bound for l 1 Thm 2. Testing whether n vectors in l 1 m can be embedded isometrically into l 1 d requires querying  (n 1/4 ) vectors. Consider first algorithms with one-sided error. Suppose d=1, m=2. Consider the following point set S: S is 1/24-far from l 1 d -embeddability because every “  ” cannot be embedded in the line.

Testing Data Dimensionality17 Lower bound for l 1 with one-sided error Assume there is an algorithm that queries t << n 1/2 points. WLOG it sees a “random” sample of S. With high probability 1 – O(t 2 /n) = 1 – o(1) – The sample contains no two points at distance O(1) from each other. – Then sample is l 1 d –embeddable (since there is a geodesic line going through all its points). – And so algorithm must accept S. Contradiction (since S is 1/24-far).

Testing Data Dimensionality18 Lower bound for l 1 with two-sided error We (randomly) create from S another data set S’ such that – S’ embeds in the line (WHP 1-o(1)). – The algorithm’s view of S differs from its view of S’ with probability o(1), – So probabilities of accepting S vs. that of S’ differ by o(1)<<1/3. Contradiction. Here (to prove Thm 2): – Create S’ by choosing r << n 1/2 random points from S and duplicating each one n/r times. – Then a sample of << r 1/2 points from S,S’ is almost the same. These inputs look the same

Testing Data Dimensionality19 Lower bound for l 2 with perturbation Thm 3. Testing whether n vectors in l 2 m can be perturbed by  to be l 2 d – embeddable requires  (min{n 1/2, m/log m}) queries. Let d=0 (I.e. testing if the vectors are in a ball of radius  ). Consider a sphere of radius  ’ =  (1+1/2n) in l 2 m. Let S’ consist of n random vectors from this sphere. Let S consist of n/2 random vectors from the sphere and their n/2 antipodal vectors (-v).

Testing Data Dimensionality20 Lower bound for l 2 with perturbation WHP, the vectors of S’ are in a ball of radius  – By concentration of measure, WHP they are nearly orthogonal, e.g. the distance between every two is roughly . – In fact, WHP they are all at distance <  from their “center of mass”, as claimed. S’ Concentration of measure YES

Testing Data Dimensionality21 Lower bound for l 2 with perturbation S is 1/2-far from being in a ball of radius  – Because the distance between antipodal vectors in S is 2  ’ . Assume algorithm queries << n 1/2 – WHP view of S, S’ is the same. – So, probability of accepting S and S’ should differ by o(1). – Contradiction. This proves Thm 3. S Antipodals NO

Testing Data Dimensionality22 Lower bound for l 2 with distortion Thm 4: Testing whether n vectors in l 2 m can be embedded in l 2 d with distortion   requires  ((n/  ) 1/2 ) queries. Let d=1 (embedding into a line with distortion  ). Consider a unit circle with equally spaced 10  points. Let S consist of points from n/10  (far apart) parallel copies of this circle in R 3.

Testing Data Dimensionality23 Lower bound for l 2 with distortion S is 1/10  -far from having an embedding with distortion  – Since embedding each cycle into the line requires distortion > .  points NO

Testing Data Dimensionality24 Lower bound for one-sided error Assume algorithm queries << (n/  ) 1/2 points of S – WLOG it sees a “random” sample of S. – WHP, this sample contains at most one point from each circle, – And then it can be embedded with distortion <  into the line (by mapping each point to its circle’s center). – So WHP algorithm must accept S. Contradiction.  points YES

Testing Data Dimensionality25 Lower bound for two-sided error We create S’ by choosing one point from each circle of S and duplicating it 10  times. – Then S’ can be embedded with distortion <  into the line. – WHP view of << (n/  ) 1/2 points from S is the same as from S’. – So, probability of accepting S and S’ should differ by o(1). – This proves Thm 4.

Testing Data Dimensionality26 Future research Testing whether – A matrix spectral norm ||A|| 2 is small. – A distances matrix represents metric (triangle inequality). – A distances matrix represents an l 1 d – metric. – A distances matrix represents an approximate l 2 d – metric. Testing with farness measure that depends on magnitude – a la [Frieze-Kannan-Vempala’98, Achlioptas-McSherry’01]