Finding Correlations in Subquadratic Time Gregory Valiant.

Slides:



Advertisements
Similar presentations
The Complexity of Linear Dependence Problems in Vector Spaces David Woodruff IBM Almaden Joint work with Arnab Bhattacharyya, Piotr Indyk, and Ning Xie.
Advertisements

LEARNIN HE UNIFORM UNDER DISTRIBUTION – Toward DNF – Ryan ODonnell Microsoft Research January, 2006.
Inapproximability of MAX-CUT Khot,Kindler,Mossel and O ’ Donnell Moshe Ben Nehemia June 05.
Overcoming the L 1 Non- Embeddability Barrier Robert Krauthgamer (Weizmann Institute) Joint work with Alexandr Andoni and Piotr Indyk (MIT)
Max Cut Problem Daniel Natapov.
Counting the bits Analysis of Algorithms Will it run on a larger problem? When will it fail?
The Stable Circuit Problem A Short Introduction Brendan Juba.
Cse 521: design and analysis of algorithms Time & place T, Th pm in CSE 203 People Prof: James Lee TA: Thach Nguyen Book.
Learning Juntas Elchanan Mossel UC Berkeley Ryan O’Donnell MIT Rocco Servedio Harvard.
1 Backdoor Sets in SAT Instances Ryan Williams Carnegie Mellon University Joint work in IJCAI03 with: Carla Gomes and Bart Selman Cornell University.
The Theory of NP-Completeness
1 NP-Complete Problems. 2 We discuss some hard problems:  how hard? (computational complexity)  what makes them hard?  any solutions? Definitions 
Adam Tauman Kalai, Georgia Tech. Yishay Mansour, Google and Tel-Aviv Elad Verbin, Tsinghua On Agnostic Boosting and Parity Learning.
Online Social Networks and Media. Graph partitioning The general problem – Input: a graph G=(V,E) edge (u,v) denotes similarity between u and v weighted.
Searching for the Minimal Bézout Number Lin Zhenjiang, Allen Dept. of CSE, CUHK 3-Oct-2005
Algorithm Design Strategy Divide and Conquer. More examples of Divide and Conquer  Review of Divide & Conquer Concept  More examples  Finding closest.
Beating Brute Force Search for Formula SAT and QBF SAT Rahul Santhanam University of Edinburgh.
1 Optimization problems such as MAXSAT, MIN NODE COVER, MAX INDEPENDENT SET, MAX CLIQUE, MIN SET COVER, TSP, KNAPSACK, BINPACKING do not have a polynomial.
Arithmetic Hardness vs. Randomness Valentine Kabanets SFU.
Testing of Clustering Noga Alon, Seannie Dar Michal Parnas, Dana Ron.
Approximate Nearest Neighbors and the Fast Johnson-Lindenstrauss Transform Nir Ailon, Bernard Chazelle (Princeton University)
Analysis of Algorithms CS 477/677
Correlation Immune Functions and Learning Lisa Hellerstein Polytechnic Institute of NYU Brooklyn, NY Includes joint work with Bernard Rosell (AT&T), Eric.
1 On approximating the number of relevant variables in a function Dana Ron & Gilad Tsur Tel-Aviv University.
Computability and Complexity 24-1 Computability and Complexity Andrei Bulatov Approximation.
Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006.
1 The PCP starting point. 2 Overview In this lecture we’ll present the Quadratic Solvability problem. In this lecture we’ll present the Quadratic Solvability.
1 The PCP starting point. 2 Overview In this lecture we’ll present the Quadratic Solvability problem. We’ll see this problem is closely related to PCP.
1 The Theory of NP-Completeness 2 NP P NPC NP: Non-deterministic Polynomial P: Polynomial NPC: Non-deterministic Polynomial Complete P=NP? X = P.
Dana Moshkovitz, MIT Joint work with Subhash Khot, NYU.
1 The Santa Claus Problem (Maximizing the minimum load on unrelated machines) Nikhil Bansal (IBM) Maxim Sviridenko (IBM)
The Theory of NP-Completeness 1. Nondeterministic algorithms A nondeterminstic algorithm consists of phase 1: guessing phase 2: checking If the checking.
The Theory of NP-Completeness 1. What is NP-completeness? Consider the circuit satisfiability problem Difficult to answer the decision problem in polynomial.
1 The Theory of NP-Completeness 2012/11/6 P: the class of problems which can be solved by a deterministic polynomial algorithm. NP : the class of decision.
1 The Theory of NP-Completeness 2 Cook ’ s Theorem (1971) Prof. Cook Toronto U. Receiving Turing Award (1982) Discussing difficult problems: worst case.
Umans Complexity Theory Lectures Lecture 1a: Problems and Languages.
NP-COMPLETE PROBLEMS. Admin  Two more assignments…  No office hours on tomorrow.
Sarah R. Allen Ryan O’Donnell David Witmer Carnegie Mellon University.
Quantum Computing MAS 725 Hartmut Klauck NTU
CS6045: Advanced Algorithms NP Completeness. NP-Completeness Some problems are intractable: as they grow large, we are unable to solve them in reasonable.
A Kernel Approach for Learning From Almost Orthogonal Pattern * CIS 525 Class Presentation Professor: Slobodan Vucetic Presenter: Yilian Qin * B. Scholkopf.
Conditional Lower Bounds for Dynamic Programming Problems Karl Bringmann Max Planck Institute for Informatics Saarbrücken, Germany.
Why almost all satisfiable k - CNF formulas are easy? Danny Vilenchik Joint work with A. Coja-Oghlan and M. Krivelevich.
The Theory of NP-Completeness 1. Nondeterministic algorithms A nondeterminstic algorithm consists of phase 1: guessing phase 2: checking If the checking.
Prof. Busch - LSU1 Time Complexity. Prof. Busch - LSU2 Consider a deterministic Turing Machine which decides a language.
1 The Theory of NP-Completeness 2 Review: Finding lower bound by problem transformation Problem X reduces to problem Y (X  Y ) iff X can be solved by.
The Theory of NP-Completeness
Clustering Data Streams
Computability and Complexity
Time Complexity Costas Busch - LSU.
Sum of Squares Lower Bounds for Refuting Any CSP
Haim Kaplan and Uri Zwick
NP-Completeness Yin Tat Lee
Background: Lattices and the Learning-with-Errors problem
Tight Fourier Tails for AC0 Circuits
Complexity 6-1 The Class P Complexity Andrei Bulatov.
Linear sketching with parities
Yair Bartal Lee-Ad Gottlieb Hebrew U. Ariel University
Linear sketching over
cse 521: design and analysis of algorithms
Linear sketching with parities
Chapter 11 Limitations of Algorithm Power
Graphs and Algorithms (2MMD30)
On The Quantitative Hardness of the Closest Vector Problem
NP-Completeness Yin Tat Lee
The Theory of NP-Completeness
Clustering.
Topics in Algorithms 2005 Max Cuts
Presentation transcript:

Finding Correlations in Subquadratic Time Gregory Valiant

Liu L et al. PNAS 2003;100: Q1: Find correlated columns (sub-quadratic time?) n

State of the Art: Correlations/Closest Pair Q1: Given n random boolean vectors, except with one pair that is  - correlated (= w. prob (1 +  )/2), find the correlated pair. Brute Force O(n 2 ) [Paturi, Rajasekaran, Reif: ’89] n 2-O(  ) [Indyk, Motwani: ‘98] (LSH) n 2-O(  ) [Dubiner: ’08] n 2-O(  ) [This work] n 1.62 poly(1/  ) Q2: Given n arbitrary boolean or Euclidean vectors, find (1+  )-approximate closest pair (for small  ). Brute Force O(n 2 ) LSH [IM’98] O( n 2-  ) LSH [AI’06] (Euclidean) O( n 2-2  ) [This work] n 2-O(sqrt(  ))

Why do we care about (1+  ) approx closest pair in the “random” setting? 1.Natural “hard” regime for LSH (Natural place to start if we hope to develop non-hashing based approaches to closest-pair/nearest neighbor…) 2. Related to other theory problems—learning parities with noise, learning juntas, learning DNFs, solving instances of random SAT, etc.

Liu L et al. PNAS 2003;100: Definition: k-Junta is function with only k relevant variables Q3: Given k-Junta, identify the relevant variables. [ Brute force: time ≈( ) ≈ n k ] Q4: Learning Parity w. noise: Given k-Junta that is PARITY function (XOR), identify the relevant variables. No noise: easy, lin. sys. over F 2 With noise: seems hard. nknk n

State of the Art: Learning Juntas Q3: Learning k-Juntas from random examples. no noise with noise * Brute Force n k n k [Mossel, O’Donnell, Servedio: ’03] n.7k n k [Grigorescu, Reyzin, Vempala, ’11] >n.7k( k+1 ) > n k( k+1 ) [This Work] n.6k n.8k * poly[1/(1-2*noise)] factors not shown

Finding Correlations Q1: Given n random boolean vectors, except with one pair that is r-correlated, find the correlated pair. n d [Prob. Equal = (1 + r)/2 ]

Naïve Approach = Problem: C has size n 2 ij i j ij

Column Aggregation * = Correlation r  O(r /  Correlation overwhelms noise if: d >>    r 2 Set  n 1/3, assume d sufficiently large ( > n 2/3 / r 2 ) Runtime = max( n (2/3) , n 5/3 ) poly(1/r) = n 1.66 poly(1/r) n n/  d 

Vector Expansion What if d < n 2/3 ? Answer: Embed vectors into ~n 2/3 dimensions! (via XORs/`tensoring’) d … Runtime = O(n 1.6 ) x y x XOR y

Algorithm Summary n d n n 2/3 Vector Expansion Column Aggregation n n 2/3 * =

Algorithm Summary n d n n 2/3 Vector Expansion Column Aggregation n n 2/3 * =  n 1/3 Metric embedding: Distribution of pairwise inner products: Sum of  2 =n 2/3 pairwise inner products 0 What embedding should we use?

Goal: design embedding f: R d -> R m s.t. * large => large * small => TINY * = d f ( ) (for all x,y) e.g. k-XOR embedding: d f (q) = q k What functions d f are realizable? Thm: [Schoenberg 1942] d f (q) = a 0 + a 1 q + a 2 q 2 + a 3 q 3 +… with a i ≥ 0 …not good enough to beat n 2-O(r) in adversarial setting

Idea: Two embedding are better than one: f,g: R d -> R m s.t. = d f,g ( )for all x,y d f,g can be ANY polynomial!!! Which polynomial do we use?

Goal: embeddings f,g: R d -> R m s.t. * = d f,g ( ) (for all x,y) * c small => d f,g (c) TINY * c large => d f,g (c) large Question: Fix a > 1, which degree q polynomial maximizes P(a) s.t. |P(x)| ≤ 1 for |x| ≤ 1 ? r “large” “small” d f,g r-approx. maximal inner product: Chebyshev Polynomials!

The Algorithm n d Partition vectors into two sets: only consider inner products across partition. n/2 Find pair of vectors with inner product within r of opt, time n 2-O(sqrt(r)) Apply Chebyshev embeddings,f and g f g d n/2 r (good) (bad) (good)

Application: random k-SAT (x 1 ∨ x 5 ∨ … ∨ x 6 ) ∧ … ∧ (x 3 ∨ x 4 ∨ … ∨ x 8 ) k literals per clause (n different variables) c 2 k n w.h.p. unsat # clauses: interesting regime Strong Exponential Time Hypothesis: For sufficiently large (constant) k no algorithm can solve k-SAT in time n Folklore belief: The hardest instances of k-SAT are the random instances that lie in the “interesting regime”

Application: random k-SAT Closest Pair: Distribution of n 2 pairwise inner products, want to find biggest in time < n 1.9 Random k-sat: Distribution of #clauses satisfied, by each of 2 n assignments, Want to find biggest in time < n 2 n/2 # clauses Reduction: Partition n variables into 2 sets. For each set, form (#clause)x(2 n/2 ) matrix. rows  clauses, columns  assignments. Inner product of column of A with column of B = #clauses satisfied Note: A,B skinny, so can really embed (project up/amplify a LOT!!!)

Application: random k-SAT Current Status: Applying approach of closest-pair algorithm (needs a little modification) leads to better algorithm for random SAT then is known for worst-case SAT. (pretty cool..) Still not quite n … 2-step plan to disprove Strong Exponential Time Hypothesis: Leverage additional SAT structure of A,B to get n for random k-SAT. Show that any additional structure to the SAT instance only helps (i.e. all we care about is distribution of # clauses satisfied by random assignment…can’t be tricked) Both parts seem a little tricky, but plausible….

Questions What are the right runtimes (random or adversarial setting)? Better embeddings? Lower bounds for this approach?? Non-hashing Nearest Neighbor Search?? Can we avoid fast matrix multiplication? (alternatives: e.g. Pagh’s FFT-based ‘sparse matrix multiplication’?) Other applications of “Chebyshev” embeddings? Other applications of closest-pair approach (juntas, SAT, ….?)