Correlation Immune Functions and Learning Lisa Hellerstein Polytechnic Institute of NYU Brooklyn, NY Includes joint work with Bernard Rosell (AT&T), Eric.

Slides:



Advertisements
Similar presentations
How to Solve Longstanding Open Problems In Quantum Computing Using Only Fourier Analysis Scott Aaronson (MIT) For those who hate quantum: The open problems.
Advertisements

LEARNIN HE UNIFORM UNDER DISTRIBUTION – Toward DNF – Ryan ODonnell Microsoft Research January, 2006.
Approximate List- Decoding and Hardness Amplification Valentine Kabanets (SFU) joint work with Russell Impagliazzo and Ragesh Jaiswal (UCSD)
Foundations of Cryptography Lecture 10 Lecturer: Moni Naor.
An Efficient Membership-Query Algorithm for Learning DNF with Respect to the Uniform Distribution Jeffrey C. Jackson Presented By: Eitan Yaakobi Tamar.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
296.3Page :Algorithms in the Real World Error Correcting Codes II – Cyclic Codes – Reed-Solomon Codes.
15-853:Algorithms in the Real World
Learning Juntas Elchanan Mossel UC Berkeley Ryan O’Donnell MIT Rocco Servedio Harvard.
Simple Affine Extractors using Dimension Expansion. Matt DeVos and Ariel Gabizon.
Chapter 5 The Witness Reduction Technique: Feasible Closure Properties of #P Greg Goldstein Andrew Learn 18 April 2001.
Bounds on Code Length Theorem: Let l ∗ 1, l ∗ 2,..., l ∗ m be optimal codeword lengths for a source distribution p and a D-ary alphabet, and let L ∗ be.
Adam Tauman Kalai, Georgia Tech. Yishay Mansour, Google and Tel-Aviv Elad Verbin, Tsinghua On Agnostic Boosting and Parity Learning.
1 The Monte Carlo method. 2 (0,0) (1,1) (-1,-1) (-1,1) (1,-1) 1 Z= 1 If  X 2 +Y 2  1 0 o/w (X,Y) is a point chosen uniformly at random in a 2  2 square.
Dictator tests and Hardness of approximating Max-Cut-Gain Ryan O’Donnell Carnegie Mellon (includes joint work with Subhash Khot of Georgia Tech)
CS151 Complexity Theory Lecture 7 April 20, 2004.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 8 May 4, 2005
CS 454 Theory of Computation Sonoma State University, Fall 2011 Instructor: B. (Ravi) Ravikumar Office: 116 I Darwin Hall Original slides by Vahid and.
Class notes for ISE 201 San Jose State University
On Uniform Amplification of Hardness in NP Luca Trevisan STOC 05 Paper Review Present by Hai Xu.
Arithmetic Hardness vs. Randomness Valentine Kabanets SFU.
Probably Approximately Correct Model (PAC)
Avraham Ben-Aroya (Tel Aviv University) Oded Regev (Tel Aviv University) Ronald de Wolf (CWI, Amsterdam) A Hypercontractive Inequality for Matrix-Valued.
Complexity 19-1 Complexity Andrei Bulatov More Probabilistic Algorithms.
Exact Learning of Boolean Functions with Queries Lisa Hellerstein Polytechnic University Brooklyn, NY AMS Short Course on Statistical Learning Theory,
Testing Metric Properties Michal Parnas and Dana Ron.
–Def: A language L is in BPP c,s ( 0  s(n)  c(n)  1,  n  N) if there exists a probabilistic poly-time TM M s.t. : 1.  w  L, Pr[M accepts w]  c(|w|),
Derandomizing LOGSPACE Based on a paper by Russell Impagliazo, Noam Nissan and Avi Wigderson Presented by Amir Rosenfeld.
1. 2 Overview Some basic math Error correcting codes Low degree polynomials Introduction to consistent readers and consistency tests H.W.
1 On The Learning Power of Evolution Vitaly Feldman.
Foundations of Privacy Lecture 11 Lecturer: Moni Naor.
Ryan Kinworthy 2/26/20031 Chapter 7- Local Search part 2 Ryan Kinworthy CSCE Advanced Constraint Processing.
Foundations of Cryptography Lecture 2 Lecturer: Moni Naor.
If we measured a distribution P, what is the tree- dependent distribution P t that best approximates P? Search Space: All possible trees Goal: From all.
Correlation testing for affine invariant properties on Shachar Lovett Institute for Advanced Study Joint with Hamed Hatami (McGill)
Great Theoretical Ideas in Computer Science.
Approximation algorithms for sequential testing of Boolean functions Lisa Hellerstein Polytechnic Institute of NYU Joint work with Devorah Kletenik (Polytechnic.
Polynomial Factoring Ramesh Hariharan. The Problem Factoring Polynomials overs Integers Factorization is unique (why?) (x^2 + 5x +6)  (x+2)(x+3) Time:
Skewing: An Efficient Alternative to Lookahead for Decision Tree Induction David PageSoumya Ray Department of Biostatistics and Medical Informatics Department.
Week 10Complexity of Algorithms1 Hard Computational Problems Some computational problems are hard Despite a numerous attempts we do not know any efficient.
Private Approximation of Search Problems Amos Beimel Paz Carmi Kobbi Nissim Enav Weinreb (Technion)
Secure Computation (Lecture 5) Arpita Patra. Recap >> Scope of MPC > models of computation > network models > modelling distrust (centralized/decentralized.
1 New Coins from old: Computing with unknown bias Elchanan Mossel, U.C. Berkeley
Great Theoretical Ideas in Computer Science.
Approximate Inference: Decomposition Methods with Applications to Computer Vision Kyomin Jung ( KAIST ) Joint work with Pushmeet Kohli (Microsoft Research)
Daniel Kroening and Ofer Strichman 1 Decision Procedures An Algorithmic Point of View BDDs.
Finding Correlations in Subquadratic Time Gregory Valiant.
Review of Probability. Important Topics 1 Random Variables and Probability Distributions 2 Expected Values, Mean, and Variance 3 Two Random Variables.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Probability Theory Modelling random phenomena. Permutations the number of ways that you can order n objects is: n! = n(n-1)(n-2)(n-3)…(3)(2)(1) Definition:
Unique Games Approximation Amit Weinstein Complexity Seminar, Fall 2006 Based on: “Near Optimal Algorithms for Unique Games" by M. Charikar, K. Makarychev,
Donghyun (David) Kim Department of Mathematics and Computer Science North Carolina Central University 1 Chapter 7 Time Complexity Some slides are in courtesy.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Dana Ron Tel Aviv University
Learning Tree Structures
Lecture 18: Uniformity Testing Monotonicity Testing
NP-Completeness Yin Tat Lee
Background: Lattices and the Learning-with-Errors problem
Tight Fourier Tails for AC0 Circuits
Turnstile Streaming Algorithms Might as Well Be Linear Sketches
Pseudo-derandomizing learning and approximation
Linear sketching with parities
Linear sketching over
Linear sketching with parities
Correlation Immune Functions and Learning
including joint work with:
NP-Completeness Yin Tat Lee
CS639: Data Management for Data Science
CS151 Complexity Theory Lecture 7 April 23, 2019.
Switching Lemmas and Proof Complexity
Presentation transcript:

Correlation Immune Functions and Learning Lisa Hellerstein Polytechnic Institute of NYU Brooklyn, NY Includes joint work with Bernard Rosell (AT&T), Eric Bach and David Page (U. of Wisconsin), and Soumya Ray (Case Western)

2 Identifying relevant variables from random examples x f(x) (1,1,0,0,0,1,1,0,1,0) 1 (0,1,0,0,1,0,1,1,0,1) 1 (1,0,0,1,0,1,0,0,1,0) 0

3 Technicalities Assume random examples drawn from uniform distribution over {0,1} n Have access to source of random examples

4 Detecting that a variable is relevant Look for dependence between input variables and output If x i irrelevant P(f=1|xi=1) = P(f=1|xi=0) If x i relevant P(f=1|xi=1) ≠ P(f=1|xi=0) for previous function f

5 Unfortunately… x i relevant P(f=1|xi=1) = 1/2 = P(f=1|xi=0) x i irrelevant P(f=1|xi=1) = 1/2 = P(f=1|xi=0) Finding a relevant variable easy for some functions. Not so easy for others.

6 How to find the relevant variables Suppose you know r (# of relevant vars) Assume r << n (Think of r = log n) Get m random examples, where m = poly(2 r,log n,1/δ) With probability > 1-δ, have enough info to determine which r variables are relevant –All other sets of r variables can be ruled out

7 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 f (1, 1, 0, 1, 1, 0, 1, 0, 1, 0) 1 (0, 1, 1, 1, 1, 0, 1, 1, 0, 0) 0 (1, 1, 1, 0, 0, 0, 0, 0, 0, 0) 1 (0, 0, 0, 1, 1, 0, 0, 0, 0, 0) 0 (1, 1, 1, 0, 0, 0, 1, 1, 1, 1) 0

8 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 f (1, 1, 0, 1, 1, 0, 1, 0, 1, 0) 1 (0, 1, 1, 1, 1, 0, 1, 1, 0, 0) 0 (1, 1, 1, 0, 0, 0, 0, 0, 0, 0) 1 (0, 0, 0, 1, 1, 0, 0, 0, 0, 0) 0 (1, 1, 1, 0, 0, 0, 1, 1, 0, 1) 0

9 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 f (1, 1, 0, 1, 1, 0, 1, 0, 1, 0) 1 (0, 1, 1, 1, 1, 0, 1, 1, 0, 0) 0 (1, 1, 1, 0, 0, 0, 0, 0, 0, 0) 1 (0, 0, 0, 1, 1, 0, 0, 0, 0, 0) 0 (1, 1, 1, 0, 0, 0, 1, 1, 0, 1) 0 x 3, x 5, x 9 can’t be the relevant variables

10 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 f (1, 1, 0, 1, 1, 0, 1, 0, 1, 0) 1 (0, 1, 1, 1, 1, 0, 1, 1, 0, 0) 0 (1, 1, 1, 0, 0, 0, 0, 0, 0, 0) 1 (0, 0, 0, 1, 1, 0, 0, 0, 0, 0) 0 (1, 1, 1, 0, 0, 0, 1, 1, 1, 1) 0 x 1, x 3, x 10 ok

11 Naïve algorithm: Try all combinations of r variables. Time ≈ n r Mossel, O’Donnell, Servedio [STOC 2003] –Algorithm that takes time ≈ n cr where c ≈.704 –Subroutine: Find a single relevant variable Still open: Can this bound be improved?

12 If output of f is dependent on xi, can detect dependence (whp) in time poly(n, 2r) and identify xi as relevant. Problematic Functions Every variable is independent of output of f P[f=1|x i =0] = P[f=1|x i =1] for all x i Equivalently, all degree 1 Fourier coeffs = 0 Functions with this property said to be CORRELATION-IMMUNE

13 P[f=1|x i =0] = P[f=1|x i =1] for all x i Geometrically: e.g. n=2

14 P[f=1|x i =0] = P[f=1|x i =1] for all x i Geometrically: Parity(x 1,x 2 )

15 P[f=1|x i =0] = P[f=1|x i =1] for all x i Geometrically: X 1 =1 X 1 =0

16 P[f=1|x i =0] = P[f=1|x i =1] for all x i X 2 =0X 2 =1

17 Other correlation-immune functions besides parity? –f(x 1,…,x n ) = 1 iff x 1 = x 2 = … = x n

18 Other correlation-immune functions besides parity? –All reflexive functions

19 Other correlation-immune functions besides parity? –All reflexive functions –More…

20 Correlation-immune functions and decision tree learners Decision tree learners in ML –Popular machine learning approach (CART, C4.5) –Given set of examples of Boolean function, build a decision tree Heuristics for decision tree learning –Greedy, top-down –Differ in way choose which variable to put in node –Pick variable having highest “gain” –P[f=1|xi=1] = P[f=1|xi=0] means 0 gain Correlation-immune functions problematic for decision tree learners

21 Lookahead Skewing: An efficient alternative to lookahead for decision tree induction. IJCAI 2003 [Page, Ray] Why skewing works: learning difficult Boolean functions with greedy tree learners. ICML 2005 [Rosell, Hellerstein, Ray, Page]

22 Story Part One

23 How many difficult functions? More than n # fns n-1 2

24 How many different hard functions? More than SOMEONE MUST HAVE STUDIED THESE FUNCTIONS BEFORE… n # fns n/2 2

25

26

27 Story Part Two

28 I had lunch with Eric Bach

29 Roy, B. K A Brief Outline of Research on Correlation Immune Functions. In Proceedings of the 7th Australian Conference on information Security and Privacy (July , 2002). L. M. Batten and J. Seberry, Eds. Lecture Notes In Computer Science, vol Springer-Verlag, London,

30 Correlation-immune functions k-correlation immune function –For every subset S of the input variables s.t. 1 ≤ |S| ≤ k P[f | S] = P[f] –[Xiao, Massey 1988] Equivalently, all Fourier coefficients of degree i are 0, for 1 ≤ i ≤ k

31 Siegenthaler’s Theorem If f is k-correlation immune, then the GF[2] polynomial for f has degree at most n-k.

32 Siegenthaler’s Theorem [1984] If f is k-correlation immune, then the GF[2] polynomial for f has degree at most n-k. Algorithm of Mossel, O’Donnell, Servedio [STOC 2003] based on this theorem

33 End of Story

34 Non-uniform distributions Correlation-immune functions are defined wrt the uniform distribution What if distribution is biased? e.g. each bit 1 with probability ¾

35 f(x 1,x 2 ) = parity(x 1,x 2 ) each bit 1 with probability 3/4 x parity(x) P[x] / / / /16 P[f=1|x 1 =1] ≠ P[f=1|x 1 =0]

36 f(x 1,x 2 ) = parity(x 1,x 2 ) p=1 with probability 1/4 x parity(x) P[x] / / / /16 P[f=1|x 1 =1] ≠ P[f=1|x 1 =0] For added irrelevant variables, would be equal

37 Correlation-immunity wrt p-biased distributions Definitions f is correlation-immune wrt distribution D if P D [f=1|xi=1] = P D [f=1|xi=0] for all xi p-biased distribution D p : each bit set to 1 independently with probability p –For all p-biased distributions D, P D [f=1|xi=1] = P D [f=1|xi=0] for all irrelevant xi

38 Lemma: Let f(x1,…,xn) be a Boolean function with r relevant variables. Then f is correlation immune w.r.t. D p for at most r-1 values of p. Pf: Correlation immune wrt Dp means P[f=1|xi=1] – P[f=1|xi=0] = 0 (*) for all xi. Consider fixed f and xi. Can write lhs of (*) as polynomial h(p).

39 e.g. f(x 1,x 2, x 3 ) = parity(x 1,x 2, x 3 ) p-biased distribution Dp h(p) = P Dp [f=1|x1=1] - P Dp [f=1|x1=0] = ( p 2 + p(1-p) ) – ( p(1-p) + (1-p)p ) If add irrelevant variable, this polynomial doesn’t change h(p) for arbitrary f, variable x i, has degree <= r- 1, where r is number of variables. f correlation-immune wrt at most r-1 values of p, unless h(p) identically 0 for all xi.

40 h(p) = P Dp [f=1|xi=1] -P Dp [f=1|xi=0] where w d is number of inputs x for which f(x)=1, xi=1, and x contains exactly d additional 1’s. i.e. w d = number of positive assignments of f xi<-1 of Hamming weight d Similar expression for P Dp [f=1|xi=0]

41 P Dp [f=1|x i =1] - P Dp [f=1|x i =0] = where w d = number of positive assignments of f xi<-1 of Hamming weight d r d = number of positive assignments of f xi<-0 of Hamming weight d Not identically 0 iff w d ≠ r d for some d

42 Property of Boolean functions Lemma: If f has at least one relevant variable, then for some relevant variable xi, and some d, w d ≠ r d for some d where w d = number of positive assignments of f xi<-1 of Hamming weight d r d = number of positive assignments of f xi<-0 of Hamming weight d

43 How much does it help to have access to examples from different distributions?

44 How much does it help to have access to examples from different distributions? Hellerstein, Rosell, Bach, Page, Ray Exploiting Product Distributions to Identify Relevant Variables of Correlation Immune Functions Exploiting Product Distributions to Identify Relevant Variables of Correlation Immune Functions [Hellerstein, Rosell, Bach, Ray, Page]

45 Even if f is not correlation-immune wrt Dp, may need very large sample to detect relevant variable –if value of p very near root of h(p) Lemma: If h(p) not identically 0, then for some value of p in the set { 1/(r+1),2/(r+1),3/(r+1)…, (r+1)/(r+1) }, h(p) ≥ 1/(r+1) r-1

46 Algorithm to find a relevant variable –Uses examples from distributions Dp, for p = 1/(r+1),2/(r+1),3/(r+1)…, (r+1)/(r+1) –sample size poly((r+1) r, log n, log 1/δ) [Essentially same algorithm found independently by Arpe and Mossel, using very different techniques] Another algorithm to find a relevant variable –Based on proving (roughly) that if choose random p, then h 2 (p) likely to be reasonably large. Uses prime number theorem. –Uses examples from poly(2 r, log 1/ δ) distributions Dp. –Sample size poly(2 r, log n, log 1/ δ)

47 Better algorithms?

48 Summary Finding relevant variables (junta-learning) Correlation-immune functions Learning from p-biased distributions

49 Moral of the Story Handbook of integer sequences can be useful in doing literature search Eating lunch with the right person can be much more useful