Restriction Access, Population Recovery & Partial Identification Avi Wigderson IAS, Princeton Joint with Zeev Dvir Anup Rao Amir Yehudayoff
Restriction Access, A new model of “Grey-box” access
Systems, Models, Observations From Input-Output (I 1,O 1 ), (I 2,O 2 ), (I 3,O 3 ), ….? Typically more!
Black-box access Successes & Limits Learning: PAC, membership, statistical…queries Decision trees, DNFs? Cryptography: semantic, CPA, CCA, … security Cold boot, microwave,… attacks? Optimization: Membership, separation,… oracles Strongly polynomial algorithms? Pseudorandomness: Hardness vs. Randomness Derandomizing specific algorithms? Complexity: 2 = NP NP What problems can we solve if P=NP?
The gray scale of access f: n m D: “device” computing f (from a family of devices) D x 1, f(x 1 ) x 2, f(x 2 ) x 3, f(x 3 ) …. D D How to model? Many specific ideas. Ours: general, clean Black Box Gray Box – natural starting point - natural intermediate pt Clear Box
Restriction Access (RA) f: n m D: “device” computing f Restriction: = (x,L), L [n], x n, L L live vars Observations: ( , D| ) D| (simplified after fixing) computes f| on L Black L = Gray Clear L = [n] (x,f(x)) ( , D| ) ( x,D) D f(x) x 1, x 2, *,* …. * ||
Example: Decision Tree x1x1 x4x4 x2x2 x3x3 x2x D = (x,L) L = {3,4} x = (1010) D| = x4x4 x3x3 0 10
Modeling choices (RA-PAC) Restriction: = (x,L), L [n], x n, unknown D Input x : friendly, adversarial, random Unknown distribution (as in PAC) Live vars L : friendly, adversarial, random - independent dist (as in random restrictions)
RA-PAC Results Probably, Approximately Correct (PAC) learning of D, from restrictions with each variable remains alive with prob Thm 1 [DRWY] : A poly(s, ) alg for RA-PAC learning size-s decision trees, for every >0 (reconstruction from pairs of live variables) Thm 2 [DRWY] : A poly(s, ) alg for RA-PAC learning size-s DNFs, for every >.365… (reduction to “Population Recovery Problem”) Positive - In contrast to PAC !!!
Population Recovery (learning a mixture of binomials)
Population Recovery Problem k species, n attributes, from , Vectors v 1, v 2, … v k n Distribution p 1, p 2, … p k , >0 Task: Recover all v i, p i (upto ) from samples p 1 1/ v 1 p 2 1/ v 2 p 3 1/ v 3 Red: Known Blue: Unknown n k
Population Recovery Problem k species, n attributes, from , , >0 v 1, v 2, … v k n p 1, p 2, … p k fraction in population Task: Recover all v i, p i (upto ) from samples Samplers: (1) u v i with prob. p i -Lossy Sampler: (2) u(j) ? with prob. 1- j [n] -Noisy Sampler: (2) u(j) flipped w.p. 1/2- j [n] 0110 ?1?0?1? p 1 1/ v 1 p 2 1/ v 2 p 3 1/ v 3
Skull Teeth Vertebrae Arms Ribs Legs Tail Loss – Paleontology Height (ft) Length (ft) Weight (lbs) True Data 26% 11% 13% 30% 20%
Skull Teeth Vertebrae Arms Ribs Legs Tail Loss – Paleontology Height (ft) Length (ft) Weight (lbs) From samples Dig #1 Dig #2 Dig #3 Dig #4 …… each finding common to many species! How do they do it?
Socialism Abortion Gay marriage Marijuana Male Rich North US Noise – Privacy True Data 2% % …… From samples Joe Jane …. Who flipped every correct answer with probability 49% Deniability? Recovery?
PRP - applications Recovering from loss & noise - Clustering / Learning / Data mining - Computational biology / Archeology / …… - Error correction - Database privacy - …… Numerous related papers & books
PRP - Results Facts: =0 obliterates all information. - No polytime algorithm for = o(1) Thm 3 [DRWY] A poly(k, n, ) algorithm, from lossy samples, for every >.365… Thm 4 [WY]: A poly(k log k, n, ) algorithm, from lossy and/or noisy samples, for every > 0 Kearns, Mansour, Ron, Rubinfeld, Schapire, Sellie exp(k) algorithm for this discrete version Moitra, Valiant exp(k) algorithm for Gaussian version (even when noise is unknown)
Proof of Thm 4 Reconstruct v i, p i From samples,,,…. Lemma 1: Can assume we know the v i ’s ! Proof: Exposing one column at a time. Lemma 2: Easy in exp(n) time ! Proof: Lossy - enough samples without “?” Noisy – linear algebra on sample probabilities. Idea: Make n=O(log k) [Dimension Reduction] p 1 1/ v 1 p 2 1/ v 2 p 3 1/ v 3 ?1?00??01100 n k
Partial IDs a new dimension-reduction technique
Dimension Reduction and small IDs Lemma: Can approximate p i in exp(| S i |) time ! Does one always have small IDs? p v 1 p v 2 p v 3 p v 4 p v 5 p v 6 p v 7 p v 8 p v 9 IDs S 1 = {1,2} S 2 = {8} S 3 = {1,5,6} n = 8 k = 9 u – random sample q i = Pr[u[S i ]=v i [S i ]]
Small IDs ? NO! However,… p v 1 p v 2 p v 3 p v 4 p v 5 p v 6 p v 7 p v 8 p v 9 IDs S 1 = {1} S 2 = {2} S 3 = {3} … S 8 = {8} S 9 = {1,2,…,8} n = 8 k = 9
Linear algebra & Partial IDs However, we can compute p 9 = 1- p 1 - p 2 -…- p p v 1 p v 2 p v 3 p v 4 p v 5 p v 6 p v 7 p v 8 p v 9 IDs S 1 = {1} S 2 = {2} S 3 = {3} … S 8 = {8} S 9 = n = 8 k = 9 P
Back substitution and Imposters Can use back substitution if no cycles ! Are there always acyclic small partial IDs? p v 1 p v 2 p v 3 p v 4 p v 5 p v 6 p v 7 p v 8 p v 9 PIDs S 1 = {1,2} S 2 = {8} S 3 = {1,5,6} u – random sample q i = Pr[u[S i ]=v i [S i ]] q 1 = q 2 = q 3 = q 4 - p 1 - p 2 = S 4 = {3} any subset
Acyclic small partial IDs exist Lemma: There is always an ID of length log k p v 1 p v 2 p v 3 p v 4 p v 5 p v 6 p v 7 p v 8 p v 9 PIDs S 8 = {1,5,6} n = 8 k = 9 Idea: Remove and iterate to find more PIDs Lemma: Acyclic (log k)-PIDs always exists!
Chains of small Partial IDs Compute: q i = Pr[u i = 1] = Σ j≤i p i from sample u Back substitution : p i = q i - Σ j<i p j Problem: Long chains! Error doubles each step, so is exponential in the chain length. Want: Short chains! p v 1 p v 2 p v 3 p v 4 p v 5 p v 6 PIDs S 1 = {1} S 2 = {2} S 3 = {3} … S 6 = {6} n = 8 k = 6
The PID (imposter) graph Given: V=(v 1, v 2, … v k ) n S=(S 1, S 2,…,S k ) [ n ] n Construct G (V;S) by connecting v j v i iff v i is an imposter of v j : v i [ S j ] = v j [ S j ] v v v v v 5 PIDs S 1 = {1} S 2 = {2} S 3 = {3} … S 5 = {5} width = max i |S i | depth = depth ( G ) Want: PIDs w/small width and depth for all V v i v j iff i > j
Constructing cheap PID graphs Theorem: For every V=(v 1, v 2, … v k ), v i n we can efficiently find PIDs S=(S 1,S 2,…,S k ), S i [ n ] of width and depth at most log k Algorithm: Initialize S i = for all i Invariant: |imposters( v i ;S i )| ≤ k /2 | Si | Repeat: (1) Make S i maximal if not, add minority coordinates to S i (2) Make chains monotone: v j v i then | S j |<| S i | (so G acyclic) if not, set S i to S j ( and apply (1) to S i ) v v v v v v 6
Analysis of the algorithm Theorem: For every V=(v 1, v 2, … v k ) n we can efficiently find PIDs S=(S 1,S 2,…,S k ) [ n ] n of width and depth at most log k Algorithm: Initialize S i = for all i Invariant: |imposters( v i ;S i )| ≤ k /2 | Si | Repeat: (1) Make S i maximal (2) Make chains monotone (v j v i then | S j |<| S i |) Analysis: - | S i | log k throughout for all i - i | S i | increases each step - Termination in k log k steps. - width log k and so depth log k
Conclusions - Restriction access: a new, general model of “gray box” access (largely unexplored!) - A general problem of population recovery - Efficient reconstruction from loss & noise - Partial IDs, a new dimension reduction technique for databases. Open: polynomial time algorithm in k ? (currently k log k, PIDs can’t beat k loglog k ) Open: Handle unknown errors ?