Download presentation
Presentation is loading. Please wait.
1
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 22, 2005 http://www.ee.technion.ac.il/courses/049011
2
2 Data Stream Space Lower Bounds
3
3 Data Streams: Reminder Data (n is very large) Computer Program Stream through the data; Use limited memory Approximation of How do we prove memory lower bounds for data stream algorithms?
4
4 CC(g) = min : computes g cost( ) Communication Complexity [Yao 79] Alice Bob m1m1 m2m2 m3m3 m4m4 Referee cost( ) = i |m i | a,b) “transcript”
5
5 CC 1 (g) = min : computes g cost( ) 1-Way Communication Complexity Alice Bob Referee cost( ) = |m 1 | + |m 2 | a,b) “transcript” m1m1 m2m2
6
6 Data Stream Space Lower Bounds f: A n B: a function DS(f): data stream space complexity of f Minimum amount of memory used by any data stream algorithm computing f g: A n/2 × A n/2 B: g(x,y) = f(xy) CC 1 (g): 1-way communication complexity of g Proposition: DS(f) ≥ (CC 1 (g)) Corollary: In order to obtain a lower bound on DS(f), suffices to prove a lower bound on CC 1 (g).
7
7 Reduction: 1-Way CC to DS Proof of reduction: S = DS(f) M: an S-space data stream algorithm for f Lemma: There is a 1-way CC protocol for g whose cost is S + log(|B|). Proof: works as follows: Alice runs M on x Alice sends state of M to Bob (S bits) Bob continues execution of M on y Bob sends output of M to Alice
8
8 Distinct Elements: Reminder Input: a vector x 2 {1,2,…,m} n DE(x) = number of distinct elements of x Example: if x = (1,2,3,1,2,3), then DE(x) = 3\ Space complexity of DE: Randomized approximate: O(log m) space Deterministic approximate: (m) space Randomized exact: (m) space
9
9 The Equality Function EQ: X × X {0,1} EQ(x,y) = 1 iff x = y Theorem: CC 1 (EQ) ≥ (log |X|) Proof: : any 1-way protocol for EQ A (x): Alice’s message on input x B (m,y): Bob’s message when receiving message m from Alice and input y (x,y) = ( A (x), B ( A (x),y)): transcript on (x,y)
10
10 Equality Lower Bound Proof (cont.): Suppose | A (x)| < log|X| for all x X Then, # of distinct messages of Alice < 2 log|X| = |X| By pigeonhole principle, there exist two inputs x,x’ X s.t. A (x) = A (x’) Therefore, (x,x) = (x’,x) But EQ(x,x) ≠ EQ(x’,x). Contradiction.
11
11 Combinatorial Designs 1.For each i, |T i | = |U|/4 2.For each i,j, |T i T j | ≤ |U|/8. T1T1 T2T2 T3T3 U A family of subsets T 1,…,T k of a universe U s.t. Fact: There exist designs of size k = 2 (|U|). (Constant rate, constant relative minimum distance binary error-correcting codes).
12
12 Reduction from EQ to DE U = {1,2,…,m} X = { T 1,….,T k }: design of size k = 2 (m) EQ: X × X {0,1} Note: If x = y, then DE(xy) = m/2 If x ≠ y, then DE(xy) ≥ 3m/4 Therefore: deterministic data stream algorithm that approximates DE with space S 1-way protocol for EQ with cost S + O(log m) Conclusion: DS(DE) ≥ (m)
13
13 Randomized Communication Complexity Alice & Bob are allowed to use random coin tosses For any inputs x,y, the referee needs to find g(x,y) w.p. 1 - RCC(g) = minimum cost of a randomized protocol that computes g with error ¼. RCC 1 (g) = minimum cost of a randomized 1-way protocol that computes g with error ¼. What is RCC 1 (EQ)? RDS(f) = minimum amount of memory used by any randomized data stream algorithm computing f with error ¼. Lemma: RDS(f) ≥ (RCC 1 (g))
14
14 Set Disjointness U = { 1,2,…,m } X = 2 U = {0,1} m Disj: X × X {0,1} Disj(x,y) = 1 iff x y ≠ Equivalently: Theorem [Kalyanasundaram-Schnitger 88] RCC(Disj) ≥ (m)
15
15 Reduction from Disj to DE U = {1,2,…,m} X = 2 U Disj: X × X {0,1} Note: DE(xy) = |x y| Hence, If x y = , then DE(xy) = |x| + |y| If x y ≠ , then DE(xy) < |x| + |y| Therefore: randomized data stream algorithm that computes DE exactly with space S 1-way randomized protocol for Disj with cost S + O(log m) Conclusion: RDS(DE) ≥ (m)
16
16 Information Theory Primer X: random variable on U H(X) = entropy of X = amount of “uncertainty” in X (in bits) Ex: if X is uniform, then H(X) = log(|U|) Ex: if X is constant, then H(X) = 0 H(X | Y) = conditional entropy of X given Y = amount of uncertainty left in X after knowing Y Ex: H(X | X) = 0 Ex: If X,Y are independent, H(X | Y) = H(X) I (X ; Y) = H(X) – H(X | Y) = H(Y) – H(Y | X) = mutual information between X and Y H(X) H(Y) H(X|Y) H(Y|X) I (X;Y) H(X,Y )
17
17 Information Theory Primer (cont.) Sub-additivity of entropy: H(X,Y) ≤ H(X) + H(Y). Equality iff X,Y independent. Conditional mutual information: I (X ; Y | Z) = H(X | Z) – H(X | Y,Z)
18
18 Information Complexity g: A × B C: a function : a communication complexity protocol that computes g : distribution on A × B (X,Y): random variable with distribution Information cost of : icost( ) = I (X,Y; (X,Y)) Information complexity of g: IC (g) = min : computes g icost( ) Lemma: For any , RCC(g) ≥ IC (g)
19
19 Direct Sum for Information Complexity We want to: Find a distribution on {0,1} m {0,1} m Show that IC (Disj) ≥ (m) Recall that Disj(x,y) = OR i = 1 to m (x i AND y i ) We will prove a “direct sum” theorem for IC (Disj): Disj is a “sum” of m independent copies of AND Hence, information complexity of Disj is m times information complexity of AND We will define a distribution on {0,1} {0,1} and then define = m “Theorem” [Direct Sum]: IC (Disj) ≥ m IC (AND) It would then suffice to prove an (1) lower bound on IC (AND)
20
20 Conditional Information Complexity Cannot prove direct sum directly for information complexity Recall: : distribution on {0,1} m {0,1} m (X,Y): random variable with distribution (X,Y) is product, if X,Y are independent Z: some auxiliary random variable on domain S (X,Y) is product conditioned on Z, if for any z S, X,Y are independent conditioned on the event { Z = z }. Conditional information complexity of g given Z: CIC (g | Z) = min : computes g I (X,Y ; (X,Y) | Z) Lemma: For any and Z, IC (g) ≥ CIC (g | Z)
21
21 Input Distributions for Set Disjointness : distribution on pairs {0,1} {0,1} (U,V): random variable with distribution (U,V) are generated as follows: Choose a uniform bit D If D = 0, choose U uniformly from {0,1} and set V = 0 If D = 1, choose V uniformly from {0,1} and set U = 0 We define = m Note: is not product Conditioned on Z = D m, is product For every (x,y) in the support of , Disj(x,y) = 0.
22
22 Direct Sum for IC [Bar-Yossef, Jayram, Kumar, Sivakumar 02] Theorem: CIC (Disj | D m ) ≥ CIC (AND | D) Proof outline: Decomposition step: I (X,Y; (X,Y) | D m ) ≥ i I ((X i,Y i ) ; (X,Y) | D m ) Reduction step: I ((X i,Y i ) ; (X,Y) | D m ) ≥ CIC (AND | D)
23
23 Decomposition Step I (X,Y; (X,Y) | D m ) = = H(X,Y | D m ) – H(X,Y | (X,Y), D m ) ≥ i H(X i,Y i | D m ) - i H(X i,Y i | (X,Y), D m ) (by independence of (X 1,Y 1 ),…,(X m,Y m ) and by sub-additivity of entropy). = i I ((X i,Y i ) ; (X,Y) | D m )
24
24 Reduction Step Want to show: I ((X i,Y i ) ; (X,Y) | D m ) ≥ CIC (AND | D) I ((X i,Y i ) ; (X,Y) | D m ) = d -i Pr(D -i = d -i ) I ((X i,Y i ) ; (X,Y) | D i,D -i = d -i ) A protocol for computing AND(x i,y i ): For all j i, Alice and Bob select X j and Y j independently using d j Alice and Bob run the protocol on X = (X 1,…,X i-1,x i,X i+1,…,X m ) and Y = (Y 1,…,Y i-1,y i,Y i+1,…,Y m ) Note that Disj(X,Y) = AND(x i,y i ) icost of this protocol = I ((X i,Y i ) ; (X,Y) | D i,D -i = d -i )
25
25 End of Lecture 13
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.