Download presentation
Presentation is loading. Please wait.
1
Algorithms for Distributed Functional Monitoring Ke Yi HKUST Joint work with Graham Cormode (AT&T Labs) S. Muthukrishnan (Google Inc.)
2
The Story Begins with...
3
The Model 1421345 235212 Alice observes A(t) by time t Bob observes B(t) by time t A(t), B(t): multisets Carole tries to compute f (A(t) U B(t)) for all t All parties have infinite computing power Goal is to minimize communication t
4
The Model 1421345 235212 231313 253322 k sites Continuous Communication Model / Distributed Streaming Model
5
Combination of Two Models 3 1 1 24 2 3 1 1 24 2 Communication model 14213 Streaming model Continuous Communication Model Distributed Streaming Model One-shot Model “ ”
6
Other Models [Gibbons and Tirthapura, 2001] 1421345 235212 Carole tries to compute f (A U B) in the end All parties make one pass using small memory small communication t
7
Applied Motivation: Distributed Monitoring Large-scale querying/monitoring: Inherently distributed! Streams physically distributed across remote sites E.g., stream of UDP packets through routers Challenge is “holistic” querying/monitoring Queries over the union of distributed streams Q(S 1 ∪ S 2 ∪ …) Streaming data is spread throughout the network Network Operations Center (NOC) Query site Query 0 1 1 1 1 0 0 1 1 0 0 1 1 0 1 1 0 1 1 0 1 1 Q(S 1 ∪ S 2 ∪ …) S6S6 S5S5 S4S4 S3S3 S1S1 S2S2 Slide from the tutorial “Streaming in a connected world: Querying and tracking distributed data streams” at VLDB’06 and SIGMOD’07 [Cormode and Garofalakis]
8
Applied Motivation: Distributed Monitoring Traditional approach: “pull” based Query all nodes once for a while Expensive communication, most is wasted Inaccurate Current trend: moving towards a “push” based approach The remote sites alert the coordinator when something interesting happens Network Operations Center (NOC) Query site Query 0 1 1 1 1 0 0 1 1 0 0 1 1 0 1 1 0 1 1 0 1 1 Q(S 1 ∪ S 2 ∪ …) S6S6 S5S5 S4S4 S3S3 S1S1 S2S2
9
Theoretical Questions Upper bounds: Worst-case communication bounds for a given f ? Lower bounds: Is there a gap in the communication complexity between the one-shot model and the continuous model?
10
The Frequency Moments Assume integer domain [ n ] = { 1, …, n } i appears m i times The p -th frequency moment: F 1 is the cardinality of A F 0 is # unique items in A (define 0 0 =0 ) F 2 is Gini’s index of homogeneity in statistics self-join size in db Extensively studied since [Alon, Matias, and Szegedy, 1999]
11
Approximate Monitoring Must trigger alarm when F p > τ Cannot trigger alarm when F p < (1 − ε) τ Why approximate: Exact monitoring is expensive and unnecessary Why monitoring Most applications only need monitoring Tracking can be simulated by monitoring with τ = 1+ε, (1+ε) 2, (1+ε) 3, …, so at most an O(1/ε) factor away. time FpFp τ (1 − ε) τ alarm
12
Prior Work Several papers in the database literature Mostly heuristic based Bad worst-case bounds, no lower bounds F 1 : O(k/ε log(τ/k)) [SIGMOD’06] F 0 : Õ(k 2 /ε 3 ) [ICDE’06] F 2 : Õ(k 2 /ε 4 ) [VLDB’05] Õ() suppresses polylog factors O(k log(1/ε)) Õ(k/ε 2 ) Õ(k 2 /ε+k 3/2 /ε 3 )
13
Continuous vs One-Shot If there is a continuous monitoring algorithm that communicates X bits, then there is a one-shot algorithms that communicates O(X+k) bits
14
Our Results Good news: all continuous bounds (except F 2 ) are close to their one-shot counterparts Bad news: all continuous bounds (except F 2 ) are close to their one-shot counterparts
15
Talk Outline Introduction Deterministic F 1 algorithm: O(k log(1/ε)) Randomized F 1 algorithm: O(1/ε 2 ∙log(1/δ)) Randomized F 0 algorithm: Õ(k/ε 2 ) Randomized F 2 algorithm: Õ(k 2 /ε+k 3/2 /ε 3 ) Conclusions
16
Deterministic F 1 Algorithm The first round: τ/2kτ/2k coordinator Terminates round after receiving k signals τ/2k · k = τ/2 < F 1 < τ
17
Deterministic F 1 Algorithm The second round: τ/4kτ/4k coordinator
18
Deterministic F 1 Algorithm The second round: τ/4kτ/4k coordinator Terminates round after receiving k signals 3τ/4 < F 1 < τ
19
Deterministic F 1 Algorithm Each round communicates O(k) bits Continue until Δ=ετ O(log(1/ε)) rounds Δ=ετΔ=ετ coordinator After the last round, we have (1-ε)τ < F 1 < τ Total communication: O(k log(1/ε)) Lower bound: Ω(k log(1/(εk))) One-Shot: O(k log(1/ε)) Lower bound: Ω(k log(1/(εk)))
20
Talk Outline Introduction Deterministic F 1 algorithm: O(k log(1/ε)) Randomized F 1 algorithm: O(1/ε 2 ∙log(1/δ)) Randomized F 0 algorithm: Õ(k/ε 2 ) Randomized F 2 algorithm: Õ(k 2 /ε+k 3/2 /ε 3 ) Conclusions
21
F 0 : # Distinct Items Lower bound: Any deterministic (or Las Vegas randomized) algorithm has to communicate Ω(n) bits Consider the one-shot case first Use “sketches”: small-space streaming algorithms “Combine” the sketches from the k sites FM sketch [Flajolet and Martin 1985; Alon, Matias, and Szegedy, 1999]
22
FM Sketch Take a pair-wise independent random hash function h : {1,…,n} {1,…,2 d }, where 2 d > n For each incoming element x, compute h(x) e.g., h(5) = 10101100010000 Count how many trailing zeros Remember the maximum number of trailing zeroes in any h(x) Let Y be the maximum number of trailing zeroes Can show E[2 Y ] = # distinct elements
23
FM Sketch So 2 Y is an unbiased estimator for # distinct elements However, has a large variance Some recent techniques [Gibbons and Tirthapura, 2001; Bar- Yossef, Jayram, Kumar, Sivakumar, and Trevisan, 2002] to produce a good estimator that has probability 1–δ to be within relative error ε Space increased to Õ(1/ε 2 ) FM sketch has linearity Y 1 from A, Y 2 from B, then 2 max{Y 1, Y 2 } estimates # distinct items in A U B A one-shot algorithm with communication Õ(k/ε 2 )
24
Continuously Monitoring F 0 FM sketch is monotone Y i is non-decreasing, and Y i < log n Whenever Y i increases, notify the coordinator The coordinator can always have the up-to- date combined FM sketch Total communication: Õ(k/ε 2 ) Lower bound : Ω(k)
25
Talk Outline Introduction Deterministic F 1 algorithm: O(k log(1/ε)) Randomized F 1 algorithm: O(1/ε 2 ∙log(1/δ)) Randomized F 0 algorithm: Õ(k/ε 2 ) Randomized F 2 algorithm: Õ(k 2 /ε+k 3/2 /ε 3 ) Conclusions
26
F 2 : The One-Shot Case Lower bound: Any deterministic (or Las Vegas randomized) algorithm has to communicate Ω(n) bits Consider the one-shot case first Use “sketches”: small-space streaming algorithms “Combine” the sketches from the k sites AMS sketch [Alon, Matias, and Szegedy, 1999]
27
AMS Sketch: “Tug-of-War” Take a 4-wise independent random hash function h : {1,…,n} {−1,+1} Compute Y = ∑ h(x) over all x Y 2 is an unbiased estimator for F 2 Use O(1/ε 2 ∙ log(1/δ)) copies to guarantee a good estimator that has probability 1–δ to be within relative error ε Linearity still holds! o One-shot case can be solved with communication Õ(k/ε 2 )
28
However… Y is not monotone! Can’t afford to send all changes of the local sketch to the coordinator
29
F 2 Monitoring: Multi-Round Algorithm Beginning of a round sketch Õ(1/ε 2 ) estimate for F 2 coordinator
30
F 2 Monitoring: Multi-Round Algorithm During a round estimate for F 2 coordinator sends a signal whenever the F 2 of the updates increases by t = (τ − F 2 ) 2 /(64k 2 τ)
31
F 2 Monitoring: Multi-Round Algorithm End of a round: when k signals are received estimate for F 2 coordinator old F 2 + ( τ − old F 2 ) ∙ ε/k < new F2 < τ # rounds: O(k/ε) Total cost: Õ(k 2 /ε 3 ) # rounds: O(k/ε) Total cost: Õ(k 2 /ε 3 )
32
F 2 : Round / Sub-Round Algorithm End of a sub-round: when k signals are received estimate for F 2 coordinator old F 2 + ( τ − old F 2 ) ∙ ε/k < new F2 < τ “rough” sketch of size Õ (1) “rough” sketch of size Õ (1) combine sketches maintain an upper bound of F 2 Total cost: Õ(k 2 /ε+k 3/2 /ε 3 ) One-shot: Õ(k/ε 2 ) Lower bound: Ω(k)
33
Open Problems Still no clear separation between the one-shot model and the continuous model F 2 is an interesting case Many other functions f Statistics: entropy, heavy hitters Geometric measures: diameter, width, … Variations of the model One-way vs two-way communication Does having a broadcast channel help? Sliding windows? “Continuous Communication Complexity”?
34
Thank you!
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.