Presentation is loading. Please wait.

Presentation is loading. Please wait.

May 21, 2003CS Forum Annual Meeting1 Randomization for Massive and Streaming Data Sets Rajeev Motwani.

Similar presentations


Presentation on theme: "May 21, 2003CS Forum Annual Meeting1 Randomization for Massive and Streaming Data Sets Rajeev Motwani."— Presentation transcript:

1 May 21, 2003CS Forum Annual Meeting1 Randomization for Massive and Streaming Data Sets Rajeev Motwani

2 2 Data Streams Mangement Systems data sets  Traditional DBMS – data stored in finite, persistent data sets  Data Streams – distributed, continuous, unbounded, rapid, time-varying, noisy, …  Emerging DSMS – variety of modern applications Network monitoring and traffic engineering Telecom call records Network security Financial applications Sensor networks Manufacturing processes Web logs and clickstreams Massive data sets

3 3 DSMS Scratch Store DSMS – Big Picture Input streams Register Query Streamed Result Stored Result Archive Stored Relations

4 4 Algorithmic Issues  Computational Model Streaming data (or, secondary memory) Bounded main memory  Techniques New paradigms Negative Results and Approximation Randomization  Complexity Measures Memory Time per item (online, real-time) # Passes (linear scan in secondary memory)

5 5 Stream Model of Computation 1 0 1 1 1 0 1 0 0 1 1 Increasing time Main Memory (Synopsis Data Structures) Data Stream Memory: poly(1/ε, log N) Query/Update Time: poly(1/ε, log N) N: # items so far, or window size ε: error parameter

6 6 “Toy” Example – Network Monitoring Register Monitoring Queries DSMS Scratch Store Network measurements, Packet traces, … Intrusion Warnings Online Performance Metrics Archive Lookup Tables

7 7 Frequency Related Problems Frequency Related Problems 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Find all elements with frequency > 0.1% Top-k most frequent elements What is the frequency of element 3? What is the total frequency of elements between 8 and 14? Find elements that occupy 0.1% of the tail. Mean + Variance? Median? How many elements have non-zero frequency? Analytics on Packet Headers – IP Addresses

8 8 Example 1– Distinct Values  Input Sequence X = x 1, x 2, …, x n, …  Domain U = {0,1,2, …, u-1}  Compute D(X) number of distinct values  Remarks Assume stream size n is finite/known (generally, n is window size) Domain could be arbitrary (e.g., text, tuples)

9 9 Naïve Approach  Counter C(i) for each domain value i  Initialize counters C(i)  0  Scan X incrementing appropriate counters  Problem Memory size M << n Space O(u) – possibly u >> n (e.g., when counting distinct words in web crawl)

10 10 Negative Result Theorem: Deterministic algorithms need M = Ω(n log u) bits Proof: Information-theoretic arguments Note: Leaves open randomization/approximation

11 11 Randomized Algorithm Analysis   Random h  few collisions & avg list-size O(n/t)   Thus Space: O(n) – since we need t = Ω (n) Time: O(1) per item [Expected] h:U  [1..t] Input StreamHash Table

12 12 Improvement via Sampling?  Sample-based Estimation Random Sample R (of size r) of n values in X Compute D(R) Estimator E = D(R) x n/r  Benefit – sublinear space  Cost – estimation error is high  Why? – low-frequency values underrepresented

13 13 Negative Result for Sampling  Consider estimator E of D(X) examining r items in X  Possibly in adaptive/randomized fashion. Theorem: For any, E has relative error with probability at least.  Remarks r = n/10  Error 75% with probability ½ Leaves open randomization/approximation on full scans

14 14 Randomized Approximation  Simplified Problem – For fixed t, is D(X) >> t? Choose hash function h: U  [1..t] Initialize answer to NO For each x i, if h(x i ) = t, set answer to YES  Observe – need 1 bit memory only !  Theorem: If D(X) 0.25 If D(X) > 2t, P[output NO] < 0.14 Boolean Flag Input Stream h:U  [1..t] YES/NOt 1

15 15 Analysis  Let – Y be set of distinct elements of X  output NO no element of Y hashes to t  P [element hashes to t] = 1/t  Thus – P[output NO] = (1-1/t) |Y|  Since |Y| = D(X), D(X) (1-1/t) t > 0.25 D(X) > 2t  P[output NO] < (1-1/t) 2t < 1/e^2

16 16 Boosting Accuracy  With 1 bit  distinguish D(X) 2t  Running O(log 1/δ) instances in parallel  reduce error probability to any δ>0  Running O(log n) in parallel for t = 1, 2, 4, 8,…, n  can estimate D(X) within factor 2  Choice of multiplier 2 is arbitrary  can use factor (1+ε) to reduce error to ε  Theorem: Can estimate D(X) within factor (1±ε) with probability (1-δ) using space

17 17 Example 2 – Elephants-and-Ants   Identify items whose current frequency exceeds support threshold s = 0.1%. [Jacobson 2000, Estan-Verghese 2001] Stream

18 18 Algorithm 1: Lossy Counting Step 1: Divide the stream into ‘windows’ Window-size W is function of support s – specify later… Window 1Window 2Window 3

19 19 Lossy Counting in Action... Empty At window boundary, decrement all counters by 1

20 20 Lossy Counting continued... At window boundary, decrement all counters by 1 Next Window +

21 21 Error Analysis If current size of stream = N and window-size W = 1/ ε then # windows = εN Rule of thumb: Set ε = 10% of support s Example: Given support frequency s = 1%, set error frequency ε = 0.1%  frequency error  How much do we undercount?

22 22 Output: Elements with counter values exceeding (s-ε)N Approximation guarantees Frequencies underestimated by at most εN No false negatives False positives have true frequency at least (s–ε)N Putting it all together… How many counters do we need?   Worst case bound: 1/ε log εN counters   Implementation details…

23 23 Algorithm 2: Sticky Sampling Stream  Create counters by sampling  Maintain exact counts thereafter What is sampling rate? 34 15 30 28 31 41 23 35 19

24 24 Sticky Sampling contd... For finite stream of length N Sampling rate = 2/εN log 1/  s Same Rule of thumb: Set ε = 10% of support s Example: Given support threshold s = 1%, set error threshold ε = 0.1% set failure probability  = 0.01% Output: Elements with counter values exceeding (s-ε)N Same error guarantees as Lossy Counting but probabilistic Approximation guarantees (probabilistic) Frequencies underestimated by at most εN No false negatives False positives have true frequency at least (s-ε)N  = probability of failure

25 25 Number of counters? Finite stream of length N Sampling rate: 2/εN log 1/  s Independent of N Infinite stream with unknown N Gradually adjust sampling rate In either case, Expected number of counters = 2/  log 1/  s

26 26 Example 3 – Correlated Attributes C1C2C3C4C5 R111110 R211010 R310010 R400101 R511101 R6 11111 R701111 R801110 ………  Input Stream – items with boolean attributes  Matrix – M(r,c) = 1  Row r has Attribute c  Identify – Highly-correlated column-pairs

27 27 Correlation  Similarity  View column as set of row-indexes (where it has 1’s)  Set Similarity (Jaccard measure)  Example C i C j 0 1 1 0 1 1 sim(C i,C j ) = 2/5 = 0.4 0 1 0 1

28 28 Identifying Similar Columns?  Goal – finding candidate pairs in small memory  Signature Idea Hash columns C i to small signature sig(C i ) Set of signatures fits in memory sim(C i,C j ) approximated by sim(sig(C i ),sig(C j ))  Naïve Approach Sample P rows uniformly at random Define sig(C i ) as P bits of C i in sample Problem  sparsity  would miss interesting part of columns  sample would get only 0’s in columns

29 29 Key Observation  For columns C i, C j, four types of rows C i C j A 1 1 B 1 0 C 0 1 D 0 0  Overload notation: A = # rows of type A  Observation

30 30 Min Hashing  Randomly permute rows  Hash h(C i ) = index of first row with 1 in column C i  Suprising Property P[h(C i ) = h(C j )] = sim(C i, C j )  Why? Both are A/(A+B+C) Look down columns C i, C j until first non-Type-D row h(C i ) = h(C j )  if type A row

31 31 Min-Hash Signatures  Pick – k random row permutations  Min-Hash Signature sig(C) = k indexes of first rows with 1 in column C  Similarity of signatures Define: sim(sig(C i ),sig(C j )) = fraction of permutations where Min-Hash values agree Lemma E[sim(sig(C i ),sig(C j ))] = sim(C i,C j )

32 32 Example C 1 C 2 C 3 R 1 1 0 1 R 2 0 1 1 R 3 1 0 0 R 4 1 0 1 R 5 0 1 0 Signatures S 1 S 2 S 3 Perm 1 = (12345) 1 2 1 Perm 2 = (54321) 4 5 4 Perm 3 = (34512) 3 5 4 Similarities 1-2 1-3 2-3 Col-Col 0.00 0.50 0.25 Sig-Sig 0.00 0.67 0.00

33 33 Implementation Trick  Permuting rows even once is prohibitive  Row Hashing Pick k hash functions h k : {1,…,n}  {1,…,O(n)} Ordering under h k gives random row permutation One-pass implementation

34 34 Comparing Signatures  Signature Matrix S Rows = Hash Functions Columns = Columns Entries = Signatures  Need – Pair-wise similarity of signature columns  Problem MinHash fits column signatures in memory But comparing signature-pairs takes too much time Limiting candidate pairs – Locality Sensitive Hashing

35 35 Summary  New algorithmic paradigms needed for streams and massive data sets  Negative results abound  Need to approximate  Power of randomization

36 36 Thank You! Thank You!

37 37 References Rajeev Motwani (http://theory.stanford.edu/~rajeev)http://theory.stanford.edu/~rajeev STREAM Project (http://www-db.stanford.edu/stream)http://www-db.stanford.edu/stream  STREAM: The Stanford Stream Data Manager. Bulletin of the Technical Committee on Data Engineering 2003.  Motwani et al. Query Processing, Approximation, and Resource Management in a Data Stream Management System. CIDR 2003.  Babcock-Babu-Datar-Motwani-Widom. Models and Issues in Data Stream Systems. PODS 2002.  Manku-Motwani. Approximate Frequency Counts over Streaming Data. VLDB 2003.  Babcock-Datar-Motwani-O’Callahan. Maintaining Variance and K-Medians over Data Stream Windows. PODS 2003.  Guha-Meyerson-Mishra-Motwani-O’Callahan. Clustering Data Streams: Theory and Practice. IEEE TKDE 2003.

38 38 References (contd)  Datar-Gionis-Indyk-Motwani. Maintaining Stream Statistics over Sliding Windows. SIAM Journal on Computing 2002.  Babcock-Datar-Motwani. Sampling From a Moving Window Over Streaming Data. SODA 2002.  O’Callahan-Guha-Mishra-Meyerson-Motwani. High- Performance Clustering of Streams and Large Data Sets. ICDE 2003.  Guha-Mishra-Motwani-O’Callagahan. Clustering Data Streams. FOCS 2000.  Cohen et al. Finding Interesting Associations without Support Pruning. ICDE 2000.  Charikar-Chaudhuri-Motwani-Narasayya. Towards Estimation Error Guarantees for Distinct Values. PODS 2000.  Gionis-Indyk-Motwani. Similarity Search in High Dimensions via Hashing. VLDB 1999.  Indyk-Motwani. Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. STOC 1998.


Download ppt "May 21, 2003CS Forum Annual Meeting1 Randomization for Massive and Streaming Data Sets Rajeev Motwani."

Similar presentations


Ads by Google