Download presentation
Presentation is loading. Please wait.
1
May 21, 2003CS Forum Annual Meeting1 Randomization for Massive and Streaming Data Sets Rajeev Motwani
2
2 Data Streams Mangement Systems data sets Traditional DBMS – data stored in finite, persistent data sets Data Streams – distributed, continuous, unbounded, rapid, time-varying, noisy, … Emerging DSMS – variety of modern applications Network monitoring and traffic engineering Telecom call records Network security Financial applications Sensor networks Manufacturing processes Web logs and clickstreams Massive data sets
3
3 DSMS Scratch Store DSMS – Big Picture Input streams Register Query Streamed Result Stored Result Archive Stored Relations
4
4 Algorithmic Issues Computational Model Streaming data (or, secondary memory) Bounded main memory Techniques New paradigms Negative Results and Approximation Randomization Complexity Measures Memory Time per item (online, real-time) # Passes (linear scan in secondary memory)
5
5 Stream Model of Computation 1 0 1 1 1 0 1 0 0 1 1 Increasing time Main Memory (Synopsis Data Structures) Data Stream Memory: poly(1/ε, log N) Query/Update Time: poly(1/ε, log N) N: # items so far, or window size ε: error parameter
6
6 “Toy” Example – Network Monitoring Register Monitoring Queries DSMS Scratch Store Network measurements, Packet traces, … Intrusion Warnings Online Performance Metrics Archive Lookup Tables
7
7 Frequency Related Problems Frequency Related Problems 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Find all elements with frequency > 0.1% Top-k most frequent elements What is the frequency of element 3? What is the total frequency of elements between 8 and 14? Find elements that occupy 0.1% of the tail. Mean + Variance? Median? How many elements have non-zero frequency? Analytics on Packet Headers – IP Addresses
8
8 Example 1– Distinct Values Input Sequence X = x 1, x 2, …, x n, … Domain U = {0,1,2, …, u-1} Compute D(X) number of distinct values Remarks Assume stream size n is finite/known (generally, n is window size) Domain could be arbitrary (e.g., text, tuples)
9
9 Naïve Approach Counter C(i) for each domain value i Initialize counters C(i) 0 Scan X incrementing appropriate counters Problem Memory size M << n Space O(u) – possibly u >> n (e.g., when counting distinct words in web crawl)
10
10 Negative Result Theorem: Deterministic algorithms need M = Ω(n log u) bits Proof: Information-theoretic arguments Note: Leaves open randomization/approximation
11
11 Randomized Algorithm Analysis Random h few collisions & avg list-size O(n/t) Thus Space: O(n) – since we need t = Ω (n) Time: O(1) per item [Expected] h:U [1..t] Input StreamHash Table
12
12 Improvement via Sampling? Sample-based Estimation Random Sample R (of size r) of n values in X Compute D(R) Estimator E = D(R) x n/r Benefit – sublinear space Cost – estimation error is high Why? – low-frequency values underrepresented
13
13 Negative Result for Sampling Consider estimator E of D(X) examining r items in X Possibly in adaptive/randomized fashion. Theorem: For any, E has relative error with probability at least. Remarks r = n/10 Error 75% with probability ½ Leaves open randomization/approximation on full scans
14
14 Randomized Approximation Simplified Problem – For fixed t, is D(X) >> t? Choose hash function h: U [1..t] Initialize answer to NO For each x i, if h(x i ) = t, set answer to YES Observe – need 1 bit memory only ! Theorem: If D(X) 0.25 If D(X) > 2t, P[output NO] < 0.14 Boolean Flag Input Stream h:U [1..t] YES/NOt 1
15
15 Analysis Let – Y be set of distinct elements of X output NO no element of Y hashes to t P [element hashes to t] = 1/t Thus – P[output NO] = (1-1/t) |Y| Since |Y| = D(X), D(X) (1-1/t) t > 0.25 D(X) > 2t P[output NO] < (1-1/t) 2t < 1/e^2
16
16 Boosting Accuracy With 1 bit distinguish D(X) 2t Running O(log 1/δ) instances in parallel reduce error probability to any δ>0 Running O(log n) in parallel for t = 1, 2, 4, 8,…, n can estimate D(X) within factor 2 Choice of multiplier 2 is arbitrary can use factor (1+ε) to reduce error to ε Theorem: Can estimate D(X) within factor (1±ε) with probability (1-δ) using space
17
17 Example 2 – Elephants-and-Ants Identify items whose current frequency exceeds support threshold s = 0.1%. [Jacobson 2000, Estan-Verghese 2001] Stream
18
18 Algorithm 1: Lossy Counting Step 1: Divide the stream into ‘windows’ Window-size W is function of support s – specify later… Window 1Window 2Window 3
19
19 Lossy Counting in Action... Empty At window boundary, decrement all counters by 1
20
20 Lossy Counting continued... At window boundary, decrement all counters by 1 Next Window +
21
21 Error Analysis If current size of stream = N and window-size W = 1/ ε then # windows = εN Rule of thumb: Set ε = 10% of support s Example: Given support frequency s = 1%, set error frequency ε = 0.1% frequency error How much do we undercount?
22
22 Output: Elements with counter values exceeding (s-ε)N Approximation guarantees Frequencies underestimated by at most εN No false negatives False positives have true frequency at least (s–ε)N Putting it all together… How many counters do we need? Worst case bound: 1/ε log εN counters Implementation details…
23
23 Algorithm 2: Sticky Sampling Stream Create counters by sampling Maintain exact counts thereafter What is sampling rate? 34 15 30 28 31 41 23 35 19
24
24 Sticky Sampling contd... For finite stream of length N Sampling rate = 2/εN log 1/ s Same Rule of thumb: Set ε = 10% of support s Example: Given support threshold s = 1%, set error threshold ε = 0.1% set failure probability = 0.01% Output: Elements with counter values exceeding (s-ε)N Same error guarantees as Lossy Counting but probabilistic Approximation guarantees (probabilistic) Frequencies underestimated by at most εN No false negatives False positives have true frequency at least (s-ε)N = probability of failure
25
25 Number of counters? Finite stream of length N Sampling rate: 2/εN log 1/ s Independent of N Infinite stream with unknown N Gradually adjust sampling rate In either case, Expected number of counters = 2/ log 1/ s
26
26 Example 3 – Correlated Attributes C1C2C3C4C5 R111110 R211010 R310010 R400101 R511101 R6 11111 R701111 R801110 ……… Input Stream – items with boolean attributes Matrix – M(r,c) = 1 Row r has Attribute c Identify – Highly-correlated column-pairs
27
27 Correlation Similarity View column as set of row-indexes (where it has 1’s) Set Similarity (Jaccard measure) Example C i C j 0 1 1 0 1 1 sim(C i,C j ) = 2/5 = 0.4 0 1 0 1
28
28 Identifying Similar Columns? Goal – finding candidate pairs in small memory Signature Idea Hash columns C i to small signature sig(C i ) Set of signatures fits in memory sim(C i,C j ) approximated by sim(sig(C i ),sig(C j )) Naïve Approach Sample P rows uniformly at random Define sig(C i ) as P bits of C i in sample Problem sparsity would miss interesting part of columns sample would get only 0’s in columns
29
29 Key Observation For columns C i, C j, four types of rows C i C j A 1 1 B 1 0 C 0 1 D 0 0 Overload notation: A = # rows of type A Observation
30
30 Min Hashing Randomly permute rows Hash h(C i ) = index of first row with 1 in column C i Suprising Property P[h(C i ) = h(C j )] = sim(C i, C j ) Why? Both are A/(A+B+C) Look down columns C i, C j until first non-Type-D row h(C i ) = h(C j ) if type A row
31
31 Min-Hash Signatures Pick – k random row permutations Min-Hash Signature sig(C) = k indexes of first rows with 1 in column C Similarity of signatures Define: sim(sig(C i ),sig(C j )) = fraction of permutations where Min-Hash values agree Lemma E[sim(sig(C i ),sig(C j ))] = sim(C i,C j )
32
32 Example C 1 C 2 C 3 R 1 1 0 1 R 2 0 1 1 R 3 1 0 0 R 4 1 0 1 R 5 0 1 0 Signatures S 1 S 2 S 3 Perm 1 = (12345) 1 2 1 Perm 2 = (54321) 4 5 4 Perm 3 = (34512) 3 5 4 Similarities 1-2 1-3 2-3 Col-Col 0.00 0.50 0.25 Sig-Sig 0.00 0.67 0.00
33
33 Implementation Trick Permuting rows even once is prohibitive Row Hashing Pick k hash functions h k : {1,…,n} {1,…,O(n)} Ordering under h k gives random row permutation One-pass implementation
34
34 Comparing Signatures Signature Matrix S Rows = Hash Functions Columns = Columns Entries = Signatures Need – Pair-wise similarity of signature columns Problem MinHash fits column signatures in memory But comparing signature-pairs takes too much time Limiting candidate pairs – Locality Sensitive Hashing
35
35 Summary New algorithmic paradigms needed for streams and massive data sets Negative results abound Need to approximate Power of randomization
36
36 Thank You! Thank You!
37
37 References Rajeev Motwani (http://theory.stanford.edu/~rajeev)http://theory.stanford.edu/~rajeev STREAM Project (http://www-db.stanford.edu/stream)http://www-db.stanford.edu/stream STREAM: The Stanford Stream Data Manager. Bulletin of the Technical Committee on Data Engineering 2003. Motwani et al. Query Processing, Approximation, and Resource Management in a Data Stream Management System. CIDR 2003. Babcock-Babu-Datar-Motwani-Widom. Models and Issues in Data Stream Systems. PODS 2002. Manku-Motwani. Approximate Frequency Counts over Streaming Data. VLDB 2003. Babcock-Datar-Motwani-O’Callahan. Maintaining Variance and K-Medians over Data Stream Windows. PODS 2003. Guha-Meyerson-Mishra-Motwani-O’Callahan. Clustering Data Streams: Theory and Practice. IEEE TKDE 2003.
38
38 References (contd) Datar-Gionis-Indyk-Motwani. Maintaining Stream Statistics over Sliding Windows. SIAM Journal on Computing 2002. Babcock-Datar-Motwani. Sampling From a Moving Window Over Streaming Data. SODA 2002. O’Callahan-Guha-Mishra-Meyerson-Motwani. High- Performance Clustering of Streams and Large Data Sets. ICDE 2003. Guha-Mishra-Motwani-O’Callagahan. Clustering Data Streams. FOCS 2000. Cohen et al. Finding Interesting Associations without Support Pruning. ICDE 2000. Charikar-Chaudhuri-Motwani-Narasayya. Towards Estimation Error Guarantees for Distinct Values. PODS 2000. Gionis-Indyk-Motwani. Similarity Search in High Dimensions via Hashing. VLDB 1999. Indyk-Motwani. Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. STOC 1998.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.