May 21, 2003CS Forum Annual Meeting1 Randomization for Massive and Streaming Data Sets Rajeev Motwani.

Slides:



Advertisements
Similar presentations
Optimal Space Lower Bounds for All Frequency Moments David Woodruff MIT
Advertisements

The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.
Counting Distinct Objects over Sliding Windows Presented by: Muhammad Aamir Cheema Joint work with Wenjie Zhang, Ying Zhang and Xuemin Lin University of.
An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.
Near-Duplicates Detection
3/13/2012Data Streams: Lecture 161 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier.
Maintaining Variance and k-Medians over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O’Callaghan Stanford University.
CSCE 3400 Data Structures & Algorithm Analysis
Maintaining Variance over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O ’ Callaghan, Stanford University ACM Symp. on Principles.
Lecture 11 oct 6 Goals: hashing hash functions chaining closed hashing application of hashing.
Mining Data Streams.
MMDS Secs Slides adapted from: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, October.
1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong
Heavy hitter computation over data stream
Large-scale matching CSE P 576 Larry Zitnick
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006
Hash Tables How well do hash tables support dynamic set operations? Implementations –Direct address –Hash functions Collision resolution methods –Universal.
Algorithms for massive data sets Lecture 2 (Mar 14, 2004) Yossi Matias & Ely Porat (partially based on various presentations & notes)
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
Ph.D. SeminarUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
Processing Data-Stream Joins Using Skimmed Sketches Minos Garofalakis Internet Management Research Department Bell Labs, Lucent Technologies Joint work.
CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lecture 15 (Nov 14, 2005) Hashing for Massive/Streaming Data Rajeev Motwani.
What ’ s Hot and What ’ s Not: Tracking Most Frequent Items Dynamically G. Cormode and S. Muthukrishman Rutgers University ACM Principles of Database Systems.
Near Duplicate Detection
Reverse Hashing for Sketch Based Change Detection in High Speed Networks Ashish Gupta Elliot Parsons with Robert Schweller, Theory Group Advisor: Yan Chen.
A survey on stream data mining
Estimating Set Expression Cardinalities over Data Streams Sumit Ganguly Minos Garofalakis Rajeev Rastogi Internet Management Research Department Bell Labs,
1 PODS 2002 Motivation. 2 PODS 2002 Data Streams data sets Traditional DBMS – data stored in finite, persistent data sets data streams New Applications.
Lecture 11 oct 7 Goals: hashing hash functions chaining closed hashing application of hashing.
Algorithms for massive data sets Lecture 1 (Feb 16, 2003) Yossi Matias and Ely Porat (partially based on various presentations & notes)
Finding Near Duplicates (Adapted from slides and material from Rajeev Motwani and Jeff Ullman)
CS 591 A11 Algorithms for Data Streams Dhiman Barman CS 591 A1 Algorithms for the New Age 2 nd Dec, 2002.
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
Estimating Entropy for Data Streams Khanh Do Ba, Dartmouth College Advisor: S. Muthu Muthukrishnan.
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
Finding Frequent Items in Data Streams [Charikar-Chen-Farach-Colton] Paper report By MH, 2004/12/17.
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
Finding Similar Items 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 10: Finding Similar Items Mining.
1 Approximating Quantiles over Sliding Windows Srimathi Harinarayanan CMPS 565.
Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.
Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.
August 21, 2002VLDB Gurmeet Singh Manku Frequency Counts over Data Streams Frequency Counts over Data Streams Stanford University, USA.
Adaptive Query Processing in Data Stream Systems Paper written by Shivnath Babu Kamesh Munagala, Rajeev Motwani, Jennifer Widom stanfordstreamdatamanager.
Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.
Calculating frequency moments of Data Stream
REU 2009-Traffic Analysis of IP Networks Daniel S. Allen, Mentor: Dr. Rahul Tripathi Department of Computer Science & Engineering Data Streams Data streams.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Mining Data Streams (Part 1)
Algorithms for Big Data: Streaming and Sublinear Time Algorithms
CS276A Text Information Retrieval, Mining, and Exploitation
Frequency Counts over Data Streams
Near Duplicate Detection
Finding Frequent Items in Data Streams
Streaming & sampling.
Finding Similar Items: Locality Sensitive Hashing
Sublinear Algorithmic Tools 2
Qun Huang, Patrick P. C. Lee, Yungang Bao
Overview Massive data sets Streaming algorithms Regression
Approximate Frequency Counts over Data Streams
Lecture 2- Query Processing (continued)
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
Range-Efficient Computation of F0 over Massive Data Streams
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
Minwise Hashing and Efficient Search
(Learned) Frequency Estimation Algorithms
Presentation transcript:

May 21, 2003CS Forum Annual Meeting1 Randomization for Massive and Streaming Data Sets Rajeev Motwani

2 Data Streams Mangement Systems data sets  Traditional DBMS – data stored in finite, persistent data sets  Data Streams – distributed, continuous, unbounded, rapid, time-varying, noisy, …  Emerging DSMS – variety of modern applications Network monitoring and traffic engineering Telecom call records Network security Financial applications Sensor networks Manufacturing processes Web logs and clickstreams Massive data sets

3 DSMS Scratch Store DSMS – Big Picture Input streams Register Query Streamed Result Stored Result Archive Stored Relations

4 Algorithmic Issues  Computational Model Streaming data (or, secondary memory) Bounded main memory  Techniques New paradigms Negative Results and Approximation Randomization  Complexity Measures Memory Time per item (online, real-time) # Passes (linear scan in secondary memory)

5 Stream Model of Computation Increasing time Main Memory (Synopsis Data Structures) Data Stream Memory: poly(1/ε, log N) Query/Update Time: poly(1/ε, log N) N: # items so far, or window size ε: error parameter

6 “Toy” Example – Network Monitoring Register Monitoring Queries DSMS Scratch Store Network measurements, Packet traces, … Intrusion Warnings Online Performance Metrics Archive Lookup Tables

7 Frequency Related Problems Frequency Related Problems Find all elements with frequency > 0.1% Top-k most frequent elements What is the frequency of element 3? What is the total frequency of elements between 8 and 14? Find elements that occupy 0.1% of the tail. Mean + Variance? Median? How many elements have non-zero frequency? Analytics on Packet Headers – IP Addresses

8 Example 1– Distinct Values  Input Sequence X = x 1, x 2, …, x n, …  Domain U = {0,1,2, …, u-1}  Compute D(X) number of distinct values  Remarks Assume stream size n is finite/known (generally, n is window size) Domain could be arbitrary (e.g., text, tuples)

9 Naïve Approach  Counter C(i) for each domain value i  Initialize counters C(i)  0  Scan X incrementing appropriate counters  Problem Memory size M << n Space O(u) – possibly u >> n (e.g., when counting distinct words in web crawl)

10 Negative Result Theorem: Deterministic algorithms need M = Ω(n log u) bits Proof: Information-theoretic arguments Note: Leaves open randomization/approximation

11 Randomized Algorithm Analysis   Random h  few collisions & avg list-size O(n/t)   Thus Space: O(n) – since we need t = Ω (n) Time: O(1) per item [Expected] h:U  [1..t] Input StreamHash Table

12 Improvement via Sampling?  Sample-based Estimation Random Sample R (of size r) of n values in X Compute D(R) Estimator E = D(R) x n/r  Benefit – sublinear space  Cost – estimation error is high  Why? – low-frequency values underrepresented

13 Negative Result for Sampling  Consider estimator E of D(X) examining r items in X  Possibly in adaptive/randomized fashion. Theorem: For any, E has relative error with probability at least.  Remarks r = n/10  Error 75% with probability ½ Leaves open randomization/approximation on full scans

14 Randomized Approximation  Simplified Problem – For fixed t, is D(X) >> t? Choose hash function h: U  [1..t] Initialize answer to NO For each x i, if h(x i ) = t, set answer to YES  Observe – need 1 bit memory only !  Theorem: If D(X) 0.25 If D(X) > 2t, P[output NO] < 0.14 Boolean Flag Input Stream h:U  [1..t] YES/NOt 1

15 Analysis  Let – Y be set of distinct elements of X  output NO no element of Y hashes to t  P [element hashes to t] = 1/t  Thus – P[output NO] = (1-1/t) |Y|  Since |Y| = D(X), D(X) (1-1/t) t > 0.25 D(X) > 2t  P[output NO] < (1-1/t) 2t < 1/e^2

16 Boosting Accuracy  With 1 bit  distinguish D(X) 2t  Running O(log 1/δ) instances in parallel  reduce error probability to any δ>0  Running O(log n) in parallel for t = 1, 2, 4, 8,…, n  can estimate D(X) within factor 2  Choice of multiplier 2 is arbitrary  can use factor (1+ε) to reduce error to ε  Theorem: Can estimate D(X) within factor (1±ε) with probability (1-δ) using space

17 Example 2 – Elephants-and-Ants   Identify items whose current frequency exceeds support threshold s = 0.1%. [Jacobson 2000, Estan-Verghese 2001] Stream

18 Algorithm 1: Lossy Counting Step 1: Divide the stream into ‘windows’ Window-size W is function of support s – specify later… Window 1Window 2Window 3

19 Lossy Counting in Action... Empty At window boundary, decrement all counters by 1

20 Lossy Counting continued... At window boundary, decrement all counters by 1 Next Window +

21 Error Analysis If current size of stream = N and window-size W = 1/ ε then # windows = εN Rule of thumb: Set ε = 10% of support s Example: Given support frequency s = 1%, set error frequency ε = 0.1%  frequency error  How much do we undercount?

22 Output: Elements with counter values exceeding (s-ε)N Approximation guarantees Frequencies underestimated by at most εN No false negatives False positives have true frequency at least (s–ε)N Putting it all together… How many counters do we need?   Worst case bound: 1/ε log εN counters   Implementation details…

23 Algorithm 2: Sticky Sampling Stream  Create counters by sampling  Maintain exact counts thereafter What is sampling rate?

24 Sticky Sampling contd... For finite stream of length N Sampling rate = 2/εN log 1/  s Same Rule of thumb: Set ε = 10% of support s Example: Given support threshold s = 1%, set error threshold ε = 0.1% set failure probability  = 0.01% Output: Elements with counter values exceeding (s-ε)N Same error guarantees as Lossy Counting but probabilistic Approximation guarantees (probabilistic) Frequencies underestimated by at most εN No false negatives False positives have true frequency at least (s-ε)N  = probability of failure

25 Number of counters? Finite stream of length N Sampling rate: 2/εN log 1/  s Independent of N Infinite stream with unknown N Gradually adjust sampling rate In either case, Expected number of counters = 2/  log 1/  s

26 Example 3 – Correlated Attributes C1C2C3C4C5 R R R R R R R R ………  Input Stream – items with boolean attributes  Matrix – M(r,c) = 1  Row r has Attribute c  Identify – Highly-correlated column-pairs

27 Correlation  Similarity  View column as set of row-indexes (where it has 1’s)  Set Similarity (Jaccard measure)  Example C i C j sim(C i,C j ) = 2/5 =

28 Identifying Similar Columns?  Goal – finding candidate pairs in small memory  Signature Idea Hash columns C i to small signature sig(C i ) Set of signatures fits in memory sim(C i,C j ) approximated by sim(sig(C i ),sig(C j ))  Naïve Approach Sample P rows uniformly at random Define sig(C i ) as P bits of C i in sample Problem  sparsity  would miss interesting part of columns  sample would get only 0’s in columns

29 Key Observation  For columns C i, C j, four types of rows C i C j A 1 1 B 1 0 C 0 1 D 0 0  Overload notation: A = # rows of type A  Observation

30 Min Hashing  Randomly permute rows  Hash h(C i ) = index of first row with 1 in column C i  Suprising Property P[h(C i ) = h(C j )] = sim(C i, C j )  Why? Both are A/(A+B+C) Look down columns C i, C j until first non-Type-D row h(C i ) = h(C j )  if type A row

31 Min-Hash Signatures  Pick – k random row permutations  Min-Hash Signature sig(C) = k indexes of first rows with 1 in column C  Similarity of signatures Define: sim(sig(C i ),sig(C j )) = fraction of permutations where Min-Hash values agree Lemma E[sim(sig(C i ),sig(C j ))] = sim(C i,C j )

32 Example C 1 C 2 C 3 R R R R R Signatures S 1 S 2 S 3 Perm 1 = (12345) Perm 2 = (54321) Perm 3 = (34512) Similarities Col-Col Sig-Sig

33 Implementation Trick  Permuting rows even once is prohibitive  Row Hashing Pick k hash functions h k : {1,…,n}  {1,…,O(n)} Ordering under h k gives random row permutation One-pass implementation

34 Comparing Signatures  Signature Matrix S Rows = Hash Functions Columns = Columns Entries = Signatures  Need – Pair-wise similarity of signature columns  Problem MinHash fits column signatures in memory But comparing signature-pairs takes too much time Limiting candidate pairs – Locality Sensitive Hashing

35 Summary  New algorithmic paradigms needed for streams and massive data sets  Negative results abound  Need to approximate  Power of randomization

36 Thank You! Thank You!

37 References Rajeev Motwani ( STREAM Project (  STREAM: The Stanford Stream Data Manager. Bulletin of the Technical Committee on Data Engineering  Motwani et al. Query Processing, Approximation, and Resource Management in a Data Stream Management System. CIDR  Babcock-Babu-Datar-Motwani-Widom. Models and Issues in Data Stream Systems. PODS  Manku-Motwani. Approximate Frequency Counts over Streaming Data. VLDB  Babcock-Datar-Motwani-O’Callahan. Maintaining Variance and K-Medians over Data Stream Windows. PODS  Guha-Meyerson-Mishra-Motwani-O’Callahan. Clustering Data Streams: Theory and Practice. IEEE TKDE 2003.

38 References (contd)  Datar-Gionis-Indyk-Motwani. Maintaining Stream Statistics over Sliding Windows. SIAM Journal on Computing  Babcock-Datar-Motwani. Sampling From a Moving Window Over Streaming Data. SODA  O’Callahan-Guha-Mishra-Meyerson-Motwani. High- Performance Clustering of Streams and Large Data Sets. ICDE  Guha-Mishra-Motwani-O’Callagahan. Clustering Data Streams. FOCS  Cohen et al. Finding Interesting Associations without Support Pruning. ICDE  Charikar-Chaudhuri-Motwani-Narasayya. Towards Estimation Error Guarantees for Distinct Values. PODS  Gionis-Indyk-Motwani. Similarity Search in High Dimensions via Hashing. VLDB  Indyk-Motwani. Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. STOC 1998.