CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lecture 15 (Nov 14, 2005) Hashing for Massive/Streaming Data Rajeev Motwani.

Slides:



Advertisements
Similar presentations
Finding Frequent Items in Data Streams Moses CharikarPrinceton Un., Google Inc. Kevin ChenUC Berkeley, Google Inc. Martin Franch-ColtonRutgers Un., Google.
Advertisements

Order Statistics Sorted
Maintaining Variance and k-Medians over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O’Callaghan Stanford University.
Mining Data Streams.
1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 25, 2006
Disk Access Model. Using Secondary Storage Effectively In most studies of algorithms, one assumes the “RAM model”: –Data is in main memory, –Access to.
Streaming Algorithms for Robust, Real- Time Detection of DDoS Attacks S. Ganguly, M. Garofalakis, R. Rastogi, K. Sabnani Krishan Sabnani Bell Labs Research.
Tirgul 10 Rehearsal about Universal Hashing Solving two problems from theoretical exercises: –T2 q. 1 –T3 q. 2.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006
Algorithms for massive data sets Lecture 2 (Mar 14, 2004) Yossi Matias & Ely Porat (partially based on various presentations & notes)
I/O-Algorithms Lars Arge Aarhus University February 7, 2005.
Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.
I/O-Algorithms Lars Arge Aarhus University February 6, 2007.
I/O-Algorithms Lars Arge Spring 2006 February 2, 2006.
May 21, 2003CS Forum Annual Meeting1 Randomization for Massive and Streaming Data Sets Rajeev Motwani.
I/O-Algorithms Lars Arge Aarhus University February 14, 2008.
A survey on stream data mining
Estimating Set Expression Cardinalities over Data Streams Sumit Ganguly Minos Garofalakis Rajeev Rastogi Internet Management Research Department Bell Labs,
CS591A1 Fall Sketch based Summarization of Data Streams Manish R. Sharma and Weichao Ma.
CSCI 5708: Query Processing I Pusheng Zhang University of Minnesota Feb 3, 2004.
Algorithms for massive data sets Lecture 1 (Feb 16, 2003) Yossi Matias and Ely Porat (partially based on various presentations & notes)
CS 580S Sensor Networks and Systems Professor Kyoung Don Kang Lecture 7 February 13, 2006.
CS 591 A11 Algorithms for Data Streams Dhiman Barman CS 591 A1 Algorithms for the New Age 2 nd Dec, 2002.
Ch. 8 & 9 – Linear Sorting and Order Statistics What do you trade for speed?
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
Bin Yao Spring 2014 (Slides were made available by Feifei Li) Advanced Topics in Data Management.
Database Management 9. course. Execution of queries.
Finding Frequent Items in Data Streams [Charikar-Chen-Farach-Colton] Paper report By MH, 2004/12/17.
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
Chapter 14 Randomized algorithms Introduction Las Vegas and Monte Carlo algorithms Randomized Quicksort Randomized selection Testing String Equality Pattern.
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
Hashing Chapter 20. Hash Table A hash table is a data structure that allows fast find, insert, and delete operations (most of the time). The simplest.
David Luebke 1 10/25/2015 CS 332: Algorithms Skip Lists Hash Tables.
1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003.
Data Streams Part 3: Approximate Query Evaluation Reynold Cheng 23 rd July, 2002.
Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.
Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.
The Bloom Paradox Ori Rottenstreich Joint work with Yossi Kanizo and Isaac Keslassy Technion, Israel.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang
Hashing 8 April Example Consider a situation where we want to make a list of records for students currently doing the BSU CS degree, with each.
CS4432: Database Systems II Query Processing- Part 2.
Calculating frequency moments of Data Stream
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations – Join Chapter 14 Ramakrishnan and Gehrke (Section 14.4)
ICS 353: Design and Analysis of Algorithms
Query Processing CS 405G Introduction to Database Systems.
Lecture 1: Basic Operators in Large Data CS 6931 Database Seminar.
Computing & Information Sciences Kansas State University Wednesday, 08 Nov 2006CIS 560: Database System Concepts Lecture 32 of 42 Monday, 06 November 2006.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Mining Data Streams (Part 1)
Algorithms for Big Data: Streaming and Sublinear Time Algorithms
Frequency Counts over Data Streams
The Stream Model Sliding Windows Counting 1’s
Database Management System
The Variable-Increment Counting Bloom Filter
Finding Frequent Items in Data Streams
Streaming & sampling.
Chapter 12: Query Processing
Chapter 15 QUERY EXECUTION.
Sublinear Algorithmic Tools 2
Counting How Many Elements Computing “Moments”
COMS E F15 Lecture 2: Median trick + Chernoff, Distinct Count, Impossibility Results Left to the title, a presenter can insert his/her own image.
Advanced Topics in Data Management
Lecture 2- Query Processing (continued)
Range-Efficient Computation of F0 over Massive Data Streams
Sorting We may build an index on the relation, and then use the index to read the relation in sorted order. May lead to one disk block access for each.
Presentation transcript:

CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lecture 15 (Nov 14, 2005) Hashing for Massive/Streaming Data Rajeev Motwani

CS 361A 2 Hashing for Massive/Streaming Data  New Topic Novel hashing techniques + randomized data structures Motivated by massive/streaming data applications  Game Plan Probabilistic Counting: Flajolet-Martin & Frequency Moments Min-Hashing Locality-Sensitive Hashing Bloom Filters Consistent Hashing P2P Hashing …

CS 361A 3 Massive Data Sets  Examples Web (40 billion pages, each 1-10 KB, possibly 100TB of text) Human Genome (3 billion base pairs) Walmart Market-Basket Data (24 TB) Sloan Digital Sky Survey (40 TB) AT&T (300 million call-records per day)  Presentation? Network Access (Web) Data Warehouse (Walmart) Secondary Store (Human Genome) Streaming (Astronomy, AT&T)

CS 361A 4 Algorithmic Problems  Examples Statistics (median, variance, aggregates) Patterns (clustering, associations, classification) Query Responses (SQL, similarity) Compression (storage, communication)  Novelty? Problem size – simplicity, near-linear time Models – external memory, streaming Scale of data – emergent behavior?

CS 361A 5 Algorithmic Issues  Computational Model Streaming data (or, secondary memory) Bounded main memory  Techniques New paradigms needed Negative results and Approximation Randomization  Complexity Measures Memory Time per item (online, real-time) # Passes (linear scan in secondary memory)

CS 361A 6 Stream Model of Computation Increasing time Main Memory (Synopsis Data Structures) Data Stream Memory: poly(1/ε, log N) Query/Update Time: poly(1/ε, log N) N: # items so far, or window size ε: error parameter

CS 361A 7 “Toy” Example – Network Monitoring Register Monitoring Queries DSMS Scratch Store Network measurements, Packet traces, … Intrusion Warnings Online Performance Metrics Archive Lookup Tables

CS 361A 8 Frequency Related Problems Frequency Related Problems Find all elements with frequency > 0.1% Top-k most frequent elements What is the frequency of element 3? What is the total frequency of elements between 8 and 14? Find elements that occupy 0.1% of the tail. Mean + Variance? Median? How many elements have non-zero frequency? Analytics on Packet Headers – IP Addresses

CS 361A 9 Example 1 – Distinct Values  Problem Sequence Domain Compute D(X) number of distinct values in X  Remarks Assume stream size n is finite/known (e.g., n is window size) Domain could be arbitrary (e.g., text, tuples)  Study impact of … different presentation models different algorithmic models … and thereby understand model definitions

CS 361A 10 Naïve Approach  Counter C(i) for each domain value i  Initialize counters C(i)  0  Scan X incrementing appropriate counters  Problem Memory size M << n Space O(m) – possibly m >> n (e.g., when counting distinct words in web crawl) In fact, Time O(m) – but tricks to do initialization?

CS 361A 11 Main Memory Approach Algorithm MM  Pick r = θ(n), hash function h:U  [1..r]  Initialize array A[1..r] and D = 0  For each input value x i Check if x i occurs in list stored at A[h(i)] If not, D  D+1 and add x i to list at A[h(i)]  Output D  For “random” h, few collisions & most list-sizes O(1)  Thus Space O(n) Time O(1) per item [Expected]

CS 361A 12 Randomized Algorithms  Las Vegas (preceding algorithm) always produces right answer running-time is random variable  Monte Carlo (will see later) running-time is deterministic may produce wrong answer (bounded probability)  Atlantic City (sometimes also called M.C.) worst of both worlds

CS 361A 13 External Memory Model  Required when input X doesn’t fit in memory  M words of memory  Input size n >> M  Data stored on disk Disk block size B << M Unit time to transfer disk block to memory  Memory operations are free

CS 361A 14 Justification?  Block read/write? Transfer rate ≈ 100 MB/sec (say) Block size ≈ 100 KB (say) Block transfer time << Seek time Thus – only count number of seeks  Linear Scan even better as avoids random seeks  Free memory operations? Processor speeds – multi-GHz Disk seek time ≈ 0.01 sec

CS 361A 15 External Memory Algorithm?  Question – Why not just use Algorithm MM?  Problem Array A does not fit in memory For each value, need a random portion of A Each value involves a disk block read Thus – Ω(n) disk block accesses  Linear time – O(n/B) in this model

CS 361A 16 Algorithm EM  Merge Sort Partition into M/B groups Sort each group (recursively) Merge groups using n/B block accesses (need to hold 1 block from each group in memory)  Sorting Time –  Compute D(X) – one more pass  Total Time –  EXERCISE – verify details/analysis

CS 361A 17 Problem with Algorithm EM  Need to sort and reorder blocks on disk  Databases Tuples with multiple attributes Data might need to be ordered by attribute Y Algorithm EM reorders by attribute X  In any case, sorting is too expensive  Alternate Approach Sample portions of data Use sample to estimate distinct values

CS 361A 18 Sampling-based Approaches  Naïve sampling Random Sample R (of size r) of n values in X Compute D(R) Estimator  Note Benefit – sublinear space Cost – estimation error Why? – low-frequency value underrepresented  Existence of less naïve approaches?

CS 361A 19 Negative Result for Sampling Negative Result for Sampling [ Charikar, Chaudhuri, Motwani, Narasayya 2000]  Consider estimator E of D(X) examining r items in X  Possibly in adaptive/randomized fashion. Theorem: For any, E has relative error with probability at least.  Remarks r = n/10  Error 75% with probability ½ Leaves open randomization/approximation on full scans

CS 361A 20 Scenario Analysis Scenario A: all values in X are identical (say V) D(X) = 1 Scenario B: distinct values in X are {V, W 1, …, W k }, V appears n-k times each W i appears once W i ’s are randomly distributed D(X) = k+1

CS 361A 21 Proof  Little Birdie – one of Scenarios A or B only  Suppose E examines elements X(1), X(2), …, X(r) in that order choice of X(i) could be randomized and depend arbitrarily on values of X(1), …, X(i-1)  Lemma P[ X(i)=V | X(1)=X(2)=…=X(i-1)=V ]  Why? No information on whether Scenario A or B W i values are randomly distributed

CS 361A 22 Proof (continued)  Define EV – event {X(1)=X(2)=…=X(r)=V}  Last inequality because

CS 361A 23 Proof (conclusion)  Choose to obtain  Thus: Scenario A  Scenario B   Suppose E returns estimate Z when EV happens Scenario A  D(X)=1 Scenario B  D(X)=k+1 Z must have worst-case error >

CS 361A 24 Streaming Model  Motivating Scenarios Data flowing directly from generating source “Infinite” stream cannot be stored Real-time requirements for analysis Possibly from disk, streamed via Linear Scan  Model Stream – at each step can request next input value Assume stream size n is finite/known (fix later) Memory size M << n  VERIFY – earlier algorithms not applicable

CS 361A 25 Negative Result Theorem: Deterministic algorithms need M = Ω(n log m) Proof:  Choose – input X U of size n<m  Denote by S – state of A after X  Can check if any ε X by feeding to A as next input D(X) doesn’t increase iff ε X  Information-theoretically – can recover X from S  Thus – states require Ω(n log m) memory bits

CS 361A 26 Randomized Approximation  Lower bound does not rule out randomized or approximate solutions  Algorithm SM – For fixed t, is D(X) >> t? Choose hash function h: U  [1..t] Initialize answer to NO For each, if h( ) = t, set answer to YES  Theorem: If D(X) 0.25 If D(X) > 2t, P[SM outputs NO] < = 1/e^2

CS 361A 27 Analysis  Let – Y be set of distinct elements of X  SM(X) = NO no element of Y hashes to t  P[element hashes to t] = 1/t  Thus – P[SM(X) = NO] =  Since |Y| = D(X), If D(X) > 0.25 If D(X) > 2t, P[SM(X) = NO] < < 1/e 2  Observe – need 1 bit memory only!

CS 361A 28 Boosting Accuracy  With 1 bit  can probabilistically distinguish D(X) 2t  Running O(log 1/δ) instances in parallel  reduces error probability to any δ>0  Running O(log n) in parallel for t = 1, 2, 4, 8 …, n  can estimate D(X) within factor 2  Choice of factor 2 is arbitrary  can use factor (1+ε) to reduce error to ε  EXERCISE – Verify that we can estimate D(X) within factor (1±ε) with probability (1-δ) using space

CS 361A 29 Sampling versus Counting  Observe Count merely abstraction – need subsequent analytics Data tuples – X merely one of many attributes Databases – selection predicate, join results, … Networking – need to combine distributed streams  Single-pass Approaches Good accuracy But gives only a count -- cannot handle extensions  Sampling-based Approaches Keeps actual data – can address extensions Strong negative result

CS 361A 30 Distinct Sampling for Streams Distinct Sampling for Streams [ Gibbons 2001]  Best of both worlds Good accuracy Maintains “distinct sample” over stream Handles distributed setting  Basic idea Hash – random “priority” for domain values Tracks highest priority values seen Random sample of tuples for each such value Relative error with probability

CS 361A 31 Hash Function  Domain U = [0..m-1]  Hashing Random A, B from U, with A>0 g(x) = Ax + B (mod m) h(x) = # leading 0s in binary representation of g(x)  Clearly –  Fact

CS 361A 32 Overall Idea  Hash  random “level” for each domain value  Compute level for stream elements  Invariant Current Level – cur_lev Sample S – all distinct values scanned so far of level at least cur_lev  Observe Random hash  random sample of distinct values For each value  can keep sample of their tuples

CS 361A 33 Algorithm DS (Distinct Sample)  Parameters – memory size  Initialize – cur_lev  0; S  empty  For each input x L  h(x) If L>cur_lev then add x to S If |S| > M odelete from S all values of level cur_lev ocur_lev  cur_lev +1  Return

CS 361A 34 Analysis  Invariant – S contains all values x such that  By construction  Thus  EXERCISE – verify deviation bound

CS 361A 35 References  Towards Estimation Error Guarantees for Distinct Values. Charikar, Chaudhuri, Motwani, and Narasayya. PODS  Probabilistic counting algorithms for data base applications. Flajolet and Martin. JCSS  The space complexity of approximating the frequency moments. Alon, Matias, and Szegedy. STOC  Distinct Sampling for Highly-Accurate Answers to Distinct Value Queries and Event Reports. Gibbons. VLDB 2001.