Algorithms for massive data sets Lecture 1 (Feb 16, 2003) Yossi Matias and Ely Porat (partially based on various presentations & notes)

Slides:



Advertisements
Similar presentations
Estimating Distinct Elements, Optimally
Advertisements

The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.
Analysis of Algorithms
Order Statistics Sorted
Mining Data Streams.
Quick Sort, Shell Sort, Counting Sort, Radix Sort AND Bucket Sort
Spring 2015 Lecture 5: QuickSort & Selection
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 25, 2006
I/O-Algorithms Lars Arge January 31, Lars Arge I/O-algorithms 2 Random Access Machine Model Standard theoretical model of computation: –Infinite.
Tirgul 10 Rehearsal about Universal Hashing Solving two problems from theoretical exercises: –T2 q. 1 –T3 q. 2.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006
Algorithms for massive data sets Lecture 2 (Mar 14, 2004) Yossi Matias & Ely Porat (partially based on various presentations & notes)
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 8 May 4, 2005
I/O-Algorithms Lars Arge Aarhus University February 7, 2005.
CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lecture 15 (Nov 14, 2005) Hashing for Massive/Streaming Data Rajeev Motwani.
Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.
I/O-Algorithms Lars Arge Aarhus University February 6, 2007.
I/O-Algorithms Lars Arge Spring 2006 February 2, 2006.
Sublinear time algorithms Ronitt Rubinfeld Blavatnik School of Computer Science Tel Aviv University TexPoint fonts used in EMF. Read the TexPoint manual.
This material in not in your text (except as exercises) Sequence Comparisons –Problems in molecular biology involve finding the minimum number of edit.
May 21, 2003CS Forum Annual Meeting1 Randomization for Massive and Streaming Data Sets Rajeev Motwani.
I/O-Algorithms Lars Arge Aarhus University February 14, 2008.
A survey on stream data mining
Estimating Set Expression Cardinalities over Data Streams Sumit Ganguly Minos Garofalakis Rajeev Rastogi Internet Management Research Department Bell Labs,
Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15.
CSCI 5708: Query Processing I Pusheng Zhang University of Minnesota Feb 3, 2004.
CSE 373 Data Structures Lecture 15
Ch. 8 & 9 – Linear Sorting and Order Statistics What do you trade for speed?
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
Lecture 2 We have given O(n 3 ), O(n 2 ), O(nlogn) algorithms for the max sub-range problem. This time, a linear time algorithm! The idea is as follows:
Estimating Entropy for Data Streams Khanh Do Ba, Dartmouth College Advisor: S. Muthu Muthukrishnan.
Bin Yao Spring 2014 (Slides were made available by Feifei Li) Advanced Topics in Data Management.
Finding Frequent Items in Data Streams [Charikar-Chen-Farach-Colton] Paper report By MH, 2004/12/17.
Chapter 14 Randomized algorithms Introduction Las Vegas and Monte Carlo algorithms Randomized Quicksort Randomized selection Testing String Equality Pattern.
CHAPTER 09 Compiled by: Dr. Mohammad Omar Alhawarat Sorting & Searching.
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
The Selection Problem. 2 Median and Order Statistics In this section, we will study algorithms for finding the i th smallest element in a set of n elements.
BY Lecturer: Aisha Dawood. The hiring problem:  You are using an employment agency to hire a new office assistant.  The agency sends you one candidate.
David Luebke 1 10/25/2015 CS 332: Algorithms Skip Lists Hash Tables.
1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003.
Data Streams Part 3: Approximate Query Evaluation Reynold Cheng 23 rd July, 2002.
PODC Distributed Computation of the Mode Fabian Kuhn Thomas Locher ETH Zurich, Switzerland Stefan Schmid TU Munich, Germany TexPoint fonts used in.
Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang
CS4432: Database Systems II Query Processing- Part 2.
Calculating frequency moments of Data Stream
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations – Join Chapter 14 Ramakrishnan and Gehrke (Section 14.4)
ICS 353: Design and Analysis of Algorithms
Query Processing CS 405G Introduction to Database Systems.
Lecture 1: Basic Operators in Large Data CS 6931 Database Seminar.
Lower bounds on data stream computations Seminar in Communication Complexity By Michael Umansky Instructor: Ronitt Rubinfeld.
Computing & Information Sciences Kansas State University Wednesday, 08 Nov 2006CIS 560: Database System Concepts Lecture 32 of 42 Monday, 06 November 2006.
Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Mining Data Streams (Part 1)
Algorithms for Big Data: Streaming and Sublinear Time Algorithms
The Stream Model Sliding Windows Counting 1’s
Database Management System
Finding Frequent Items in Data Streams
Streaming & sampling.
Chapter 12: Query Processing
Counting How Many Elements Computing “Moments”
COMS E F15 Lecture 2: Median trick + Chernoff, Distinct Count, Impossibility Results Left to the title, a presenter can insert his/her own image.
Advanced Topics in Data Management
Lecture 2- Query Processing (continued)
Range-Efficient Computation of F0 over Massive Data Streams
Sorting We may build an index on the relation, and then use the index to read the relation in sorted order. May lead to one disk block access for each.
Presentation transcript:

Algorithms for massive data sets Lecture 1 (Feb 16, 2003) Yossi Matias and Ely Porat (partially based on various presentations & notes)

Massive Data Sets Examples –Web (2-4 billion pages, each 1-10 KB, possibly 20TB of text) –Human Genome (3 billion base pairs) –Walmart Market-Basket Data (24 TB) –Sloan Digital Sky Survey (40 TB) –AT&T (300 million call-records per day) Presentation? –Network Access (Web) –Data Warehouse (Walmart) –Secondary Store (Human Genome) –Streaming (Astronomy, AT&T)

Algorithmic Problems? Examples –Statistics (median, variance, aggregates) –Patterns (clustering, associations, classification) –Query Responses (SQL, similarity) –Compression (storage, communication) Novelty? –Problem size – simplicity, near-linear time –Models – external memory, streaming –Scale of data – emergent behavior?

Today – Illustrate Models Computational Models –Main memory –External memory –Streaming data Algorithmic Models –Deterministic –Randomized (Las Vegas versus Monte Carlo) –Approximation Measures –Time/Passes –Space

Example – Distinct Values Problem –Sequence –Domain –Determine number of distinct values in X, say D(X) –Note – domain could be arbitrary (e.g., text, tuples) Study impact of … –different presentation models –different algorithmic models … and thereby understand model definition

Naïve Approach Counter C(i) for each domain value i Initialize counters C(i)  0 Scan X incrementing appropriate counters Works in any model Problems? –Space O(m), and m could be much larger than n (e.g., when counting distinct words in web crawl) –In fact, Time O(m) – but tricks to do initialization?

Main Memory Approach Algorithm MM Pick r = θ(n), hash function h:U  [1..r] Initialize array A[1..r] and D = 0 For each input value –Check if occurs in list stored at A[h(i)] –If not, D  D+1 and add to list at A[h(i)] Output D For “random” h, few collisions & most list-sizes O(1) Thus –Expected Time O(n) [in the worst case] –Space O(n)

Randomized Algorithms Las Vegas (preceding algorithm) –always produces right answer –running-time is random variable Monte Carlo (will see later) –running-time is deterministic –may produce wrong answer (bounded probability) Atlantic City (sometimes also called M.C.) –worst of both worlds

External Memory Model Required when input X doesn’t fit in memory M words of memory Input size n >> M Data stored on disk –Disk block size B << M –Unit time to transfer disk block to memory Memory operations are free

Justification? Block read/write? –Transfer rate ≈ 100 MB/sec (say) –Block size ≈ 100 KB (say) –Block transfer time << Seek time –Thus – only count number of seeks Linear Scan –even better as avoids random seeks Free memory operations? –Processor speeds – multi-GHz –Disk seek time ≈ 0.01 sec

External Memory Algorithm? Question – Why not just use Algorithm MM? Problem –Array A does not fit in memory –For each value, need a random portion of A –Each value involves a disk block read –Thus – Ω(n) disk block accesses Linear time – O(n/B) in this model

Algorithm EM Merge Sort –Partition into M/B groups –Sort each group (recursively) –Merge groups using n/B block accesses (need to hold 1 block from each group in memory) Sorting Time – Compute D(X) – one more pass Total Time – EXERCISE – verify details/analysis

Problem with Algorithm EM Need to sort and reorder blocks on disk Databases –Tuples with multiple attributes –Data might need to be ordered by attribute Y –Algorithm EM reorders by attribute X In any case, sorting is too expensive Alternate Approach –Sample portions of data –Use sample to estimate distinct values

Sampling-based Approaches Benefit – sublinear space Cost – estimation error Naïve approach –Random Sample R (of size r) of n values in X –Compute D(R) –Estimator Error – low-frequency value underrepresented Existence of less naïve approaches?

Negative Result for Sampling Negative Result for Sampling [ Charikar, Chaudhuri, Motwani, Narasayya 2000] Theorem: Let E be estimator for D(X) examining r<n values in X, possibly in an adaptive and randomized order. Then, for any, E must have relative error with probability at least. Example –Say, r = n/5 –Error 20% with probability 1/2

Scenario Analysis Scenario A: –all values in X are identical (say V) –D(X) = 1 Scenario B: –distinct values in X are {V, W1, …, Wk}, –V appears n-k times –each Wi appears once –Wi’s are randomly distributed –D(X) = k+1

Proof Little Birdie – one of Scenarios A or B only Suppose –E examines elements X(1), X(2), …, X(r) in that order –choice of X(i) could be randomized and depend arbitrarily on values of X(1), …, X(i-1) Lemma P[ X(i)=V | X(1)=X(2)=…=X(i-1)=V ] Why? –No information on whether Scenario A or B –Wi values are randomly distributed

Proof (continued) Define EV – event {X(1)=X(2)=…=X(r)=V} Last inequality because

Proof (conclusion) Choose to obtain Thus: –Scenario A  –Scenario B  Suppose –E returns estimate Z when EV happens –Scenario A  D(X)=1 –Scenario B  D(X)=k+1 –Z must have worst-case error >

Algorithms for massive data sets Lecture 2 (Feb 23, 2003) Yossi Matias (partially based on various presentations & notes)

Streaming Model Motivating Scenarios –Data flowing directly from generating source –“Infinite” stream cannot be stored –Real-time requirements for analysis –Possibly from disk, streamed via Linear Scan Model –Stream – at each step can request next input value –Assume stream size n is finite/known (fix later) –Memory size M << n VERIFY – earlier algorithms not applicable

Negative Result Theorem: Deterministic algorithms need M = Ω(n log m) Proof: Assume – deterministic A uses o(n log m) bits Choose – input X U of size n<m Denote by S – state of A after X Can check if any ε X by feeding to A as next input –D(X) doesn’t increase iff ε X Information-theoretically – can recover X from S Thus – states require Ω(n log m) memory bits

We’ve seen so far MM algorithm is ineffective for massive data sets EM algorithm expensive Sampling is ineffective Lower bound for deterministic algorithm in the streaming model Lower bound does not rule out randomized or approximate solutions

A bit vector BV will represent the set Let b be smallest integer s.t. 2^b > u. Let F = GF(2^b). Let r,s be random from F. For a in A, let h(a) = r ·a + s = 101****10….0 Set k’th bit. Estimate is 2^{max bit set}. k Randomized Approximation (based on [Flajolet-Martin 1983, Alon-Matias-Szegedy 1996]) Theorem: For every c > 2 there exists an algorithm that, given a sequence A of n members of U={1,2,…,u}, computes a number d’ using O(log u) memory bits, such that the probability that max(d’/d,d/d’) > c is at most 2/c. 0 1 k u-1 Pr(h(a)=k) Bit vector : b k

Randomized Approximation (2) (based on [Indyk-Motwani 1998]) Algorithm SM – For fixed t, is D(X) >> t? –Choose hash function h: U  [1..t] –Initialize answer to NO –For each, if h( ) = t, set answer to YES Theorem: –If D(X) 0.25 –If D(X) > 2t, P[SM outputs NO] < = 1/e^2

Analysis Let – Y be set of distinct elements of X SM(X) = NO no element of Y hashes to t P[element hashes to t] = 1/t Thus – P[SM(X) = NO] = Since |Y| = D(X), –If D(X) > 0.25 –If D(X) > 2t, P[SM(X) = NO] < < 1/e^2 Observe – need 1 bit memory only!

Boosting Accuracy With 1 bit  can probabilistically distinguish D(X) 2t Running O(log 1/δ) instances in parallel  reduces error probability to any δ>0 Running O(log n) in parallel for t = 1, 2, 4, 8 …, n  can estimate D(X) within factor 2 Choice of factor 2 is arbitrary  can use factor (1+ε) to reduce error to ε EXERCISE – Verify that we can estimate D(X) within factor (1±ε) with probability (1-δ) using space

Sampling versus Counting Observe –Count merely abstraction – need subsequent analytics –Data tuples – X merely one of many attributes –Databases – selection predicate, join results, … –Networking – need to combine distributed streams Single-pass Approaches –Good accuracy –But gives only a count -- cannot handle extensions Sampling-based Approaches –Keeps actual data – can address extensions –Strong negative result

Distinct Sampling for Streams Distinct Sampling for Streams [ Gibbons 2001] Best of both worlds –Good accuracy –Maintains “distinct sample” over stream –Handles distributed setting Basic idea –Hash – random “priority” for domain values –Tracks highest priority values seen –Random sample of tuples for each such value –Relative error with probability

Hash Function Domain U = [0..m-1] Hashing –Random A, B from U, with A>0 –g(x) = Ax + B (mod m) –h(x) – # leading 0s in binary representation of g(x) Clearly – Fact

Overall Idea Hash  random “level” for each domain value Compute level for stream elements Invariant –Current Level – cur_lev –Sample S – all distinct values scanned so far of level at least cur_lev Observe –Random hash  random sample of distinct values –For each value  can keep sample of their tuples

Algorithm DS (Distinct Sample) Parameters – memory size Initialize – cur_lev  0; S  empty For each input x –L  h(x) –If L>cur_lev then add x to S –If |S| > M delete from S all values of level cur_lev cur_lev  cur_lev +1 Return

Analysis Invariant – S contains all values x such that By construction Thus EXERCISE – verify deviation bound

Summary Algorithmic paradigms vary dramatically over models Negative results abound Power of randomization Need to approximate References –Towards Estimation Error Guarantees for Distinct Values. Charikar, Chaudhuri, Motwani, and Narasayya. PODS –Probabilistic counting algorithms for data base applications. Flajolet and Martin. JCSS 31, 2, –The space complexity of approximating the frequency moments. Alon, Matias, and Szegedy. STOC –Distinct Sampling for Highly-Accurate Answers to Distinct Value Queries and Event Reports. Gibbons. VLDB 2001.