Cloud and Big Data Summer School, Stockholm, Aug. 2015 Jeffrey D. Ullman.

Slides:



Advertisements
Similar presentations
Sampling From a Moving Window Over Streaming Data Brian Babcock * Mayur Datar Rajeev Motwani * Speaker Stanford University.
Advertisements

Mining Data Streams (Part 1)
Ariel Rosenfeld Network Traffic Engineering. Call Record Analysis. Sensor Data Analysis. Medical, Financial Monitoring. Etc,
1 A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window Costas Busch Rensselaer Polytechnic Institute Srikanta Tirthapura.
Maintaining Variance over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O ’ Callaghan, Stanford University ACM Symp. on Principles.
Mining Data Streams.
Copyright © 2009 Curt Hill Self Organizing Lists Another form of searchable list.
Indian Statistical Institute Kolkata
Advanced Databases: Lecture 2 Query Optimization (I) 1 Query Optimization (introduction to query processing) Advanced Databases By Dr. Akhtar Ali.
COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong
Extension of DGIM to More Complex Problems
Hash Table indexing and Secondary Storage Hashing.
1 Overview of Storage and Indexing Chapter 8 (part 1)
FALL 2004CENG 351 Data Management and File Structures1 External Sorting Reference: Chapter 8.
CURE Algorithm Clustering Streams
FALL 2006CENG 351 Data Management and File Structures1 External Sorting.
A survey on stream data mining
Tirgul 6 B-Trees – Another kind of balanced trees Problem set 1 - some solutions.
Using Secondary Storage Effectively In most studies of algorithms, one assumes the "RAM model“: –The data is in main memory, –Access to any item of data.
1 Mining Data Streams The Stream Model Sliding Windows Counting 1’s.
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these.
Copyright © Cengage Learning. All rights reserved. CHAPTER 11 ANALYSIS OF ALGORITHM EFFICIENCY ANALYSIS OF ALGORITHM EFFICIENCY.
1 Overview of Storage and Indexing Chapter 8 1. Basics about file management 2. Introduction to indexing 3. First glimpse at indices and workloads.
1 Mining Data Streams The Stream Model Sliding Windows Counting 1’s.
CS 591 A11 Algorithms for Data Streams Dhiman Barman CS 591 A1 Algorithms for the New Age 2 nd Dec, 2002.
- ABHRA DASGUPTA Solving Adhoc and Math related problems.
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
PostgreSQL and relational databases As well as assignment 4…
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
Database Management 9. course. Execution of queries.
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
Memory Management Techniques
1 CS 430 Database Theory Winter 2005 Lecture 16: Inside a DBMS.
1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003.
1 Overview of Storage and Indexing Chapter 8 (part 1)
Storage and Indexing1 Overview of Storage and Indexing.
MA/CSSE 473 Day 28 Dynamic Programming Binomial Coefficients Warshall's algorithm Student questions?
Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.
Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.
March 23 & 28, Csci 2111: Data and File Structures Week 10, Lectures 1 & 2 Hashing.
Virtual Memory The memory space of a process is normally divided into blocks that are either pages or segments. Virtual memory management takes.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
Mining of Massive Datasets Ch4. Mining Data Streams
FALL 2005CENG 351 Data Management and File Structures1 External Sorting Reference: Chapter 8.
Mining of Massive Datasets Ch4. Mining Data Streams.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Continuous Monitoring of Distributed Data Streams over a Time-based Sliding Window MADALGO – Center for Massive Data Algorithmics, a Center of the Danish.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Sorting.
Mining Data Streams (Part 1)
What about streaming data?
Indexing Goals: Store large files Support multiple search keys
The Stream Model Sliding Windows Counting 1’s
The Stream Model Sliding Windows Counting 1’s
Web-Mining Agents Stream Mining
COMP9313: Big Data Management Lecturer: Xin Cao Course web site:
Mining Data Streams The Stream Model Sliding Windows Counting 1’s Exponentially Decaying Windows Usually, we mine data that sits somewhere in a database.
Streaming & sampling.
Hash-Based Indexes Chapter 11
Mining Data Streams (Part 1)
Chapter 15 QUERY EXECUTION.
COMP9313: Big Data Management Lecturer: Xin Cao Course web site:
Counting How Many Elements Computing “Moments”
Mining Data Streams Some of these slides are based on Stanford Mining Massive Data Sets Course slides at
Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani
Range-Efficient Computation of F0 over Massive Data Streams
Data Intensive and Cloud Computing Data Streams Lecture 10
Introduction to Stream Computing and Reservoir Sampling
CENG 351 Data Management and File Structures
Counting Bits.
Presentation transcript:

Cloud and Big Data Summer School, Stockholm, Aug Jeffrey D. Ullman

2  In a DBMS, input is under the control of the programming staff.  SQL INSERT commands or bulk loaders.  Stream management is important when the input rate is controlled externally.  Example: Google search queries.

3  Input tuples enter at a rapid rate, at one or more input ports.  The system cannot store the entire stream accessibly.  How do you make critical calculations about the stream using a limited amount of (primary or secondary) memory?

4 1. Ad-hoc queries: Normal queries asked one time about streams.  Example: What is the maximum value seen so far in stream S? 2. Standing queries: Queries that are, in principle, asked about the stream at all times.  Example: Report each new maximum value ever seen in stream S.

5 Limited Working Storage... 1, 5, 2, 7, 0, 9, 3... a, r, v, t, y, h, b... 0, 0, 1, 0, 1, 1, 0 time Streams Entering Output Archival Storage Processor Ad-Hoc Queries Standing Queries

6  Mining query streams.  Google wants to know what queries are more frequent today than yesterday.  Mining click streams.  Yahoo! wants to know which of its pages are getting an unusual number of hits in the past hour.  Often caused by annoyed users clicking on a broken page.  IP packets can be monitored at a switch.  Gather information for optimal routing.  Detect denial-of-service attacks.

7  A useful model of stream processing is that queries are about a window of length N – the N most recent elements received.  Alternative: elements received within a time interval T.  Interesting case: N is so large it cannot be stored in main memory.  Or, there are so many streams that windows for all do not fit in main memory.

8 q w e r t y u i o p a s d f g h j k l z x c v b n m Past Future

 Stream of integers, window of size N.  Standing query: what is the average of the integers in the window?  For the first N inputs, sum and count to get the average.  Afterward, when a new input i arrives, change the average by adding (i - j)/N, where j is the oldest integer in the window.  Good: O(1) time per input.  Bad: Requires the entire window in memory. 9

11  Problem: given a stream of 0’s and 1’s, be prepared to answer queries of the form “how many 1’s in the last k bits?” where k ≤ N.  Obvious solution: store the most recent N bits.  But answering the query will take O(k) time.  Very possibly too much time.  And the space requirements can be too great.  Especially if there are many streams to be managed in main memory at once, or N is huge.

 Count recent hits on URL’s belonging to a site.  Stream is a sequence of URL’s.  Window size N = 1 billion.  Think of the data as many streams – one for each URL.  Bit on the stream for URL x is 0 unless the actual stream has x. 12

13  Name refers to the inventors:  Datar, Gionis, Indyk, and Motwani.  Store only O(log 2 N) bits per stream (N = window size).  Gives approximate answer, never off by more than 50%.  Error factor can be reduced to any fraction > 0, with more complicated algorithm and proportionally more stored bits.

14  Each bit in the stream has a timestamp, starting 0, 1, …  Record timestamps modulo N (the window size), so we can represent any relevant timestamp in O(log 2 N) bits.

15  A bucket is a segment of the window; it is represented by a record consisting of: 1.The timestamp of its end [O(log N) bits]. 2.The number of 1’s between its beginning and end.  Number of 1’s = size of the bucket.  Constraint on bucket sizes: number of 1’s must be a power of 2.  Thus, only O(log log N) bits are required for this count.

16  Either one or two buckets with the same power-of-2 number of 1’s.  Buckets do not overlap.  Buckets are sorted by size.  Older buckets are not smaller than newer buckets.  Buckets disappear when their end-time is > N time units in the past.

N 1 of size 2 2 of size 4 2 of size 8 At least 1 of size 16. Partially beyond window. 2 of size 1

18  When a new bit comes in, drop the last (oldest) bucket if its end-time is prior to N time units before the current time.  If the current bit is 0, no other changes are needed.

19  If the current bit is 1: 1.Create a new bucket of size 1, for just this bit.  End timestamp = current time. 2.If there are now three buckets of size 1, combine the oldest two into a bucket of size 2. 3.If there are now three buckets of size 2, combine the oldest two into a bucket of size 4. 4.And so on …

21  To estimate the number of 1’s in the most recent k < N bits: 1.Restrict your attention to only those buckets whose end time stamp is at most k bits in the past. 2.Sum the sizes of all these buckets but the oldest. 3.Add half the size of the oldest bucket.  Remember: we don’t know how many 1’s of the last bucket are still within the window.

22  Suppose the oldest bucket within range has size 2 i.  Then by assuming 2 i -1 of its 1’s are still within the window, we make an error of at most 2 i -1.  Since there is at least one bucket of each of the sizes less than 2 i, and at least 1 from the oldest bucket, the true sum is no less than 2 i.  Thus, error at most 50%.