Mining of Massive Datasets Ch4. Mining Data Streams

Slides:



Advertisements
Similar presentations
Sampling From a Moving Window Over Streaming Data Brian Babcock * Mayur Datar Rajeev Motwani * Speaker Stanford University.
Advertisements

Mining Data Streams (Part 1)
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these.
Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
Data Mining of Very Large Data
Near-Duplicates Detection
Ariel Rosenfeld Network Traffic Engineering. Call Record Analysis. Sensor Data Analysis. Medical, Financial Monitoring. Etc,
1 A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window Costas Busch Rensselaer Polytechnic Institute Srikanta Tirthapura.
Chapter 11 Indexing and Hashing (2) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
Mining Data Streams.
Adaptive Monitoring of Bursty Data Streams Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani.
Indian Statistical Institute Kolkata
Advanced Databases: Lecture 2 Query Optimization (I) 1 Query Optimization (introduction to query processing) Advanced Databases By Dr. Akhtar Ali.
Author: Francis Chang, Wu-chang Feng, Kang Li Publisher: INFOCOM 2004 Presenter: Yun-Yan Chang Date: 2010/12/01 1.
Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.
An Improved Construction for Counting Bloom Filters Flavio Bonomi Michael Mitzenmacher Rina Panigrahy Sushil Singh George Varghese Presented by: Sailesh.
Static Optimization of Conjunctive Queries with Sliding Windows over Infinite Streams Presented by: Andy Mason and Sheng Zhong Ahmed M.Ayad and Jeffrey.
SIGMOD 2006University of Alberta1 Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters Presented by Fan Deng Joint work with.
Hash Table indexing and Secondary Storage Hashing.
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
Ph.D. SeminarUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
Spring 2003 ECE569 Lecture ECE 569 Database System Engineering Spring 2003 Yanyong Zhang
Look-up problem IP address did we see the IP address before?
REPRESENTING SETS CSC 172 SPRING 2002 LECTURE 21.
SWiM Benchmark Brainstorming Dave Maier Mike Stonebraker and All of You! With thanks to Jim Gray for suggestions.
SECTIONS 13.1 – 13.3 Sanuja Dabade & Eilbroun Benjamin CS 257 – Dr. TY Lin SECONDARY STORAGE MANAGEMENT.
A survey on stream data mining
1 Mining Data Streams The Stream Model Sliding Windows Counting 1’s.
Spring 2004 ECE569 Lecture ECE 569 Database System Engineering Spring 2004 Yanyong Zhang
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these.
1 The Mystery of Cooperative Web Caching 2 b b Web caching : is a process implemented by a caching proxy to improve the efficiency of the web. It reduces.
1 Mining Data Streams The Stream Model Sliding Windows Counting 1’s.
A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Rainer Gemulla (University of Technology Dresden) Wolfgang Lehner (University.
Query Optimization CS 157B Ch. 14 Mien Siao. Outline Introduction Steps in Cost-based query optimization- Query Flow Projection Example Query Interaction.
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
Cloud and Big Data Summer School, Stockholm, Aug Jeffrey D. Ullman.
Bin Yao Spring 2014 (Slides were made available by Feifei Li) Advanced Topics in Data Management.
Computer Parts. Two Basic Parts Hardware & Software.
Approximate Encoding for Direct Access and Query Processing over Compressed Bitmaps Tan Apaydin – The Ohio State University Guadalupe Canahuate – The Ohio.
Towards Low Overhead Provenance Tracking in Near Real-Time Stream Filtering Nithya N. Vijayakumar, Beth Plale DDE Lab, Indiana University {nvijayak,
Ms Burnham Programs Words Ending in byte(s) General Computer Knowledge.
TOPIC CENTRIC QUERY ROUTING Research Methods (CS689) 11/21/00 By Anupam Khanal.
1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003.
1 Approximating Quantiles over Sliding Windows Srimathi Harinarayanan CMPS 565.
The Bloom Paradox Ori Rottenstreich Joint work with Yossi Kanizo and Isaac Keslassy Technion, Israel.
Client-Server Paradise ICOM 8015 Distributed Databases.
The Bloom Paradox Ori Rottenstreich Joint work with Isaac Keslassy Technion, Israel.
MapReduce and the New Software Stack. Outline  Algorithm Using MapReduce  Matrix-Vector Multiplication  Matrix-Vector Multiplication by MapReduce 
Duplicate Detection in Click Streams(2005) SubtitleAhmed Metwally Divyakant Agrawal Amr El Abbadi Tian Wang.
Mining of Massive Datasets Ch4. Mining Data Streams.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Mining Data Streams (Part 1)
What about streaming data?
The Stream Model Sliding Windows Counting 1’s
Web-Mining Agents Stream Mining
The Variable-Increment Counting Bloom Filter
COMP9313: Big Data Management Lecturer: Xin Cao Course web site:
PRIMARY STORAGE.
Mining Data Streams (Part 1)
COMP9313: Big Data Management Lecturer: Xin Cao Course web site:
Counting How Many Elements Computing “Moments”
Mining Data Streams (Part 2)
Advanced Topics in Data Management
Mining Data Streams Some of these slides are based on Stanford Mining Massive Data Sets Course slides at
Query Optimization CS 157B Ch. 14 Mien Siao.
Data Intensive and Cloud Computing Data Streams Lecture 10
Introduction to Stream Computing and Reservoir Sampling
Minwise Hashing and Efficient Search
Counting Bits.
Presentation transcript:

Mining of Massive Datasets Ch4. Mining Data Streams

Outline What is data stream The stream data model Example of stream sources Stream queries : standing queries & ad-hoc queries Sampling data in a stream Obtaining a Representative Sample Varying the sample size Filtering streams Bloom filtering

What is data stream Database Data stream Data is available when and if we want it Data stream Data arrives in a stream Stream is composed of elements/tuples If it is not processed immediately, then it is lost forever (0, 7, k),(1,5,n),(0,3,d)

The stream data model Streams need not have the same data rates or data types Working storage : main memory/disk Problem : cannot store all the data from all the streams

Example of stream sources Sensor data Temperature sensor in the ocean Give the sensor a GPS unit : report surface height Image data Satellites Surveillance cameras Internet and web traffic Google – hundred million search queries per day Yahoo! – billion of clicks per day

Stream queries Standing queries Ad-hoc queries permanently executing produce outputs at appropriate times

Standing queries

Ad-hoc queries A question asked once about the current state of the stream We do not store all streams =>we cannot answer arbitrary queries Solution : store a sliding window Elements or time q w e r t y u i o p a s d f g h j k l z x c v b n m q w e r t y u i o p a s d f g h j k l z x c v b n m q w e r t y u i o p a s d f g h j k l z x c v b n m Past Future

Example

Sampling data in a stream We cannot store all streams in main memory Solution : to get a approximate answer than an exact solution We ask queries about the sampled data

Example Scenario: Search engine query stream Obvious solution: Stream of tuples: (user, query, time) Answer questions such as: How often did a user run the same query in a single days Wish to store 1/10th of query stream Obvious solution: Generate a random integer in [0...9] for each query Store the query if the integer is 0, otherwise discard This solution is wrong

Suppose each user issues x queries once and d queries twice (total of x+2d queries) Correct answer: d/(x + d) Proposed solution: We keep 10% of the queries Sample will contain x/10 of the singleton queries and 2d/10 of the duplicate queries at least once But only d/100 pairs of duplicates : d/100 = 1/10 ∙ 1/10 ∙ d Of d “duplicates” 18d/100 appear exactly once 18d/100 = ((1/10 ∙ 9/10)+(9/10 ∙ 1/10)) ∙ d So the sample-based answer is 𝑑 100 𝑥 10 + 𝑑 100 + 18𝑑 100 = 𝒅 𝟏𝟎𝒙+𝟏𝟗𝒅 d/(x + d) ≠ 𝒅 𝟏𝟎𝒙+𝟏𝟗𝒅

Obtaining a Representative Sample Pick 1/10th of the users and take all of their searches Store a list of users Generate a random integer between 0 and 9 0 =>value : in ; others =>value : out Hash function Hash each user name to one of ten buckets,0 through 9 1 2 3 4 5 6 7 8 9

The general sampling problem The stream consists of tuples with n components A subset of components are the key If the key consists of more than one component, the hash function needs to combine the values to make a single hash-value Stream of tuples: (user, query, time)

Varying the sample size Because the sample will grow Values 0,1,2,…,B-1 Threshold t We sample the tuples whose key K satisfies h(K) ≦ t Lower t to t-1, if the samples exceeds the allotted space t=4 1 2 3 4 5 6 7 8 9

Filtering streams We want to accept those tuples that meet a crierion. Accept tuples are passed to another process, while others are dropped Email spam filtering Suppose we have a set S of one billion allowed email address Email address is 20 bytes or more We have 1 GB of available main memory

Bloom filtering Use the main memory as a bit array B : B = [0,1,0,0,1,0,1,…,0] 1 GB main memory => 8 billion bits All the bit is 0 in the beginning : B = [0,0,0,0,…,0] Hash each member of S to a bit, and set that bit to 1 : B[h(s)] = 1 Approximately 1/8th of the bits will be 1 When a element a arrives, we hash its email address If B[h(a)] == 1, we let it through ; If B[h(a)] == 0, we drop this email Approximately 1/8th of the spam email will get through

Analysis of Bloom filtering Suppose we have x targets and y darts The probability that a given dart will not hit a given target is (x-1)/x The probability that none of the darts will hit a given target is ((𝑥−1)/𝑥) 𝑦 Rewrite ((𝑥−1)/𝑥) 𝑦 = (1− 1 𝑥 ) 𝑥( 𝑦 𝑥 ) Because (1−𝜖) 1 𝜖 = 1 𝑒 for small 𝜖 => (1− 1 𝑥 ) 𝑥( 𝑦 𝑥 ) = 𝑒 − 𝑦 𝑥 The probability of a false positive is 1 - 𝑒 − 𝑦 𝑥

Example Consider the spam email 8 billion bits => x = 8× 10 9 targets 1 billion members of S => y = 10 9 darts The probability of a 1 is (1 - 𝑒 − 1 8 ) ≒ 0.1175

There are k hash functions The number of targets is x = n Number of hash functions, k False positive prob. Set S has m members The array has n bits There are k hash functions The number of targets is x = n The number of darts is y = km The probability of a 1 is (1 - 𝑒 − 𝑘𝑚 𝑛 ) Optimal” value of k: n/m ln(2)