August 21, 2002VLDB 20021 Gurmeet Singh Manku Frequency Counts over Data Streams Frequency Counts over Data Streams Stanford University, USA.

Slides:

Advertisements

Similar presentations

Recap: Mining association rules from large datasets

Advertisements

DATA MINING Association Rule Discovery. AR Definition aka Affinity Grouping Common example: Discovery of which items are frequently sold together at a.

Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.

3/13/2012Data Streams: Lecture 161 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier.

Maintaining Variance over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O ’ Callaghan, Stanford University ACM Symp. on Principles.

Adaptive Frequency Counting over Bursty Data Streams Bill Lin, Wai-Shing Ho, Ben Kao and Chun-Kit Chui Form CIDM07.

Mining Data Streams.

 Back to finding frequent itemsets  Typically, data is kept in flat files rather than in a database system:  Stored on disk  Stored basket-by-basket.

Resource-oriented Approximation for Frequent Itemset Mining from Bursty Data Streams SIGMOD’14 Toshitaka Yamamoto, Koji Iwanuma, Shoshi Fukuda.

Mining Frequent Patterns in Data Streams at Multiple Time Granularities CS525 Paper Presentation Presented by: Pei Zhang, Jiahua Liu, Pengfei Geng and.

1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku

Verify and mining frequent patterns from large windows over data streams Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo ICDE2008.

COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong

Mining Frequent Itemsets from Uncertain Data Presented by Chun-Kit Chui, Ben Kao, Edward Hung Department of Computer Science, The University of Hong Kong.

Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.

Heavy hitter computation over data stream

Approximate Frequency Counts over Data Streams Gurmeet Singh Manku (Standford) Rajeev Motwani (Standford) Presented by Michal Spivak November, 2003.

What ’ s Hot and What ’ s Not: Tracking Most Frequent Items Dynamically G. Cormode and S. Muthukrishman Rutgers University ACM Principles of Database Systems.

May 21, 2003CS Forum Annual Meeting1 Randomization for Massive and Streaming Data Sets Rajeev Motwani.

A survey on stream data mining

Continuous Data Stream Processing MAKE Lab Date: 2006/03/07 Post-Excellence Project Subproject 6.

Fast Algorithms for Association Rule Mining

Lecture14: Association Rules

Student Seminar – Fall 2012 A Simple Algorithm for Finding Frequent Elements in Streams and Bags RICHARD M. KARP, SCOTT SHENKER and CHRISTOS H. PAPADIMITRIOU.

A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Rainer Gemulla (University of Technology Dresden) Wolfgang Lehner (University.

Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.

An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets.

CIS 068 Welcome to CIS 068 ! Lesson 9: Sorting. CIS 068 Overview Algorithmic Description and Analysis of Selection Sort Bubble Sort Insertion Sort Merge.

NGDM’02 1 Efficient Data-Reduction Methods for On-line Association Rule Mining H. Bronnimann B. ChenM. Dash, Y. Qiao, P. ScheuermannP. Haas Polytechnic.

Chapter 19: Binary Trees. Objectives In this chapter, you will: – Learn about binary trees – Explore various binary tree traversal algorithms – Organize.

Sequential PAttern Mining using A Bitmap Representation

Bin Yao Spring 2014 (Slides were made available by Feifei Li) Advanced Topics in Data Management.

Mining frequency counts from sensor set data Loo Kin Kong 25 th June 2003.

Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.

Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.

AR mining Implementation and comparison of three AR mining algorithms Xuehai Wang, Xiaobo Chen, Shen chen CSCI6405 class project.

1 Verifying and Mining Frequent Patterns from Large Windows ICDE2008 Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo Date: 2008/9/25 Speaker: Li, HueiJyun.

Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,

Frequent Itemsets and Association Rules 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 3: Frequent Itemsets.

False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu, Zhihong Chong, Hongjun Lu, Aoying.

1 Approximating Quantiles over Sliding Windows Srimathi Harinarayanan CMPS 565.

Research issues on association rule mining Loo Kin Kong 26 th February, 2003.

1 Efficient Algorithms for Incremental Update of Frequent Sequences Minghua ZHANG Dec. 7, 2001.

Mining Frequent Itemsets from Uncertain Data Presenter : Chun-Kit Chui Chun-Kit Chui [1], Ben Kao [1] and Edward Hung [2] [1] Department of Computer Science.

1 Inverted Matrix: Efficient Discovery of Frequent Items in Large Datasets in the Context of Interactive Mining -SIGKDD’03 Mohammad El-Hajj, Osmar R. Zaïane.

Space-Efficient Online Computation of Quantile Summaries SIGMOD 01 Michael Greenwald & Sanjeev Khanna Presented by ellery.

@ Carnegie Mellon Databases 1 Finding Frequent Items in Distributed Data Streams Amit Manjhi V. Shkapenyuk, K. Dhamdhere, C. Olston Carnegie Mellon University.

June 16, 2004 PODS 1 Approximate Counts and Quantiles over Sliding Windows Arvind Arasu, Gurmeet Singh Manku Stanford University.

Lecture 1: Basic Operators in Large Data CS 6931 Database Seminar.

1 Online Mining (Recently) Maximal Frequent Itemsets over Data Streams Hua-Fu Li, Suh-Yin Lee, Man Kwan Shan RIDE-SDMA ’ 05 speaker ：董原賓 Advisor ：柯佳伶.

1 Efficient Data Reduction Methods for Online Association Rule Discovery -NGDM’02 Herve Bronnimann, Bin Chen, Manoranjan Dash, Peter Haas, Yi Qiao, Peter.

Mining of Massive Datasets Ch4. Mining Data Streams.

Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo

Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo

Mining Data Streams (Part 1)

Frequency Counts over Data Streams

The Stream Model Sliding Windows Counting 1’s

Lecture – 2 on Data structures

Streaming & sampling.

Data Mining for Data Streams

A Framework for Automatic Resource and Accuracy Management in A Cloud Environment Smita Vijayakumar.

Spatial Online Sampling and Aggregation

False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu , Zhihong Chong（崇志宏） , Hongjun Lu.

Communication and Memory Efficient Parallel Decision Tree Construction

Advanced Topics in Data Management

Welcome to CIS 068 ! Lesson 9: Sorting CIS 068.

Approximate Frequency Counts over Data Streams

Searching, Sorting, and Asymptotic Complexity

Minwise Hashing and Efficient Search

Dynamically Maintaining Frequent Items Over A Data Stream

Presentation transcript:

August 21, 2002VLDB Gurmeet Singh Manku Frequency Counts over Data Streams Frequency Counts over Data Streams Stanford University, USA

The Problem...  Identify all elements whose current frequency exceeds support threshold s = 0.1%. Stream

A Related Problem... Stream  Identify all subsets of items whose current frequency exceeds s = 0.1%. Frequent Itemsets / Association Rules

Applications Flow Identification at IP Router [EV01] Iceberg Queries [FSGM+98] Iceberg Datacubes [BR99 HPDW01] Association Rules & Frequent Itemsets [AS94 SON95 Toi96 Hid99 HPY00 …]

Presentation Outline Lossy Counting 2. Sticky Sampling 3. Algorithm for Frequent Itemsets

Algorithm 1: Lossy Counting Step 1: Divide the stream into ‘windows’ Is window size a function of support s? Will fix later… Window 1Window 2Window 3

Lossy Counting in Action... Empty At window boundary, decrement all counters by 1

Lossy Counting continued... At window boundary, decrement all counters by 1 Next Window +

Error Analysis If current size of stream = N and window-size = 1/ ε then #windows = εN Rule of thumb: Set ε = 10% of support s Example: Given support frequency s = 1%, set error frequency ε = 0.1%  frequency error  How much do we undercount?

How many counters do we need? Worst case: 1/ε log (ε N) counters [See paper for proof] Output: Elements with counter values exceeding sN – εN Approximation guarantees Frequencies underestimated by at most εN No false negatives False positives have true frequency at least sN – εN

Enhancements... Frequency Errors For counter (X, c), true frequency in [c, c+ ε N ] Trick: Remember window-id’s For counter (X, c, w), true frequency in [c, c+w-1] Batch Processing Decrements after k windows If (w = 1), no error!

Algorithm 2: Sticky Sampling Stream  Create counters by sampling  Maintain exact counts thereafter What rate should we sample?

Sticky Sampling contd... For finite stream of length N Sampling rate = 2/Nε log 1/(s  ) Same Rule of thumb: Set ε = 10% of support s Example: Given support threshold s = 1%, set error threshold ε = 0.1% set failure probability  = 0.01% Output: Elements with counter values exceeding sN – εN Same error guarantees as Lossy Counting but probabilistic Approximation guarantees (probabilistic) Frequencies underestimated by at most εN No false negatives False positives have true frequency at least sN – εN  = probability of failure

Sampling rate? Finite stream of length N Sampling rate: 2/Nε log 1/(s  ) Independent of N! Infinite stream with unknown N Gradually adjust sampling rate (see paper for details) In either case, Expected number of counters = 2/  log 1/s 

No of counters Support s = 1% Error ε = 0.1% N (stream length) No of counters Sticky Sampling Expected: 2/ log 1/s Lossy Counting Worst Case: 1/ log N Log10 of N (stream length)

From elements to sets of elements …

Frequent Itemsets Problem... Stream  Identify all subsets of items whose current frequency exceeds s = 0.1%. Frequent Itemsets => Association Rules

Three Modules BUFFER TRIE SUBSET-GEN

Module 1: TRIE Compact representation of frequent itemsets in lexicographic order Sets with frequency counts

Module 2: BUFFER Compact representation as sequence of ints Transactions sorted by item-id Bitmap for transaction boundaries Window 1 Window 2 Window 3 Window 4 Window 5 Window 6 In Main Memory

Module 3: SUBSET-GEN BUFFER Frequency counts of subsets in lexicographic order

Overall Algorithm... BUFFER SUBSET-GEN TRIE new TRIE Problem: Number of subsets is exponential!

SUBSET-GEN Pruning Rules A-priori Pruning Rule If set S is infrequent, every superset of S is infrequent. See paper for details... Lossy Counting Pruning Rule At each ‘window boundary’ decrement TRIE counters by 1. Actually, ‘Batch Deletion’: At each ‘main memory buffer’ boundary, decrement all TRIE counters by b.

Bottlenecks... BUFFER SUBSET-GEN TRIE new TRIE Consumes main memory Consumes CPU time

Design Decisions for Performance TRIE Main memory bottleneck Compact linear array  (element, counter, level) in preorder traversal  No pointers! Tries are on disk  All of main memory devoted to BUFFER Pair of tries  old and new (in chunks) mmap() and madvise() SUBSET-GEN CPU bottleneck Very fast implementation  See paper for details

Experiments... IBM synthetic dataset T10.I4.1000K N = 1Million Avg Tran Size = 10 Input Size = 49MB IBM synthetic dataset T15.I6.1000K N = 1Million Avg Tran Size = 15 Input Size = 69MB Frequent word pairs in 100K web documents N = 100K Avg Tran Size = 134 Input Size = 54MB Frequent word pairs in 806K Reuters newsreports N = 806K Avg Tran Size = 61 Input Size = 210MB

What do we study? For each dataset Support threshold s Length of stream N BUFFER size B Time taken t Set ε = 10% of support s Three independent variables Fix one and vary two Measure time taken

Varying support s and BUFFER B IBM 1M transactionsReuters 806K docs BUFFER size in MB Time in seconds Fixed: Stream length N Varying: BUFFER size B Support threshold s S = S = S = S = S = S = S = S = S = 0.020

Varying length N and support s IBM 1M transactionsReuters 806K docs Time in seconds Length of stream in Thousands Fixed: BUFFER size B Varying: Stream length N Support threshold s S = S = S = S = S = S = 0.004

Varying BUFFER B and support s Time in seconds IBM 1M transactionsReuters 806K docs Support threshold s Fixed: Stream length N Varying: BUFFER size B Support threshold s B = 4 MB B = 16 MB B = 28 MB B = 40 MB B = 4 MB B = 16 MB B = 28 MB B = 40 MB

Comparison with fast A-priori APrioriOur Algorithm with 4MB Buffer Our Algorithm with 44MB Buffer SupportTimeMemoryTimeMemoryTimeMemory s82 MB111 s12 MB27 s45 MB s53 MB94 s10 MB15 s45 MB s48 MB65 s7MB8 s45 MB s48 MB46 s6 MB6 s45 MB s48 MB34 s5 MB4 s45 MB s48 MB26 s5 MB4 s45 MB Dataset: IBM T10.I4.1000K with 1M transactions, average size 10. A-priori by Christian Borgelt

Comparison with Iceberg Queries Query: Identify all word pairs in 100K web documents which co-occur in at least 0.5% of the documents. [FSGM+98] multiple pass algorithm: 7000 seconds with 30 MB memory Our single-pass algorithm: 4500 seconds with 26 MB memory Our algorithm would be much faster if allowed multiple passes!

Lessons Learnt... Optimizing for #passes is wrong! Small support s  Too many frequent itemsets! Time to redefine the problem itself? Interesting combination of Theory and Systems.

Work in Progress... Frequency Counts over Sliding Windows Multiple pass Algorithm for Frequent Itemsets Iceberg Datacubes

Summary Lossy Counting: A Practical algorithm for online frequency counting. First ever single pass algorithm for Association Rules with user specified error guarantees. Basic algorithm applicable to several problems.

Thank you!

 s SS worst LC worst SS Zipf LC Zipf SS Uniq LC Uniq 0.1%1.0%27K9K6K41927K1K 0.05%0.5%58K17K11K70958K2K 0.01%0.1%322K69K37K2K322K10K 0.005%0.05%672K124K62K4K672K20K LC: Lossy Counting SS:Sticky Sampling Zipf: Zipfian distribution Uniq: Unique elements Sticky Sampling Expected: 2/ log 1/s Lossy Counting Worst Case: 1/ log N But...