Adaptive Frequency Counting over Bursty Data Streams Bill Lin, Wai-Shing Ho, Ben Kao and Chun-Kit Chui Form CIDM07.

Slides:



Advertisements
Similar presentations
Online Mining of Frequent Query Trees over XML Data Streams Hua-Fu Li*, Man-Kwan Shan and Suh-Yin Lee Department of Computer Science.
Advertisements

Semantics and Evaluation Techniques for Window Aggregates in Data Streams Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, Peter A. Tucker SIGMOD.
3/13/2012Data Streams: Lecture 161 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier.
1 A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window Costas Busch Rensselaer Polytechnic Institute Srikanta Tirthapura.
Resource-oriented Approximation for Frequent Itemset Mining from Bursty Data Streams SIGMOD’14 Toshitaka Yamamoto, Koji Iwanuma, Shoshi Fukuda.
Mining Frequent Patterns in Data Streams at Multiple Time Granularities CS525 Paper Presentation Presented by: Pei Zhang, Jiahua Liu, Pengfei Geng and.
Priority Scheduling and Buffer Management for ATM Traffic Shaping Authors: Todd Lizambri, Fernando Duran and Shukri Wakid Present: Hongming Wu.
Adaptive Load Shedding for Mining Frequent Patterns from Data Streams Xuan Hong Dang, Wee-Keong Ng, and Kok-Leong Ong (DaWaK 2006) 2008/3/191Yi-Chun Chen.
Acceptance Sampling for Attributes Statistical Quality Control
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong
Mining Frequent Itemsets from Uncertain Data Presented by Chun-Kit Chui, Ben Kao, Edward Hung Department of Computer Science, The University of Hong Kong.
Heavy hitter computation over data stream
Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee, David W. Cheung, Ben Kao The University of Hong.
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku (Standford) Rajeev Motwani (Standford) Presented by Michal Spivak November, 2003.
1 Mining Quantitative Association Rules in Large Relational Database Presented by Jin Jin April 1, 2004.
Communication-Efficient Distributed Monitoring of Thresholded Counts Ram Keralapura, UC-Davis Graham Cormode, Bell Labs Jai Ramamirtham, Bell Labs.
Maintenance of Discovered Association Rules S.D.LeeDavid W.Cheung Presentation : Pablo Gazmuri.
1 The Designs and Analysis of a Scalable Optical Packet Switching Architecture Speaker: Chia-Wei Tuan Adviser: Prof. Ho-Ting Wu 3/4/2009.
Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee David W. Cheung Ben Kao The University of Hong Kong.
C82MCP Diploma Statistics School of Psychology University of Nottingham 1 Linear Regression and Linear Prediction Predicting the score on one variable.
Detecting Distance-Based Outliers in Streams of Data Fabrizio Angiulli and Fabio Fassetti DEIS, Universit `a della Calabria CIKM 07.
Cloud and Big Data Summer School, Stockholm, Aug Jeffrey D. Ullman.
Real-Time Concepts for Embedded Systems Author: Qing Li with Caroline Yao ISBN: CMPBooks.
Introduction to Adaptive Digital Filters Algorithms
NGDM’02 1 Efficient Data-Reduction Methods for On-line Association Rule Mining H. Bronnimann B. ChenM. Dash, Y. Qiao, P. ScheuermannP. Haas Polytechnic.
Univ. of TehranAdv. topics in Computer Network1 Advanced topics in Computer Networks University of Tehran Dept. of EE and Computer Engineering By: Dr.
Mining frequency counts from sensor set data Loo Kin Kong 25 th June 2003.
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
By Graham Cormode and Marios Hadjieleftheriou Presented by Ankur Agrawal ( )
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
1 Efficient Computation of Frequent and Top-k Elements in Data Streams.
1 Verifying and Mining Frequent Patterns from Large Windows ICDE2008 Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo Date: 2008/9/25 Speaker: Li, HueiJyun.
THE DESIGN AND IMPLEMENTATION OF A LOG-STRUCTURED FILE SYSTEM M. Rosenblum and J. K. Ousterhout University of California, Berkeley.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
MINING FREQUENT ITEMSETS IN A STREAM TOON CALDERS, NELE DEXTERS, BART GOETHALS ICDM2007 Date: 5 June 2008 Speaker: Li, Huei-Jyun Advisor: Dr. Koh, Jia-Ling.
Smita Vijayakumar Qian Zhu Gagan Agrawal 1.  Background  Data Streams  Virtualization  Dynamic Resource Allocation  Accuracy Adaptation  Research.
False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu, Zhihong Chong, Hongjun Lu, Aoying.
1 Approximating Quantiles over Sliding Windows Srimathi Harinarayanan CMPS 565.
Expert Systems with Applications 34 (2008) 459–468 Multi-level fuzzy mining with multiple minimum supports Yeong-Chyi Lee, Tzung-Pei Hong, Tien-Chin Wang.
Competitive Queue Policies for Differentiated Services Seminar in Packet Networks1 Competitive Queue Policies for Differentiated Services William.
Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.
August 21, 2002VLDB Gurmeet Singh Manku Frequency Counts over Data Streams Frequency Counts over Data Streams Stanford University, USA.
Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin.
Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee, David W. Cheung, Ben Kao The University of Hong.
Research issues on association rule mining Loo Kin Kong 26 th February, 2003.
Mining Frequent Itemsets from Uncertain Data Presenter : Chun-Kit Chui Chun-Kit Chui [1], Ben Kao [1] and Edward Hung [2] [1] Department of Computer Science.
Space-Efficient Online Computation of Quantile Summaries SIGMOD 01 Michael Greenwald & Sanjeev Khanna Presented by ellery.
We used ns-2 network simulator [5] to evaluate RED-DT and compare its performance to RED [1], FRED [2], LQD [3], and CHOKe [4]. All simulation scenarios.
1 IP Routing table compaction and sampling schemes to enhance TCAM cache performance Author: Ruirui Guo, Jose G. Delgado-Frias Publisher: Journal of Systems.
1 Efficient Data Reduction Methods for Online Association Rule Discovery -NGDM’02 Herve Bronnimann, Bin Chen, Manoranjan Dash, Peter Haas, Yi Qiao, Peter.
Improvement of Apriori Algorithm in Log mining Junghee Jaeho Information and Communications University,
Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.
Continuous Monitoring of Distributed Data Streams over a Time-based Sliding Window MADALGO – Center for Massive Data Algorithmics, a Center of the Danish.
Mining Data Streams (Part 1)
Frequency Counts over Data Streams
Finding Maximal Frequent Itemsets over Online Data Streams Adaptively
Updating SF-Tree Speaker: Ho Wai Shing.
The Stream Model Sliding Windows Counting 1’s
CHAPTER 5 PARTIAL DERIVATIVES
A Framework for Automatic Resource and Accuracy Management in A Cloud Environment Smita Vijayakumar.
Automatic Generation Control (AGC)
False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu , Zhihong Chong(崇志宏) , Hongjun Lu.
Association Rule Mining
Dusit Niyato, Student Member, IEEE Ekram Hossain, Senior Member, IEEE
Smita Vijayakumar Qian Zhu Gagan Agrawal
Approximate Frequency Counts over Data Streams
Qingwen Liu, Student Member, IEEE Xin Wang, Member, IEEE,
Maintaining Frequent Itemsets over High-Speed Data Streams
Presentation transcript:

Adaptive Frequency Counting over Bursty Data Streams Bill Lin, Wai-Shing Ho, Ben Kao and Chun-Kit Chui Form CIDM07

Outline. Introduction Lossy Counting, Data Aging, Problem Definition System Architecture and Feedback Mechanism Algorithm Experiments Conclusion

Introduction. Large volumes of data. Real-time processing. Bursty traffic. Data aging.

Lossy Counting. Lossy Counting (LC) algorithm, the first one-pass algorithm for counting approximately the set of frequent itemsets over a stream of transactions. Given a user-specified error bound ε, LC processes incoming data bucket-by-bucket and updates a summary structure D. D contains a set of entries of the form, where X is an itemset, is an approximate support count of X, and δ(X) is the maximum possible error in.

Lossy Counting. (cont) The summary structure D has the following two properties: 1. For each itemset X, if X is not in D, then σ(X) < εN, where N is the total number of transactions processed. 2. For each entry in D, D is updated by the following rules: 1. If D does not contain an entry for X, we create an entry ifσ i (X) +εN 1 >εN Otherwise, X has an entry in D and the approximate support count is incremented by σ i (X). 3. If the updated entry satisfies, we delete the entry from D.

Data Aging. In many data stream applications, recent data are more important than older ones. denote the support count of X in bucket T i. denote the time-weighted support count of X obtain from the first k buckets of the stream.

Problem Definition. Given a data stream T with bursty transaction arrivals, a user- specified support threshold and a decay factor α, the problem is to report all itemsets X and to estimate their support counts such that (i) all frequent itemsets in T are reported, and (ii) a maximum possible error is reported for each estimated support count. Due to the large volume of transactions, we also require that transactions be loaded into memory and processed only once.

Architecture.

Architecture. (cont) The buffer manager monitors the occupancy of the buffer slots and submits statistical information to the speed regulator to control the mining speed. The mining module processes each bucket of buffered data an summarizes the mining result in its internal summary structure. The mining module submits feedback statistics to the speed regulator after each bucket is processed. The speed regulator receives statistical information from the buffer manager to estimate the data arrival rate. It also receives feedback information from the mining module to determine the mining speed. Base on this information, the speed regulator determines a target processing speed and sends a speed control signal to the mining module.

FeedBack Mechanism The core component of AFC is the feedback mechanism, which is implemented in the speed regulator. Estimates a target processing time, denoted by p i, which is the amount of time within which the mining module should complete the processing of the next bucket of transactions (T i ).

Estimating target Processing Time. The objective of the feedback mechanism is to try to maintain the buffer occupancy to a fraction f of the Q buffer slots. For example, if q i (denote the number of buffer slots that are occupied just before bucket Ti is processed.) is larger than the target buffer occupancy fQ, AFC should set p i to a small value so that transactions are mined at a higher rate to bring the occupancy down towards fQ. Hence, the number of buckets that have arrived during this period is 1 + q i -q i-1. Therefore, the bucket arrival rate is buckets per unit time.

Estimating target Processing Time. (cont) During this period of time, we have k bucket arrivals, and the occupancy changes from q i to fQ. Hence, the number of buckets processed is q i + k- fQ. If we assume that buckets arrive at a constant rate during this period, we have Since the system has to process q i + k- fQ buckets within this period of time, the target processing time of T i should be …………(1)

Estimating target Processing Time. (cont) This estimation is redone after each bucket is processed. A smaller value of k leads to a smaller restoration time (t restore ) and thus a higher processing speed is required at the mining module.

Speed Control. In AFC, the algorithm determines a suitable error threshold ε i for processing bucket T i in order that the target processing time P i is achieved. That is the sum of all the support counts in T i of the itemsets that are retained in the summary structure D i after the bucket T i is processed. We can discover the relationship between an ε i and this sum C i.

Speed Control. (cont) The larger ε i (i.e., the error threshold used when processing T i ) is, the fewer itemsets will be kept in D i, leading to a smaller C i. We now describe a two-step approach for determining ε i. 1) Step 1. Estimate a target value of C i from p i ……(2)

Speed Control. (cont) 2) Step 2. Estimate ε i from C i AFC determines a target processing time (p i ) based on Equation (1). It then calculates a target value of C i by Equation (2). Using the C i- 1 ε i-1 curve, AFC determines the value of ε i given the estimated value of C i. This error threshold is then applied when the mining module processes bucket T i.

Algorithm. AFC provides the following accuracy guarantees: 1. All itemsets whose true time-weighted support counts exceed are returned. 2. No itemset whose true time-weighted suppor count is less than is returned. 3. If no buckets are dropped in the processing of the data stream, then the difference between the reported support count of an itemset X and the true time-weighted support count of X is at most

Algorithm. (cont)

Experiment.

Experiment. (cont)

Conclusion.