Download presentation
Presentation is loading. Please wait.
Published byGillian Hunt Modified over 9 years ago
1
Adaptive Load Shedding for Mining Frequent Patterns from Data Streams Xuan Hong Dang, Wee-Keong Ng, and Kok-Leong Ong (DaWaK 2006) 2008/3/191Yi-Chun Chen
2
Outline Motivation Objective Definition Adaptive Load Shedding in Data Stream Performace Results Conclusion 2008/3/192Yi-Chun Chen
3
Motivation Finding frequent itemsets plays an important role in analyzing data streams Only assuming that the machinery itself is fast enough to handle all incoming transactions without incurring any unwanted latencies 2008/3/19Yi-Chun Chen3
4
(Cont.) The arrival rate of data streams usually exceeds the system capacity Algorithms mining from data streams must cope with system overload situations 2008/3/19Yi-Chun Chen4
5
Objective Given a processing capacity C of a mining system and a data stream DS with high arrival rates Load(DS) : the workload of the system If, a load shedding is invoked Guarantee Discover a set of patterns closely approximates to the set of actual frequent itemsets 2008/3/19Yi-Chun Chen5
6
(Cont.) How to determine overload situations? How much load to shed? How to approximate frequent patterns under the introduction of load shedding? 2008/3/19Yi-Chun Chen6
7
Definition : the occurrence count of X in DS up to the transaction MFIs: maximal frequent itemset 2008/3/19Yi-Chun Chen7
8
Adaptive Load Shedding in Data Streams Overload Detection Load Shedding by Sampling Transactions 2008/3/19Yi-Chun Chen8
9
Overload Detection To quickly estimate the system workload, we propose an approximate method on MFIs –MFIs also contains all frequent itemsets –The # of MFIs is smaller than the # of frequent itemsets –The support of MFIs is always closest to 2008/3/19Yi-Chun Chen9
10
(Cont.) load coefficient: –k be the # of MFIs in a transaction – be a MFI, where Suppose we measure the above statistics for n transactions over one time unit –r be the current rate of the data stream 2008/3/19Yi-Chun Chen10
11
Load Shedding by Sampling Transactions In order to estimate how much load to shed –P be a parameter expressing the fraction of transactions that should be discarded –Suppose P < 1, then we use Hoeffding bound to discard transactions and to approximate frequent patterns 2008/3/19Yi-Chun Chen11
12
(Cont.) Hoeffding bound: –, – r be the number of times that occurs in these transactions –sup(X) = p : the true support of X – : the estimated support of X –We want to satisfy the inequality, so the required number of sampling transactions is at least 2008/3/19Yi-Chun Chen12
13
(Cont.) Sample batch: each incoming transaction is chosen with probability P until we sample enough transactions Local patterns: all freq. itemsets in this sample batch are found only within part of the stream Global freq. itemsets in the entire stream 2008/3/19Yi-Chun Chen13
14
(Cont.) Due to the non-uniform distribution of the stream –False global patterns –Significant support : the max. support error of each pattern : frequent : sub-frequent : infrequent 2008/3/19Yi-Chun Chen14 Significant patterns
15
(Cont.) The required number of sampling transactions is at least If and,then is too huge we assume that each itemset appearing more than 0.01%,then if, then every itemset will be chosen, 2008/3/19Yi-Chun Chen15
16
Performance Results Accuracy Measurements Adaptability Recall: 找到的 true freq. patterns / 實際上是 true freq. patterns Precision: 找到 true freq. patterns / 找到的 total freq. patterns Synthetic: T5I3D1000K, T8I4D1000K with 10000 unique items Real-life: “BMS-POS” T6.5 D515597 with 1657 distinct items Fix, select 2008/3/19Yi-Chun Chen16
17
2008/3/19Yi-Chun Chen17
18
2008/3/19Yi-Chun Chen18
19
Conclusion To address the problem of finding frequent patterns from data streams where the mining system may not keep up with the arrival reat of the stream 2008/3/19Yi-Chun Chen19
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.