1 Efficient Data Reduction Methods for Online Association Rule Discovery -NGDM’02 Herve Bronnimann, Bin Chen, Manoranjan Dash, Peter Haas, Yi Qiao, Peter Scheuermann Presented by: Ivy Tong 18 June, 2003
2 Outline Outline of the Presentation Motivation FAST Epsilon Approximation Experimental Results Data Stream Reduction Conclusion
3 Motivation Volume of Data in Warehouses & Internet is growing faster than Moore’s Law Scalability is a major concern Classical algorithms require one/more scans of the database Need to adapt to Streaming Data Data elements arrive online Limited amount of memory One Solution: Execute algorithm on a subset of data
4 Motivation Sampling Methods Advantage: can explicitly trade-off accuracy and speed Work best when tailored to application Contributions of this paper Sampling methods for count datasets Application: Association rule mining
5 Notations D: Database of interest S: A simple random sample drawn without replacement from D I: The set of all items that appear in D I(D): the collection of itemsets that appear in D I(S): itemsets that appear in S For k 1, I k (D) and I k (S) denote the connection of k-itemsets in D and S L(D) and L(S): frequent itemsets in D and S L k (D) and L k (S): collections of frequent k-itemsets in D and S For an itemset A I and a set of transactions T, Let n(A; T) be the number of transactions in T that contain A |T|: total number of transactions in T Support of A in D: f(A;D) = n(A;D)/|D| Support of A in S: f(A;S)= n(A;S)/|S|
6 Problem Definition Generate a smaller subset S 0 of a larger S such that the supports of 1-itemsets in S 0 are close to those in S I 1 (T) = set of all 1-itemsets in transaction set T L 1 (T) = set of frequent 1-itemsets in transaction set T f(A;T) = support of itemset A in transaction set T
7 FAST Algorithm Finding Association rules from Sampled Transactions (SIGKDD’02)SIGKDD’02 Given a specified minimum support p and confidence c, FAST proceeds as follows: 1.Obtain a large simple random sample S from D. 2.Compute f(A;S) for each 1-itemset A. 3.Using the supports computed in Step 2, obtain a reduced sample S 0 from S by trimming away outlier transactions. 4.Run a standard association-rule algorithm against S 0 - with Minimum support p and confidence c - to obtain the final set of Association Rules.
8 FAST-trim Removes the “outlier” transactions from the sample S to obtain S 0 Outlier – a transaction whose removal from S maximally reduces (minimally increases) the difference between the supports of the 1-itemsets in S and the corresponding supports in D. Since supports of items in D is unknown, estimate them from S as computed in Step[2] Distance function used:
9 FAST-trim Uses input parameter k to explicitly trade-off speed and accuracy (1 k |S|) Trimming Phase Note: Removal of outlier t *, causes maximum decrease or minimum increase in Dist(S 0,S) while (|S 0 | > n) { divide S 0 into disjoint groups of min(k,|S 0 |) transactions each; for each group G { compute f(A;S 0 ) for each item A; set S 0 =S 0 – {t*}, where Dist(S 0 -{t*},S) = min Dist(S 0 - {t},S) }
10 FAST-grow Select representative transactions from S and add them to the sample S 0 that is initially empty Growing Phase while (|S 0 | < n) { divide S 0 into disjoint groups of min(k,|S 0 |) transactions each; for each group G { compute f(A;S 0 ) for each item A; set S 0 =S 0 {t*}, where Dist(S 0 {t*},S) = min Dist(S 0 {t},S) }
11 Epsilon Approximation (EA) Similar to FAST Find a small subset having 1-itemset supports that are close to those in the entire database The discrepancy of any subset S 0 of a superset S (the distance between S 0 and S with respect to the 1-itemset frequencies) is computed as the L distance between the frequency vectors Def: A sample S 0 of S 1 is an approximation iff discrepancy satisfies Dist (S 0,S 1 )
12 Epsilon Approximation (EA) Halving Method Deterministically halves the data to get sample S 0 Apply halving repeatedly (S 1 => S 2 => … => S t (= S 0 )) Each halving step introduce a discrepancy i (n i,m) where m=total no. of items in database n i =size of sub-sample S i Halving stops with the max t such that
13 Epsilon Approximation (EA) 1.Color each transaction red (in sample) or blue (not in sample) 2.Penalty for each item, reflects Penalty small if red/blue approximately balanced Penalty will shoot up exponentially when red dominates (item is over-sampled), or blue dominates (item is under-sampled) 3.Color transactions sequentially, keeping penalty low Choose the color which gives smaller penalty
14 Epsilon Approximation (EA) Penalty Computation Let Q i = Penalty for item A i Init Q i = 2 Suppose that we have colored the first j transactions where r i = r i (j) = no. of red transactions containing A i b i = b i (j) = no. of blue transactions containing A i = parameter that influences how fast penalty changes as function of |r i - b i |, (0,1) Error bound
15 Epsilon Approximation (EA) How to color transaction j+1 Compute global penalty Choose color for which global penalty is smaller = Global penalty assuming transaction j+1 is red = Global penalty assuming transaction j+1 is blue
16 Epsilon Approximation (EA) initialization Compute penalty of each item Global penalty Red transactions are added to Sample, blue are forgotten Decide to color it red or blue
17 Epsilon Approximation (EA) Repeated halving method starts with S Apply one round of halving to get S 1 Then another round of halving to S 1 to S 2 etc If S 1 is an 1 approximation of S and S 2 is an 2 approximation of S 1, S 2 is an ( 1 + 2 ) approximation of S S t is an t -approximation, t = k t (n k,m) Stop repeated halving for the max t s.t. t
18 Epsilon Approximation (EA) Require t passes over the database Observation: Halving is sequential in deciding the color of a transaction In single pass, store all penalties of each halving method Based on penalties from 1 st halving, decide to color it red or not If red, compute the penalty for 2 nd halving, etc. until the transaction is colored blue, or belongs to S t, t=log n
19 Experiments Synthetic data set IBM QUEST project 100,000 transactions 1,000 items number of maximal potentially large itemsets = 2000 average transaction length: 10 average length of maximal large itemsets: 4 minimum support: 0.77% length of the maximal large itemsets: 6
20 Experiments Use Apriori in all cases to compute the large itemsets Accuracy and execution time measured FAST 2 implementations: Dist 1 and Dist 2 Phase 1: sample size=30% Parameter k:10 EA Run EA with a given value and then use the obtained sample size to run FAST and SRS Final sampling ratios: 0.76%, 1.51%, 3.02%,6.04%, 12.4%, and 24.9% … dictated by EA halvings
21 Experimental Results Accuracy vs Sampling Ratio
22 Experimental Results Time vs Sampling Ratio
23 Streaming Data Analysis Streaming databases grow continuously, rapidly and without bound Example applications: Stock tickers Network traffic monitors POS systems Phone conversation wiretaps Challenges of analysis: Timely response Use of limited memory
24 Previous Work Algorithms that identify frequent singleton items over a data stream (VLDB’02)VLDB’02 Sticky Sampling Sticky Sampling Lossy Counting Lossy Counting Problems Can accurately maintain statistics of items over a stable data stream in which patterns change slowly Fail for some applications that require information of the entire stream but with emphasis on the most recent data
25 DSR: Data Stream Reduction EA-based algorithm, to sample data streams Goal: generate a sample which carries information about entire stream while favor recent data Model: Each element of the data stream is a transaction consisting of a set of items (0-1 problem) Suppose we want to generate an N s -element sample, S s S s puts more weight on recent data
26 Data Stream Reduction Representative sample of data stream … … N S /2 NS/2msNS/2ms mSmS m S -1 m S -2 1 Bucket# 1 m S -2m S -1 mSmS To generate N S -element sample, halve (m S -k) times of bucket k Total #Transactions = m s.N s /2 N S /2 N S /8 N S /4 #Transactions in S ~ N s
27 Problem-frequent halving Expensive to map transactions into conceptual buckets and compute the representative subset of each bucket whenever a new transaction arrives. Goal: Stimulate the ideal scenario while avoid frequent halving Solution: Use a working buffer that hold N s transactions and compute new representative sample when buffer is full by applying EA
28 Frequent Halving NsNs Empty Full 0 Halving 1 Halving Full 1 Halving 0 Halving 1 Halving 2 Halving 1 Halving 2 Halving 0 Halving Full Empty 1 Halving 2 Halving 3 Halving Empty
29 Problem of Halving Problem: Two users immediately before and after halving operation see data that varies substantially Continuous DSR: Buffer divided into N s /2n s chunks, n s << N s 2n s 4n s N s -2n s NsNs Next n s transactions arrive Oldest chunk is halved first nsns 3n s 5n s N s -n s NsNs New transactions == ==
30 Discussions Advantages of DSR DSR is more sensitive to recent changes in the stream DSR generates a representative sample instead of collecting statistics such as counts => more flexibility Each halving operation is relatively cost-effective when compared to expensive frequent itemset identifcations in the Lossy Counting based approach Future work Choice of discrepancy function Based on single item frequencies How to evaluate goodness of representative subset
31 Conclusions FAST: Two 2-stage sampling approach based on trimming outliers or selecting representative transactions Epsilon approximation: deterministic method for repeatedly halving data to obtain final sample Can be used in conjunction with other non-sampling count-based mining algorithms Trade-off processing speed and accuracy of results EA-based data stream reduction
32 References H. Bronnimann, B. Chen, M. Dash, P. Haas, and Y. Qiao, Peter Scheuermann. Efficient Data-Reduction Methods for On-Line Association Rule Discovery, NGDM’02 NGDM’02 B. Chen, P. Haas, and P. Scheuermann. A New Two- Phase Sampling Based Algorithm for Discovering Association Rules, SIGKDD’02SIGKDD’02 G. S. Manku, and R. Motwani. Approximate Frequency Counts Over Data Streams, VLDB’02VLDB’02
33 Sticky Sampling…… Use a fixed-size buffer and various sampling rates to estimate counts Sample the first 2t incoming items at rate r=1 (select one item for every item seen) Sample the next 2t items at rate r=2 (select one item for every two items seen) For next 4t items, r=4 t is predefined based on freq. Threshold, user specified error and prob. of failure Randomly select same no. of elements from an enlarging moving window which keeps doubling itself
34 Lossy Counting…… Store observed freq and estimated maximal freq. error for each frequent, or potentially frequent, item in a series of conceptual buckets. Keep adding new items to and removing existing less frequent items from the buckets