NGDM’02 1 Efficient Data-Reduction Methods for On-line Association Rule Mining H. Bronnimann B. ChenM. Dash, Y. Qiao, P. ScheuermannP. Haas Polytechnic Univ ExilixisNorthwestern UniversityIBM Almaden
NGDM’02 2 Motivation Volume of Data in Warehouses & Internet is growing faster than Moore’s Law Scalability is a major concern “Classical” algorithms require one/more scans of the database Need to adopt to Streaming Data One Solution: Execute algorithm on a sample Data elements arrive on-line Limited amount of memory Lossy compressed synopses (sketch) of data
NGDM’02 3 Motivation Advantage: can explicitly trade-off accuracy and speed Work best when tailored to application Base set of items & each data element is vector of item counts Application: Association rule mining Sampling Methods Our Contributions Sampling methods for count datasets
NGDM’02 4 Outline Motivation FAST Epsilon Approximation Experimental Results Data Stream Reduction Conclusion Outline of the Presentation
NGDM’02 5 The Problem Generate a smaller subset S 0 of a larger superset S such that the supports of 1-itemsets in S 0 are close to those in S NP-Complete: One-In-Three SAT Problem I 1 (T) = set of all 1-itemsets in transaction set T L 1 (T) = set of frequent 1-itemsets in transaction set T f(A;T) = support of itemset A in transaction set T
NGDM’02 6 FAST-trim 1. Obtain a large simple random sample S from D. 2. Compute f(A;S) for each 1-itemset A. 3. Using the supports computed in Step 2, obtain a reduced sample S 0 from S by trimming away outlier transactions. 4. Run a standard association-rule algorithm against S 0 – with Minimum support p and confidence c – to obtain the final set of Association Rules. FAST-trim Outline Given a specified minimum support p and confidence c, FAST-trim Algorithm proceeds as follows:
NGDM’02 7 FAST-trim while (|S 0 | > n) { divide S 0 into disjoint groups of min(k,|S 0 |) transactions each; for each group G { compute f(A;S 0 ) for each item A; set S 0 =S 0 – {t*}, where Dist(S 0 -{t*},S) = min Dist(S 0 - {t},S) } FAST-trim Algorithm Uses input parameter k to explicitly trade-off speed and accuracy Trimming Phase t G Note : Removal of outlier t* causes maximum decrease or minimum increase in Dist(S 0,S)
NGDM’02 8 FAST-grow while (|S 0 | > n) { divide S 0 into disjoint groups of min(k,|S 0 |) transactions each; for each group G { compute f(A;S 0 ) for each item A; set S 0 =S 0 {t*}, where Dist(S 0 {t*},S) = min Dist(S 0 {t},S) } FAST-grow Algorithm Select representative transactions from S and add them to the sample S 0 that is initially empty Growing Phase t G
NGDM’02 9 Epsilon Approximation (EA) Theory based on work in statistics on VC Dimensions (Vapnik & Cervonenkis’71) shows: Epsilon Approximation (EA) Can estimate simultaneously the frequency of a collection of subsets VC dimension is finite Applications to computational geometry and learning theory Def: A sample S 0 of S 1 is an approximation iff discrepancy satisfies
NGDM’02 10 Epsilon Approximation (EA) Deterministically halves the data to get sample S 0 Apply halving repeatedly (S 1 => S 2 => … => S t (= S 0 )) until Each halving step introduce a discrepancy where m = total no. of items in database, n i = size of sub-sample S i Halving stops with the maximum t such that Halving Method
NGDM’02 11 Epsilon Approximation (EA) How to compute halving? Hyperbolic cosine method [Spencer] 1.Color each transaction red (in sample) or blue (not in sample) 2.Penalty for each item, reflects Penalty small if red/blue approximately balanced Penalty will shoot up exponentially when red dominates (item is over-sampled), or blue dominates (item is under-sampled) 3.Color transactions sequentially, keeping penalty low Key property: no increase on penalty in average => One of the two colors does not increase the penalty globally
NGDM’02 12 Epsilon Approximation (EA) Penalty Computation Let Q i = Penalty for item A i Init Q i = 2 Suppose that we have colored the first j transactions where r i = r i (j) = no. of red transactions containing A i b i = b i (j) = no. of blue transactions containing A i i = parameter that influences how fast penalty changes as function of |r i - b i |
NGDM’02 13 Epsilon Approximation (EA) How to color transaction j+1 Compute global penalty: = Global penalty assuming transaction j+1 is red = Global penalty assuming transaction j+1 is blue Choose color for which global penalty is smaller EA is inherently an on-line method
NGDM’02 14 Performance Evaluation Synthetic data set IBM QUEST project [AS94] 100,000 transactions 1,000 items number of maximal potentially large itemsets = 2000 average transaction length: 10 average length of maximal large itemsets: 4 minimum support: 0.77% length of the maximal large itemsets: 6 Final sampling ratios: 0.76%, 1.51%, 3.0%, … dictated by EA halvings
NGDM’02 15 Experimental Results 87% reduction in sample size for accuracy: EA (99%), FAST_trim_D2 (97%), SRS (94.6%)
NGDM’02 16 Experimental Results FAST_grow_D2 is best for very small sampling ratio (< 2%) EA best over-all in accuracy
NGDM’02 17 Data Stream Reduction Data Stream Reduction (DSR) Representative sample of data stream Assign more weight to recent data while partially keeping track of old data … … N S /2 N S /4N S /81 mSmS m S -1m S -21Bucket# 1m S -2m S -1mSmS To generate N S -element sample, halve (m S -k) times of bucket k Total #Transactions = m s.N s /2
NGDM’02 18 Data Stream Reduction Practical Implementation NsNs 0 Halving 1 Halving 2 Halving Empty 1 Halving 2 Halving 3 Halving To avoid frequent halving we use one buffer once and compute new representative sample when buffer is full by applying EA
NGDM’02 19 Data Stream Reduction Problem: Two users immediately before and after halving operation see data that varies substantially Continuous DSR: Buffer divided into chunks 2n s 4n s N s -2n s NsNs Next n s transactions arrive Oldest chunk is halved first New trans nsns 3n s 5n s N s -n s NsNs
NGDM’02 20 Conclusion Two-stage sampling approach based on trimming outliers or selecting representative transactions Epsilon approximation: deterministic method for repeatedly halving data to obtain final sample Can be used in conjunction with other non-sampling count-based mining algorithms EA-based data stream reduction We are investigating how to evaluate goodness of representative subset Frequency information to be used for discrepancy function