Download presentation
Presentation is loading. Please wait.
Published byGordon Holmes Modified over 9 years ago
1
NGDM’02 1 Efficient Data-Reduction Methods for On-line Association Rule Mining H. Bronnimann B. ChenM. Dash, Y. Qiao, P. ScheuermannP. Haas Polytechnic Univ ExilixisNorthwestern UniversityIBM Almaden hbr@poly.edubchen@ece.nwu.edupeterh@almaden.ibm.com{manoranj,yiqiao,peters}@ece.nwu.edu
2
NGDM’02 2 Motivation Volume of Data in Warehouses & Internet is growing faster than Moore’s Law Scalability is a major concern “Classical” algorithms require one/more scans of the database Need to adopt to Streaming Data One Solution: Execute algorithm on a sample Data elements arrive on-line Limited amount of memory Lossy compressed synopses (sketch) of data
3
NGDM’02 3 Motivation Advantage: can explicitly trade-off accuracy and speed Work best when tailored to application Base set of items & each data element is vector of item counts Application: Association rule mining Sampling Methods Our Contributions Sampling methods for count datasets
4
NGDM’02 4 Outline Motivation FAST Epsilon Approximation Experimental Results Data Stream Reduction Conclusion Outline of the Presentation
5
NGDM’02 5 The Problem Generate a smaller subset S 0 of a larger superset S such that the supports of 1-itemsets in S 0 are close to those in S NP-Complete: One-In-Three SAT Problem I 1 (T) = set of all 1-itemsets in transaction set T L 1 (T) = set of frequent 1-itemsets in transaction set T f(A;T) = support of itemset A in transaction set T
6
NGDM’02 6 FAST-trim 1. Obtain a large simple random sample S from D. 2. Compute f(A;S) for each 1-itemset A. 3. Using the supports computed in Step 2, obtain a reduced sample S 0 from S by trimming away outlier transactions. 4. Run a standard association-rule algorithm against S 0 – with Minimum support p and confidence c – to obtain the final set of Association Rules. FAST-trim Outline Given a specified minimum support p and confidence c, FAST-trim Algorithm proceeds as follows:
7
NGDM’02 7 FAST-trim while (|S 0 | > n) { divide S 0 into disjoint groups of min(k,|S 0 |) transactions each; for each group G { compute f(A;S 0 ) for each item A; set S 0 =S 0 – {t*}, where Dist(S 0 -{t*},S) = min Dist(S 0 - {t},S) } FAST-trim Algorithm Uses input parameter k to explicitly trade-off speed and accuracy Trimming Phase t G Note : Removal of outlier t* causes maximum decrease or minimum increase in Dist(S 0,S)
8
NGDM’02 8 FAST-grow while (|S 0 | > n) { divide S 0 into disjoint groups of min(k,|S 0 |) transactions each; for each group G { compute f(A;S 0 ) for each item A; set S 0 =S 0 {t*}, where Dist(S 0 {t*},S) = min Dist(S 0 {t},S) } FAST-grow Algorithm Select representative transactions from S and add them to the sample S 0 that is initially empty Growing Phase t G
9
NGDM’02 9 Epsilon Approximation (EA) Theory based on work in statistics on VC Dimensions (Vapnik & Cervonenkis’71) shows: Epsilon Approximation (EA) Can estimate simultaneously the frequency of a collection of subsets VC dimension is finite Applications to computational geometry and learning theory Def: A sample S 0 of S 1 is an approximation iff discrepancy satisfies
10
NGDM’02 10 Epsilon Approximation (EA) Deterministically halves the data to get sample S 0 Apply halving repeatedly (S 1 => S 2 => … => S t (= S 0 )) until Each halving step introduce a discrepancy where m = total no. of items in database, n i = size of sub-sample S i Halving stops with the maximum t such that Halving Method
11
NGDM’02 11 Epsilon Approximation (EA) How to compute halving? Hyperbolic cosine method [Spencer] 1.Color each transaction red (in sample) or blue (not in sample) 2.Penalty for each item, reflects Penalty small if red/blue approximately balanced Penalty will shoot up exponentially when red dominates (item is over-sampled), or blue dominates (item is under-sampled) 3.Color transactions sequentially, keeping penalty low Key property: no increase on penalty in average => One of the two colors does not increase the penalty globally
12
NGDM’02 12 Epsilon Approximation (EA) Penalty Computation Let Q i = Penalty for item A i Init Q i = 2 Suppose that we have colored the first j transactions where r i = r i (j) = no. of red transactions containing A i b i = b i (j) = no. of blue transactions containing A i i = parameter that influences how fast penalty changes as function of |r i - b i |
13
NGDM’02 13 Epsilon Approximation (EA) How to color transaction j+1 Compute global penalty: = Global penalty assuming transaction j+1 is red = Global penalty assuming transaction j+1 is blue Choose color for which global penalty is smaller EA is inherently an on-line method
14
NGDM’02 14 Performance Evaluation Synthetic data set IBM QUEST project [AS94] 100,000 transactions 1,000 items number of maximal potentially large itemsets = 2000 average transaction length: 10 average length of maximal large itemsets: 4 minimum support: 0.77% length of the maximal large itemsets: 6 Final sampling ratios: 0.76%, 1.51%, 3.0%, … dictated by EA halvings
15
NGDM’02 15 Experimental Results 87% reduction in sample size for accuracy: EA (99%), FAST_trim_D2 (97%), SRS (94.6%)
16
NGDM’02 16 Experimental Results FAST_grow_D2 is best for very small sampling ratio (< 2%) EA best over-all in accuracy
17
NGDM’02 17 Data Stream Reduction Data Stream Reduction (DSR) Representative sample of data stream Assign more weight to recent data while partially keeping track of old data … … N S /2 N S /4N S /81 mSmS m S -1m S -21Bucket# 1m S -2m S -1mSmS To generate N S -element sample, halve (m S -k) times of bucket k Total #Transactions = m s.N s /2
18
NGDM’02 18 Data Stream Reduction Practical Implementation NsNs 0 Halving 1 Halving 2 Halving Empty 1 Halving 2 Halving 3 Halving To avoid frequent halving we use one buffer once and compute new representative sample when buffer is full by applying EA
19
NGDM’02 19 Data Stream Reduction Problem: Two users immediately before and after halving operation see data that varies substantially Continuous DSR: Buffer divided into chunks 2n s 4n s N s -2n s NsNs Next n s transactions arrive Oldest chunk is halved first New trans nsns 3n s 5n s N s -n s NsNs
20
NGDM’02 20 Conclusion Two-stage sampling approach based on trimming outliers or selecting representative transactions Epsilon approximation: deterministic method for repeatedly halving data to obtain final sample Can be used in conjunction with other non-sampling count-based mining algorithms EA-based data stream reduction We are investigating how to evaluate goodness of representative subset Frequency information to be used for discrepancy function
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.