Presentation is loading. Please wait.

Presentation is loading. Please wait.

NGDM’02 1 Efficient Data-Reduction Methods for On-line Association Rule Mining H. Bronnimann B. ChenM. Dash, Y. Qiao, P. ScheuermannP. Haas Polytechnic.

Similar presentations


Presentation on theme: "NGDM’02 1 Efficient Data-Reduction Methods for On-line Association Rule Mining H. Bronnimann B. ChenM. Dash, Y. Qiao, P. ScheuermannP. Haas Polytechnic."— Presentation transcript:

1 NGDM’02 1 Efficient Data-Reduction Methods for On-line Association Rule Mining H. Bronnimann B. ChenM. Dash, Y. Qiao, P. ScheuermannP. Haas Polytechnic Univ ExilixisNorthwestern UniversityIBM Almaden hbr@poly.edubchen@ece.nwu.edupeterh@almaden.ibm.com{manoranj,yiqiao,peters}@ece.nwu.edu

2 NGDM’02 2 Motivation  Volume of Data in Warehouses & Internet is growing faster than Moore’s Law  Scalability is a major concern  “Classical” algorithms require one/more scans of the database  Need to adopt to Streaming Data  One Solution: Execute algorithm on a sample  Data elements arrive on-line  Limited amount of memory  Lossy compressed synopses (sketch) of data

3 NGDM’02 3 Motivation  Advantage: can explicitly trade-off accuracy and speed  Work best when tailored to application  Base set of items & each data element is vector of item counts  Application: Association rule mining  Sampling Methods  Our Contributions  Sampling methods for count datasets

4 NGDM’02 4 Outline  Motivation  FAST  Epsilon Approximation  Experimental Results  Data Stream Reduction  Conclusion  Outline of the Presentation

5 NGDM’02 5 The Problem Generate a smaller subset S 0 of a larger superset S such that the supports of 1-itemsets in S 0 are close to those in S NP-Complete: One-In-Three SAT Problem I 1 (T) = set of all 1-itemsets in transaction set T L 1 (T) = set of frequent 1-itemsets in transaction set T f(A;T) = support of itemset A in transaction set T

6 NGDM’02 6 FAST-trim 1. Obtain a large simple random sample S from D. 2. Compute f(A;S) for each 1-itemset A. 3. Using the supports computed in Step 2, obtain a reduced sample S 0 from S by trimming away outlier transactions. 4. Run a standard association-rule algorithm against S 0 – with Minimum support p and confidence c – to obtain the final set of Association Rules.  FAST-trim Outline Given a specified minimum support p and confidence c, FAST-trim Algorithm proceeds as follows:

7 NGDM’02 7 FAST-trim while (|S 0 | > n) { divide S 0 into disjoint groups of min(k,|S 0 |) transactions each; for each group G { compute f(A;S 0 ) for each item A; set S 0 =S 0 – {t*}, where Dist(S 0 -{t*},S) = min Dist(S 0 - {t},S) }  FAST-trim Algorithm Uses input parameter k to explicitly trade-off speed and accuracy Trimming Phase t  G Note : Removal of outlier t* causes maximum decrease or minimum increase in Dist(S 0,S)

8 NGDM’02 8 FAST-grow while (|S 0 | > n) { divide S 0 into disjoint groups of min(k,|S 0 |) transactions each; for each group G { compute f(A;S 0 ) for each item A; set S 0 =S 0  {t*}, where Dist(S 0  {t*},S) = min Dist(S 0  {t},S) }  FAST-grow Algorithm Select representative transactions from S and add them to the sample S 0 that is initially empty Growing Phase t  G

9 NGDM’02 9 Epsilon Approximation (EA)  Theory based on work in statistics on VC Dimensions (Vapnik & Cervonenkis’71) shows: Epsilon Approximation (EA) Can estimate simultaneously the frequency of a collection of subsets VC dimension is finite  Applications to computational geometry and learning theory Def: A sample S 0 of S 1 is an  approximation iff discrepancy satisfies

10 NGDM’02 10 Epsilon Approximation (EA)  Deterministically halves the data to get sample S 0  Apply halving repeatedly (S 1 => S 2 => … => S t (= S 0 )) until  Each halving step introduce a discrepancy where m = total no. of items in database, n i = size of sub-sample S i  Halving stops with the maximum t such that Halving Method

11 NGDM’02 11 Epsilon Approximation (EA) How to compute halving? Hyperbolic cosine method [Spencer] 1.Color each transaction red (in sample) or blue (not in sample) 2.Penalty for each item, reflects Penalty small if red/blue approximately balanced Penalty will shoot up exponentially when red dominates (item is over-sampled), or blue dominates (item is under-sampled) 3.Color transactions sequentially, keeping penalty low Key property: no increase on penalty in average => One of the two colors does not increase the penalty globally

12 NGDM’02 12 Epsilon Approximation (EA) Penalty Computation  Let Q i = Penalty for item A i  Init Q i = 2  Suppose that we have colored the first j transactions where r i = r i (j) = no. of red transactions containing A i b i = b i (j) = no. of blue transactions containing A i  i = parameter that influences how fast penalty changes as function of |r i - b i |

13 NGDM’02 13 Epsilon Approximation (EA) How to color transaction j+1  Compute global penalty: = Global penalty assuming transaction j+1 is red = Global penalty assuming transaction j+1 is blue  Choose color for which global penalty is smaller EA is inherently an on-line method

14 NGDM’02 14 Performance Evaluation  Synthetic data set  IBM QUEST project [AS94]  100,000 transactions  1,000 items  number of maximal potentially large itemsets = 2000  average transaction length: 10  average length of maximal large itemsets: 4  minimum support: 0.77%  length of the maximal large itemsets: 6  Final sampling ratios: 0.76%, 1.51%, 3.0%, … dictated by EA halvings

15 NGDM’02 15 Experimental Results  87% reduction in sample size for accuracy: EA (99%), FAST_trim_D2 (97%), SRS (94.6%)

16 NGDM’02 16 Experimental Results  FAST_grow_D2 is best for very small sampling ratio (< 2%)  EA best over-all in accuracy

17 NGDM’02 17 Data Stream Reduction Data Stream Reduction (DSR)  Representative sample of data stream  Assign more weight to recent data while partially keeping track of old data … … N S /2 N S /4N S /81 mSmS m S -1m S -21Bucket# 1m S -2m S -1mSmS To generate N S -element sample, halve (m S -k) times of bucket k Total #Transactions = m s.N s /2

18 NGDM’02 18 Data Stream Reduction  Practical Implementation NsNs 0 Halving 1 Halving 2 Halving Empty 1 Halving 2 Halving 3 Halving To avoid frequent halving we use one buffer once and compute new representative sample when buffer is full by applying EA

19 NGDM’02 19 Data Stream Reduction Problem: Two users immediately before and after halving operation see data that varies substantially Continuous DSR: Buffer divided into chunks 2n s 4n s N s -2n s NsNs Next n s transactions arrive Oldest chunk is halved first New trans nsns 3n s 5n s N s -n s NsNs

20 NGDM’02 20 Conclusion  Two-stage sampling approach based on trimming outliers or selecting representative transactions  Epsilon approximation: deterministic method for repeatedly halving data to obtain final sample  Can be used in conjunction with other non-sampling count-based mining algorithms  EA-based data stream reduction We are investigating how to evaluate goodness of representative subset Frequency information to be used for discrepancy function


Download ppt "NGDM’02 1 Efficient Data-Reduction Methods for On-line Association Rule Mining H. Bronnimann B. ChenM. Dash, Y. Qiao, P. ScheuermannP. Haas Polytechnic."

Similar presentations


Ads by Google