Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee, David W. Cheung, Ben Kao The University of Hong.

Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee, David W. Cheung, Ben Kao The University of Hong Kong Data Mining and Knowledge Discovery, 1998 Shan Huang April 5th, 2007 Matthew TretinApril 5, 2009

 Old Algorithms Apriori FUP 2  New Algorithm: DELI Design Pseudo code  Experiments(Comparisons) showing DELI is better What is the main idea? 2

Definitions(1) ‏ D = Transaction Set I = Full Item Set T1T1 T2T2 T3T3 3

Definitions(2) ‏ For X= σ X = Support count = 4 Support = 4/5 = 80% Support threshold: s% 4

Definitions(3)  K-itemset: itemset containing k items.  Large itemset: itemset with support larger than support threshold. 5

Old Algorithm (1) ‏ Apriori 6

C k = Candidate Set L k = Large Set apriori_gen() ‏ s% = 40% 7

Pseudo code of Apriori get C 1 ; k = 1; until (C k is empty || L k is empty)‏ do { Get L k from C k using minimum support count; Use apriori_gen() to generate C k+1 from L k ; k++; } ; return union(all L k ); 8

Use Apriori in Maintenance  Simply apply the algorithm to the updated database again; Not efficient; Fails to reuse the results of previous mining; Very cost-expensive. 9

 FUP 2 works similarly to Apriori by generating large itemsets iteratively;  It scans only the updated part of the database for old large itemsets;  For the rest, it scans the whole database. Old Algorithm 2: FUP 2 10

Δ - : set of deleted transactions Δ +: set of added transactions D: old database D': updated database D*: set of unchanged transactions σ X : support count of itemset X σ ’ X : new support count of itemset X δ X - : support count of itemset X in Δ - δ X + : support count of itemset X in Δ + 11

Pseudo code of FUP 2 get C 1 ; k = 1; until (C k is empty || L k ’ is empty)‏ do { divide C k into two partitions: P k = C k ۸ L k and Q k = C k – P k ; For X in P k, calculate σ’ X = σ X - δ X - + δ X + and get part 1 of L k ’ ; For X in Q k, eliminate candidates with δ X + - δ X - < (Δ + -Δ - )s% ; For the remaining candidates X in Q k, scan D* to get part 2 of L k ’ ; Use apriori_gen() to generate C k+1 from L k ’; k++; }; return union(all L k ’); CkCk LkLk PkPk QkQk Δ - (δ - X )‏ Δ + (δ + X )‏ D* D(σX)‏D(σX)‏ D’ ( σ ’ X )‏ 12

An Example on FUP 2 13

DELI Algorithm  Difference Estimation for Large Itemsets  Key idea: It examines samples of the database when the update is not too much; 14

Basic pseudo code of DELI get C 1 ; k = 1; until (C k is empty || L k ’ is empty)‏ do { divide C k into two partitions: P k = C k ۸ L k and Q k = C k – P k For X in P k, calculate σ’ X = σ X - δ X - + δ X + and get part 1 of L k ’ For X in Q k, eliminate candidates with δ X + - δ X - < (Δ + -Δ - )s%, For the remaining candidates X in Q k, scan D* to get part 2 of L k ’ Use apriori_gen() to generate C k+1 from L k ’; k++; }; return union(all L k ’); A sample subset of D* 15

Binomial Distribution  Assume 5% of the population is green-eyed.  You pick 500 people randomly with replacement.  The total number of green-eyed people you pick is a random variable X which follows a binomial distribution with n = 500 and p = 0.05. 16

Binomial Distribution http://en.wikipedia.org/wiki/Image:Binomial_distribution_pmf.png 17

Sampling(1) ‏  Consider an arbitrary itemset X;  Randomly select m transactions from D with replacement;  T X = the total number of X out of m;  T X is binomially distributed with p = σ X / |D| n = m Mean = np = (m / |D|) σ X Variance = np(1-p) 18

Sampling(2) ‏  T X approximates normally distributed with Mean = (m / |D|) σ X Variance = mp(1 - p) ‏  Define: σ X ^ = |D| / m * T X  σ X ^ is normally distributed with Mean = σ X Variance = σ X (|D| - σ X )/m 19

Confidence Interval axax bxbx Mean = σ X α/2 20

Sampling(3) ‏  We can obtain a 100(1-α)% confidence interval [a x, b x ] for σ X where  For α= 0.1, z α/2 =1.645  For α= 0.05,z α/2 =1.960 21

Sampling(4) ‏  the width of this interval is  the widths of all condence intervals are no more than  Suppose we want the widths not to exceed 22

Sampling(5) ‏  If s = 2 and α= 0.05, then z α/2 =1.96  Solving the above inequality gives m ≥ 18823.84.  This value is independent of the size of the database D! 23

axax bxbx σXσX σ x^σ x^ σ x^σ x^ σ x^σ x^ Sampling(6) ‏ 24

 L k » : large in D and D’ ;  L k > : not large in D, large in D’ with a certain confidence;  L k ≈ : not large in D, maybe large in D’ ;  L k ’ : approximation of new L k. L k ’ =L k »  L k >  L k ≈ LkLk Lk’Lk’ CkCk QkQk PkPk Lk»Lk» Lk>Lk> Lk≈Lk≈ Obtain the estimated set of L k 25

 Degree of uncertainty u k = L k ≈ /L k ’, uncertainty factor u k - is a user-specified threshold If u k ≥ u k -, then DELI halts and FUP 2 is needed  Amount of changes (symmetric difference) ‏ η k = |L k – L k ’ | ξ k = |L k (>) | + |L k (≈) | d k - is a user-specified threshold If d k ≥ d k -, then DELI halts and FUP 2 is needed Criterion to perform a full update 26

Pseudo code of DELI get C 1 ; k = 1; until (C k is empty || L k is empty)‏ do { divide C k into two partitions: P k = C k ۸ L k and Q k = C k – P k For X in P k, calculate σ’ X = σ X - δ X - + δ X + and get part 1 of L k ’ For X in Q k, eliminate candidates with δ X + - δ X - < (Δ + -Δ - )s%, For the remaining candidates X in Q k, scan a sample subset of D* to get part 2 of L k ’ Use apriori_gen() to generate C k+1 from L k ’; If any criteria is met, then terminate and go to FUP 2 ; k++; }; return union(all L k ’); 27

An Improvement  Store the support counts of all 1- itemsets  Extra storage: O(|I|) ‏ 28

Experiment Preparation  Synthetic databases – generate D, Δ +, Δ -  1%-18% of the large itemsets are changed by the updates.  u k - = ∞  d k - = ∞ 29

Experimental Results (1) ‏ α= 0.05 z α/2 =1.960 |Δ + |=|Δ - | = 5000 |D| = 100000 s% = 2% 30

Experimental Results (2) ‏ α= 0.05 z α/2 =1.960 |Δ + |=|Δ - | = 5000 |D| = 100000 s% = 2% 31

Experimental Results (3) ‏ m=20000 |Δ + |=|Δ - | = 5000 |D| = 100000 s% = 2% 32

Experimental Results (4) ‏ m=20000 |Δ + |=|Δ - | = 5000 |D| = 100000 s% = 2% 33

Experimental Results (5) ‏ α= 0.05 z α/2 =1.960 m=20000 |D| = 100000 s% = 2% 34

Experimental Results (6) ‏ α= 0.05 z α/2 =1.960 m=20000 |D| = 100000 s% = 2% 35

Experimental Results (7) ‏ α= 0.05 z α/2 =1.960 |Δ - |= 5000 m = 20000 |D| = 100000 s% = 2% 36

Experimental Results (8) ‏ α= 0.05 z α/2 =1.960 |Δ - |= 5000 m = 20000 |D| = 100000 s% = 2% 37

Experimental Results (9) ‏ α= 0.05 z α/2 =1.960 |Δ + |= 5000 m = 20000 |D| = 100000 s% = 2% 38

Experimental Results (10) ‏ α= 0.05 z α/2 =1.960 |Δ + |= 5000 m = 20000 |D| = 100000 s% = 2% 39

Experimental Results (11) ‏ α= 0.05 z α/2 =1.960 |Δ + |= |Δ-| = 5000 m = 20000 |D| = 100000 40

Experimental Results (12) ‏ α= 0.05 z α/2 =1.960 |Δ + |= |Δ-| = 5000 m = 20000 |D| = 100000 41

Experimental Results (13) ‏ α= 0.05 z α/2 =1.960 |Δ + |= |Δ-| = 5000 m = 20000 s% = 2% 42

Experimental Summary  u c - < 0.036, very low;  when | Δ - | < 10000, d c - < 0.1;  when | Δ - | = 20000, d c - < 0.21;  (Suggested) u - = 0.05, d - = 0.1 43

Consecutive Runs Keep every δ X + and δ X - ? 44

Conclusions  Real-world databases get updated constantly, therefore the knowledge extracted from them changes too.  The authors proposed DELI algorithm to determine if the change is significant so that when to update the extracted association rules.  The algorithm applies sampling techniques and statistic methods to efficiently estimate an approximate large itemsets. 45

Criticism  Is this kind of sampling really saving time?  Did the authors perform the experiments for sufficient number of times?  Do the experiments have practical meaning?  Any theoretical mistake?  Any better measurements for the experiments? 46

Final Exam Questions  Q1: Compare and contrast FUP 2 and DELI Both algorithms are used in Association Analysis; Goal: DELI decides when to update the association rules while FUP 2 provides an efficient way of updating them; Technique: DELI scans a small portion of the database (sample) and approximates the large itemsets whereas FUP 2 scans the whole database and returns the large itemsets exactly; DELI saves machine resources and time. 47

Final Exam Questions  Q2: Difference Estimation for Large Itemsets  Q3 Difference between Apriori and FUP 2 : Apriori scans the whole database to find association rules, and does not use old data mining results; For most itemsets, FUP 2 scans only the updated part of the database and takes advantage of the old association analysis results. 48

Thank you! Now it is discussion time! 49

Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee, David W. Cheung, Ben Kao The University of Hong.

Similar presentations

Presentation on theme: "Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee, David W. Cheung, Ben Kao The University of Hong."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee, David W. Cheung, Ben Kao The University of Hong.

Similar presentations

Presentation on theme: "Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee, David W. Cheung, Ben Kao The University of Hong."— Presentation transcript:

Similar presentations

About project

Feedback