Mining Frequent Patterns without Candidate Generation : A Frequent-Pattern Tree Approach 指導教授:廖述賢博士 報 告 人:朱 佩 慧 班 級:管科所博一
2 AGENDA Introduction Review & Overview Design and Construction (FP tree) Mining frequent patterns (FP growth) DB Projection Experiments Discussion
3 Impact Factor: 2.42 (2007) Subject category "Comp. sc., artificial intelligence": Rank 11 of 93 Subject category "Comp. sc., information systems": Rank 7 of 92
4 Frequent Pattern Mining Given a transaction database DB and a minimum support threshold ξ, find all frequent patterns (item sets) with support no less than ξ. Problem TIDItems bought 100{f, a, c, d, g, i, m, p} 200{a, b, c, f, l, m, o} 300{b, f, h, j, o} 400{b, c, k, s, p} 500{a, f, c, e, l, p, m, n} Minimum support: ξ =3 Input:Output : Problem: How to efficiently find all frequent patterns? all frequent patterns f:4 c:4 a:3 b:3 m:3 p:3 cp:3 fm:3 cm:3 am:3 ca:3 fa:3 fc:3 cam:3 fam:3 fcm:3 fcam:3
5 內容分配 Total: 35 pages Abstract: 15 lines Introduction: 2 pages(6%) FP-tree(39%) Construction: 5 pages(14%) Use: 8.5 pages(24%) FP-growth: 3.5 pages(10%) Experiments: 8 pages(23%) Discussion: 3 pages(9%) Conclusion: 1 pages(3%)
6 文獻內容 - 35pages FP-tree 38.5%
7 Introduction Review: Apriori-like methods Overview: FP-tree based mining method Additional Progress Han, J., Pei, J., and Yin, Y Mining frequent patterns without candidate generation. In Proc ACMSIGMOD Int. Conf. Management of Data (SIGMOD’00), Dallas, TX, pp. 1–12. Introduction
8 The core of the Apriori algorithm: Use frequent ( k – 1)-itemsets (L k-1 ) to generate candidates of frequent k- itemsets C k Scan database and count each pattern in C k, get frequent k - itemsets ( L k ). E.g., TIDItems bought 100{f, a, c, d, g, i, m, p} 200{a, b, c, f, l, m, o} 300{b, f, h, j, o} 400{b, c, k, s, p} 500{a, f, c, e, l, p, m, n} Apriori iteration C1f,a,c,d,g,i,m,p,l,o,h,j,k,s,b,e,n L1f, a, c, m, b, p C2 fa, fc, fm, fp, ac, am, …bp L2 fa, fc, fm, … Apriori Algorithm Review =1.02x =1.05x =1.27x10 30 n=10 n=20 n=100 # of candidates ? Agrawal, R., Imielinski, T., and Swami, A Mining association rules between sets of items in large databases.
9 9 Compress a large database into a compact, Frequent-Pattern tree (FP-tree) structure highly condensed, but complete for frequent pattern mining avoid costly database scans Develop an efficient, FP-tree-based frequent pattern mining method (FP-growth) A divide-and-conquer methodology: decompose mining tasks into smaller ones Avoid candidate generation: sub-database test only. Ideas Overview
FP-tree: Design and Construction FP-Tree
11 Definition 1 ( FP-tree ) null Item-prefix subtree Frequent-item -header table Item-nameNode-link HItem-namecountNode-link
12 Construct FP-tree 1. First Scan: Collects the set of frequent items. 2. Second Scan: Construct the FP-tree. Construct FPt
13 First Scan Collects the set of frequent items Find frequent items (L 1 ) 建立 Frequent-item-header Table 將每筆交易內容依 frequency 遞減排序 ( 不含 infrequent item) Item frequency f4 c4 a3 b3 m3 p3 Frequent-item -header table Construct FPt
14 以遞減排序之內容建構 FP-Tree 會較為緊密 First Scan Collects the set of frequent items Construct FPt
15 TIDItems bought 100 {f, a, c, d, g, i, m, p} ↓ 100 {f, c, a, m, p} (a)Creat root-node { null }. (b)For each transaction, order its frequent items according to the order in L. (ex: TID=100) (c) Construct FP-tree by putting each frequency ordered transaction onto it. Second Scan Construct FP-tree Frequent-item -header table f:1 null c:1 a:1 m:1 p:1 ItemLink f c a b m p Construct FPt
16 TIDItems bought 200 {a, b, c, f, l, m, o} ↓ 200{f, c, a, b, m} Construct FP-tree Second Time (c) Construct FP-tree by putting each frequency ordered transaction onto it. (ex: TID=200) Construct FP-tree Frequent-item -header table f:2 null c:2 a:2 m:1 p:1 ItemLink f c a b m p b:1 m:1 Construct FPt
17 TIDItems bought 300 {b, f, h, j, o} ↓ 300{f, b} Construct FP-tree Second Time (c) Construct FP-tree by putting each frequency ordered transaction onto it. (ex: TID=300) Construct FP-tree Frequent-item -header table f:3 null c:2 a:2 m:1 p:1 ItemLink f c a b m p b:1 m:1 b:1 Construct FPt
18 TIDItems bought 400 {b, c, k, s, p} ↓ 400{c, b, p} Construct FP-tree Second Time (c) Construct FP-tree by putting each frequency ordered transaction onto it. (ex: TID=400) Construct FP-tree f:3 null c:2 a:2 m:1 p:1 b:1 m:1 b:1 c:1 b:1 p:1 Frequent-item -header table ItemLink f c a b m p Construct FPt
19 TIDItems bought 500 {a,f,c,e,l,p,m,n} ↓ 500{f,c,a,m,p} Construct FP-tree Second Time (c) Construct FP-tree by putting each frequency ordered transaction onto it. (ex: TID=500) Construct FP-tree Frequent-item -header table f:4 null c:3 a:3 m:2 p:2 ItemLink f c a b m p b:1 m:1 b:1 c:1 b:1 p:1 Construct FPt
20 Efficiency Analysis FP-tree is much smaller than the size of the DB Pattern base is smaller than original FP-tree Conditional FP-tree is smaller than pattern base mining process works on a set of usually much smaller pattern bases and conditional FP-trees Divide-and-conquer and dramatic scale of shrinking
21 Lemma 2.1 Given a transaction database DB and a support threshold ξ,the complete set of frequent item projections of transactions in the database can be derived from DB’s FP-tree. a k =“A”, C ak =5, C’ ak =3+1=4 Support(root ~A)=5 B:8 A:5 null C:3 D:1 contains the complete information of DB 1. BACD 2. BAC 3. BAC 4. BAD 5. BA 6. B 7. B 8. B
22 Lemma 2.2 Given a transaction database DB and a support threshold ξ. The size of an FP-tree is bounded by Σ T ∈ DB | freq (T)|, The height of the tree is bounded by max T ∈ DB {| freq (T)|} 右圖在 DB 中為 8 筆交易, 在 FP-tree 僅需 5 個 node. FP-tree 最長的 subtree 長度為最長的交易筆數 [BACD] (root: null 不計 ) 1. BACD 2. BAC 3. BAC 4. BAD 5. BA 6. B 7. B 8. B B:8 A:5 null C:3 D:1
3.Mining frequent patterns using FP-tree
24 Starting the processing from the end of list L: Step 1: Construct conditional pattern base. Step 2 Construct conditional FP-tree. Step 3 Recursively mine conditional FP-trees and grow frequent patterns obtained so far. FP-Growth Mining FP
25 Conditional FP-Tree (suffix p ) Frequent-item -header table f:4 c:3 a:3 m:2 p:2 ItemLink f c a b m p b:1 m:1 b:1 c:1 b:1 p:1 Conditional pattern base {(fcam:2),(cb:1)} Conditional FP-tree {(c:3)}| p Frequent Pattern {(cp:3)} f c a b m 2 2 _ 1 1 _ Mining FP null
26 Conditional FP-Tree (suffix m ) Frequent-item -header table f:4 null c:3 a:3 m:2 p:2 ItemLink f c a b m p b:1 m:1 b:1 c:1 b:1 p:1 Conditional pattern base {(fca:2),(fcab:1)} Frequent Pattern {(m:3), (am:3), (cm:3), (fm:3), (cam:3), (fam:3), (fcam:3), (fcm:3)} f c a b m _ Mining FP Conditional FP-tree {(f:3,c:3,a:3)}|m {(f:3,c:3)}|am {(f:3)}|cam {(f:3)}|cm (Recursively mine)
27 Conditional FP-Tree (suffix b ) Frequent-item -header table f:4 null c:3 a:3 m:2 p:2 ItemLink f c a b m p b:1 m:1 b:1 c:1 b:1 p:1 Conditional pattern base {(fca:1),(f:1)(c:1)} f c a b m _ Mining FP Conditional FP-tree {}|b
28 Conditional FP-Tree (suffix a ) Frequent-item -header table f:4 null c:3 a:3 m:2 p:2 ItemLink f c a b m p b:1 m:1 b:1 c:1 b:1 p:1 Conditional pattern base {(fc:3)} Frequent Pattern {(ca:3), (fca:3), (fa:3)} f c a b m 3 Mining FP Conditional FP-tree {(f:3,c:3)}|a {(f:3)}|cm (Recursively mine)
29 Conditional FP-Tree (suffix c ) Frequent-item -header table f:4 null c:3 a:3 m:2 p:2 ItemLink f c a b m p b:1 m:1 b:1 c:1 b:1 p:1 Conditional pattern base {(f:3)} Frequent Pattern {(fc:3)} f c a b m 3 Mining FP Conditional FP-tree {(f:3)}|c
30 Conditional FP-Tree (suffix f ) Frequent-item -header table f:4 null c:3 a:3 m:2 p:2 ItemLink f c a b m p b:1 m:1 b:1 c:1 b:1 p:1 Conditional pattern base {} Mining FP
31 Frequent Pattern TIDItems bought 100{f, a, c, d, g, i, m, p} 200{a, b, c, f, l, m, o} 300{b, f, h, j, o} 400{b, c, k, s, p} 500{a, f, c, e, l, p, m, n} Minimum support: ξ =3 Input:Output: all frequent patterns f:4 c:4 a:3 b:3 m:3 p:3 cp:3 fm:3 cm:3 am:3 ca:3 fa:3 fc:3 cam:3 fam:3 fcm:3 fcam:3
32 Single prefix path FP-tree freq pattern set(P) = {(a:10), (b:8), (c:7), (ab:8), (ac:7), (bc:7), (abc:7)}. freq pattern set(Q) = {(d:4), (e:3), (f:3), (df:3)} freq pattern set(Q) × freq pattern set(P) ={(ad:4),(ae:3),(af:3),(adf:3),(bd:4),(be:3),… (abcd:4), (abce:3), (abcf:3), (abcdf:3)}
4.Database Projection Main Memory-Based FP mining⇩ Disk-Based FP mining
34 Parallel Projection Item freq f4 c4 a3 b3 m3 p3 Projection
35 Partition Projection Projection Item freq f4 c4 a3 b3 m3 p3
5.Performance & Compactness
37 Performance & Compactness
38 Performance & Compactness
6.Discussions Improve the performance and scalability of FP-growth
40 Performance Improvement Projected DBs. partition the DB into a set of projected DBs and then construct an FP-tree and mine it in each projected DB. Disk-resident FP-tree Store the FP-tree in the hark disks by using B+ tree structure to reduce I/O cost. FP-tree Materialization a low ξ may usually satisfy most of the mining queries in the FP-tree construction. FP-tree Incremental update How to update an FP-tree when there are new data? ❖ Re-construct the FP-tree ❖ Or do not update the FP-tree
感謝聆聽 敬請指教
42 B:8 A:5 null C:3 D:1 A:2 C:1 D:1 E:1 D:1 E:1 C:3 D:1 E:1 B8 A7 C7 D5 E3 B:8 null B:5 A:5 null A:2 Suffix B {} {}|B {} Suffix A {B:5} {B:5}|A {BA:5} B:6 A:3 null C:3 A:1 C:1 C:3 Suffix C {BA:3,B:3,A:1} {B:6,A:4}|C {BC:6,AC:4} B:3 A:2 null C:1 D:1 A:2 C:1 D:1 C:1 D:1 Suffix D {BAC:1, BA:1,BC:1,AC:1,A:1} {B:3,A:4,C:3}|D {BD:3, AD:4, CD:3} B:1 null C:1 A:2 C:1 D:1 E:1 D:1 E:1 Suffix E {BC:1, ACD:1,AD:1} {B:1,A:2,C:2,D:2}|E {AE:2,CE:2,DE:2} Conditional FP-Tree Conditional pattern base Conditional FP-tree Frequent Pattern
43 6.Discussion Materialization and maintenance of FP-tree –Main memory disk-resident FP-trees. –Select a good minimum support threshold ξ. –Incremental update of an FP-tree. Extensions of frequent-pattern growth method in data mining –FP-tree mining with constraints