Download presentation
Presentation is loading. Please wait.
Published byMelvin Edmund McDaniel Modified over 9 years ago
1
LOGO 改善 FP-growth 資料挖掘演算法 在巨大資料庫的效能 CHEN-HUNG Lin 2010.05.04 國立高雄大學資訊工程學系 ( 研究所 ) 碩士論文 研究生:黃正男
2
Contents Introduction 1 Item Partition 2 Generation of Frequent Itemsets 3 Finding Cross-Group Frequent Itemsets 4 Conclusions 5
3
Introduction Apriori may cause iterative database scan and high computational cost Frequent-Pattern-tree(FP-tree) may not allow all nodes generated from a huge database
4
Introduction item B A D FP-tree root A: 5 B: 9 D: 3 D: 4 Memory
5
Introduction TIDDomain item(A, B, C, D, E) 01A, B, C, D, E 02B,C, D, E 03A, C, D, E 04A, B, C, D {A, B, C, D, E} {A, B, C}{D, E} Independent group Independent group: The itemsets that cross groups are infrequent. E.g: ABD, BCE, ….
6
Introduction {A, B, C, D, E}={A, B, C, D, E} TIDDomain item(A, B, C, D, E) 01A, B, C 02C, D, E 03A, C, D, E 04A, B, C, D A B C D E 2 2 2 1 2 3 1 2 A B C D E min_support=2 min_support=3 {A, B, C, D, E}={C, D}, {A}, {B}, {E} A B C D E
7
Introduction {A, B, C, D, E, F, G, H, I, J} {A, B, C, D, E, F, G, H}{I, J} Item number > threshold 3 Independent group {A, B, C} {D, E, F} {G, H} Dependent group FP-tree FP-Growth FP-tree FP-Growth FP-tree FP-Growth FIT(A, B, C)FIT(D, E, F)FIT(G, H) Merge All frequent itemset How to divide a big group ? How to find all miss frequent itemset ?
8
{A, B, C, D, E, F, G, H, I, J} {A, B, C, D, E, F, G, H}{I, J} Independent group {A, B, C} {D, E, F} {G, H}Dependent group Item number > threshold 3 FIT(A, B, C)FIT(D, E, F)FIT(G, H) Merge All frequent itemset FP-tree FP-Growth FP-tree FP-Growth
9
Item-partition algorithm Start min_support = 5 Generate & Counts 2-itemsetcount2-itemsetcount2-itemsetcount2-itemsetcount2-itemsetcount AB5BC7CE3DH0FH0 AC6BD5CF8DJ2FJ4 AD5BE3CG2DK3FK5 AE2BF6CH0EF3GH0 AF6BG0CJ4EG0GJ1 AG2BH0CK4EH0GK1 AH0BJ3DE2EJ1HJ1 AJ3BK4DF7EK1HK1 AK4CD7DG2FG2JK5 Frequent 2-itemsets Initially set Merged Frequent 2-itemset Partition -{A},{B},{C},{D},{E},{F},{G},{H},{J},{K} AB{A,B},{C},{D},{E},{F},{G},{H},{J},{K}
10
Item-partition algorithm Start Generate & Counts Frequent 2-itemsets Initially set Merged Frequent 2-itemset Partition Frequent 2-itemset Partition -{A},{B},{C},{D},{E},{F},{G},{H},{J},K}BF{A, B, C, D, F},{E},{G},{H},{J},{K} AB{A,B},{C},{D},{E},{F},{G},{H},{J},{K}CD{A, B, C, D, F},{E},{G},{H},{J},{K} AC{A, B, C},{D},{E},{F},{G},{H},{J},{K}CF{A, B, C, D, F},{E},{G},{H},{J},{K} AD{A, B, C, D},{E},{F},{G},{H},{J},{K}DF{A, B, C, D, F},{E},{G},{H},{J},{K} AF{A, B, C, D, F},{E},{G},{H},{J},{K}FK{A, B, C, D, F, K},{E},{G},{H},{J} BC{A, B, C, D, F},{E},{G},{H},{J},{K}JK{A, B, C, D, F, J, K},{E},{G},{H} BD{A, B, C, D, F},{E},{G},{H},{J},{K} min_support = 5 {A, B, C, D, F, J, K},{E},{G},{H} Check Output & Exit Refine-partition β = 3
11
Start set upper bound upper bound = ∞ set the score {A, B, C, D, F, J, K} 2-itemsetscore2-itemsetscore2-itemsetscore AB0BD0CK1 AC0BFBF0DF0 AD0BJBJ1DJ1 AF0BK1DK1 AJ1CD0FJ1 AKAK1CF0FK0 BC0CJ1JK0 set root node {A, B, C, D, F, J, K} LB = 0 {A, B, C, D, F, J, K} LB = 0 Generate child nodes {A,B,C,D,F,J,K} LB = 0 {A,B,C,D,F,J,K} LB = 0 {A,B,C}{D,F,J,K} LB = {A,B,C}{D,F,J,K} LB = {A,B,D}{C,F,J,K} LB = {A,B,D}{C,F,J,K} LB = {A,J,K}{B,C,D,F} LB = {A,J,K}{B,C,D,F} LB = {A,B}{C,D,F,J,K} LB = {A,B}{C,D,F,J,K} LB = {A,B,C,D,F,J,K } =7 items β = 3, 7/3 = 2.333 => 3 group 7/3 = 2.333 2 or 3 for each group
12
Start set upper bound upper bound = ∞ set the score set root node Generate child nodes Calculate the lower bound {A,B,C}{D,F,J,K} LB = {A,B,C}{D,F,J,K} LB = decided part undecided part {A,B} = 0 {A,C} = 0 {B,C} = 0 S decide = 0 0+0+0 {D,F} = 0 {D,J} = 1 {D,K} = 1 {F,J} = 1 {F,K} = 0 {J,K} = 0 {D,F} = 0 {F,K} = 0 {J,K} = 0 {D,J} = 1 {D,K} = 1 {F,J} = 1 S unecide = 0 0+0+0 2- itemset scor e 2- itemset scor e 2- itemset scor e AB0BD0CK1 AC0BFBF0DF0 AD0BJBJ1DJ1 AF0BK1DK1 AJ1CD0FJ1 AKAK1CF0FK0 BC0CJ1JK0 {A,B,C}{D,F,J,K} LB = 0 {A,B,C}{D,F,J,K} LB = 0
13
Start set upper bound upper bound = ∞ set the score set root node Generate child nodes Calculate the lower bound Stop node choose replace upper bound End {A,B,C,D,F,J,K} LB = 0 {A,B,C,D,F,J,K} LB = 0 {A,B,C}{D,F,J,K} LB = 0 {A,B,C}{D,F,J,K} LB = 0 {A,B,D}{C,F,J,K} LB = 0 {A,B,D}{C,F,J,K} LB = 0 {A,J,K}{B,C,D,F} LB = 2 {A,J,K}{B,C,D,F} LB = 2 {A,B}{C,D,F,J,K} LB = 5 {A,B}{C,D,F,J,K} LB = 5 upper bound = 0 {A,B,C}{D,F}{J,K} LB = 0 {A,B,C}{D,F}{J,K} LB = 0 {A,B,C}{D,K}{F,J} LB = 2 {A,B,C}{D,K}{F,J} LB = 2 {A,B}{C,D,K}{F,J} LB = 3 {A,B}{C,D,K}{F,J} LB = 3
14
The proposed item-partition {A, B, C, D, F, J, K},{E},{G},{H} {A,B,C}{D,F}{J,K},{E},{G},{H} Start Generate & Counts Frequent 2-itemsets Initially set Merged Check Output & Exit Refine-partition
15
{A, B, C, D, E, F, G, H, I, J} {A, B, C, D, E, F, G, H}{I, J} Independent group {A, B, C} {D, E, F} {G, H}Dependent group Item number > threshold 3 FIT(A, B, C)FIT(D, E, F)FIT(G, H) Merge All frequent itemset FP-tree FP-Growth FP-tree FP-Growth
16
Generation of Frequent Itemsets STEP 1 Generate an initial MFPT with only the empty root node. STEP 2 Set the initial count of each item in the given group as 0 root G = {A, B, D}
17
Algorithm (cont.) STEP 3 read a transaction from the given data set D delete the items that does not appear in G. STEP 4 If an item in G appears in the transaction, add its count by 1 STEP 5: Repeat step 3 and 4 until all the transactions are processed G = {A, B, D}
18
Algorithm (cont.) STEP 6 Compare the items with min_support and remove the items which are not frequent STEP 7 Sort the items in G according to their final counts STEP 8 Sequentially read a transaction T from the given data set D Sorted order = (B,D,A)
19
Algorithm (cont.) STEP 9: Generate a tree path P from the transaction T with only the frequent items according to the sorted order in STEP 7. Merge P into MFPT in a way similar to FPT. STEP 10: Add the count of each node in P of MFPT by 1 and add the transaction ID (TID) of T to the last node of P Sorted order = (B,D,A) A A D D B B root D: 1 B: 1 A: 1 TIDs = 01 A: 1 TIDs = 01 root D D B B A A
20
Algorithm (cont.) STEP 11 Repeat STEPs 8 to 10 until all transactions in D are processed root B: 9 A: 3 TIDs = 01, 03, 08 A: 3 TIDs = 01, 03, 08 A: 2 TIDs = 05, 09 A: 2 TIDs = 05, 09 D: 7 TIDs=02, 04,06,10 D: 7 TIDs=02, 04,06,10
21
The Enumeration Tree The enumerated order (B,BD,BA,BDA,D,DA,A) root B: 9 A: 3 TIDs = 01, 03, 08 A: 3 TIDs = 01, 03, 08 A: 2 TIDs = 05, 09 A: 2 TIDs = 05, 09 D: 7 TIDs=02, 04,06,10 D: 7 TIDs=02, 04,06,10 {B}(01, 02, 03, 04, 05, 06, 08, 09, 10) {A}(01, 03, 05, 08, 09) {D}(01, 02, 03, 04, 06, 08, 10) {BA}(01, 03, 05, 08, 09) {BD}(01, 02, 03, 04, 06, 08, 10) {BDA}(01, 03, 08) {DA}(01, 02, 03) Sorted order = (B,D,A)
22
FIT(A,B,D) {BDA}(01:3,03:3,08:3) {BA}(01:2,03:2,05:2,08:2,09:2) … FIT(C,E,F) {CEF}(01:3,02:3,03:2,04:1) {CF}{01:1,03:1} … FIT(G,H,I) {GHI}(01:3,02:3,03:2,04:1) {GI}{01:1,03:1} … {A,B,D} {C,E,F} {G,H,I} Depedent group FP-tree FP-Growth FP-tree FP-Growth FP-tree FP-Growth
23
Merge All frequent itemset {A, B, C, D, E, F, G, H, I, J} {A, B, C, D, E, F, G, H}{I, J} Independent group {A, B, C} {D, E, F} {G, H}Dependent group Item number > threshold 3 FIT(A, B, C)FIT(D, E, F)FIT(G, H) FP-tree FP-Growth FP-tree FP-Growth FP-tree FP-Growth
24
X X X Finding Cross-Group Frequent Itemsets
25
X
26
A(10,20,30,50,60,80) AB(10,20,30,50) ABC(10,20,30) D(10,20,30,35,70,80) DE(10,20,30,35) DEF(10,20,30) CFI 1 =ABC(10:3,20:3,30:3,50:2,60:1,80:1) CFI 1 =DEF(10:3,20:3,30:3,35:2,70:1,80:1)
27
A(10,20,30,50,60,80) D(10,20,30,35,70,80) AD(10,20,30,80 ) DE(10,20,30,35) ADE(10,20,30) DEF(10,20,30) ADEF(10,20,30) AD(10,20,30,80) ADE(10,20,30) ADEF(10,20,30) ADEF(10:3,20:3,30:3,80:1) Finding Cross-Group Frequent Itemsets
28
ADEF(10:3,20:3,30:3,80:1) ABDEF(10:3,20:3,30:3) AD(10,20,30,80) ADE(10,20,30) ADEF(10,20,30) ABD(10,20,30) ABDE(10,20,30) ABDEF(10,20,30) ABCD(10,20,30) ABCDE (10,20,30) ABCDEF (10,20,30) ABCDEF(10:3,20:3,30:3)
29
{ABC}(01, 10) {AB}(01, 05,10, 11) {A}(01, 05, 06,10, 11) {DEF}(01, 10) {DE}(01, 05,10, 11) {D}(01, 05, 07,10, 11) {ADE}(01:2, 05:2, 10:2, 11:2) X Finding Cross-Group Frequent Itemsets
30
{ABC}(01, 10) {AB}(01, 05,10, 11) {A}(01, 05, 06,10, 11) {DEF}(01, 10) {DE}(01, 05,10, 11) {D}(01, 05, 07,10, 11) {ABDE}(01:2, 05:2, 10:2, 11:2) X
31
{ABC}(01, 10) {AB}(01, 05,10, 11) {A}(01, 05, 06,10, 11) {DEF}(01, 10) {DE}(01, 05,10, 11) {D}(01, 05, 07,10, 11) X X Finding Cross-Group Frequent Itemsets
32
{ADE}(01:2, 05:2, 10:2, 11:2) {ABDE}(01:2, 05:2, 10:2, 11:2)
34
FIT(A, B, C) FIT(A, B, C) X FIT(D, E, F) FIT(A, B, C) X FIT(D, E, F) X FIT(G, H, I) FIT(A, B, C) X FIT(D, E, F) X FIT(G, H, I) X FIT(J, K) FIT(A, B, C) X FITD,E,F) X FIT(J, K) FIT(A, B, C) X FIT(G,H,I ) FIT(A,B,C) X FIT(G,H,I) X FIT(J,K) FIT(A, B, C) X FIT(J, K)
35
Conclusions focuses on solving or easing off the mining problems incurred from memory limitation. The proposed approach can be divided into three phases. Item Partition Generation of Frequent Itemsets Finding Cross-Group Frequent Itemsets
36
Conclusions 優點: 可分散至多台電腦執行。 亦能在有限資源下,運行龐大資料庫的採掘。 缺點: 資料庫不能共享,必須一台電腦一個。 在資料 merge ,只能有少數電腦運行,不能分 散進行。
37
LOGO
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.