LOGO 改善 FP-growth 資料挖掘演算法在巨大資料庫的效能 CHEN-HUNG Lin 2010.05.04 國立高雄大學資訊工程學系 ( 研究所 ) 碩士論文研究生：黃正男.

LOGO 改善 FP-growth 資料挖掘演算法在巨大資料庫的效能 CHEN-HUNG Lin 2010.05.04 國立高雄大學資訊工程學系 ( 研究所 ) 碩士論文研究生：黃正男

Contents Introduction 1 Item Partition 2 Generation of Frequent Itemsets 3 Finding Cross-Group Frequent Itemsets 4 Conclusions 5

Introduction  Apriori  may cause iterative database scan and high computational cost  Frequent-Pattern-tree(FP-tree)  may not allow all nodes generated from a huge database

Introduction item B A D FP-tree root A: 5 B: 9 D: 3 D: 4 Memory

Introduction TIDDomain item(A, B, C, D, E) 01A, B, C, D, E 02B,C, D, E 03A, C, D, E 04A, B, C, D {A, B, C, D, E} {A, B, C}{D, E} Independent group  Independent group:  The itemsets that cross groups are infrequent.  E.g: ABD, BCE, ….

Introduction  {A, B, C, D, E}={A, B, C, D, E} TIDDomain item(A, B, C, D, E) 01A, B, C 02C, D, E 03A, C, D, E 04A, B, C, D A B C D E 2 2 2 1 2 3 1 2 A B C D E min_support=2 min_support=3  {A, B, C, D, E}={C, D}, {A}, {B}, {E} A B C D E

Introduction {A, B, C, D, E, F, G, H, I, J} {A, B, C, D, E, F, G, H}{I, J} Item number > threshold 3 Independent group {A, B, C} {D, E, F} {G, H} Dependent group FP-tree FP-Growth FP-tree FP-Growth FP-tree FP-Growth FIT(A, B, C)FIT(D, E, F)FIT(G, H) Merge All frequent itemset  How to divide a big group ?  How to find all miss frequent itemset ?

{A, B, C, D, E, F, G, H, I, J} {A, B, C, D, E, F, G, H}{I, J} Independent group {A, B, C} {D, E, F} {G, H}Dependent group Item number > threshold 3 FIT(A, B, C)FIT(D, E, F)FIT(G, H) Merge All frequent itemset FP-tree FP-Growth FP-tree FP-Growth

Item-partition algorithm Start min_support = 5 Generate & Counts 2-itemsetcount2-itemsetcount2-itemsetcount2-itemsetcount2-itemsetcount AB5BC7CE3DH0FH0 AC6BD5CF8DJ2FJ4 AD5BE3CG2DK3FK5 AE2BF6CH0EF3GH0 AF6BG0CJ4EG0GJ1 AG2BH0CK4EH0GK1 AH0BJ3DE2EJ1HJ1 AJ3BK4DF7EK1HK1 AK4CD7DG2FG2JK5 Frequent 2-itemsets Initially set Merged Frequent 2-itemset Partition -{A},{B},{C},{D},{E},{F},{G},{H},{J},{K} AB{A,B},{C},{D},{E},{F},{G},{H},{J},{K}

Item-partition algorithm Start Generate & Counts Frequent 2-itemsets Initially set Merged Frequent 2-itemset Partition Frequent 2-itemset Partition -{A},{B},{C},{D},{E},{F},{G},{H},{J},K}BF{A, B, C, D, F},{E},{G},{H},{J},{K} AB{A,B},{C},{D},{E},{F},{G},{H},{J},{K}CD{A, B, C, D, F},{E},{G},{H},{J},{K} AC{A, B, C},{D},{E},{F},{G},{H},{J},{K}CF{A, B, C, D, F},{E},{G},{H},{J},{K} AD{A, B, C, D},{E},{F},{G},{H},{J},{K}DF{A, B, C, D, F},{E},{G},{H},{J},{K} AF{A, B, C, D, F},{E},{G},{H},{J},{K}FK{A, B, C, D, F, K},{E},{G},{H},{J} BC{A, B, C, D, F},{E},{G},{H},{J},{K}JK{A, B, C, D, F, J, K},{E},{G},{H} BD{A, B, C, D, F},{E},{G},{H},{J},{K} min_support = 5 {A, B, C, D, F, J, K},{E},{G},{H} Check Output & Exit Refine-partition β = 3

Start set upper bound upper bound = ∞ set the score {A, B, C, D, F, J, K} 2-itemsetscore2-itemsetscore2-itemsetscore AB0BD0CK1 AC0BFBF0DF0 AD0BJBJ1DJ1 AF0BK1DK1 AJ1CD0FJ1 AKAK1CF0FK0 BC0CJ1JK0 set root node {A, B, C, D, F, J, K} LB = 0 {A, B, C, D, F, J, K} LB = 0 Generate child nodes {A,B,C,D,F,J,K} LB = 0 {A,B,C,D,F,J,K} LB = 0 {A,B,C}{D,F,J,K} LB = {A,B,C}{D,F,J,K} LB = {A,B,D}{C,F,J,K} LB = {A,B,D}{C,F,J,K} LB = {A,J,K}{B,C,D,F} LB = {A,J,K}{B,C,D,F} LB = {A,B}{C,D,F,J,K} LB = {A,B}{C,D,F,J,K} LB = {A,B,C,D,F,J,K } =7 items β = 3, 7/3 = 2.333 => 3 group 7/3 = 2.333  2 or 3 for each group

Start set upper bound upper bound = ∞ set the score set root node Generate child nodes Calculate the lower bound {A,B,C}{D,F,J,K} LB = {A,B,C}{D,F,J,K} LB = decided part undecided part {A,B} = 0 {A,C} = 0 {B,C} = 0 S decide = 0 0+0+0 {D,F} = 0 {D,J} = 1 {D,K} = 1 {F,J} = 1 {F,K} = 0 {J,K} = 0 {D,F} = 0 {F,K} = 0 {J,K} = 0 {D,J} = 1 {D,K} = 1 {F,J} = 1 S unecide = 0 0+0+0 2- itemset scor e 2- itemset scor e 2- itemset scor e AB0BD0CK1 AC0BFBF0DF0 AD0BJBJ1DJ1 AF0BK1DK1 AJ1CD0FJ1 AKAK1CF0FK0 BC0CJ1JK0 {A,B,C}{D,F,J,K} LB = 0 {A,B,C}{D,F,J,K} LB = 0

Start set upper bound upper bound = ∞ set the score set root node Generate child nodes Calculate the lower bound Stop node choose replace upper bound End {A,B,C,D,F,J,K} LB = 0 {A,B,C,D,F,J,K} LB = 0 {A,B,C}{D,F,J,K} LB = 0 {A,B,C}{D,F,J,K} LB = 0 {A,B,D}{C,F,J,K} LB = 0 {A,B,D}{C,F,J,K} LB = 0 {A,J,K}{B,C,D,F} LB = 2 {A,J,K}{B,C,D,F} LB = 2 {A,B}{C,D,F,J,K} LB = 5 {A,B}{C,D,F,J,K} LB = 5 upper bound = 0 {A,B,C}{D,F}{J,K} LB = 0 {A,B,C}{D,F}{J,K} LB = 0 {A,B,C}{D,K}{F,J} LB = 2 {A,B,C}{D,K}{F,J} LB = 2 {A,B}{C,D,K}{F,J} LB = 3 {A,B}{C,D,K}{F,J} LB = 3

The proposed item-partition {A, B, C, D, F, J, K},{E},{G},{H} {A,B,C}{D,F}{J,K},{E},{G},{H} Start Generate & Counts Frequent 2-itemsets Initially set Merged Check Output & Exit Refine-partition

{A, B, C, D, E, F, G, H, I, J} {A, B, C, D, E, F, G, H}{I, J} Independent group {A, B, C} {D, E, F} {G, H}Dependent group Item number > threshold 3 FIT(A, B, C)FIT(D, E, F)FIT(G, H) Merge All frequent itemset FP-tree FP-Growth FP-tree FP-Growth

Generation of Frequent Itemsets  STEP 1  Generate an initial MFPT with only the empty root node.  STEP 2  Set the initial count of each item in the given group as 0 root G = {A, B, D}

Algorithm (cont.)  STEP 3  read a transaction from the given data set D  delete the items that does not appear in G.  STEP 4  If an item in G appears in the transaction, add its count by 1  STEP 5: Repeat step 3 and 4  until all the transactions are processed G = {A, B, D}

Algorithm (cont.)  STEP 6  Compare the items with min_support and remove the items which are not frequent  STEP 7  Sort the items in G according to their final counts  STEP 8  Sequentially read a transaction T from the given data set D Sorted order = (B,D,A)

Algorithm (cont.)  STEP 9:  Generate a tree path P from the transaction T with only the frequent items according to the sorted order in STEP 7.  Merge P into MFPT in a way similar to FPT.  STEP 10:  Add the count of each node in P of MFPT by 1 and add the transaction ID (TID) of T to the last node of P Sorted order = (B,D,A) A A D D B B root D: 1 B: 1 A: 1 TIDs = 01 A: 1 TIDs = 01 root D D B B A A

Algorithm (cont.)  STEP 11  Repeat STEPs 8 to 10 until all transactions in D are processed root B: 9 A: 3 TIDs = 01, 03, 08 A: 3 TIDs = 01, 03, 08 A: 2 TIDs = 05, 09 A: 2 TIDs = 05, 09 D: 7 TIDs=02, 04,06,10 D: 7 TIDs=02, 04,06,10

The Enumeration Tree The enumerated order (B,BD,BA,BDA,D,DA,A) root B: 9 A: 3 TIDs = 01, 03, 08 A: 3 TIDs = 01, 03, 08 A: 2 TIDs = 05, 09 A: 2 TIDs = 05, 09 D: 7 TIDs=02, 04,06,10 D: 7 TIDs=02, 04,06,10 {B}(01, 02, 03, 04, 05, 06, 08, 09, 10) {A}(01, 03, 05, 08, 09) {D}(01, 02, 03, 04, 06, 08, 10) {BA}(01, 03, 05, 08, 09) {BD}(01, 02, 03, 04, 06, 08, 10) {BDA}(01, 03, 08) {DA}(01, 02, 03) Sorted order = (B,D,A)

FIT(A,B,D) {BDA}(01:3,03:3,08:3) {BA}(01:2,03:2,05:2,08:2,09:2) … FIT(C,E,F) {CEF}(01:3,02:3,03:2,04:1) {CF}{01:1,03:1} … FIT(G,H,I) {GHI}(01:3,02:3,03:2,04:1) {GI}{01:1,03:1} … {A,B,D} {C,E,F} {G,H,I} Depedent group FP-tree FP-Growth FP-tree FP-Growth FP-tree FP-Growth

Merge All frequent itemset {A, B, C, D, E, F, G, H, I, J} {A, B, C, D, E, F, G, H}{I, J} Independent group {A, B, C} {D, E, F} {G, H}Dependent group Item number > threshold 3 FIT(A, B, C)FIT(D, E, F)FIT(G, H) FP-tree FP-Growth FP-tree FP-Growth FP-tree FP-Growth

X X X Finding Cross-Group Frequent Itemsets

A(10,20,30,50,60,80) AB(10,20,30,50) ABC(10,20,30) D(10,20,30,35,70,80) DE(10,20,30,35) DEF(10,20,30) CFI 1 =ABC(10:3,20:3,30:3,50:2,60:1,80:1) CFI 1 =DEF(10:3,20:3,30:3,35:2,70:1,80:1)

A(10,20,30,50,60,80) D(10,20,30,35,70,80) AD(10,20,30,80 ) DE(10,20,30,35) ADE(10,20,30) DEF(10,20,30) ADEF(10,20,30) AD(10,20,30,80) ADE(10,20,30) ADEF(10,20,30) ADEF(10:3,20:3,30:3,80:1) Finding Cross-Group Frequent Itemsets

ADEF(10:3,20:3,30:3,80:1) ABDEF(10:3,20:3,30:3) AD(10,20,30,80) ADE(10,20,30) ADEF(10,20,30) ABD(10,20,30) ABDE(10,20,30) ABDEF(10,20,30) ABCD(10,20,30) ABCDE (10,20,30) ABCDEF (10,20,30) ABCDEF(10:3,20:3,30:3)

{ABC}(01, 10) {AB}(01, 05,10, 11) {A}(01, 05, 06,10, 11) {DEF}(01, 10) {DE}(01, 05,10, 11) {D}(01, 05, 07,10, 11) {ADE}(01:2, 05:2, 10:2, 11:2) X Finding Cross-Group Frequent Itemsets

{ABC}(01, 10) {AB}(01, 05,10, 11) {A}(01, 05, 06,10, 11) {DEF}(01, 10) {DE}(01, 05,10, 11) {D}(01, 05, 07,10, 11) {ABDE}(01:2, 05:2, 10:2, 11:2) X

{ABC}(01, 10) {AB}(01, 05,10, 11) {A}(01, 05, 06,10, 11) {DEF}(01, 10) {DE}(01, 05,10, 11) {D}(01, 05, 07,10, 11) X X Finding Cross-Group Frequent Itemsets

{ADE}(01:2, 05:2, 10:2, 11:2) {ABDE}(01:2, 05:2, 10:2, 11:2)

FIT(A, B, C) FIT(A, B, C) X FIT(D, E, F) FIT(A, B, C) X FIT(D, E, F) X FIT(G, H, I) FIT(A, B, C) X FIT(D, E, F) X FIT(G, H, I) X FIT(J, K) FIT(A, B, C) X FITD,E,F) X FIT(J, K) FIT(A, B, C) X FIT(G,H,I ) FIT(A,B,C) X FIT(G,H,I) X FIT(J,K) FIT(A, B, C) X FIT(J, K)

Conclusions  focuses on solving or easing off the mining problems incurred from memory limitation.  The proposed approach can be divided into three phases.  Item Partition  Generation of Frequent Itemsets  Finding Cross-Group Frequent Itemsets

Conclusions  優點：  可分散至多台電腦執行。  亦能在有限資源下，運行龐大資料庫的採掘。  缺點：  資料庫不能共享，必須一台電腦一個。  在資料 merge ，只能有少數電腦運行，不能分散進行。

LOGO 改善 FP-growth 資料挖掘演算法在巨大資料庫的效能 CHEN-HUNG Lin 2010.05.04 國立高雄大學資訊工程學系 ( 研究所 ) 碩士論文研究生：黃正男.

Similar presentations

Presentation on theme: "LOGO 改善 FP-growth 資料挖掘演算法在巨大資料庫的效能 CHEN-HUNG Lin 2010.05.04 國立高雄大學資訊工程學系 ( 研究所 ) 碩士論文研究生：黃正男."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

LOGO 改善 FP-growth 資料挖掘演算法 在巨大資料庫的效能 CHEN-HUNG Lin 2010.05.04 國立高雄大學資訊工程學系 ( 研究所 ) 碩士論文 研究生：黃正男.

Similar presentations

Presentation on theme: "LOGO 改善 FP-growth 資料挖掘演算法 在巨大資料庫的效能 CHEN-HUNG Lin 2010.05.04 國立高雄大學資訊工程學系 ( 研究所 ) 碩士論文 研究生：黃正男."— Presentation transcript:

Similar presentations

About project

Feedback

LOGO 改善 FP-growth 資料挖掘演算法在巨大資料庫的效能 CHEN-HUNG Lin 2010.05.04 國立高雄大學資訊工程學系 ( 研究所 ) 碩士論文研究生：黃正男.

Presentation on theme: "LOGO 改善 FP-growth 資料挖掘演算法在巨大資料庫的效能 CHEN-HUNG Lin 2010.05.04 國立高雄大學資訊工程學系 ( 研究所 ) 碩士論文研究生：黃正男."— Presentation transcript: