Fast Mining Frequent Patterns with Secondary Memory Kawuu W. Lin, Sheng-Hao Chung, Sheng-Shiung Huang and Chun-Cheng Lin Department of Computer Science and Information Engineering National Kaohsiung University of Applied Sciences Kaohsiung, Taiwan
Outline INTRODUCTION RELATED WORK PROPOSED METHOD EXPERIMENTAL RESULTS CONCLUSIONS 1
Introduction Association Rule 2 Data Data Mining Association Rule Frequent Pattern Hidden Information
Introduction (cont.) 3
Data Mining Introduction (cont.) Association Rule Algorithms Apriori FP-growth 4 frequent
Introduction (cont.) But huge 5
INTRODUCTION (cont.) FP-tree 6 Max 5 nodes in memory TID Sorted Items ,3, ,3,5, ,5 3, R Fail to mine Delete FP-tree to restart other algorithm mining data. This wastes a lot of time and information.
INTRODUCTION (cont.) Our goal 7 TID Sorted Items ,3, ,3,5, ,5 3, R Max Using 95 % memory Disk
Related Work FP-growth Database Projection Algorithm (DP) It is based on the framework of FP-growth; when confronted with insufficient memory it reduces database actions and attempts the FP-growth again. CARM Algorithm To reduce the amount of data transmitted FD-Mine uses a matrix to retain the necessary FP-tree node information (Label, Count, and Parent). 8
Related Work (cont.) 9 Base on FP-growth Use “Build & Reduce” and “Repeat Testing”. Database Projection FP-Tree Original Database Fail Database Projection (DP) a Database b Database c Database d Database
Related Work (cont.) 10 Network CARM (FD-Mine) Trusted Node Original Database Zip Sub FP-Tree
Proposed Method There are five function in H-Mine algorithm : Memory warning mechanism Reserved node mapping disk mechanism Disk information structure quick search and tree- building Storage FP-tree node in the disk information structure LINK Header Table in the disk information structure 11
Example 12 nodeToSeek.data Childnode.data Index Count Max:98 Disk address (Childnode.data) Index LabelChildNode … … TID Sorted Items ,3, ,3,5, ,5 3, R Using 95 % memory Disk Node Disk address nodeToSeek.data Total 11 items
Example (cont.) 13 R 0 3:1 1 1:1 2 5:1 7 index TreeNodeInDisk.data TID Sorted Items ,3, ,3,5, ,5 3, 2: :1 5 5:1 6 1:1 2:2 3:2 5:2 2:3 indexlabelcountparent
Example (cont.) 14 Next.data Header Table R index itemcountnext…(nodeIndex) Addr Next.data InDiskCount ,
Experimental Results IBM Generator Filename = T20I10N10KD1000K.data IBM Almaden Quest research group Filename = T40I10D100K.data Compare the generation time with FP-growth/DP/H-Mine for difference minSup. We want to observe those relationship between minSup and generation time 15 Each computing nodeSpecification CPUi7- RAM1GB HDD1TB OSWin 7
Experimental Results (cont.) The experimental results showed that H-Mine performed better than the FP-growth and DP algorithms in terms of execution time. 16
Experimental Results (cont.) We limited the set memory to 500 MB, the reserved memory space to 95%. H-Mine performed better than the FP-growth and DP algorithms in terms of execution time 17
Conclusions It can be seen from this experiment that when dataset size increases rapidly, the execution time of the H-Mine algorithm increases but the curve remains steady. For future work, we intend to further improve the efficiency of this algorithm by combining it with cloud computing technology through various nodes. 18
Thank you! 19