Download presentation
Presentation is loading. Please wait.
Published byEugenia Hopkins Modified over 8 years ago
1
Improving Parallelism in Structural Data Mining Min Cai, Istvan Jonyer, Marcin Paprzycki Computer Science Department, Oklahoma State University, Stillwater, Oklahoma 74078, U.S.A.
2
Who am I? Min Cai: cmin@cs.okstate.educmin@cs.okstate.edu Ph.D. Student of Computer Science Department at Oklahoma State University Research Interests: Parallel and distributed computing Data Mining
3
Introduction Data warehouses of increasing size Data mining technique for discovering interesting properties in data structural data mining data represented as a graph aim substructure discovery finding “interesting” and recurring subgraphs in a labeled graph
4
SUBDUE (1) Discovers substructures utilizing minimum description length (MDL) principle Cook, D.J., Holder, L.B., G alal, G., Maglothin, R.: Approaches to Parallel Graph-Based Knowledge Discovery. Journal of Parallel and Distributed Computing, 61(3) (2001) 427-446 Data objects graph vertices Relationships graph edges Substructure connected subgraph NOTE graph algorithms are notorious for long execution times
5
SUBDUE (2) Algorithm two basic steps substructure discovery apply minimal description length (MDL) principle to find the “best” / “most important” structure in the graph possibly stop here this is the answer substructure replacement replace the substructure found in the first step by a single vertex and repeat the process results single substructure hierarchy of substructures
6
Parallel SUBDUE Data-parallel approach Graph divided into subgraphs and send to separate processor Processors find their best structure and communicate with the rest The best overall substructure is found Hierarchical process can be repeated
7
MPI-SUBDUE Graph divided into subgraphs using METIS point-to-point communications (MPI_Send and MPI_Recv) used to communicate between processors NOTE best structure in data set “7” may be dreadful when confronted with data set “18” Galal, G.M., Cook, D.J., Holder, L.B.: Improving Scalability in a Knowledge Discovery System by Exploiting Parallelism In the Proceedings of the Third International Conference on Knowledge Discovery and Data Mining (1997) 171-174
8
NEW-MPI-SUBDUE Improvements use PARMETIS to divide the initial graph use global communication (MPI_Allgatherv) use binary summation YES, these changes do not look like much
9
NEW-MPI-SUBDUE Spawn P(0), P(1), P(2),..., P(n) Apply PARMETIS to partition G into n partitions for all P(i) where 1 ≤ i ≤ n do discover the best substructure in partition broadcast best substructure to all other processors evaluate best substructure and broadcast results parallel-binary summation of results to find the best overall partition P(0) finds the best overall structure
10
EXPERIMENTAL SETUP Mutagenesis data from OxUni datasets collected in order to predict mutagenicity of aromatic and heteroaromatic nitro compounds Graph 1 2844 vertices and 2883 edges Graph 2 2896 vertices and 2934 edges Graph 3 22268 vertices and 22823 edges 16 node cluster (32 processors) two AMD Athlon MP 1800+ (1.6GHz) CPUs, 2 GB of DDR SDRAM, full-backplane Gigabit Ethernet switch RedHat Linux 9.0, MPICH, Portland Group C compiler 5.0-2
11
EXPERIMENTAL RESULTS I
12
EXPERIMENTAL RESULTS II
13
EXPERIMENTAL REULTS III
14
COMMENTS Graphs 1 and 2 that were large in 2000 are small and “ useless ” today Graph 3 gives realistic performance picture gains about 33% Speedup over original SUBDUE 268 on 32 processors for Graph 3 this IS “ cheating ” as some information may be lost due to graph partitioning but … Graph partitioning and balancing matter
15
THE END THANK YOU!
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.