Presentation is loading. Please wait.

Presentation is loading. Please wait.

March 12, 2007CICC quarterly meeting1 Optimizing DivKmeans for Multicore Architectures: a status report Jiahu Deng and Beth Plale Department of Computer.

Similar presentations


Presentation on theme: "March 12, 2007CICC quarterly meeting1 Optimizing DivKmeans for Multicore Architectures: a status report Jiahu Deng and Beth Plale Department of Computer."— Presentation transcript:

1 March 12, 2007CICC quarterly meeting1 Optimizing DivKmeans for Multicore Architectures: a status report Jiahu Deng and Beth Plale Department of Computer Science Indiana University

2 March 12, 2007CICC quarterly meeting2 Acknowledgements David Wild Rajarshi Guha Digital Chemistry Work funded in part by CICC and Microsoft

3 March 12, 2007CICC quarterly meeting3 Problem Statements 1. Clustering is an important method to organize thousands of data times into meaningful groups. It is widely applied in chemistry, chemical informatics, biology, drug discovery, etc. However, for large datasets, clustering is a slow process even its parallelized and be executed in powerful computer clusters. 2. Multi-core architectures provide large degrees of parallelism. Taking advantage of this requires examination of traditional parallelism approaches. We apply that examination to the DivKmeans clustering method.

4 March 12, 2007CICC quarterly meeting4 Multi-core Architectures Diagram of an Intel Core 2 dual core processor, with CPU-local Level 1 caches, a shared, on-die Level 2 cache. Multi-core processors: combines two or more independent processors into a single package.

5 March 12, 2007CICC quarterly meeting5 Clustering Algorithm 1. hierarchical clustering Series of partitioning steps take place, generating a hierarchy of clusters. It includes two families, agglomerative methods, which work from leaves upward, and divisive methods which decompose from a root downward. http://www.digitalchemistry.co.uk/prod_clustering.html

6 March 12, 2007CICC quarterly meeting6 Clustering Algorithm 2. non-hierarchical clustering Clusters form around centroids, the number of which can be specified by the user. All clusters rank equally and there is no particular relationship between them. http://www.digitalchemistry.co.uk/prod_clustering.html

7 March 12, 2007CICC quarterly meeting7 Divisive KMeans (DivKmeans) Clustering Algorithm Kmeans Method: K: number of clusters, which can be specified. The items are initially randomly assigned to a cluster. The kmeans clustering proceeds by repeated application of a two-step process: 1. The mean vector for all items in each cluster is computed. 2. Items are reassigned to the cluster whose center is closest to the item. Features: The K-means algorithm is stochastic and the results are subject to a random component. The K-means algorithm works very well for well-defined clusters with a clear cluster center.

8 March 12, 2007CICC quarterly meeting8 Divisive KMeans (DivKmeans) Clustering Algorithm Divisive KMeans : A hierarchical kmeans method. In the following discussion, we consider k= 2, i.e. each clustering process accepts one cluster as input, and generates two partitioned clusters as outputs. Original cluster Kmeans Method Kmeans Method Kmeans Method cluster1 cluster2 Kmeans Method … … … …

9 March 12, 2007CICC quarterly meeting9 Parallelization of DivKmeans Algorithm for Multicore Proceeding without Digital Chemistry DivKmeans Once agreement was reached (Nov 2006), could not get version of source code isolated that communicated with public interfaces instead of private interfaces. Naive parallelization of DivKmeans Chose to work with Cluster 3.0 from Open Source Clustering Software Laboratory of DNA Information Analysis, Human Genome Center Institute of Medical Science, University of Tokyo. The C clustering library is released under the Python License. Parallelized this Kmeans code with decomposition Gather performance results on naive parallelization Suggest multicore-sensitive parallelizations Early performance results of these parallelizations

10 March 12, 2007CICC quarterly meeting10 Naive Parallelization of Cluster 3.0 Kmeans Treat each kmeans clustering process as a black box, which takes one cluster as input, and generates two clusters as outputs When a new cluster is generated having more than one element in it, assign it to free processor for further clustering A master node maintains status of each node

11 March 12, 2007CICC quarterly meeting11 Naive Parallelization of Cluster 3.0 Kmeans...... Working Node1 Master Node Working Node2 Working Node3 Original cluster cluster1 cluster2 Assign to Node 2 Reassign to Node 1 Assign to Node 3 (Reassign to Node 2)

12 March 12, 2007CICC quarterly meeting12 Quality of Cluster 3.0 Kmeans Naive Parallelization Pros: Dont need to worry about the details of DivKmeans method. Can use Kmeans functions of other libraries directly. Cons: Speedup and scalability? How about parallelization overhead?

13 March 12, 2007CICC quarterly meeting13 Profiling Naive Parallelization Platform: A Linux cluster, each node has two 2GHz AMD Opteron(TM) CPUs, each CPU has dual cores Linux RHEL WS release 4 Algorithm: Cluster 3.0, parallelized and made divisive Dataset: Pubchem dataset of sizes 24,000 and 96,000 elements Additional Libraries: LAM 7.1.2/MPI

14 March 12, 2007CICC quarterly meeting14 Speedup: naive parallelization of Cluster 3.0 speedup is defined by S p = T 1 /T p where: * p is number of processors * T 1 is execution time of sequential algorithm * T p is execution time of parallel algorithm with p processors Conclusion: maximum benefit reached at 17 nodes; significant decrease in speedup after only 5 nodes.

15 March 12, 2007CICC quarterly meeting15 CPU Utilization: Conclusion: Node 1 maxes out at 100% utilization. A likely limiter to overall performance.

16 March 12, 2007CICC quarterly meeting16 Memory Utilization Conclusion: nothing outstanding

17 March 12, 2007CICC quarterly meeting17 Process Behaviors By XMPI, which is a graphical user interface for running, debugging and visualizing MPI programs.

18 March 12, 2007CICC quarterly meeting18 Conclusions on Naive Parallelization from Profiling Poor scalability beyond 5 nodes. Performance likely inhibited by 100% utilization of Node 1. Proposed Solution Multi-core solution: using multi-threads on each node, each thread runs on one core. How this solution will explicitly address the two problems identified above.

19 March 12, 2007CICC quarterly meeting19 Proposed Solution Instead of treating each kmeans clustering process as a black box, each clustering process is decomposed into several threads. original cluster thread 3thread 2thread 4 Merge Results some pre-processing other processing cluster1 Cluster 2 thread 1

20 March 12, 2007CICC quarterly meeting20 Step 1: identify parts to decompose (parallelize) Calling sequence of kmeans clustering process Do loop Finding Centroids Calculating Distance While loop Do loop Inside Kmeans Profiling shows: -> About 93% of total execution time is spent in kmeans() functions. -> Inside kmeans() function, almost all time is spent in Finding Centroids and Calculating Distance. -> Hence, parallelize these two. DivKmeansKmeans() Calculating Distance Find Centroids

21 March 12, 2007CICC quarterly meeting21 Simplified codes of Finding Centroids // sum up all elements for (k = 0; k < nrows; k++) { i = clusterid[k]; for (j = 0; j < ncolumns; j++) cdata[i][j]+=data[k][j]; } // calculate mean values for (i = 0; i < nclusters; i++) { for (j = 0; j < ncolumns; j++) cdata[i][j] /= total_number[i][j]; }

22 March 12, 2007CICC quarterly meeting22 Parallelized Codes of Finding Centroids // sum up all elements for (k = 0; k < nrows; k++) { i = clusterid[k]; for (j = 0; j < ncolumns; j++) cdata[i][j]+=data[k][j]; } // calculate mean values … Before parallelizationAfter parallelization // sum up elements assigned to current thread for (k = nrows * index / n_thread; k < nrows * (index + 1) / n_thread; k++) { i = clusterid[k]; for (j = 0; j < ncolumns; j++) { if (mask[k][j] != 0) { t_data[i][j]+=data[k][j]; t_mask[i][j]++; } // merge data … // calculate mean values …

23 March 12, 2007CICC quarterly meeting23 Mapping of Algorithms into Multi-core Architectures original cluster Core 3Core 2Core 4 Merge Results some pre-processing other processing cluster1 Cluster 2 Core1 Each thread uses one core

24 March 12, 2007CICC quarterly meeting24 Mapping of Algorithms into Multi-core Architectures How to further benefit from multi-core architectures? Data locality Cache aware algorithm Architecture aware algorithm

25 March 12, 2007CICC quarterly meeting25 Example 1: AMD Opteron Mapping of Algorithms into Multi-core Architectures No cache sharing between two cores in this architecture Diagram of AMD Opteron

26 March 12, 2007CICC quarterly meeting26 Mapping of Algorithms into Multi-core Architectures Example 2: Intel Core 2 Improve cache re-use: If two threads share common data, assign them to the cores on the same die. Diagram of an Intel Core 2 dual Core processor

27 March 12, 2007CICC quarterly meeting27 Mapping of Algorithms into Multi-core Architectures Dell PowerEdge 6950 NUMA (Non-Uniform Memory Access) Example 3: Improve data locality: Keep data in local memory so that each thread uses local memory instead of remote ones as much as possible.

28 March 12, 2007CICC quarterly meeting28 Early Results on Multi-core Platform Experiment Environments Platform: 3 nodes in a Linux cluster, each node has two 2GHz AMD Opteron(TM) CPUs, each CPU has dual cores Linux RHEL WS release 4 Library: LAM 7.1.2/MPI Pthread for Linux RHEL WS release 4 Degree of Parallelization Only the code of Finding Centroids is parallelized for early study. 4 threads are used for Finding Centroids on each node, and each thread runs on one core.

29 March 12, 2007CICC quarterly meeting29 Results of Parallelizing Finding Centroids Conclusion: Modest improvement. DivKmeans runs about 12% faster after parallelization.

30 March 12, 2007CICC quarterly meeting30 Parallelizing Finding Centroids with Different Number of Threads per Node Conclusion: can hardly benefit from using more threads than the number of cores. Total Number of Cores per Node: 4

31 March 12, 2007CICC quarterly meeting31 Optimizations for Next Step Reduce overhead of managing threads (e.g. use thread pool instead of creating new threads for each call to Finding Centroids) Parallelize the Calculating Distance part, which consumes twice the time of Finding Centroids More cores (4, 8, 32…) on a single computer are on the way. Should get more performance enhancements with more cores if the scalability of the program is good. The platform we used (AMD Opteron TM) doesnt support cache sharing between two cores on the same die. However, L2, and even L1 cache sharing among cores are becoming available.

32 March 12, 2007CICC quarterly meeting32 The Multi-core Project in the Distributed Data Everywhere (DDE) Lab and the Extreme Lab Multi-core processors: represent a major evolution in todays computing technology We are exploring the programming styles and challenges on multi-core platforms, and potential applications in both academic and commercial areas, including chemical- informatics, XML parsing, data streaming, Web Service, etc.

33 March 12, 2007CICC quarterly meeting33 References 1. Open Source Clustering Software Laboratory of DNA Information Analysis, Human Genome Center Institute of Medical Science, University of Tokyo http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/ 2. http://www.nsc.liu.se/rd/enacts/Smith/img1.htmhttp://www.nsc.liu.se/rd/enacts/Smith/img1.htm 3. http://www.mhpcc.edu/training/workshop/parallel_intro/http://www.mhpcc.edu/training/workshop/parallel_intro/ 4. http://www.digitalchemistry.co.uk/prod_clustering.htmlhttp://www.digitalchemistry.co.uk/prod_clustering.html 5. Performance Benchmarking on the Dell PowerEdge 6950 David Morse, Dell Inc.


Download ppt "March 12, 2007CICC quarterly meeting1 Optimizing DivKmeans for Multicore Architectures: a status report Jiahu Deng and Beth Plale Department of Computer."

Similar presentations


Ads by Google