Download presentation
Presentation is loading. Please wait.
Published byShonda Garrison Modified over 9 years ago
1
Advisor: Dr. Aamir Shafi Co-Advisor: Mr. Ali Sajjad Member: Dr. Hafiz Farooq Member: Mr. Tahir Azim Optimizing N-body Simulations for Multi-core Compute Clusters Ammar Ahmad Awan BIT-6
2
Presentation Outline 2 Introduction Design & Implementation Performance Evaluation Conclusions and Future Work
3
Introduction Sea change in the basic computer architecture: –Power Consumption –Heat Dissipation Emergence of multiple energy-efficient processing cores instead of a single power-hungry core Moore’s law will now be realized by increasing core-count instead of increasing clock speeds Impact on software applications: –Change of focus from Instruction Level Parallelism ( higher clock frequency) to Thread Level Parallelism ( increasing core count ) Huge impact on High Performance Computing (HPC) community: –70% of the TOP500 supercomputers are based on multi-core processors 3
4
Source : Google Images 4
5
5 Source : www.intel.com
6
Symmetric Multi-ProcessorMulti-core Processor SMP vs Multicore 6
7
HPC and Multi-core Message Passing Interface (MPI) is the defacto standard for programming today’s supercomputers –Alternatives include OpenMP (for SMP machines) and Unified Parallel C (UPC) With the existing approaches, it is possible to port MPI on multi-core processors: –One MPI process per core—we call it the “Pure MPI” approach –OpenMP threads inside MPI process—we call it “MPI+threads” approach We expect “MPI+threads” approach to be good because –Communication cost for threads is lower than processes –Threads are light-weight We have evaluated this hypothesis by comparing both approaches 7
8
Pure MPI vs “MPI+threads” approach 8
9
Sample Application: N-body Simulations To demonstrate the usefulness of our “MPI+threads” approach, we chose N-body simulation code N-body or “many body” method is used for simulating the evolution of a system consisting of ‘n’ bodies. It has found a widespread use in the fields of –Astrophysics –Molecular Dynamics –Computational Biology 9
10
Summation Approach to solving N-body problems 10 The most compute intensive part of any N-body method is the “force calculation” phase The cost of this calculation is O(n 2 )
11
Barnes Hut Tree The Barnes-Hut algorithm is divided into 3 steps 1.Building the tree – O( n * log n ) 2.Computing cell centers of mass – O (n) 3.Computing Forces – O( n * log n ) The Barnes-Hut algorithm is divided into 3 steps 1.Building the tree – O( n * log n ) 2.Computing cell centers of mass – O (n) 3.Computing Forces – O( n * log n ) 11 Other popular methods are Fast Multipole Method Particle Mesh Method TreePM Method Symplectic Methods
12
Sample Application: Gadget-2 Cosmological Simulation Code Simulates a system of “n” bodies –Implements Barnes-Hut Algorithm Written in C language & parallelized with MPI As part of this project: –Understood the Gadget-2 code –How it is used in production mode –Modified the C code to use threads in the Barnes-hut tree algorithm –Added performance counters to the code for measuring cache utilization 12
13
Presentation Outline 13 Introduction Design & Implementation Performance Evaluation Conclusions and Future Work 13
14
Gadget-2 Architecture 14
15
Code Analysis 15 parallel for ( i=0 to n ) { calculate_force( i ); } for ( i = 0 to No. of particles && n = 0 to BufferSize ) { for ( j = 0 to No. of tasks ) { export_particles ( j ); } parallel for ( i=0 to n ) { calculate_force( i ); } for ( i = 0 to No. of particles && n = 0 to BufferSize ) { for ( j = 0 to No. of tasks ) { export_particles ( j ); } for ( i = 0 to No. of particles && n = 0 to BufferSize) { calculate_force ( i ); for ( j = 0 to No. of tasks ) { export_particles ( j ); } for ( i = 0 to No. of particles && n = 0 to BufferSize) { calculate_force ( i ); for ( j = 0 to No. of tasks ) { export_particles ( j ); } Original Code Modified Code
16
Presentation Outline 16 Introduction Design & Implementation Performance Evaluation Conclusions and Future Work 16
17
Evaluation Testbed Our cluster called Chenab consists of nine nodes. Each node consists of an –Intel Xeon Quad-Core Kentsfield Processor 2.4 GHz with 1066 MHZ FSB 4 MB L2 Cache / two cores 32 KB L1 Cache / core –2 GB main memory 17
18
Performance Evaluation Performance evaluation is based on two main parameters –Execution Time Calculated directly from MPI wallclock timings –Cache Utilization We patched the Linux kernel using perfctr patch We selected the PerfAPI ( PAPI ) for hardware performance counting Used PAPI_L2_TCM (Total Cache Misses ) and PAPI_L2_TCA (Total Cache Accesses ) to calculate cache miss ratio Results are shown on the upcoming slides –Execution Time for Colliding Galaxies –Execution Time for Cluster Formation –Execution Time for Custom Simulation –Cache Utilization for Cluster Formation 18
19
Execution Time for Colliding Galaxies 19
20
Execution Time for Cluster Formation 20
21
Execution Time for Custom Simulation 21
22
Cache Utilization for Cluster Formation 22 Cache utilization has been measured using hardware counters provided by the kernel patch (Perfctr) and PerfAPI (PAPI)
23
Presentation Outline 23 Introduction Design & Implementation Performance Evaluation Conclusions and Future Work 23
24
Conclusion We optimized Gadget-2 which was our sample application –“MPI+threads” approach performs better –The optimized code offers scalable performance We are witnessing dramatic changes in core designs for multicore systems –Heterogeneous and Homogeneous designs –Targeting a 1000 core processor will require scalable frameworks and tools for programming 24
25
Conclusion 25 Source: Dave Patterson, Overview of the Parallel Laboratory Towards Many-core computing –Multicore : 2x / 2 yrs ≈ 64 cores in 8 years –Manycore : 8x to 16x multicore
26
Future Work Scalable Frameworks which provide programmer friendly high level constructs are very important –PeakStream provides GPU and CPU+GPU hybrid programs –Cilk++ augment the C++ compiler with three new keywords ( cilk_for, cilk_sync, cilk_spawn ) –Research Accelerator for Multi Processors (RAMP) can be used to simulate a 1000 core processor –Gadget-2 can be ported to GPUs using Nvidia’s CUDA framework –‘xlc’ compiler to program the STI Cell Processor 26
28
The Timeline 28
29
29
30
30
31
31
32
Barnes Hut Tree 32
33
33
34
34
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.