Presentation is loading. Please wait.

Presentation is loading. Please wait.

Advisor: Dr. Aamir Shafi Co-Advisor: Mr. Ali Sajjad Member: Dr. Hafiz Farooq Member: Mr. Tahir Azim Optimizing N-body Simulations for Multi-core Compute.

Similar presentations


Presentation on theme: "Advisor: Dr. Aamir Shafi Co-Advisor: Mr. Ali Sajjad Member: Dr. Hafiz Farooq Member: Mr. Tahir Azim Optimizing N-body Simulations for Multi-core Compute."— Presentation transcript:

1 Advisor: Dr. Aamir Shafi Co-Advisor: Mr. Ali Sajjad Member: Dr. Hafiz Farooq Member: Mr. Tahir Azim Optimizing N-body Simulations for Multi-core Compute Clusters Ammar Ahmad Awan BIT-6

2 Presentation Outline 2 Introduction Design & Implementation Performance Evaluation Conclusions and Future Work

3 Introduction Sea change in the basic computer architecture: –Power Consumption –Heat Dissipation Emergence of multiple energy-efficient processing cores instead of a single power-hungry core Moore’s law will now be realized by increasing core-count instead of increasing clock speeds Impact on software applications: –Change of focus from Instruction Level Parallelism ( higher clock frequency) to Thread Level Parallelism ( increasing core count ) Huge impact on High Performance Computing (HPC) community: –70% of the TOP500 supercomputers are based on multi-core processors 3

4 Source : Google Images 4

5 5 Source : www.intel.com

6 Symmetric Multi-ProcessorMulti-core Processor SMP vs Multicore 6

7 HPC and Multi-core Message Passing Interface (MPI) is the defacto standard for programming today’s supercomputers –Alternatives include OpenMP (for SMP machines) and Unified Parallel C (UPC) With the existing approaches, it is possible to port MPI on multi-core processors: –One MPI process per core—we call it the “Pure MPI” approach –OpenMP threads inside MPI process—we call it “MPI+threads” approach We expect “MPI+threads” approach to be good because –Communication cost for threads is lower than processes –Threads are light-weight We have evaluated this hypothesis by comparing both approaches 7

8 Pure MPI vs “MPI+threads” approach 8

9 Sample Application: N-body Simulations To demonstrate the usefulness of our “MPI+threads” approach, we chose N-body simulation code N-body or “many body” method is used for simulating the evolution of a system consisting of ‘n’ bodies. It has found a widespread use in the fields of –Astrophysics –Molecular Dynamics –Computational Biology 9

10 Summation Approach to solving N-body problems 10 The most compute intensive part of any N-body method is the “force calculation” phase The cost of this calculation is O(n 2 )

11 Barnes Hut Tree The Barnes-Hut algorithm is divided into 3 steps 1.Building the tree – O( n * log n ) 2.Computing cell centers of mass – O (n) 3.Computing Forces – O( n * log n ) The Barnes-Hut algorithm is divided into 3 steps 1.Building the tree – O( n * log n ) 2.Computing cell centers of mass – O (n) 3.Computing Forces – O( n * log n ) 11 Other popular methods are Fast Multipole Method Particle Mesh Method TreePM Method Symplectic Methods

12 Sample Application: Gadget-2 Cosmological Simulation Code Simulates a system of “n” bodies –Implements Barnes-Hut Algorithm Written in C language & parallelized with MPI As part of this project: –Understood the Gadget-2 code –How it is used in production mode –Modified the C code to use threads in the Barnes-hut tree algorithm –Added performance counters to the code for measuring cache utilization 12

13 Presentation Outline 13 Introduction Design & Implementation Performance Evaluation Conclusions and Future Work 13

14 Gadget-2 Architecture 14

15 Code Analysis 15 parallel for ( i=0 to n ) { calculate_force( i ); } for ( i = 0 to No. of particles && n = 0 to BufferSize ) { for ( j = 0 to No. of tasks ) { export_particles ( j ); } parallel for ( i=0 to n ) { calculate_force( i ); } for ( i = 0 to No. of particles && n = 0 to BufferSize ) { for ( j = 0 to No. of tasks ) { export_particles ( j ); } for ( i = 0 to No. of particles && n = 0 to BufferSize) { calculate_force ( i ); for ( j = 0 to No. of tasks ) { export_particles ( j ); } for ( i = 0 to No. of particles && n = 0 to BufferSize) { calculate_force ( i ); for ( j = 0 to No. of tasks ) { export_particles ( j ); } Original Code Modified Code

16 Presentation Outline 16 Introduction Design & Implementation Performance Evaluation Conclusions and Future Work 16

17 Evaluation Testbed Our cluster called Chenab consists of nine nodes. Each node consists of an –Intel Xeon Quad-Core Kentsfield Processor 2.4 GHz with 1066 MHZ FSB 4 MB L2 Cache / two cores 32 KB L1 Cache / core –2 GB main memory 17

18 Performance Evaluation Performance evaluation is based on two main parameters –Execution Time Calculated directly from MPI wallclock timings –Cache Utilization We patched the Linux kernel using perfctr patch We selected the PerfAPI ( PAPI ) for hardware performance counting Used PAPI_L2_TCM (Total Cache Misses ) and PAPI_L2_TCA (Total Cache Accesses ) to calculate cache miss ratio Results are shown on the upcoming slides –Execution Time for Colliding Galaxies –Execution Time for Cluster Formation –Execution Time for Custom Simulation –Cache Utilization for Cluster Formation 18

19 Execution Time for Colliding Galaxies 19

20 Execution Time for Cluster Formation 20

21 Execution Time for Custom Simulation 21

22 Cache Utilization for Cluster Formation 22 Cache utilization has been measured using hardware counters provided by the kernel patch (Perfctr) and PerfAPI (PAPI)

23 Presentation Outline 23 Introduction Design & Implementation Performance Evaluation Conclusions and Future Work 23

24 Conclusion We optimized Gadget-2 which was our sample application –“MPI+threads” approach performs better –The optimized code offers scalable performance We are witnessing dramatic changes in core designs for multicore systems –Heterogeneous and Homogeneous designs –Targeting a 1000 core processor will require scalable frameworks and tools for programming 24

25 Conclusion 25 Source: Dave Patterson, Overview of the Parallel Laboratory Towards Many-core computing –Multicore : 2x / 2 yrs  ≈ 64 cores in 8 years –Manycore : 8x to 16x multicore

26 Future Work Scalable Frameworks which provide programmer friendly high level constructs are very important –PeakStream provides GPU and CPU+GPU hybrid programs –Cilk++ augment the C++ compiler with three new keywords ( cilk_for, cilk_sync, cilk_spawn ) –Research Accelerator for Multi Processors (RAMP) can be used to simulate a 1000 core processor –Gadget-2 can be ported to GPUs using Nvidia’s CUDA framework –‘xlc’ compiler to program the STI Cell Processor 26

27

28 The Timeline 28

29 29

30 30

31 31

32 Barnes Hut Tree 32

33 33

34 34


Download ppt "Advisor: Dr. Aamir Shafi Co-Advisor: Mr. Ali Sajjad Member: Dr. Hafiz Farooq Member: Mr. Tahir Azim Optimizing N-body Simulations for Multi-core Compute."

Similar presentations


Ads by Google