High Performance Computing Group Feasibility Study of MPI Implementation on the Heterogeneous Multi-Core Cell BE TM Architecture Feasibility Study of MPI.

High Performance Computing Group Feasibility Study of MPI Implementation on the Heterogeneous Multi-Core Cell BE TM Architecture Feasibility Study of MPI Implementation on the Heterogeneous Multi-Core Cell BE TM Architecture A.Kumar 1, G. Senthilkumar 1, M. Krishna 1, N. Jayam 1, P.K. Baruah 1, R. Sarma 1, S. Kapoor 2, A. Srinivasan 3 1 Sri Sathya Sai University, Prashanthi Nilayam, India 2 IBM, Austin, skapoor@us.ibm.com 3 Florida State University, asriniva@cs.fsu.edu Goals 1.Determine the feasibility of Intra-Cell MPI 2.Evaluate the impact of different design choices on performance

High Performance Computing Group A PowerPC core, with 8 co-processors (SPE) with 256 K local store each Shared 512 MB - 2 GB main memory - SPEs can DMA Peak speeds of 204.8 Gflops in single precision and 14.64 Gflops in double precision for SPEs 204.8 GB/s EIB bandwidth, 25.6 GB/s for memory Two Cell processors can be combined to form a Cell blade with global shared memory Cell Architecture DMA put times Memory to memory copy using: SPE local store memcpy by PPE

High Performance Computing Group Intra-Cell MPI Design Choices Cell features  In order execution, but DMAs can be out of order  Over 100 simultaneous DMAs can be in flight Constraints  Unconventional, heterogeneous architecture  SPEs have limited functionality, and can act directly only on local stores  SPEs access main memory through DMA  Use of PPE should be limited to get good performance MPI design choices  Application data in: (i) local store or (ii) main memory  MPI meta-data in: (i) local store or (ii) main memory  PPE involvement: (i) active or (ii) only during initialization and finalization  Point-to-point communication mode: (i) synchronous or (ii) buffered

High Performance Computing Group Blocking Point-to-Point Communication Performance Results are from a 3.2 GHz Cell Blade, at IBM Rochester The final version uses buffered mode for small messages and synchronous mode for long messages Threshold to switch to Synchronous mode is set to 2KB In these figures, the default is for Application data to be in main memory, MPI data in Local Store, no congestion, and limited PPE involvement

High Performance Computing Group MPI/Platform Latency (0 Byte) Maximum throughput MPICELL 0.41 µs 6.01 GB/s MPICELL CongestedNA 4.48 GB/s MPICELL Small 0.65 µs 23.12 GB/s Nemesis/Xeon  1.0 µs  0.65 GB/s Shm/Xeon  1.3 µs  0.5 GB/s Open MPI/Xeon  2.8 µs  0.5 GB/s Nemesis/Opteron  0.34 µs  1.5 GB/s Open MPI/Opteron  0.6 µs  1.0 GB/s Comparison of MPICELL with MPI on Other Hardware

High Performance Computing Group Collective Communication Example – Broadcast Broadcast on 16 SPEs (2 processors) TREE: Pipelined tree structured communication based on LS TREEMM: Tree structured Send/Recv type implementation AG: Each SPE is responsible for a different portion of data OTA: Each SPE copies data to its location G: Root copies all data Broadcast with good choice of algorithms for each data size and SPE count Maximum main memory bandwidth is also shown

High Performance Computing Group Application Performance – Matrix- Vector Multiplication Used a 1-D decomposition (not very efficient) Achieved a peak double precision throughput of 7.8 Gflop/s for matrices of size of 1024 The collective used was from an older implementation on the Cell, built on top of Send/Recv using a tree structured communication The Opteron results used LAM MPI Performance of Double Precision matrix-vector multiplication

High Performance Computing Group Conclusions and Future Work Conclusions  The Cell processor has good potential for MPI applications. PPE should have a very limited role Very high bandwidths with application data in local store High bandwidth and low latency even with application data in main memory  But local store should be used effectively, with double buffering to hide latency  Main memory bandwidth is then the bottleneck Good performance for collectives even with two Cell processors Current and future work  Implemented Collective communication operations optimized for contiguous data Blocking and non-blocking communication  Future work Optimize collectives for derived data types with non-contiguous data Optimize point-to-point communication on blade with two processors More features, such as topologies, etc

High Performance Computing Group Feasibility Study of MPI Implementation on the Heterogeneous Multi-Core Cell BE TM Architecture Feasibility Study of MPI.

Similar presentations

Presentation on theme: "High Performance Computing Group Feasibility Study of MPI Implementation on the Heterogeneous Multi-Core Cell BE TM Architecture Feasibility Study of MPI."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

High Performance Computing Group Feasibility Study of MPI Implementation on the Heterogeneous Multi-Core Cell BE TM Architecture Feasibility Study of MPI.

Similar presentations

Presentation on theme: "High Performance Computing Group Feasibility Study of MPI Implementation on the Heterogeneous Multi-Core Cell BE TM Architecture Feasibility Study of MPI."— Presentation transcript:

Similar presentations

About project

Feedback