Message Passing On Tightly- Interconnected Multi-Core Processors James Psota and Anant Agarwal MIT CSAIL
Technology Scaling Enables Multi- Cores Multi-cores offer a novel environment for parallel computing clustermulti-core
Shared Memory –Shared caches or memory –Remote DMA (RDMA) Traditional Communication On Multi-Processors Interconnects –Ethernet TCP/IP –Myrinet –Scalable Coherent Interconnect (SCI) AMD Dual-Core OpteronBeowulf Cluster
On-Chip Networks Enable Fast Communication Some multi-cores offer… –tightly integrated on-chip networks –direct access to hardware resources (no OS layers) –fast interrupts MIT Raw Processor used for experimentation and validation
Parallel Programming is Hard Must orchestrate of computation and communication Extra resources present both opportunity and challenge Trivial to deadlock Constraints on message sizes No operating system support
rMPI’s Approach Goals –robust, deadlock-free, scalable programming interface –easy to program through high-level routines Challenge –exploit hardware resources for efficient communication –don’t sacrifice performance
Outline Introduction Background Design Results Related Work
The Raw Multi-Core Processor 16 identical tiles –processing core –network routers 4 register-mapped on-chip networks Direct access to hardware resources Hardware fabricated in ASIC process Raw Processor
Raw’s General Dynamic Network Handles run-time events –interrupts, dynamic messages Network guarantees atomic, in-order messages Dimension-ordered wormhole routed Maximum message length: 31 words Blocking sends/receives Minimal network buffering
MPI: Portable Message Passing API Gives programmers high-level abstractions for parallel programming –send/receive, scatter/gather, reductions, etc. MPI is a standard, not an implementation –many implementations for many HW platforms –over 200 API functions MPI applications portable across MPI- compliant systems Can impose high overhead
MPI Semantics: Cooperative Communication process 0 private address space process 1 private address space communication channel Data exchanged cooperatively via explicit send and receive Receiving process’s memory only modified with its explicit participation Combines communication and synchronization send(dest=1, tag=17) temp tag=17 recv(src=0, tag=42) interrupt send(dest=1, tag=42) tag=42 recv(src=0, tag=17) interrupt
Outline Introduction Background Design Results Related Work
rMPI System Architecture
High-Level MPI Layer Argument checking (MPI semantics) Buffer prep Calls appropriate low level functions LAM/MPI partially ported
Collective Communications Layer Algorithms for collective operations –Broadcast –Scatter/Gather –Reduce Invokes low level functions
Point-to-Point Layer Low-level send/receive routines Highly optimized interrupt-driven receive design Packetization and reassembly
Outline Introduction Background Design Results Related Work
rMPI Evaluation How much overhead does high-level interface impose? –compare against hand-coded GDN Does it scale? –with problem size and number of processors? –compare against hand-coded GDN –compare against commercial MPI implementation on cluster
End-to-End Latency Overhead vs. Hand-Coded (1) Experiment measures latency for: –sender: load message from memory –sender: break up and send message –receiver: receive message –receiver: store message to memory
End-to-End Latency Overhead vs. Hand- Coded (2) 1 word: 481% 1000 words: 33% packet management complexity overflows cache
Performance Scaling: Jacobi 16x16 input matrix 2048 x 2048 input matrix
Performance Scaling: Jacobi, 16 processors sequential version cache capacity overflow
Overhead: Jacobi, rMPI vs. Hand- Coded many small messages 16 tiles: 5% overhead memory access synchronization
Matrix Multiplication: rMPI vs. LAM/MPI many smaller messages; smaller message length has less effect on LAM
Trapezoidal Integration: rMPI vs. LAM/MPI
Pi Estimation: rMPI vs. LAM/MPI
Related Work Low-latency communication networks –iWarp, Alewife, INMOS Multi-core processors –VIRAM, Wavescalar, TRIPS, POWER 4, Pentium D Alternatives to programming Raw –scalar operand network, CFlow, rawcc MPI implementations –OpenMPI, LAM/MPI, MPICH
Summary rMPI provides easy yet powerful programming model for multi-cores Scales better than commercial MPI implementation Low overhead over hand-coded applications
Thanks! For more information, see Master’s Thesis:
rMPI messages broken into packets Receiver buffers and demultiplexes packets from different sources Messages received upon interrupt, and buffered until user-level receive rMPI receiver process interrupt rMPI sender process 1 2 rMPI sender process GDN messages have a max length of 31 words rMPI packet format for 65 [payload] word MPI message
rMPI: enabling MPI programs on Raw rMPI… is compatible with current MPI software gives programmers already familiar with MPI an easy interface to program Raw gives programmers fine-grain control over their programs when trusting automatic parallelization tools are not adequate gives users a robust, deadlock-free, and high- performance programming model with which to program Raw ► easily write programs on Raw without overly sacrificing performance
Packet boundary bookkeeping Receiver must handle packet interleaving across multiple interrupt handler invocations
Receive-side packet management Global data structures accessed by interrupt handler and MPI Receive threads Data structure design minimizes pointer chasing for fast lookups No memcpy for receive- before-send case
User-thread CFG for receiving
Interrupt handler CFG logic supports MPI semantics and packet construction
Future work: improving performance Comparison of rMPI to standard cluster running off-the-shelf MPI library Improve system performance –further minimize MPI overhead –spatially-aware collective communication algorithms –further Raw-specific optimizations Investigate new APIs better suited for TPAs
Future work: HW extensions Simple hardware tweaks may significantly improve performance –larger input/output FIFOs –simple switch logic/demultiplexing to handle packetization could drastically simplify software logic –larger header words (64 bit?) would allow for much larger (atomic) packets (also, current header only scales to 32 x 32 tile fabrics)
Conclusions MPI standard was designed for “standard” parallel machines, not for tiled architectures –MPI may no longer make sense for tiled designs Simple hardware could significantly reduce packet management overhead increase rMPI performance