Message Passing On Tightly- Interconnected Multi-Core Processors James Psota and Anant Agarwal MIT CSAIL.

Message Passing On Tightly- Interconnected Multi-Core Processors James Psota and Anant Agarwal MIT CSAIL

Technology Scaling Enables Multi- Cores Multi-cores offer a novel environment for parallel computing clustermulti-core

Shared Memory –Shared caches or memory –Remote DMA (RDMA) Traditional Communication On Multi-Processors Interconnects –Ethernet TCP/IP –Myrinet –Scalable Coherent Interconnect (SCI) AMD Dual-Core OpteronBeowulf Cluster

On-Chip Networks Enable Fast Communication Some multi-cores offer… –tightly integrated on-chip networks –direct access to hardware resources (no OS layers) –fast interrupts MIT Raw Processor used for experimentation and validation

Parallel Programming is Hard Must orchestrate of computation and communication Extra resources present both opportunity and challenge Trivial to deadlock Constraints on message sizes No operating system support

rMPI’s Approach Goals –robust, deadlock-free, scalable programming interface –easy to program through high-level routines Challenge –exploit hardware resources for efficient communication –don’t sacrifice performance

Outline Introduction Background Design Results Related Work

The Raw Multi-Core Processor 16 identical tiles –processing core –network routers 4 register-mapped on-chip networks Direct access to hardware resources Hardware fabricated in ASIC process Raw Processor

Raw’s General Dynamic Network Handles run-time events –interrupts, dynamic messages Network guarantees atomic, in-order messages Dimension-ordered wormhole routed Maximum message length: 31 words Blocking sends/receives Minimal network buffering

MPI: Portable Message Passing API Gives programmers high-level abstractions for parallel programming –send/receive, scatter/gather, reductions, etc. MPI is a standard, not an implementation –many implementations for many HW platforms –over 200 API functions MPI applications portable across MPI- compliant systems Can impose high overhead

MPI Semantics: Cooperative Communication process 0 private address space process 1 private address space communication channel Data exchanged cooperatively via explicit send and receive Receiving process’s memory only modified with its explicit participation Combines communication and synchronization send(dest=1, tag=17) temp tag=17 recv(src=0, tag=42) interrupt send(dest=1, tag=42) tag=42 recv(src=0, tag=17) interrupt

rMPI System Architecture

High-Level MPI Layer Argument checking (MPI semantics) Buffer prep Calls appropriate low level functions LAM/MPI partially ported

Collective Communications Layer Algorithms for collective operations –Broadcast –Scatter/Gather –Reduce Invokes low level functions

Point-to-Point Layer Low-level send/receive routines Highly optimized interrupt-driven receive design Packetization and reassembly

rMPI Evaluation How much overhead does high-level interface impose? –compare against hand-coded GDN Does it scale? –with problem size and number of processors? –compare against hand-coded GDN –compare against commercial MPI implementation on cluster

End-to-End Latency Overhead vs. Hand-Coded (1) Experiment measures latency for: –sender: load message from memory –sender: break up and send message –receiver: receive message –receiver: store message to memory

End-to-End Latency Overhead vs. Hand- Coded (2) 1 word: 481% 1000 words: 33% packet management complexity overflows cache

Performance Scaling: Jacobi 16x16 input matrix 2048 x 2048 input matrix

Performance Scaling: Jacobi, 16 processors sequential version cache capacity overflow

Overhead: Jacobi, rMPI vs. Hand- Coded many small messages 16 tiles: 5% overhead memory access synchronization

Matrix Multiplication: rMPI vs. LAM/MPI many smaller messages; smaller message length has less effect on LAM

Trapezoidal Integration: rMPI vs. LAM/MPI

Pi Estimation: rMPI vs. LAM/MPI

Related Work Low-latency communication networks –iWarp, Alewife, INMOS Multi-core processors –VIRAM, Wavescalar, TRIPS, POWER 4, Pentium D Alternatives to programming Raw –scalar operand network, CFlow, rawcc MPI implementations –OpenMPI, LAM/MPI, MPICH

Summary rMPI provides easy yet powerful programming model for multi-cores Scales better than commercial MPI implementation Low overhead over hand-coded applications

Thanks! For more information, see Master’s Thesis: http://cag.lcs.mit.edu/~jim/publications/ms.pdf

rMPI messages broken into packets Receiver buffers and demultiplexes packets from different sources Messages received upon interrupt, and buffered until user-level receive rMPI receiver process interrupt rMPI sender process 1 2 rMPI sender process 2 3 1 21 GDN messages have a max length of 31 words rMPI packet format for 65 [payload] word MPI message

rMPI: enabling MPI programs on Raw rMPI… is compatible with current MPI software gives programmers already familiar with MPI an easy interface to program Raw gives programmers fine-grain control over their programs when trusting automatic parallelization tools are not adequate gives users a robust, deadlock-free, and high- performance programming model with which to program Raw ► easily write programs on Raw without overly sacrificing performance

Packet boundary bookkeeping Receiver must handle packet interleaving across multiple interrupt handler invocations

Receive-side packet management Global data structures accessed by interrupt handler and MPI Receive threads Data structure design minimizes pointer chasing for fast lookups No memcpy for receive- before-send case

User-thread CFG for receiving

Interrupt handler CFG logic supports MPI semantics and packet construction

Future work: improving performance Comparison of rMPI to standard cluster running off-the-shelf MPI library Improve system performance –further minimize MPI overhead –spatially-aware collective communication algorithms –further Raw-specific optimizations Investigate new APIs better suited for TPAs

Future work: HW extensions Simple hardware tweaks may significantly improve performance –larger input/output FIFOs –simple switch logic/demultiplexing to handle packetization could drastically simplify software logic –larger header words (64 bit?) would allow for much larger (atomic) packets (also, current header only scales to 32 x 32 tile fabrics)

Conclusions MPI standard was designed for “standard” parallel machines, not for tiled architectures –MPI may no longer make sense for tiled designs Simple hardware could significantly reduce packet management overhead  increase rMPI performance

Message Passing On Tightly- Interconnected Multi-Core Processors James Psota and Anant Agarwal MIT CSAIL.

Similar presentations

Presentation on theme: "Message Passing On Tightly- Interconnected Multi-Core Processors James Psota and Anant Agarwal MIT CSAIL."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Message Passing On Tightly- Interconnected Multi-Core Processors James Psota and Anant Agarwal MIT CSAIL.

Similar presentations

Presentation on theme: "Message Passing On Tightly- Interconnected Multi-Core Processors James Psota and Anant Agarwal MIT CSAIL."— Presentation transcript:

Similar presentations

About project

Feedback