Presentation is loading. Please wait.

Presentation is loading. Please wait.

Message Passing On Tightly- Interconnected Multi-Core Processors James Psota and Anant Agarwal MIT CSAIL.

Similar presentations


Presentation on theme: "Message Passing On Tightly- Interconnected Multi-Core Processors James Psota and Anant Agarwal MIT CSAIL."— Presentation transcript:

1 Message Passing On Tightly- Interconnected Multi-Core Processors James Psota and Anant Agarwal MIT CSAIL

2 Technology Scaling Enables Multi- Cores Multi-cores offer a novel environment for parallel computing clustermulti-core

3 Shared Memory –Shared caches or memory –Remote DMA (RDMA) Traditional Communication On Multi-Processors Interconnects –Ethernet TCP/IP –Myrinet –Scalable Coherent Interconnect (SCI) AMD Dual-Core OpteronBeowulf Cluster

4 On-Chip Networks Enable Fast Communication Some multi-cores offer… –tightly integrated on-chip networks –direct access to hardware resources (no OS layers) –fast interrupts MIT Raw Processor used for experimentation and validation

5 Parallel Programming is Hard Must orchestrate of computation and communication Extra resources present both opportunity and challenge Trivial to deadlock Constraints on message sizes No operating system support

6 rMPI’s Approach Goals –robust, deadlock-free, scalable programming interface –easy to program through high-level routines Challenge –exploit hardware resources for efficient communication –don’t sacrifice performance

7 Outline Introduction Background Design Results Related Work

8 The Raw Multi-Core Processor 16 identical tiles –processing core –network routers 4 register-mapped on-chip networks Direct access to hardware resources Hardware fabricated in ASIC process Raw Processor

9 Raw’s General Dynamic Network Handles run-time events –interrupts, dynamic messages Network guarantees atomic, in-order messages Dimension-ordered wormhole routed Maximum message length: 31 words Blocking sends/receives Minimal network buffering

10 MPI: Portable Message Passing API Gives programmers high-level abstractions for parallel programming –send/receive, scatter/gather, reductions, etc. MPI is a standard, not an implementation –many implementations for many HW platforms –over 200 API functions MPI applications portable across MPI- compliant systems Can impose high overhead

11 MPI Semantics: Cooperative Communication process 0 private address space process 1 private address space communication channel Data exchanged cooperatively via explicit send and receive Receiving process’s memory only modified with its explicit participation Combines communication and synchronization send(dest=1, tag=17) temp tag=17 recv(src=0, tag=42) interrupt send(dest=1, tag=42) tag=42 recv(src=0, tag=17) interrupt

12 Outline Introduction Background Design Results Related Work

13 rMPI System Architecture

14 High-Level MPI Layer Argument checking (MPI semantics) Buffer prep Calls appropriate low level functions LAM/MPI partially ported

15 Collective Communications Layer Algorithms for collective operations –Broadcast –Scatter/Gather –Reduce Invokes low level functions

16 Point-to-Point Layer Low-level send/receive routines Highly optimized interrupt-driven receive design Packetization and reassembly

17 Outline Introduction Background Design Results Related Work

18 rMPI Evaluation How much overhead does high-level interface impose? –compare against hand-coded GDN Does it scale? –with problem size and number of processors? –compare against hand-coded GDN –compare against commercial MPI implementation on cluster

19 End-to-End Latency Overhead vs. Hand-Coded (1) Experiment measures latency for: –sender: load message from memory –sender: break up and send message –receiver: receive message –receiver: store message to memory

20 End-to-End Latency Overhead vs. Hand- Coded (2) 1 word: 481% 1000 words: 33% packet management complexity overflows cache

21 Performance Scaling: Jacobi 16x16 input matrix 2048 x 2048 input matrix

22 Performance Scaling: Jacobi, 16 processors sequential version cache capacity overflow

23 Overhead: Jacobi, rMPI vs. Hand- Coded many small messages 16 tiles: 5% overhead memory access synchronization

24 Matrix Multiplication: rMPI vs. LAM/MPI many smaller messages; smaller message length has less effect on LAM

25 Trapezoidal Integration: rMPI vs. LAM/MPI

26 Pi Estimation: rMPI vs. LAM/MPI

27 Related Work Low-latency communication networks –iWarp, Alewife, INMOS Multi-core processors –VIRAM, Wavescalar, TRIPS, POWER 4, Pentium D Alternatives to programming Raw –scalar operand network, CFlow, rawcc MPI implementations –OpenMPI, LAM/MPI, MPICH

28 Summary rMPI provides easy yet powerful programming model for multi-cores Scales better than commercial MPI implementation Low overhead over hand-coded applications

29 Thanks! For more information, see Master’s Thesis: http://cag.lcs.mit.edu/~jim/publications/ms.pdf

30

31 rMPI messages broken into packets Receiver buffers and demultiplexes packets from different sources Messages received upon interrupt, and buffered until user-level receive rMPI receiver process interrupt rMPI sender process 1 2 rMPI sender process 2 3 1 21 GDN messages have a max length of 31 words rMPI packet format for 65 [payload] word MPI message

32 rMPI: enabling MPI programs on Raw rMPI… is compatible with current MPI software gives programmers already familiar with MPI an easy interface to program Raw gives programmers fine-grain control over their programs when trusting automatic parallelization tools are not adequate gives users a robust, deadlock-free, and high- performance programming model with which to program Raw ► easily write programs on Raw without overly sacrificing performance

33 Packet boundary bookkeeping Receiver must handle packet interleaving across multiple interrupt handler invocations

34 Receive-side packet management Global data structures accessed by interrupt handler and MPI Receive threads Data structure design minimizes pointer chasing for fast lookups No memcpy for receive- before-send case

35 User-thread CFG for receiving

36 Interrupt handler CFG logic supports MPI semantics and packet construction

37 Future work: improving performance Comparison of rMPI to standard cluster running off-the-shelf MPI library Improve system performance –further minimize MPI overhead –spatially-aware collective communication algorithms –further Raw-specific optimizations Investigate new APIs better suited for TPAs

38 Future work: HW extensions Simple hardware tweaks may significantly improve performance –larger input/output FIFOs –simple switch logic/demultiplexing to handle packetization could drastically simplify software logic –larger header words (64 bit?) would allow for much larger (atomic) packets (also, current header only scales to 32 x 32 tile fabrics)

39 Conclusions MPI standard was designed for “standard” parallel machines, not for tiled architectures –MPI may no longer make sense for tiled designs Simple hardware could significantly reduce packet management overhead  increase rMPI performance


Download ppt "Message Passing On Tightly- Interconnected Multi-Core Processors James Psota and Anant Agarwal MIT CSAIL."

Similar presentations


Ads by Google