Message Passing On Tightly- Interconnected Multi-Core Processors James Psota and Anant Agarwal MIT CSAIL.

Slides:

Advertisements

Similar presentations

Multiple Processor Systems

Advertisements

Distributed Systems CS

ECE669 L3: Design Issues February 5, 2004 ECE 669 Parallel Computer Architecture Lecture 3 Design Issues.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

AMLAPI: Active Messages over Low-level Application Programming Interface Simon Yau, Tyson Condie,

Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer Architecture and VLSI Systems Laboratory Exploiting Spatial.

CSC457 Seminar YongKang Zhu December 6 th, 2001 About Network Processor.

Types of Parallel Computers

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

1. Overview  Introduction  Motivations  Multikernel Model  Implementation – The Barrelfish  Performance Testing  Conclusion 2.

Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.

Fundamental Design Issues for Parallel Architecture Todd C. Mowry CS 495 January 22, 2002.

1 Multi - Core fast Communication for SoPC Multi - Core fast Communication for SoPC Technion – Israel Institute of Technology Department of Electrical.

Chapter Hardwired vs Microprogrammed Control Multithreading

Haoyuan Li CS 6410 Fall /15/2009.  U-Net: A User-Level Network Interface for Parallel and Distributed Computing ◦ Thorsten von Eicken, Anindya.

Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.

Threads CSCI 444/544 Operating Systems Fall 2008.

Modern trends in computer architecture and semiconductor scaling are leading towards the design of chips with more and more processor cores. Highly concurrent.

Exokernel: An Operating System Architecture for Application-Level Resource Management Dawson R. Engler, M. Frans Kaashoek, and James O’Toole Jr. M.I.T.

High Performance Communication using MPJ Express 1 Presented by Jawad Manzoor National University of Sciences and Technology, Pakistan 29 June 2015.

Computer System Overview Chapter 1. Basic computer structure CPU Memory memory bus I/O bus diskNet interface.

Research Directions for On-chip Network Microarchitectures Luca Carloni, Steve Keckler, Robert Mullins, Vijay Narayanan, Steve Reinhardt, Michael Taylor.

1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,

Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18,

Gigabit Routing on a Software-exposed Tiled-Microprocessor

Optimizing Threaded MPI Execution on SMP Clusters Hong Tang and Tao Yang Department of Computer Science University of California, Santa Barbara.

Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.

1 Developing Native Device for MPJ Express Advisor: Dr. Aamir Shafi Co-advisor: Ms Samin Khaliq.

ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.

Synchronization and Communication in the T3E Multiprocessor.

Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.

1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,

The MPC Parallel Computer Hardware, Low-level Protocols and Performances University P. & M. Curie (PARIS) LIP6 laboratory Olivier Glück.

TILEmpower-Gx36 - Architecture overview & performance benchmarks – Presented by Younghyun Jo 2013/12/18.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.

TEMPLATE DESIGN © Hardware Design, Synthesis, and Verification of a Multicore Communication API Ben Meakin, Ganesh Gopalakrishnan.

The Performance of Micro-Kernel- Based Systems H. Haertig, M. Hohmuth, J. Liedtke, S. Schoenberg, J. Wolter Presentation by: Seungweon Park.

Jump to first page One-gigabit Router Oskar E. Bruening and Cemal Akcaba Advisor: Prof. Agarwal.

CSE 661 PAPER PRESENTATION

Minimizing Communication Latency to Maximize Network Communication Throughput over InfiniBand Design and Implementation of MPICH-2 over InfiniBand with.

Multiple Processor Systems. Multiprocessor Systems Continuous need for faster computers –shared memory model ( access nsec) –message passing multiprocessor.

1 Public DAFS Storage for High Performance Computing using MPI-I/O: Design and Experience Arkady Kanevsky & Peter Corbett Network Appliance Vijay Velusamy.

Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.

The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.

The Mach System Abraham Silberschatz, Peter Baer Galvin, Greg Gagne Presentation By: Agnimitra Roy.

1 Message Passing Models CEG 4131 Computer Architecture III Miodrag Bolic.

The influence of system calls and interrupts on the performances of a PC cluster using a Remote DMA communication primitive Olivier Glück Jean-Luc Lamotte.

Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 April 11, 2006 Session 23.

LRPC Firefly RPC, Lightweight RPC, Winsock Direct and VIA.

Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.

Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy. Presented by: Tim Fleck.

Interconnection network network interface and a case study.

Software Overhead in Messaging Layers Pitch Patarasuk.

OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel ECE1747 – Parallel Programming Vicky Tsang.

Embedded Computer Architecture 5SAI0 Multi-Processor Systems

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

Operating Systems Unit 2: – Process Context switch Interrupt Interprocess communication – Thread Thread models Operating Systems.

Exploiting Task-level Concurrency in a Programmable Network Interface June 11, 2003 Hyong-youb Kim, Vijay S. Pai, and Scott Rixner Rice Computer Architecture.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.

Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)

Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.

CS5102 High Performance Computer Systems Thread-Level Parallelism

The Mach System Sri Ramkrishna.

CMSC 611: Advanced Computer Architecture

The Multikernel A new OS architecture for scalable multicore systems

Message Passing Models

MPJ: A Java-based Parallel Computing System

Presentation transcript:

Message Passing On Tightly- Interconnected Multi-Core Processors James Psota and Anant Agarwal MIT CSAIL

Technology Scaling Enables Multi- Cores Multi-cores offer a novel environment for parallel computing clustermulti-core

Shared Memory –Shared caches or memory –Remote DMA (RDMA) Traditional Communication On Multi-Processors Interconnects –Ethernet TCP/IP –Myrinet –Scalable Coherent Interconnect (SCI) AMD Dual-Core OpteronBeowulf Cluster

On-Chip Networks Enable Fast Communication Some multi-cores offer… –tightly integrated on-chip networks –direct access to hardware resources (no OS layers) –fast interrupts MIT Raw Processor used for experimentation and validation

Parallel Programming is Hard Must orchestrate of computation and communication Extra resources present both opportunity and challenge Trivial to deadlock Constraints on message sizes No operating system support

rMPI’s Approach Goals –robust, deadlock-free, scalable programming interface –easy to program through high-level routines Challenge –exploit hardware resources for efficient communication –don’t sacrifice performance

Outline Introduction Background Design Results Related Work

The Raw Multi-Core Processor 16 identical tiles –processing core –network routers 4 register-mapped on-chip networks Direct access to hardware resources Hardware fabricated in ASIC process Raw Processor

Raw’s General Dynamic Network Handles run-time events –interrupts, dynamic messages Network guarantees atomic, in-order messages Dimension-ordered wormhole routed Maximum message length: 31 words Blocking sends/receives Minimal network buffering

MPI: Portable Message Passing API Gives programmers high-level abstractions for parallel programming –send/receive, scatter/gather, reductions, etc. MPI is a standard, not an implementation –many implementations for many HW platforms –over 200 API functions MPI applications portable across MPI- compliant systems Can impose high overhead

MPI Semantics: Cooperative Communication process 0 private address space process 1 private address space communication channel Data exchanged cooperatively via explicit send and receive Receiving process’s memory only modified with its explicit participation Combines communication and synchronization send(dest=1, tag=17) temp tag=17 recv(src=0, tag=42) interrupt send(dest=1, tag=42) tag=42 recv(src=0, tag=17) interrupt

Outline Introduction Background Design Results Related Work

rMPI System Architecture

High-Level MPI Layer Argument checking (MPI semantics) Buffer prep Calls appropriate low level functions LAM/MPI partially ported

Collective Communications Layer Algorithms for collective operations –Broadcast –Scatter/Gather –Reduce Invokes low level functions

Point-to-Point Layer Low-level send/receive routines Highly optimized interrupt-driven receive design Packetization and reassembly

Outline Introduction Background Design Results Related Work

rMPI Evaluation How much overhead does high-level interface impose? –compare against hand-coded GDN Does it scale? –with problem size and number of processors? –compare against hand-coded GDN –compare against commercial MPI implementation on cluster

End-to-End Latency Overhead vs. Hand-Coded (1) Experiment measures latency for: –sender: load message from memory –sender: break up and send message –receiver: receive message –receiver: store message to memory

End-to-End Latency Overhead vs. Hand- Coded (2) 1 word: 481% 1000 words: 33% packet management complexity overflows cache

Performance Scaling: Jacobi 16x16 input matrix 2048 x 2048 input matrix

Performance Scaling: Jacobi, 16 processors sequential version cache capacity overflow

Overhead: Jacobi, rMPI vs. Hand- Coded many small messages 16 tiles: 5% overhead memory access synchronization

Matrix Multiplication: rMPI vs. LAM/MPI many smaller messages; smaller message length has less effect on LAM

Trapezoidal Integration: rMPI vs. LAM/MPI

Pi Estimation: rMPI vs. LAM/MPI

Related Work Low-latency communication networks –iWarp, Alewife, INMOS Multi-core processors –VIRAM, Wavescalar, TRIPS, POWER 4, Pentium D Alternatives to programming Raw –scalar operand network, CFlow, rawcc MPI implementations –OpenMPI, LAM/MPI, MPICH

Summary rMPI provides easy yet powerful programming model for multi-cores Scales better than commercial MPI implementation Low overhead over hand-coded applications

Thanks! For more information, see Master’s Thesis:

rMPI messages broken into packets Receiver buffers and demultiplexes packets from different sources Messages received upon interrupt, and buffered until user-level receive rMPI receiver process interrupt rMPI sender process 1 2 rMPI sender process GDN messages have a max length of 31 words rMPI packet format for 65 [payload] word MPI message

rMPI: enabling MPI programs on Raw rMPI… is compatible with current MPI software gives programmers already familiar with MPI an easy interface to program Raw gives programmers fine-grain control over their programs when trusting automatic parallelization tools are not adequate gives users a robust, deadlock-free, and high- performance programming model with which to program Raw ► easily write programs on Raw without overly sacrificing performance

Packet boundary bookkeeping Receiver must handle packet interleaving across multiple interrupt handler invocations

Receive-side packet management Global data structures accessed by interrupt handler and MPI Receive threads Data structure design minimizes pointer chasing for fast lookups No memcpy for receive- before-send case

User-thread CFG for receiving

Interrupt handler CFG logic supports MPI semantics and packet construction

Future work: improving performance Comparison of rMPI to standard cluster running off-the-shelf MPI library Improve system performance –further minimize MPI overhead –spatially-aware collective communication algorithms –further Raw-specific optimizations Investigate new APIs better suited for TPAs

Future work: HW extensions Simple hardware tweaks may significantly improve performance –larger input/output FIFOs –simple switch logic/demultiplexing to handle packetization could drastically simplify software logic –larger header words (64 bit?) would allow for much larger (atomic) packets (also, current header only scales to 32 x 32 tile fabrics)

Conclusions MPI standard was designed for “standard” parallel machines, not for tiled architectures –MPI may no longer make sense for tiled designs Simple hardware could significantly reduce packet management overhead  increase rMPI performance