ProOnE: A General-Purpose Protocol Onload Engine for Multi- and Many- Core Architectures P. Lai, P. Balaji, R. Thakur and D. K. Panda Computer Science.

Slides:

Advertisements

Similar presentations

Middleware Support for RDMA-based Data Transfer in Cloud Computing Yufei Ren, Tan Li, Dantong Yu, Shudong Jin, Thomas Robertazzi Department of Electrical.

Advertisements

Natively Supporting True One-sided Communication in MPI on Multi-core Systems with InfiniBand G. Santhanaraman, P. Balaji, K. Gopalakrishnan, R. Thakur,

04/25/06Pavan Balaji (The Ohio State University) Asynchronous Zero-copy Communication for Synchronous Sockets in the Sockets Direct Protocol over InfiniBand.

Evaluation of ConnectX Virtual Protocol Interconnect for Data Centers Ryan E. GrantAhmad Afsahi Pavan Balaji Department of Electrical and Computer Engineering,

Performance Characterization of a 10-Gigabit Ethernet TOE W. Feng ¥ P. Balaji α C. Baron £ L. N. Bhuyan £ D. K. Panda α ¥ Advanced Computing Lab, Los Alamos.

Performance Evaluation of RDMA over IP: A Case Study with the Ammasso Gigabit Ethernet NIC H.-W. Jin, S. Narravula, G. Brown, K. Vaidyanathan, P. Balaji,

Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer Architecture and VLSI Systems Laboratory Exploiting Spatial.

Toward Efficient Support for Multithreaded MPI Communication Pavan Balaji 1, Darius Buntinas 1, David Goodell 1, William Gropp 2, and Rajeev Thakur 1 1.

Analyzing the Impact of Supporting Out-of-order Communication on In-order Performance with iWARP P. Balaji, W. Feng, S. Bhagvat, D. K. Panda, R. Thakur.

Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.

1 Parallel Computing—Introduction to Message Passing Interface (MPI)

High Performance Communication using MPJ Express 1 Presented by Jawad Manzoor National University of Sciences and Technology, Pakistan 29 June 2015.

Embedded Transport Acceleration Intel Xeon Processor as a Packet Processing Engine Abhishek Mitra Professor: Dr. Bhuyan.

A Brief Look At MPI’s Point To Point Communication Brian T. Smith Professor, Department of Computer Science Director, Albuquerque High Performance Computing.

1 What is message passing? l Data transfer plus synchronization l Requires cooperation of sender and receiver l Cooperation not always apparent in code.

P. Balaji, S. Bhagvat, D. K. Panda, R. Thakur, and W. Gropp

Profile Guided MPI Protocol Selection for Point-to-Point Communication Calls 5/9/111 Aniruddha Marathe, David K. Lowenthal Department of Computer Science.

Implementing MPI on Windows: Comparison with Common Approaches on Unix Jayesh Krishna, 1 Pavan Balaji, 1 Ewing Lusk, 1 Rajeev Thakur, 1 Fabian Tillier.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

Optimizing Threaded MPI Execution on SMP Clusters Hong Tang and Tao Yang Department of Computer Science University of California, Santa Barbara.

RSC Williams MAPLD 2005/BOF-S1 A Linux-based Software Environment for the Reconfigurable Scalable Computing Project John A. Williams 1

Computer System Architectures Computer System Software

Designing Efficient Systems Services and Primitives for Next-Generation Data-Centers K. Vaidyanathan, S. Narravula, P. Balaji and D. K. Panda Network Based.

Performance Evaluation of Hybrid MPI/OpenMP Implementation of a Lattice Boltzmann Application on Multicore Systems Department of Computer Science and Engineering,

COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.

GePSeA: A General Purpose Software Acceleration Framework for Lightweight Task Offloading Ajeet SinghPavan BalajiWu-chun Feng Dept. of Computer Science,

Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.

1 March 2010 A Study of Hardware Assisted IP over InfiniBand and its Impact on Enterprise Data Center Performance Ryan E. Grant 1, Pavan Balaji 2, Ahmad.

LiNK: An Operating System Architecture for Network Processors Steve Muir, Jonathan Smith Princeton University, University of Pennsylvania

Towards a Common Communication Infrastructure for Clusters and Grids Darius Buntinas Argonne National Laboratory.

High Performance User-Level Sockets over Gigabit Ethernet Pavan Balaji Ohio State University Piyush Shivam Ohio State University.

University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.

Impact of Network Sharing in Multi-core Architectures G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech Mathematics and Comp.

1 Choosing MPI Alternatives l MPI offers may ways to accomplish the same task l Which is best? »Just like everything else, it depends on the vendor, system.

MIDeA :A Multi-Parallel Instrusion Detection Architecture Author: Giorgos Vasiliadis, Michalis Polychronakis,Sotiris Ioannidis Publisher: CCS’11, October.

HierKNEM: An Adaptive Framework for Kernel- Assisted and Topology-Aware Collective Communications on Many-core Clusters Teng Ma, George Bosilca, Aurelien.

© 2010 IBM Corporation Enabling Concurrent Multithreaded MPI Communication on Multicore Petascale Systems Gabor Dozsa 1, Sameer Kumar 1, Pavan Balaji 2,

Sigrity, Inc © Efficient Signal and Power Integrity Analysis Using Parallel Techniques Tao Su, Xiaofeng Wang, Zhengang Bai, Venkata Vennam Sigrity,

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

Semantics-based Distributed I/O with the ParaMEDIC Framework P. Balaji, W. Feng, H. Lin Math. and Computer Science, Argonne National Laboratory Computer.

MPICH2 – A High-Performance and Widely Portable Open- Source MPI Implementation Darius Buntinas Argonne National Laboratory.

An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech.

Loosely Coupled Parallelism: Clusters. Context We have studied older archictures for loosely coupled parallelism, such as mesh’s, hypercubes etc, which.

A Profiler for a Multi-Core Multi-FPGA System by Daniel Nunes Supervisor: Professor Paul Chow September 30 th, 2008 University of Toronto Electrical and.

1 CMPE 511 HIGH PERFORMANCE COMPUTING CLUSTERS Dilek Demirel İşçi.

Non-Data-Communication Overheads in MPI: Analysis on Blue Gene/P P. Balaji, A. Chan, W. Gropp, R. Thakur, E. Lusk Argonne National Laboratory University.

PMI: A Scalable Process- Management Interface for Extreme-Scale Systems Pavan Balaji, Darius Buntinas, David Goodell, William Gropp, Jayesh Krishna, Ewing.

Buddy Scheduling in Distributed Scientific Applications on SMT Processors Nikola Vouk December 20, 2004 In Fulfillment of Requirement for Master’s of Science.

An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.

Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.

1 Lecture 4: Part 2: MPI Point-to-Point Communication.

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.

MULTICORE PROCESSOR TECHNOLOGY.  Introduction  history  Why multi-core ?  What do you mean by multicore?  Multi core architecture  Comparison of.

COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.

By Islam Atta Supervised by Dr. Ihab Talkhan

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 1.

Sockets Direct Protocol for Hybrid Network Stacks: A Case Study with iWARP over 10G Ethernet P. Balaji, S. Bhagvat, R. Thakur and D. K. Panda, Mathematics.

From Use Cases to Implementation 1. Structural and Behavioral Aspects of Collaborations  Two aspects of Collaborations Structural – specifies the static.

OpenMP Runtime Extensions Many core Massively parallel environment Intel® Xeon Phi co-processor Blue Gene/Q MPI Internal Parallelism Optimizing MPI Implementation.

1 Advanced MPI William D. Gropp Rusty Lusk and Rajeev Thakur Mathematics and Computer Science Division Argonne National Laboratory.

Background Computer System Architectures Computer System Software.

From Use Cases to Implementation 1. Mapping Requirements Directly to Design and Code  For many, if not most, of our requirements it is relatively easy.

Introduction to threads

Balazs Voneki CERN/EP/LHCb Online group

Productive Performance Tools for Heterogeneous Parallel Computing

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

MPI Point to Point Communication

Chapter 4: Threads.

Multithreaded Programming

Presentation transcript:

ProOnE: A General-Purpose Protocol Onload Engine for Multi- and Many- Core Architectures P. Lai, P. Balaji, R. Thakur and D. K. Panda Computer Science and Engg., Ohio State University Math. and Computer Sci., Argonne National Laboratory

Hardware Offload Engines Specialized Engines –Widely used in High-End Computing (HEC) systems for accelerating task processing –Built for specific purposes Not easily extendable or programmable Serve a small niche of applications Trends in HEC systems: Increasing size and complexity –Difficult for hardware to deal with complexity Fault tolerance (understanding application and system requirements is complex and too environment specific) Multi-path communication (optimal solution is NP-complete) Pavan Balaji, Argonne National Laboratory ISC (06/23/2009)

General Purpose Processors Multi- and Many-core Processors –Quad- and hex-core processors are commodity components today –Intel Larrabee will have 16-cores; Intel Terascale will have 80-cores –Simultaneous Multi-threading (SMT or Hyperthreading) is becoming common Expected future –Each physical node will have a massive number of processing elements (Terascale on the chip) Future multicore systems Pavan Balaji, Argonne National Laboratory ISC (06/23/2009)

Benefits of Multi-core Architectures Cost –Cheap: Moore’s law will continue to drive costs down –HEC is a small market; Multi-cores != HEC Flexibility –General purpose processing units –A huge number of tools already exist to program and utilize them (e.g., debuggers, performance measuring tools) –Extremely flexible and extendable Pavan Balaji, Argonne National Laboratory ISC (06/23/2009)

Multi-core vs. Hardware Accelarators Will multi-core architectures eradicate hardware accelerators? –Unlikely: Hardware accelerators have their own benefits –Hardware accelerators provide two advantages: More processing power, better on-board memory bandwidth Dedicated processing capabilities –They run compute kernels in a dedicated manner –Do not deal with shared mode processing like CPUs –But more complex machines need dedicated processing for more things More powerful hardware offload techniques possible, but decreasing returns! Pavan Balaji, Argonne National Laboratory ISC (06/23/2009)

ProOnE: General Purpose Onload Engine Hybrid hardware-software engines –Utilize hardware offload engines for low-hanging fruit Where return for investment is maximum –Utilize software “onload” engines for more complex tasks Can imitate some of the benefits of hardware offload engines, such as dedicated processing Basic Idea of ProOnE –Dedicate a small subset of processing elements on a multicore architecture for “protocol onloading” –Different capabilities added to ProOnE as plugins –Application interacts with ProOnE for dedicated processing Pavan Balaji, Argonne National Laboratory ISC (06/23/2009)

ProOnE: Basic Overview ProOnE does not try to take over tasks performed well by hardware offload engines It only supplements their capabilities Software middleware (e.g. MPI) Network with advanced acceleration Network with basic acceleration Basic network hardware ProOnE Pavan Balaji, Argonne National Laboratory ISC (06/23/2009)

Presentation Layout Introduction and Motivation ProOnE: A General Purpose Protocol Onload Engine Experimental Results and Analysis Conclusions and Future Work Pavan Balaji, Argonne National Laboratory ISC (06/23/2009)

ProOnE Infrastructure Basic Idea: –Onload a part of the work to the ProOnE daemons –Application processes and ProOnE daemons communicate through intra-node and inter-node communication Shared memory Node 1 Shared memory Node 2 Network ProOnEApplication process Pavan Balaji, Argonne National Laboratory ISC (06/23/2009)

Design Components Intra-node Communication –Each ProOnE process allocates a shared memory segment for communication with the application processes –Application processes use this shared memory to communicate requests, completions, signals, etc., to and from ProOnE Use queues and hash functions to manage the shared memory Inter-node Communication –Utilize any network features that are available –Build a logical all-to-all connectivity on the available communication infrastructure Pavan Balaji, Argonne National Laboratory ISC (06/23/2009)

Design Components (contd.) Initialization Infrastructure –ProOnE processes Create its own shared memory region Attach to the shared memory created by other ProOnE processes Connect to remote ProOnE processes Listen and accept connections from other processes (ProOnE processes and application processes) –Application processes Attach to ProOnE shared memory segments Connect to remote ProOnE processes Pavan Balaji, Argonne National Laboratory ISC (06/23/2009)

ProOnE Capabilities Can “onload” tasks for any component in the application executable: –Application itself Useful for progress management (master-worker models) and work-stealing based applications (e.g., mpiBLAST) –Communication middleware MPI stack can utilize it to offload certain tasks that can take advantage of dedicated processing Assuming that the CPU is shared with other processes makes many things inefficient –Network Protocol stacks E.g., Greg Reigner’s TCP Onload Engine Pavan Balaji, Argonne National Laboratory ISC (06/23/2009)

Case Study: MPI Rendezvous Protocol Used for large messages to avoid copies and/or large unexpected data CTS RTS MPI_Recv DATA Sender Receiver MPI_Send Pavan Balaji, Argonne National Laboratory ISC (06/23/2009)

Issues with MPI Rendezvous Communication Progress in MPI Rendezvous –Delay in detecting control messages –Hardware support for One- sided communication useful, but not perfect Many cases are expensive to implement in hardware –Bad computation communication overlap! CTS MPI_Wait RTS MPI_Isend MPI_Irecv computation DATA MPI_Wai t Wait for data Sender Receiver Delayed! No overlap! Pavan Balaji, Argonne National Laboratory ISC (06/23/2009)

MPI Rendezvous with ProOnE ProOnE is a dedicated processing engine –No delay because the receiver is “busy” with something else Communication progress is similar to earlier –Completion notifications included as appropriate MPI sender MPI receiver ProOnE 0 ProOnE 1 SEND request read reques t Try to match a posted RECV request CTS (matched) RTS (no match) DATA CMPLT RECV request RTS CTS RTS Shmem 0 Shmem 1 CMPLT read request CMPLT Receiver arrives earlier Receiver arrives later Pavan Balaji, Argonne National Laboratory ISC (06/23/2009)

Design Issues and Solutions MPICH2’s internal message matching uses the three-tuple for matching –(src, comm, tag) –Issue: out-of-order messages (even when the communication sub-system is ordered) –Solution: Add a sequence number to the matching tuple MPI Rank 0 Send1: (src=0, tag=0, comm=0, len=1M) Send2: (src=0, tag=0, comm=0, len=1K) MPI Rank 1 Recv1: (src=0, tag=0, comm=0, len=1M) Recv2: (src=0, tag=0, comm=0, len=1K) MPI Rank 0 Send1: (src=0, tag=0, comm=0, len=1M) Send2: (src=0, tag=0, comm=0, len=1K) MPI Rank 1 Recv1: (src=0, tag=0, comm=0, len=1M) Recv2: (src=0, tag=0, comm=0, len=1M) Pavan Balaji, Argonne National Laboratory ISC (06/23/2009)

Design Issues and Solutions (contd.) ProOnE is “shared” between all processes on the node –It cannot distinguish requests to different ranks –Solution: Add a destination rank to the matching tuple Shared memory lock contention –Shared memory divided into 3 segments: (1) SEND/RECV requests, (2) RTS messages and (3) CMPLT notifications –(1) and (2) use the same lock to avoid mismatches Memory Mapping –ProOnE needs access to application memory to avoid extra copies –Kernel direct-copy used to achieve this (knem, LiMIC) Pavan Balaji, Argonne National Laboratory ISC (06/23/2009)

Presentation Layout Introduction and Motivation ProOnE: A General Purpose Protocol Onload Engine Experimental Results and Analysis Conclusions and Future Work Pavan Balaji, Argonne National Laboratory ISC (06/23/2009)

Experimental Setup Experimental Testbed –Dual Quad-core Xeon Processors –4GB DDR SDRAM –Linux kernel –Nodes connected with Mellanox InfiniBand DDR adapters Experimental Design –Use one-core on each node to run ProOnE –Compare the performance of ProOnE-enabled MPICH2 with vanilla MPICH2, marked as “ProOnE” and “Original” Pavan Balaji, Argonne National Laboratory ISC (06/23/2009)

Overview of the MPICH2 Software Stack High Performance and Widely Portable MPI –Support MPI-1, MPI-2 and MPI-2.1 –Supports multiple network stacks (TCP, MX, GM) –Commercial support by many vendors IBM (integrated stack distributed by Argonne) Microsoft, Intel (in process of integrating their stacks) –Used by many derivative implementations MVAPICH2, BG, Intel-MPI, MS-MPI, SC-MPI, Cray, Myricom MPICH2 and its derivatives support many Top500 systems –Estimated at more than 90% –Available with many software distributions –Integrated with ROMIO MPI-IO and the MPE profiling library Pavan Balaji, Argonne National Laboratory ISC (06/23/2009)

Computation/Communication Overlap Sender Overlap Receiver Overlap Computation/Communication Ratio: W/T MPI_Isend(array); Compute(); MPI_Wait(); WT MPI_Recv(array); MPI_Irecv(array); Compute(); MPI_Wait(); WT MPI_Send(array); Pavan Balaji, Argonne National Laboratory ISC (06/23/2009)

Sender-side Overlap Pavan Balaji, Argonne National Laboratory ISC (06/23/2009)

Receiver-side Overlap Pavan Balaji, Argonne National Laboratory ISC (06/23/2009)

Impact of Process Skew Benchmark Process 0: for many loops lrecv(1MB, rank 1) Send(1MB, rank1) Computation() Wait() Process 1: Irecv() for many loops Computation Wait() lrecv(1MB, rank 0) Send (1MB, rank 0) Pavan Balaji, Argonne National Laboratory ISC (06/23/2009)

Sandia Application Availability Benchmark Pavan Balaji, Argonne National Laboratory ISC (06/23/2009)

Matrix Multiplication Performance Pavan Balaji, Argonne National Laboratory ISC (06/23/2009)

2D Jacobi Sweep Performance Pavan Balaji, Argonne National Laboratory ISC (06/23/2009)

Presentation Layout Introduction and Motivation ProOnE: A General Purpose Protocol Onload Engine Experimental Results and Analysis Conclusions and Future Work Pavan Balaji, Argonne National Laboratory ISC (06/23/2009)

Concluding Remarks and Future Work Proposed, designed and evaluated a general purpose Protocol Onload Engine (ProOnE) –Utilize a small subset of the cores to onload complex tasks Presented detailed design of the ProOnE infrastructure Onloaded MPI Rendezvous protocol as a case study –ProOnE-enabled MPI provides significant performance benefits for benchmarks as well as applications Future Work: –Study performance and scalability on large-scale systems –Onload other complex tasks including application kernels Pavan Balaji, Argonne National Laboratory ISC (06/23/2009)

Acknowledgments The Argonne part of the work was funded by: –NSF Computing Processes and Artifacts (CPA) –DOE ASCR –DOE Parallel Programming models The Ohio State part of the work was funded by –NSF Computing Processes and Artifacts (CPA) –DOE Parallel Programming models Pavan Balaji, Argonne National Laboratory ISC (06/23/2009)

Thank You! Contacts: {laipi, cse.ohio-state.edu {balaji, mcs.anl.gov Web Links: MPICH2: NBCLab: