Optimizing Threaded MPI Execution on SMP Clusters Hong Tang and Tao Yang Department of Computer Science University of California, Santa Barbara.

Slides:

Advertisements

Similar presentations

Multiple Processor Systems

Advertisements

Threads, SMP, and Microkernels

MPI Message Passing Interface

System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.

Distributed Systems CS

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Class CS 775/875, Spring 2011 Amit H. Kumar, OCCS Old Dominion University.

CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 Supporting Parallel Applications on Clusters of Workstations: The Intelligent Network Interface Approach.

Protocols and software for exploiting Myrinet clusters Congduc Pham and the main contributors P. Geoffray, L. Prylli, B. Tourancheau, R. Westrelin.

AMLAPI: Active Messages over Low-level Application Programming Interface Simon Yau, Tyson Condie,

Fast Communication Firefly RPC Lightweight RPC  CS 614  Tuesday March 13, 2001  Jeff Hoy.

Multiple Processor Systems

November 1, 2005Sebastian Niezgoda TreadMarks Sebastian Niezgoda.

Computer Systems/Operating Systems - Class 8

Understanding Application Scaling NAS Parallel Benchmarks 2.2 on NOW and SGI Origin 2000 Frederick Wong, Rich Martin, Remzi Arpaci-Dusseau, David Wu, and.

Distributed Processing, Client/Server, and Clusters

Chapter 16 Client/Server Computing Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design Principles,

Multiple Processor Systems Chapter Multiprocessors 8.2 Multicomputers 8.3 Distributed systems.

Performance Analysis of MPI Communications on the SGI Altix 3700 Nor Asilah Wati Abdul Hamid, Paul Coddington, Francis Vaughan Distributed & High Performance.

3.5 Interprocess Communication Many operating systems provide mechanisms for interprocess communication (IPC) –Processes must communicate with one another.

3.5 Interprocess Communication

1 CSE SUNY New Paltz Chapter Nine Multiprocessors.

High Performance Communication using MPJ Express 1 Presented by Jawad Manzoor National University of Sciences and Technology, Pakistan 29 June 2015.

Nor Asilah Wati Abdul Hamid, Paul Coddington. School of Computer Science, University of Adelaide PDCN FEBRUARY 2007 AVERAGES, DISTRIBUTIONS AND SCALABILITY.

1 Tuesday, October 10, 2006 To err is human, and to blame it on a computer is even more so. -Robert Orben.

Basics of Message-passing Mechanics of message-passing –A means of creating separate processes on different computers –A way to send and receive messages.

1 What is message passing? l Data transfer plus synchronization l Requires cooperation of sender and receiver l Cooperation not always apparent in code.

Advanced Hybrid MPI/OpenMP Parallelization Paradigms for Nested Loop Algorithms onto Clusters of SMPs Nikolaos Drosinos and Nectarios Koziris National.

AICS Café – 2013/01/18 AICS System Software team Akio SHIMADA.

1 Chapter Client-Server Interaction. 2 Functionality  Transport layer and layers below  Basic communication  Reliability  Application layer.

Chapter 6 Operating System Support. This chapter describes how middleware is supported by the operating system facilities at the nodes of a distributed.

Distributed Shared Memory: A Survey of Issues and Algorithms B,. Nitzberg and V. Lo University of Oregon.

High Performance User-Level Sockets over Gigabit Ethernet Pavan Balaji Ohio State University Piyush Shivam Ohio State University.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.

Processes and Threads Processes have two characteristics: – Resource ownership - process includes a virtual address space to hold the process image – Scheduling/execution.

Planned AlltoAllv a clustered approach Stephen Booth (EPCC) Adrian Jackson (EPCC)

Lecture 3 Process Concepts. What is a Process? A process is the dynamic execution context of an executing program. Several processes may run concurrently,

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

Copyright © George Coulouris, Jean Dollimore, Tim Kindberg This material is made available for private study and for direct.

Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device Shuang LiangRanjit NoronhaDhabaleswar K. Panda IEEE.

Advanced Computer Networks Topic 2: Characterization of Distributed Systems.

ECE200 – Computer Organization Chapter 9 – Multiprocessors.

TECHNIQUES FOR REDUCING CONSISTENCY- RELATED COMMUNICATION IN DISTRIBUTED SHARED-MEMORY SYSTEMS J. B. Carter University of Utah J. K. Bennett and W. Zwaenepoel.

Minimizing Communication Latency to Maximize Network Communication Throughput over InfiniBand Design and Implementation of MPICH-2 over InfiniBand with.

A record and replay mechanism using programmable network interface cards Laurent Lefèvre INRIA / LIP (UMR CNRS, INRIA, ENS, UCB)

Latency Reduction Techniques for Remote Memory Access in ANEMONE Mark Lewandowski Department of Computer Science Florida State University.

1 Client-Server Interaction. 2 Functionality Transport layer and layers below –Basic communication –Reliability Application layer –Abstractions Files.

Message-Passing Computing Chapter 2. Programming Multicomputer Design special parallel programming language –Occam Extend existing language to handle.

Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.

1 Lecture 4: Part 2: MPI Point-to-Point Communication.

OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel ECE1747 – Parallel Programming Vicky Tsang.

CC-MPI: A Compiled Communication Capable MPI Prototype for Ethernet Switched Clusters Amit Karwande, Xin Yuan Department of Computer Science, Florida State.

Wide-Area Parallel Computing in Java Henri Bal Vrije Universiteit Amsterdam Faculty of Sciences vrije Universiteit.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

Parallel Computing on Wide-Area Clusters: the Albatross Project Aske Plaat Thilo Kielmann Jason Maassen Rob van Nieuwpoort Ronald Veldema Vrije Universiteit.

Background Computer System Architectures Computer System Software.

ADDRESS MAPPING ADDRESS MAPPING The delivery of a packet to a host or a router requires two levels of addressing: logical and physical. We need to be able.

Memory COMPUTER ARCHITECTURE

NET323 D: Network Protocols

Department of Computer Science University of California, Santa Barbara

Pluggable Architecture for Java HPC Messaging

NET323 D: Network Protocols

Multiple Processor Systems

Fast Communication and User Level Parallelism

MPJ: A Java-based Parallel Computing System

Department of Computer Science University of California, Santa Barbara

Emulating Massively Parallel (PetaFLOPS) Machines

Presentation transcript:

Optimizing Threaded MPI Execution on SMP Clusters Hong Tang and Tao Yang Department of Computer Science University of California, Santa Barbara

June, 20th, 2001Hong Tang2 Parallel Computation on SMP Clusters Massively Parallel Machines  SMP Clusters Massively Parallel Machines  SMP Clusters Commodity Components: Off-the-shelf Processors + Fast Network (Myrinet, Fast/GigaBit Ethernet) Commodity Components: Off-the-shelf Processors + Fast Network (Myrinet, Fast/GigaBit Ethernet) Parallel Programming Model for SMP Clusters Parallel Programming Model for SMP Clusters MPI: Portability, Performance, Legacy Programs MPI: Portability, Performance, Legacy Programs MPI+Variations: MPI+Multithreading, MPI+OpenMP MPI+Variations: MPI+Multithreading, MPI+OpenMP

June, 20th, 2001Hong Tang3 Threaded MPI Execution MPI Paradigm: Separated Address Spaces for Different MPI Nodes MPI Paradigm: Separated Address Spaces for Different MPI Nodes Natural Solution: MPI Nodes  Processes Natural Solution: MPI Nodes  Processes What if we map MPI nodes to threads? What if we map MPI nodes to threads? Faster synchronization among MPI nodes running on the same machine. Faster synchronization among MPI nodes running on the same machine. Demonstrated in previous work [PPoPP ’99] for a single shared memory machine. (Developed techniques to safely execute MPI programs using threads.) Demonstrated in previous work [PPoPP ’99] for a single shared memory machine. (Developed techniques to safely execute MPI programs using threads.) Threaded MPI Execution on SMP Clusters Threaded MPI Execution on SMP Clusters Intra-Machine Comm. through Shared Memory Intra-Machine Comm. through Shared Memory Inter-Machine Comm. through Network Inter-Machine Comm. through Network

June, 20th, 2001Hong Tang4 Threaded MPI Execution Benefits Inter- Machine Communication Common Intuition Common Intuition Our Findings Our Findings Inter-machine communication cost is dominated by network delay, so the advantage of executing MPI nodes as threads diminishes. Using threads can significantly reduce the buffering and orchestration overhead for inter- machine communications.

June, 20th, 2001Hong Tang5 Related Work MPI on Network Clusters MPI on Network Clusters MPICH – a portable MPI implementation. MPICH – a portable MPI implementation. LAM/MPI – communication through a standalone RPI server. LAM/MPI – communication through a standalone RPI server. Collective Communication Optimization Collective Communication Optimization SUN-MPI and MPI-StarT – modify MPICH ADI layer; target for SMP clusters. SUN-MPI and MPI-StarT – modify MPICH ADI layer; target for SMP clusters. MagPIe – target for SMP clusters connected through WAN. MagPIe – target for SMP clusters connected through WAN. Lower Communication Layer Optimization Lower Communication Layer Optimization MPI-FM and MPI-AM. MPI-FM and MPI-AM. Threaded Execution of Message Passing Programs Threaded Execution of Message Passing Programs MPI-Lite, LPVM, TPVM. MPI-Lite, LPVM, TPVM.

June, 20th, 2001Hong Tang6 Background: MPICH Design

June, 20th, 2001Hong Tang7 MPICH Communication Structure MPICH without shared memory MPICH with shared memory

June, 20th, 2001Hong Tang8 TMPI Communication Structure   

June, 20th, 2001Hong Tang9 Comparison of TMPI and MPICH Drawbacks of MPICH w/ Shared Memory Intra-node communication limited by shared memory size. Busy polling to check messages from either daemon or local peer. Cannot do automatic resource clean-up. Drawbacks of MPICH w/o Shared Memory Big overhead for intra-node communication. Too many daemon processes and open connections. Drawbacks of both MPICH Systems Extra data copying for inter-machine communication.

June, 20th, 2001Hong Tang10 TMPI Communication Design

June, 20th, 2001Hong Tang11 Separation of Point-to-Point and Collective Communication Channels Observations: MPI Point-to-point Communication and Collective Communication Semantics are Different. Observations: MPI Point-to-point Communication and Collective Communication Semantics are Different. Separated channels for pt2pt and collective comm. Separated channels for pt2pt and collective comm. Eliminate daemon intervention for collective communication. Eliminate daemon intervention for collective communication. Less effective for MPICH – no sharing of ports among processes. Less effective for MPICH – no sharing of ports among processes. Point-to-pointCollective Unknown Source (MPI_ANY_SOURCE) Determined Source (Ancestor in the spanning tree.) Out-of-order (Message Tag) In order delivery Asynchronous (Non-block Receive) Synchronous

June, 20th, 2001Hong Tang12 Observation: Two level communication hierarchy. Observation: Two level communication hierarchy. Inside an SMP node: shared memory ( sec) Inside an SMP node: shared memory ( sec) Between SMP nodes: network ( sec) Between SMP nodes: network ( sec) Idea: Building the communication spanning tree in two steps Idea: Building the communication spanning tree in two steps Choose a root MPI node on each cluster node and build a spanning tree among all the cluster nodes. Choose a root MPI node on each cluster node and build a spanning tree among all the cluster nodes. Second, all other MPI nodes connect to the local root node. Second, all other MPI nodes connect to the local root node. Hierarchy-Aware Collective Communication

June, 20th, 2001Hong Tang13 Question: How do we manage temporary buffering of message data when the remote receiver is not ready to accept data? Question: How do we manage temporary buffering of message data when the remote receiver is not ready to accept data? Choices: Choices: Send the data with the request – eager push. Send the data with the request – eager push. Send request only and send data when the receiver is ready – three-phase protocol. Send request only and send data when the receiver is ready – three-phase protocol. TMPI – adapt between both methods. TMPI – adapt between both methods. Adaptive Buffer Management

June, 20th, 2001Hong Tang14 Experimental Study Goal: Illustrate the advantage of threaded MPI execution on SMP clusters. Goal: Illustrate the advantage of threaded MPI execution on SMP clusters. Hardware Setting Hardware Setting A cluster of 6 Quad-Xeon 500MHz SMPs, with 1GB main memory and 2 fast Ethernet cards per machine. A cluster of 6 Quad-Xeon 500MHz SMPs, with 1GB main memory and 2 fast Ethernet cards per machine. Software Setting Software Setting OS: RedHat Linux 6.0, kernel version w/ channel bonding enabled. OS: RedHat Linux 6.0, kernel version w/ channel bonding enabled. Process-based MPI System: MPICH 1.2 Process-based MPI System: MPICH 1.2 Thread-based MPI System: TMPI (45 functions in MPI 1.1 standard) Thread-based MPI System: TMPI (45 functions in MPI 1.1 standard)

June, 20th, 2001Hong Tang15 Inter-Cluster-Node Point-to-Point Ping-ping, TMPI vs MPICH w/ shared memory Ping-ping, TMPI vs MPICH w/ shared memory Message Size (bytes) Round Trip Time (  s ) (a) Ping-Pong Short Message TMPI MPICH Message Size (KB) Transfer Rate (MB) (b) Ping-Pong Long Message TMPI MPICH

June, 20th, 2001Hong Tang16 Intra-Cluster-Node Point-to-Point Message Size (bytes) Round Trip Time (  s ) (a) Ping-Pong Short Message TMPI MPICH1 MPICH Message Size (KB) Transfer Rate (MB) (b) Ping-Pong Long Message TMPI MPICH1 MPICH2 Ping-pong, TMPI vs MPICH1 (MPICH w/ shared memory) and MPICH2 (MPICH w/o shared memory) Ping-pong, TMPI vs MPICH1 (MPICH w/ shared memory) and MPICH2 (MPICH w/o shared memory)

June, 20th, 2001Hong Tang17 Collective Communication Reduce, Bcast, Allreduce. Reduce, Bcast, Allreduce. TMPI / MPICH_SHM / MPICH_NOSHM TMPI / MPICH_SHM / MPICH_NOSHM Three node distributions, three root node settings. Three node distributions, three root node settings. (us)rootReduceBcastAllreduce 4x1 same 9/121/ /137/ /175/627 rotate 33/81/ /91/4238 combo 25/102/ /32/966 1x4 same 28/1999/ /1610/ /675/775 rotate 146/1944/ /1774/1834 combo 167/1977/ /409/392 4x4 same 39/2532/ /2792/ /1412/19914 rotate 161/1718/ /2204/8036 combo 141/2242/ /489/2054 1) MPICH w/o shared memory performs the worst.2) TMPI is 70+ times faster than MPICH w/ Shared Memory for MPI_Bcast and MPI_Reduce. 3) For TMPI, the performance of 4X4 cases is roughly the summation of that of the 4X1 cases and that of the 1X4 cases.

June, 20th, 2001Hong Tang18 Macro-Benchmark Performance

June, 20th, 2001Hong Tang19 Conclusions Great Advantage of Threaded MPI Execution on SMP Clusters Great Advantage of Threaded MPI Execution on SMP Clusters Micro-benchmark: 70+ times faster than MPICH. Micro-benchmark: 70+ times faster than MPICH. Macro-benchmark: 100% faster than MPICH. Macro-benchmark: 100% faster than MPICH. Optimization Techniques Optimization Techniques Separated Collective and Point-to-Point Communication Channels Separated Collective and Point-to-Point Communication Channels Adaptive Buffer Management Adaptive Buffer Management Hierarchy-Aware Communications Hierarchy-Aware Communications

June, 20th, 2001Hong Tang20 Background: Safe Execution of MPI Programs using Threads Program Transformation: Eliminate global and static variables (called permanent variables). Program Transformation: Eliminate global and static variables (called permanent variables). Thread-Specific Data (TSD) Thread-Specific Data (TSD) Each thread can associate a pointer-sized data variable with a commonly defined key value (an integer). With the same key, different threads can set/get the values of their own copy of the data variable. TSD-based Transformation Each permanent variable declaration is replaced with a KEY declaration. Each node associates its private copy of the permanent variable with the corresponding key. In places where global variables are referenced, use the global keys to retrieve the per-thread copies of the variables. TSD-based Transformation Each permanent variable declaration is replaced with a KEY declaration. Each node associates its private copy of the permanent variable with the corresponding key. In places where global variables are referenced, use the global keys to retrieve the per-thread copies of the variables.

June, 20th, 2001Hong Tang21 Program Transformation – An Example