Evaluating the Performance Limitations of MPMD Communication Chi-Chao Chang Dept. of Computer Science Cornell University Grzegorz Czajkowski (Cornell)

Slides:



Advertisements
Similar presentations
MPI Message Passing Interface
Advertisements

RPC Robert Grimm New York University Remote Procedure Calls.
Remote Procedure Call Design issues Implementation RPC programming
Extensibility, Safety and Performance in the SPIN Operating System Presented by Allen Kerr.
AMLAPI: Active Messages over Low-level Application Programming Interface Simon Yau, Tyson Condie,
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
Fast Communication Firefly RPC Lightweight RPC  CS 614  Tuesday March 13, 2001  Jeff Hoy.
Lightweight Remote Procedure Call BRIAN N. BERSHAD THOMAS E. ANDERSON EDWARD D. LAZOWSKA HENRY M. LEVY Presented by Wen Sun.
Lightweight Remote Procedure Call Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy Presented by Alana Sweat.
Extensibility, Safety and Performance in the SPIN Operating System Department of Computer Science and Engineering, University of Washington Brian N. Bershad,
Presented By Srinivas Sundaravaradan. MACH µ-Kernel system based on message passing Over 5000 cycles to transfer a short message Buffering IPC L3 Similar.
G Robert Grimm New York University Lightweight RPC.
Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.
Tutorials 2 A programmer can use two approaches when designing a distributed application. Describe what are they? Communication-Oriented Design Begin with.
Fundamental Design Issues for Parallel Architecture Todd C. Mowry CS 495 January 22, 2002.
User Level Interprocess Communication for Shared Memory Multiprocessor by Bershad, B.N. Anderson, A.E., Lazowska, E.D., and Levy, H.M.
G Robert Grimm New York University Extensibility: SPIN and exokernels.
Lightweight Remote Procedure Call Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, Henry M. Levy ACM Transactions Vol. 8, No. 1, February 1990,
Haoyuan Li CS 6410 Fall /15/2009.  U-Net: A User-Level Network Interface for Parallel and Distributed Computing ◦ Thorsten von Eicken, Anindya.
Performance Implications of Communication Mechanisms in All-Software Global Address Space Systems Chi-Chao Chang Dept. of Computer Science Cornell University.
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
Remote Procedure Calls. 2 Client/Server Paradigm Common model for structuring distributed computations A server is a program (or collection of programs)
PRASHANTHI NARAYAN NETTEM.
.NET Mobile Application Development Remote Procedure Call.
Remote Procedure Calls Taiyang Chen 10/06/2009. Overview Remote Procedure Call (RPC): procedure call across the network Lightweight Remote Procedure Call.
1 Lightweight Remote Procedure Call Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska and Henry M. Levy Presented by: Karthika Kothapally.
CS533 Concepts of Operating Systems Class 9 Lightweight Remote Procedure Call (LRPC) Rizal Arryadi.
CS510 Concurrent Systems Jonathan Walpole. Lightweight Remote Procedure Call (LRPC)
Parallel Architectures
1 Developing Native Device for MPJ Express Advisor: Dr. Aamir Shafi Co-advisor: Ms Samin Khaliq.
German National Research Center for Information Technology Research Institute for Computer Architecture and Software Technology German National Research.
Operating System Review September 10, 2012Introduction to Computer Security ©2004 Matt Bishop Slide #1-1.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
1 Micro-kernel. 2 Key points Microkernel provides minimal abstractions –Address space, threads, IPC Abstractions –… are machine independent –But implementation.
CS 390- Unix Programming Environment CS 390 Unix Programming Environment Topics to be covered: Distributed Computing Fundamentals.
Introduction to Distributed Systems Slides for CSCI 3171 Lectures E. W. Grundke.
Scalable Web Server on Heterogeneous Cluster CHEN Ge.
Implementing Remote Procedure Calls Authored by Andrew D. Birrell and Bruce Jay Nelson Xerox Palo Alto Research Center Presented by Lars Larsson.
COP4020 Programming Languages Subroutines and Parameter Passing Prof. Xin Yuan.
 Remote Procedure Call (RPC) is a high-level model for client-sever communication.  It provides the programmers with a familiar mechanism for building.
The Mach System Abraham Silberschatz, Peter Baer Galvin, Greg Gagne Presentation By: Agnimitra Roy.
The Performance of μ-Kernel-Based Systems H. Haertig, M. Hohmuth, J. Liedtke, S. Schoenberg, J. Wolter Presenter: Sunita Marathe.
The Mach System Abraham Silberschatz, Peter Baer Galvin, and Greg Gagne Presented by: Jee Vang.
Networking Implementations (part 1) CPS210 Spring 2006.
LRPC Firefly RPC, Lightweight RPC, Winsock Direct and VIA.
Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy. Presented by: Tim Fleck.
Mark Stanovich Operating Systems COP Primitives to Build Distributed Applications send and receive Used to synchronize cooperating processes running.
Interconnection network network interface and a case study.
Lecture 4 Mechanisms & Kernel for NOSs. Mechanisms for Network Operating Systems  Network operating systems provide three basic mechanisms that support.
OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel ECE1747 – Parallel Programming Vicky Tsang.
Wide-Area Parallel Computing in Java Henri Bal Vrije Universiteit Amsterdam Faculty of Sciences vrije Universiteit.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
© Oxford University Press 2011 DISTRIBUTED COMPUTING Sunita Mahajan Sunita Mahajan, Principal, Institute of Computer Science, MET League of Colleges, Mumbai.
LECTURE 19 Subroutines and Parameter Passing. ABSTRACTION Recall: Abstraction is the process by which we can hide larger or more complex code fragments.
Lecture 5: RPC (exercises/questions). 26-Jun-16COMP28112 Lecture 52 First Six Steps of RPC TvS: Figure 4-7.
A Parallel Communication Infrastructure for STAPL
Architecting Web Services
CS533 Concepts of Operating Systems
Architecting Web Services
B. N. Bershad, T. E. Anderson, E. D. Lazowska and H. M
DISTRIBUTED COMPUTING
By Brian N. Bershad, Thomas E. Anderson, Edward D
Presented by Neha Agrawal
Remote Procedure Call Hank Levy 1.
Presented by: SHILPI AGARWAL
Remote Procedure Call Hank Levy 1.
Lecture 7: RPC (exercises/questions)
Remote Procedure Call Hank Levy 1.
Presentation transcript:

Evaluating the Performance Limitations of MPMD Communication Chi-Chao Chang Dept. of Computer Science Cornell University Grzegorz Czajkowski (Cornell) Thorsten von Eicken (Cornell) Carl Kesselman (ISI/USC)

Framework Parallel computing on clusters of workstations l l Hardware communication primitives are message-based l l Programming models: SPMD and MPMD l l SPMD is the predominant model Why use MPMD ? l l appropriate for distributed, heterogeneous setting: metacomputing l l parallel software as “components” Why use RPC ? l l right level of abstraction l l message passing requires receiver to know when to expect incoming communication Systems with similar philosophy: Nexus, Legion How do RPC-based MPMD systems perform on homogeneous MPPs? 2

Problem MPMD systems are an order of magnitude slower than SPMD systems on homogeneous MPPs 1. Implementation: l l trade-off: existing MPMD systems focus on the general case at expense of performance in the homogeneous case 2. RPC is more complex when the SPMD assumption is dropped. 3

Approach MRPC: an MPMD RPC system specialized for MPPs l l best base-line RPC performance at the expense of heterogeneity l l start from simple SPMD RPC: Active Messages l l “minimal” runtime system for MPMD l l integrate with a MPMD parallel language: CC++ l l no modifications to front-end translator or back-end compiler Goal is to introduce only the necessary RPC runtime overheads for MPMD Evaluate it w.r.t. a highly-tuned SPMD system l l Split-C over Active Messages 4

MRPC Implementation l l Library: RPC, basic types marshalling, remote program execution l l about 4K lines of C++ and 2K lines of C l l Implemented on top of Active Messages (SC ‘96) l l “dispatcher” handler l l Currently runs on the IBM SP2 (AIX 3.2.5) Integrated into CC++: l l relies on CC++ global pointers for RPC binding l l borrows RPC stub generation from CC++ l l no modification to front-end compiler 5

Outline l l Design issues in MRPC l l MRPC and CC++ l l Performance results 6

Method Name Resolution Compiler cannot determine the existence or location of a remote procedure statically 7 SPMD: same program image MPMD: needs mapping foo: &foo foo: “foo” “foo” &foo... &foo MRPC: sender-side stub address caching

Stub address caching 8 &e_foo 4 e_foo: dispatcher “e_foo” &e_foo... paddr “e_foo” $ &e_foo “e_foo” miss Cold Invocation: Hot Invocation: GP &e_foo hit e_foo: dispatcher paddr “e_foo” $ GP

Argument Marshalling Arguments of RPC can be arbitrary objects l l must be marshalled and unmarshalled by RPC stubs l l even more expensive in heterogeneous setting versus… l l AM: up to 4 4-byte arguments, arbitrary buffers (programmer takes care of marshalling) MRPC: efficient data copying routines for stubs 9

Data Transfer Caller stub does not know about the receive buffer l l no caller/callee synchronization versus… l l AM: caller specifies remote buffer address MRPC: Efficient buffer management and persistent receive buffers 10

Persistent Receive Buffers 11 S-buf Persistent R-buf Static, per-node buffer Persistent R-buf &R-buf is stored in the cache e_foo Dispatcher &R-buf $ copy S-buf Data is sent to static buffer Data is sent directly to R-buf Cold Invocation: Hot Invocation: e_foo

Threads Each RPC requires a new (logical) thread at the receiving end l l No restrictions on operations performed in remote procedures l l Runtime system must be thread safe versus… l l Split-C: single thread of control per node MRPC: custom, non-preemptive threads package 12

Message Reception Message reception is not receiver-initiated l l Software interrupts: very expensive versus… l l MPI: several different ways to receive a message (poll, post, etc) l l SPMD: user typically identifies comm phases into which cheap polling can be introduced easily MRPC: Polling thread 13

CC++ over MRPC 14 gpA->foo(p,i); (endpt.InitRPC(gpA, “entry_foo”), endpt << p, endpt << i, endpt.SendRPC(), endpt >> retval, endpt.Reset()); global class A {... }; double A::foo(int p, int i) {...} A::entry_foo(...) {... endpt.RecvRPC(inbuf,... ); endpt >> arg1; endpt >> arg2; double retval = foo(arg1, arg2); endpt << retval; endpt.ReplyRPC();... } MRPC Interface InitRPC SendRPC RecvRPC ReplyRPC Reset CC++: caller compiler C++ caller stub CC++: callee C++ callee stub compiler

Null RPC: AM:55 us CC++/MRPC:87 us Nexus/MPL:240 μs (DCE: ~50 μs) Global pointer read/write (8 bytes) Split-C/AM:57 μs CC++/MRPC:92 μs Bulk read (160 bytes) Split-C/AM:74 μs CC++/MRPC:154 μs IBM MPI-F and MPL (AIX 3.2.5): 88 us Basic comm costs in CC++/MRPC are within 2x with Split-C/AM and other messaging layers Micro-benchmarks

Applications 16 l l 3 versions of EM3D, 2 versions of Water, LU and FFT l l CC++ versions based on original Split-C code l l Runs taken for 4 and 8 processors on IBM SP-2

Water 17

Discussion CC++ applications perform within a factor of 2 to 6 of Split-C l l order of magnitude improvement over previous impl Method name resolution l l constant cost, almost negligible in apps Threads l l accounts for ~25-50% of the gap, including: l l synchronization (~15-35% of the gap) due to thread safety l l thread management (~10-15% of the gap), 75% context switches Argument Marshalling and Data Copy l l large fraction of the remaining gap (~50-75%) l l opportunity for compiler-level optimizations 18

Related Work Lightweight RPC l l LRPC: RPC specialization for local case High-Performance RPC in MPPs l l Concert, pC++, ABCL Integrating threads with communication l l Optimistic Active Messages l l Nexus Compiling techniques l l Specialized frame mgmt and calling conventions, lazy threads, etc. (Taura’s PLDI ‘97) 19

Conclusion Possible to implement an RPC-based MPMD system that is competitive with SPMD systems on homogeneous MPPs l l same order of magnitude performance l l trade-off between generality and performance Questions remaining: l l scalability for larger number of nodes l l integration with heterogeneous runtime infrastructure Slides: MRPC, CC++ apps source code: 20