Presented by Roberto Starc

Presented by Roberto Starc
Active Messages: a Mechanism for Integrated Communication and Computation - ISCA 1992 Thorsten von Eicken, David E. Culler, Seth Copen Goldstein, Klaus Erik Schauser University of California, Berkeley Computer Science Division – EECS Presented by Roberto Starc Hello everyone, and Welcome to my presentation .Lets get right to it. 14/11/2018

Executive Summary Problem – Communication between processors is slow, and speeding it up sacrifices cost/performance of the system Goal - Reduce communication overhead and allow overlapping of communication with computation Active Messages - Integrate communication and computation Messages consist of the address of a user-level handler at the head, and the arguments to be passed as the body The handler gets the message out of the network and into ongoing computation as fast as possible A simple mechanism close to hardware that can be used to implement existing parallel programming paradigms Result – Near order-of-magnitude reduction in per-byte and start-up cost of messages! To give you a summary of the paper: Our problem is that communication between processors is slow, and if we want to speed it up we have to sacrifice cost and performance of the system Our goal then is obviously to solve that problem; meaning we want to reduce communication overhead as much as possible, and allow the overlapping of communication with computation so we no longer have to sacrifice performance to make communication fast We do this by using Active Messages – The key idea is to integrate communication and computation (hence the title) An active message is very simple: It consists of the address of a user-level handler at the head, and the arguments we pass to the handler as the body The handler’s job is to get the message out of the network and into ongoing computation, as fast as possible This results in a simple, flexible mechanism close to existing hardware functionality that can be used to implement existing programming paradigms And it turns out: This works really well! We get a near order of magnitude reduction in per-byte and start up-cost of messages – you’ll see what exactly that means in a second 14/11/2018

Outline Problem & Goal Background Active Messages: Novelty & Mechanism
Example Methodology and Evaluation Strengths & Weaknesses Takeaways/Beyond the Paper Questions & Discussion This is the outline – You should be familiar with the structure by now, so lets start with [---] 14/11/2018

Example Methodology and Evaluation Strengths & Weaknesses Takeaways/Beyond the Paper Questions & Discussion The problem! 14/11/2018

Problem Problem receive send Message Processing Node B Processing Node A Network Communication between processors is slow, and speeding it up sacrifices cost / performance! What’s the problem? We have two processing nodes, A and B, who are connected by a network. We want to send a message from A, which is then received by B. Pretty simple. The problem is: This is slow and if we want to speed it up, we are sacrificing the cost and performance of the rest of the system ! 14/11/2018

Reduce communication overhead!
Goal Goal Processing Node B Processing Node A Network Message send receive Reduce communication overhead! Our goal is then obviously to solve the problem! Meaning: Make this fast! We want to do this by reducing communication overhead as much as possible, and allowing overlapping of communication and communication. You’re about to see why we try to solve the problem this way: 14/11/2018

Example Methodology and Evaluation Strengths & Weaknesses Takeaways/Beyond the Paper Questions & Discussion When I give you some Background 14/11/2018

Algorithmic Communication Model
Background Algorithmic Communication Model Program T = Tcompute + Tcommunicate 90% of peak performance : Tcompute≈9Tcommunicate Computation Communication Tcommunicate = NC(TS+LCTb) Computation Communication TS Tb * NC To really understand this paper, we need the Algorithmic Communication Model In the model, any program consist of communication and computation phases and just alternates between them. So the time of running this program is simply the time the computation takes plus the time the communication takes Communication consist of a start-up cost, which consist of setting up the connection and stuff, And then scales linearly with the size of the message. So you pay Tb, which is the cost per byte, for every byte, so L-C times, which is the length of the message. And if we communicate more than once we have to multiply this with N-C, which is the number of communications. This results in this formula, which is the T-Communicate used here. If we want to achieve high efficiency in this model, for example 90%, we have to make sure that we get 9 times as much computation as communication. Computation LC 14/11/2018

Background Algorithmic Communication Model Program T=max{ Tcompute+NCTS NCLCTb T = Tcompute + Tcommunicate ≤ To achieve high efficiency : Tcompute ≫ NCTS → T ≈ Tcompute Computation Communication Computation Communication Either the start-up-costs dominate or the time to send/receive the message (for large messages) Computation The situation looks quite a bit differently if we can overlap communication and computation. We again have a program, only this time, the computation and communication we do are overlap. The time to run the program then becomes this: meaning either the start-up costs dominate or the time to actually transmit the message, which is going to be the case for large messages. This is also always going to be smaller than or equal to the time it takes to run the program without overlapping computation and communicaton. [EXPLAIN ANIMATION AND FORMULAS] 14/11/2018

Shortcomings of Existing Solutions - send/receive
Background Shortcomings of Existing Solutions - send/receive The simple approach: blocking 3-way send/receive Problem: Nodes cannot continue computation while waiting for messages! Node sits idle! Node sits idle! So, why do we even need Active Messages? What’s wrong with the solutions we already have? We have two nodes here. Node 1 wants to send a message to Node 2, so it has to stop its computation, send a request to send to Node 2, wait for a ready to receive, and then it can send its data. The problem here is that Nodes cannot compute anything while waiting for messages! So they just sit there, doing nothing, which is obviously bad. 14/11/2018

Buffer space for the entire volume of communication must be allocated!
Background Shortcomings of Existing Solutions - send/receive This can be improved by adding buffering at the message layer send appears instantaneous to the user The message is buffered until it can be sent It is then transmitted to the recipient, where it is again buffered until a matching receive can be executed Buffer space for the entire volume of communication must be allocated! send buffer receive buffer We improve this by implementing buffering at the message layer. So each send and receive has a buffer associated with it. We have here a number of nodes in a ring topology, so each node has a left and right neighbour. Data can now be exchanged while computing by executing all sends before the computation phase and all receives afterwards Sends now appear instantaneous to the user because the message is buffered, until it can be sent Then we transmit the message to the recipient, where it is again buffered until we can execute a matching receive Note that during the whole computation the buffer space for the entire volume of communication must be allocated 14/11/2018

Shortcomings of Existing Solutions
Background Shortcomings of Existing Solutions This allows for the overlap of communication and computation – but it’s still slow. Why? Buffer Management - Have to make sure that enough space for the whole communication phase is available! This incurs a huge start-up cost So this asynchronous version of send/receive allows for the overlap of communication with computation. But it’s still slow. Why? Let’s look at this chart over here. It shows the cost of asynchronous send/receive on a number of message passing architectures – the start-up cost, per byte cost, and for comparison, the cost of a floating point operation Notice that the y-axis is LOGARITHMIC, so the start-up cost of a message is several MAGNITUDES larger than the cost of executing a floating point operation This is due to buffer management – to make sure that enough space for the whole communication phase is available, we have to allocated and manage a lot of buffers! This is the reason for this giant red bar you see here [µs] 14/11/2018

Example Methodology and Evaluation Strengths & Weaknesses Takeaways/Beyond the Paper Questions & Discussion Now how do Active Messages do it differently? 14/11/2018

Novelty Active Messages
It aims to integrate communication into ongoing computation instead of separating the two, thereby reducing overhead. Active Messages is a primitive, asynchronous communication mechanism Not just a new parallel programming paradigm Can be used to implement a wide variety of models simply and efficiently It is close to hardware functionality: Active Messages work like interrupts, which are already supported! What’s so special about it? P1 P2 P3 Send and receive on the other hand is not native to hardware – The hardware model is essentially one of launching messages into the network and causing a handler to be executed asynchronously on arrival 14/11/2018

Address of a user-level handler
Active Messages Key Approach and Ideas Address of a user-level handler Arguments Active Message Processing Node A Processing Node B Network Computation Active Message Memory Interrupt Handler Message So that’s exactly how Active Messages work! Each message contains at its head the address of a user level handler, which is executed upon message arrival with the message body as its argument. We again have two nodes, A and B connected by some network, and we want to send a message from A to B, only this time It’s an Active Message! On the side of B, we have a handler sitting in memory. The Head of the Active Message just points to the Handler When the message arrives, it causes an Interrupt and the handler executes. The job of the handler is then to get the message out of the network and into the ongoing computation 14/11/2018

Mechanism (in more detail)
Active Messages Mechanism (in more detail) Active Messages are not buffered (except as required for network transport) The handler executes immediately upon arrival of the message (like an interrupt!) The network is viewed as a pipeline The sender launches the message into the network and continues computation The receiver gets notified or interrupted upon message arrival The handler is specified by a user-level address, so traditional protection models apply The handler does not block – Otherwise deadlocks and network congestion can occur To give you some more detail: [SLIDE = NOTES] 14/11/2018

Example Methodology and Evaluation Strengths & Weaknesses Takeaways/Beyond the Paper Questions & Discussion 14/11/2018

Example Split-C Split-C : provides split phase remote memory operations in C PUT copies a local memory block into a remote memory at an address specified by the sender GET retrieves a block of remote memory and makes a local copy 14/11/2018

Matrix Multiplication with Split-C
Example Matrix Multiplication with Split-C 1 2 3 4 A N R B 1 2 3 4 C N M 1 2 3 4 R M To demonstrate this, we use Split-C to perform a Matrix Multiplication on multiple processors: four in this example. First, the matrix is partitioned into blocks of columns across the processors. For C = A x B each processors GETs one column of A one after another and does its computation with its own columns of B into its columns of C To balance the communication pattern, each processor first computes with its own columns of A and then proceeds by getting the columns of the next Processor. The key to high performance here is overlapping communication and computation. This is done by getting the column for the next iteration while computing with the current column. Of course it is then necessary to balance The two. 14/11/2018

Matrix Multiplication with Split-C: Processor 1
Example Matrix Multiplication with Split-C: Processor 1 A B C GET 1 1 1 1 R N N M PUT R M To demonstrate this, we use Split-C to perform a Matrix Multiplication on multiple processors: four in this example. First, the matrix is partitioned into blocks of columns across the processors. For C = A x B each processors GETs one column of A one after another and does its computation with its own columns of B into its columns of C To balance the communication pattern, each processor first computes with its own columns of A and then proceeds by getting the columns of the next Processor. The key to high performance here is overlapping communication and computation. This is done by getting the column for the next iteration while computing with the current column. Of course it is then necessary to balance The two. 14/11/2018

Example Matrix Multiplication with Split-C: Processor 1 A B C GET 1 2 2 1 1 R N N M PUT R M To demonstrate this, we use Split-C to perform a Matrix Multiplication on multiple processors: four in this example. First, the matrix is partitioned into blocks of columns across the processors. For C = A x B each processors GETs one column of A one after another and does its computation with its own columns of B into its columns of C To balance the communication pattern, each processor first computes with its own columns of A and then proceeds by getting the columns of the next Processor. The key to high performance here is overlapping communication and computation. This is done by getting the column for the next iteration while computing with the current column. Of course it is then necessary to balance The two. 14/11/2018

Example Matrix Multiplication with Split-C: Processor 1 A B C GET 1 2 3 3 1 1 R N N M PUT R M To demonstrate this, we use Split-C to perform a Matrix Multiplication on multiple processors: four in this example. First, the matrix is partitioned into blocks of columns across the processors. For C = A x B each processors GETs one column of A one after another and does its computation with its own columns of B into its columns of C To balance the communication pattern, each processor first computes with its own columns of A and then proceeds by getting the columns of the next Processor. The key to high performance here is overlapping communication and computation. This is done by getting the column for the next iteration while computing with the current column. Of course it is then necessary to balance The two. 14/11/2018

Example Matrix Multiplication with Split-C: Processor 1 1 2 3 4 A N R B 1 C N M 1 R PUT M To demonstrate this, we use Split-C to perform a Matrix Multiplication on multiple processors: four in this example. First, the matrix is partitioned into blocks of columns across the processors. For C = A x B each processors GETs one column of A one after another and does its computation with its own columns of B into its columns of C To balance the communication pattern, each processor first computes with its own columns of A and then proceeds by getting the columns of the next Processor. The key to high performance here is overlapping communication and computation. This is done by getting the column for the next iteration while computing with the current column. Of course it is then necessary to balance The two. 14/11/2018

Matrix Multiplication with Split-C : Master
Example Matrix Multiplication with Split-C : Master 1 2 3 4 A N R B C 1 2 3 4 1 PUT 2 PUT 3 PUT 4 PUT R N M M To demonstrate this, we use Split-C to perform a Matrix Multiplication on multiple processors: four in this example. First, the matrix is partitioned into blocks of columns across the processors. For C = A x B each processors GETs one column of A one after another and does its computation with its own columns of B into its columns of C To balance the communication pattern, each processor first computes with its own columns of A and then proceeds by getting the columns of the next Processor. The key to high performance here is overlapping communication and computation. This is done by getting the column for the next iteration while computing with the current column. Of course it is then necessary to balance The two. 14/11/2018

Matrix Multiplication with Split-C
Example Matrix Multiplication with Split-C % utilization Result: Performance predicted and measured by the model on a 128 node nCUBE/2 as the number of columns of A per processor is varied from 1 to 32 90% processor utilization #columns of A per processor 14/11/2018

Methodology Methodology nCUBE/2 & CM-5 Message passing architectures
Each node consists of a simple CPU, DRAM, and a Network Interface Highly Interconnected Network 14/11/2018

Active Messages on the nCUBE/2
Evaluation Active Messages on the nCUBE/2 Sending one word of data: 21 instructions , 11µs Receiving such a message: 34 instructions, 15µs Reduces buffer management to the minimum required for actual data transport Very close to the absolute minimal message layer 14/11/2018

Active Messages on the nCUBE/2
Evaluation Active Messages on the nCUBE/2 Sending one word of data: 21 instructions , 11µs Receiving such a message: 34 instructions, 15µs Near order of magnitude reduction in start-up cost TC = 30µs/msg , Tb = 0.45µs/byte x6,15 µs/msg 14/11/2018

Active Messages on the CM-5
Evaluation Active Messages on the CM-5 Sending a single-packet Active Message: 1.6µs Blocking send/receive on top of Active Messages: TC = 26µs , Tb = 0 .12µs x3,3 x53,75 TC is an order of magnitude larger for the send receive due to the three phase protocol required by send/receive µs/msg 14/11/2018

Executive Summary Problem – Communication between processors is slow, and speeding it up sacrifices cost/performance of the system Goal - Reduce communication overhead and allow overlapping of communication with computation Active Messages - Integrate communication and computation Messages consist of the address of a user-level handler at the head, and the arguments to be passed as the body The handler gets the message out of the network and into ongoing computation as fast as possible A simple mechanism close to hardware that can be used to implement existing parallel programming paradigms Result – Near order-of-magnitude reduction in per-byte and start-up cost of messages! To give you a summary of the paper: Our problem is that communication between processors is slow, and if we want to speed it up we have to sacrifice cost and performance of the system Our goal then is obviously to solve that problem; meaning we want to reduce communication overhead as much as possible, and allow the overlapping of communication with computation so we no longer have to sacrifice performance to make communication fast We do this by using Active Messages – The key idea is to integrate communication and computation (hence the title) An active message is very simple: It consists of the address of a user-level handler at the head, and the arguments we pass to the handler as the body The handler’s job is to get the message out of the network and into ongoing computation, as fast as possible This results in a simple, flexible mechanism close to existing hardware functionality that can be used to implement existing programming paradigms And it turns out: This works really well! We get a near order of magnitude reduction in per-byte and start up-cost of messages – You’ll see what exactly that means in the next 25 minutes. 14/11/2018

Strengths Strengths Simple, novel Mechanism that solves a very important problem Flexible: Can be implemented on existing systems and can be used to implement existing models Close to hardware, which results in low overhead and makes it cheap to implement Greatly improves performance Well written paper Paper highlights several applications of Active Messages Simple, novel Mechanism that solves a very important problem Flexible: Can be implemented on existing systems and can be used to implement existing models Close to hardware, which results in low overhead and makes it cheap to implement Makes parallel programming faster and simpler! ( Though not necessarily easier) Greatly improves performance Well written paper Paper highlights several applications of Active Messages 14/11/2018

Weaknesses Weaknesses
Restricted to SPMD (Single Program Multiple Data) Model Handler code is restricted Can’t block and has to get the message out of the network as fast as possible Performance evaluation is not presented well in the paper Possible Hardware Support in the paper is very speculative Restricted to SPMD (Single Program Multiple Data) Model - Because the address of the user-level handler has to be known in advance, all the nodes basically have to execute the same program Another thing is that the code a handler can actually run is fairly restricted – As we have seen, a handler cannot block and has to execute fast – this for example means that The handler cannot acquire any locks Possible Hardware Support is very speculative This is likely due to lack of time and funding – Trying out all the different possible hardware support for AM simply isn’t a feasible approach and not the goal of the paper Performance evaluation is not presented well in the paper 14/11/2018

Takeaways Takeaways Simple, flexible and effective
Still very relevant today Wide range of possible improvements at software and hardware level A lot of work has already been done But there is a lot more potential here! Easy to read paper Today: One of the three main types of distributed memory programming , the others being Data Parallel and Message Passing Has inspired many other papers Currently implemented in some form or another in ?? / Influence in future/todays systems 14/11/2018

Beyond the Paper Beyond the Paper
Used in many MPI implementations at the low-level transport layer (e.g. GASNet ) If you want more detail: Read Thorsten von Eicken’s dissertation! “Active Messages: an Efficient Communication for Multiprocessors”, Thorsten von Eicken, Cornell 1993 ( “Active Message Applications Programming Interface and Communication Subsystem Organization” , David E. Culler, Alan M. Mainwaring, GASNet1996 and “AM++: A Generalized Active Message Framework” , T.Hoefler, J.J. Willcock, N.G. Edmonds, A. Lumsdaine, PACT 2010 14/11/2018

Thoughts and Ideas Discussion
Could be expanded to support other Models like MPMD & many Applications more “Active Message Applications Programming Interface and Communication Subsystem Organization” ,D. E. Culler, A. M. Mainwaring, GASNet 1996 “AM++: A Generalized Active Message Framework” , T.Hoefler, J.J. Willcock, N.G. Edmonds, A .Lumsdaine, PACT 2010 This could be even faster in combination with hardware support! “Accelerating Irregular Computations with Hardware Transactional Memory and Active Messages”, M. Besta, T.Hoefler, HPDC 2015 Now, here are my thoughts and ideas –When I looked into this however, I found that a lot of people with a lot more time and brainpower than me have been thinking about this since before I was born, literally. 14/11/2018

Questions? 14/11/2018

Discussion Discussion
Could we somehow make the handler run arbitrary code? “Optimistic Active Messages: A Mechanism for Scheduling Communication with Computation” , D. A. Wallach, W.C. Hsieh, K.L. Johnson, M.F. Kaashoek, W.E. Weihl , EW SIGOPS 1994 How could we support Active Messages in hardware? Is this it? What happens once we get to the minimal required message layer? 14/11/2018

Open Discussion 14/11/2018

Thanks for watching! And special thanks to Giray & Geraldo!
14/11/2018

Backup Slides 14/11/2018

Backup Algorithmic Communication Model Assumption: The program alternates between computation and communication Communication requires time linear in the size of the message, plus a start-up cost Time to run a program: T = Tcompute + Tcommunicate and Tcommunicate = NC(TS+LCTb) TS : start-up-cost , Tb : time per byte, LC : message length, NC : number of communications To achieve high efficiency, the programmer must tailor the algorithm to achieve a high ratio of computation to communication (i.e. to achieve 90% of peak performance : Tcompute≤9Tcommunicate ) If communication is overlapped with communication: T=max(Tcompute+NCTS, NCLCTb) To achieve high efficiency : Tcompute ≫ NCTS 14/11/2018

Backup PERFORMANCE CHART 14/11/2018

Methodology – CM-5 Backup CM-5
Up to a few thousand nodes interconnected in a “hypertree” CPU: 33 Mhz Sparc RISC processor, local DRAM, network interface 14/11/2018

Methodology – nCUBE/2 Backup nCUBE/2
Has up to a few thousand nodes interconnected in a binary hypercube network CPU: 64-bit Integer Unit, IEEE floating-point unit, DRAM interface, network interface with 28 channels Runs at 20 Mhz Routers to support routing across a 13 dimensional hypercube 14/11/2018

Backup GET cost model 14/11/2018

Presented by Roberto Starc

Similar presentations

Presentation on theme: "Presented by Roberto Starc"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Presented by Roberto Starc

Similar presentations

Presentation on theme: "Presented by Roberto Starc"— Presentation transcript:

Similar presentations

About project

Feedback