AMLAPI: Active Messages over Low-level Application Programming Interface Simon Yau, Tyson Condie,

Slides:



Advertisements
Similar presentations
Threads, SMP, and Microkernels
Advertisements

Middleware Support for RDMA-based Data Transfer in Cloud Computing Yufei Ren, Tan Li, Dantong Yu, Shudong Jin, Thomas Robertazzi Department of Electrical.
System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.
More on Processes Chapter 3. Process image _the physical representation of a process in the OS _an address space consisting of code, data and stack segments.
Distributed Systems CS
4. Shared Memory Parallel Architectures 4.4. Multicore Architectures
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
Multiple Processor Systems
Chapter 4 Threads, SMP, and Microkernels Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design.
1. Overview  Introduction  Motivations  Multikernel Model  Implementation – The Barrelfish  Performance Testing  Conclusion 2.
IBM RS6000/SP Overview Advanced IBM Unix computers series Multiple different configurations Available from entry level to high-end machines. POWER (1,2,3,4)
ECE669 L20: Evaluation and Message Passing April 13, 2004 ECE 669 Parallel Computer Architecture Lecture 20 Evaluation and Message Passing.
Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.
Haoyuan Li CS 6410 Fall /15/2009.  U-Net: A User-Level Network Interface for Parallel and Distributed Computing ◦ Thorsten von Eicken, Anindya.
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
Threads CSCI 444/544 Operating Systems Fall 2008.
Evaluation of High-Performance Networks as Compilation Targets for Global Address Space Languages Mike Welcome In conjunction with the joint UCB and NERSC/LBL.
14.1 Silberschatz, Galvin and Gagne ©2009 Operating System Concepts with Java – 8 th Edition Chapter 4: Threads.
Optimizing Threaded MPI Execution on SMP Clusters Hong Tang and Tao Yang Department of Computer Science University of California, Santa Barbara.
Chapter 4 Threads, SMP, and Microkernels Dave Bremer Otago Polytechnic, N.Z. ©2008, Prentice Hall Operating Systems: Internals and Design Principles, 6/E.
Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 16: May 6, 2005 Fast Messaging.
1 Lecture 4: Threads Operating System Fall Contents Overview: Processes & Threads Benefits of Threads Thread State and Operations User Thread.
Silberschatz, Galvin and Gagne ©2011Operating System Concepts Essentials – 8 th Edition Chapter 4: Threads.
High Performance User-Level Sockets over Gigabit Ethernet Pavan Balaji Ohio State University Piyush Shivam Ohio State University.
Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.
High Performance Computing & Communication Research Laboratory 12/11/1997 [1] Hyok Kim Performance Analysis of TCP/IP Data.
Boston, May 22 nd, 2013 IPDPS 1 Acceleration of an Asynchronous Message Driven Programming Paradigm on IBM Blue Gene/Q Sameer Kumar* IBM T J Watson Research.
© 2010 IBM Corporation Enabling Concurrent Multithreaded MPI Communication on Multicore Petascale Systems Gabor Dozsa 1, Sameer Kumar 1, Pavan Balaji 2,
Processes and Threads Processes have two characteristics: – Resource ownership - process includes a virtual address space to hold the process image – Scheduling/execution.
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
Planned AlltoAllv a clustered approach Stephen Booth (EPCC) Adrian Jackson (EPCC)
Lecture 3 Process Concepts. What is a Process? A process is the dynamic execution context of an executing program. Several processes may run concurrently,
Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.
Threads, SMP, and Microkernels Chapter 4. Process Resource ownership - process is allocated a virtual address space to hold the process image Scheduling/execution-
Minimizing Communication Latency to Maximize Network Communication Throughput over InfiniBand Design and Implementation of MPICH-2 over InfiniBand with.
The Mach System Abraham Silberschatz, Peter Baer Galvin, Greg Gagne Presentation By: Agnimitra Roy.
Overview of An Efficient Implementation Scheme of Concurrent Object-Oriented Languages on Stock Multicomputers Tony Chen, Sunjeev Sikand, and John Kerwin.
1 Threads, SMP, and Microkernels Chapter Multithreading Operating system supports multiple threads of execution within a single process MS-DOS.
CS533 - Concepts of Operating Systems 1 The Mach System Presented by Catherine Vilhauer.
Processes CSCI 4534 Chapter 4. Introduction Early computer systems allowed one program to be executed at a time –The program had complete control of the.
Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.
LRPC Firefly RPC, Lightweight RPC, Winsock Direct and VIA.
Operating Systems CSE 411 CPU Management Sept Lecture 10 Instructor: Bhuvan Urgaonkar.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 4: Threads.
The Mach System Silberschatz et al Presented By Anjana Venkat.
OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel ECE1747 – Parallel Programming Vicky Tsang.
Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 14: May 7, 2003 Fast Messaging.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Chapter 4 Threads, SMP, and Microkernels Dave Bremer Otago Polytechnic, N.Z. ©2008, Prentice Hall Operating Systems: Internals and Design Principles, 6/E.
UDP: User Datagram Protocol Chapter 12. Introduction Multiple application programs can execute simultaneously on a given computer and can send and receive.
Chapter 4: Threads 羅習五. Chapter 4: Threads Motivation and Overview Multithreading Models Threading Issues Examples – Pthreads – Windows XP Threads – Linux.
Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.
T HE M ULTIKERNEL : A NEW OS ARCHITECTURE FOR SCALABLE MULTICORE SYSTEMS Presented by Mohammed Mustafa Distributed Systems CIS*6000 School of Computer.
Split-C for the New Millennium
OPERATING SYSTEM CONCEPT AND PRACTISE
Chapter 4: Threads.
The Mach System Sri Ramkrishna.
Scheduler activations
Threads & multithreading
CMSC 611: Advanced Computer Architecture
Chapter 4: Threads.
Chapter 4: Threads.
CS 258 Reading Assignment 4 Discussion Exploiting Two-Case Delivery for Fast Protected Messages Bill Kramer February 13, 2002 #
Chapter 4: Threads.
The Stanford FLASH Multiprocessor
Architectural Interactions in High Performance Clusters
Lecture Topics: 11/1 General Operating System Concepts Processes
CHAPTER 4:THreads Bashair Al-harthi OPERATING SYSTEM
Prof. Leonardo Mostarda University of Camerino
Chapter 4: Threads.
Presentation transcript:

AMLAPI: Active Messages over Low-level Application Programming Interface Simon Yau, Tyson Condie,

Background AM is a low-level communication architecture for high-performance parallel computing LAPI is IBM’s version of AM Very similar API’s Programs running on AM platform should be able to run on LAPI. Use AMLAPI layer to emulate AM using LAPI.

Similarities Both are low-level message- passing style architectures. Both use active messages: –One node initiates an active message. –Receiving node executes a handler upon reception of the active message.

Differences AM virtualized network interface with endpoints and bundles – allow multiple threads at each endpoint. AM requires handlers to be executed in the context of the application program; LAPI handlers execute in the context of the polling thread. LAPI separates handlers into header and completion handlers. LAPI uses counters for synchronization (guarantees execution of handlers); AM guarantees network has accepted data.

AM & LAPI Execution Model AM Execution LAPI Execution Send Msg Do work.. Get Msg Execute Handler (and send reply) Sender Receiver Do work… Send Msg Do work… Get Msg Exec Header Handler Sender Receiver Do work… Poll Get Footer Send Footer Exec Footer Handler Poll…

To Emulate AM on LAPI Emulate Endpoints and bundles –Maintain a list of endpoints per box –Each endpoint is represented by the box id and its position in the list Associate each endpoint bundle with a task queue. –An AM is done with a LAPI call which schedules a task on the queue at the remote end.

Design Sending an AM: –Package a LAPI Message and send to the receiving node –At receiving node, multiplex the message to the appropriate endpoint and put the associated function pointer with arguments on to the task queue Receiving an AM: –When the user Polls, check the task queue and execute a task from it. –Execute only one task since we do not want the user thread to spend too much time in the handler.

Picture Send Msg Do work… Get Msg Header Handler Sender Receiver Do work… Poll Get Footer Send Footer Footer Handler Poll…Execute Handler… 1. Sender executes AM_Send 2. Sender piggy backs information about the AM call and executes LAPI_Send 3. Network ships the message to receiver 4. Receiver’s network gets the request message, causes the polling thread to execute the header handler 5. Header handler allocates buffer space to which the message is copied. 6. LAPI copies the data into a buffer and calls the Footer handler. 7. Footer handler posts the AM handler with the arguments and AM information on the queue of the designation endpoint.. 8. When user application polls, it will pull out the handler from the task queue and executes it.

Evaluation Platform: SP3 Interconnect: –Advertised bandwidth = 350MB/s –Advertised latency = ~17 micro seconds. SMPs: –8 X Power3 processor SMPs –4 GB of memory per node Processor: –super-scalar, pipelined 64 bit RISC. –8 instructions per clock at 375 MHz. –64KB L1 cache, 8MB L2 cache OS: –AIX with IBM Parallel Environment.

Micro Benchmarks: Round trip latency: 473 us LAPI round trip latency: 32 us

Explanation Copying data from message buffer to an Endpoint’s VM segment takes up the bulk of the overhead. Context switching and packing AM info takes up the rest. Since SP3 is an SMP, the LAPI threads and application thread run on different nodes. Moving data from LAPI thread’s processor requires invalidating the processor cache on which the LAPI thread runs.

Conclusion Using low-level glue-ware is viable option to make programs portable if the communication layers match Future work: –Macro benchmarks –Improve short message latency by header handler –“Zero copy” to endpoint VM – make AM handler run in LAPI context