Caltech CS184 Spring2005 -- DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 16: May 6, 2005 Fast Messaging.

Slides:



Advertisements
Similar presentations
More on Processes Chapter 3. Process image _the physical representation of a process in the OS _an address space consisting of code, data and stack segments.
Advertisements

AMLAPI: Active Messages over Low-level Application Programming Interface Simon Yau, Tyson Condie,
Lecture Objectives: 1)Explain the limitations of flash memory. 2)Define wear leveling. 3)Define the term IO Transaction 4)Define the terms synchronous.
Chapter 7 Protocol Software On A Conventional Processor.
ECE 526 – Network Processing Systems Design Software-based Protocol Processing Chapter 7: D. E. Comer.
Architectural Support for OS March 29, 2000 Instructor: Gary Kimura Slides courtesy of Hank Levy.
Figure 2.8 Compiler phases Compiling. Figure 2.9 Object module Linking.
OS Fall ’ 02 Introduction Operating Systems Fall 2002.
Architectural Support for Operating Systems. Announcements Most office hours are finalized Assignments up every Wednesday, due next week CS 415 section.
3.5 Interprocess Communication Many operating systems provide mechanisms for interprocess communication (IPC) –Processes must communicate with one another.
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
3.5 Interprocess Communication
OS Spring’03 Introduction Operating Systems Spring 2003.
1 Last Class: Introduction Operating system = interface between user & architecture Importance of OS OS history: Change is only constant User-level Applications.
CS533 Concepts of Operating Systems Class 3 Integrated Task and Stack Management.
1 Process Description and Control Chapter 3 = Why process? = What is a process? = How to represent processes? = How to control processes?
CS533 Concepts of OS Class 16 ExoKernel by Constantia Tryman.
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
I/O Tanenbaum, ch. 5 p. 329 – 427 Silberschatz, ch. 13 p
Paper Review Building a Robust Software-based Router Using Network Processors.
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 6: April 19, 2001 Dataflow.
3/11/2002CSE Input/Output Input/Output Control Datapath Memory Processor Input Output Memory Input Output Network Control Datapath Processor.
LWIP TCP/IP Stack 김백규.
Hardware Definitions –Port: Point of connection –Bus: Interface Daisy Chain (A=>B=>…=>X) Shared Direct Device Access –Controller: Device Electronics –Registers:
LWIP TCP/IP Stack 김백규.
Three fundamental concepts in computer security: Reference Monitors: An access control concept that refers to an abstract machine that mediates all accesses.
Architecture Support for OS CSCI 444/544 Operating Systems Fall 2008.
CHAPTER 3 TOP LEVEL VIEW OF COMPUTER FUNCTION AND INTERCONNECTION
Recall: Three I/O Methods Synchronous: Wait for I/O operation to complete. Asynchronous: Post I/O request and switch to other work. DMA (Direct Memory.
Lecture 3 Process Concepts. What is a Process? A process is the dynamic execution context of an executing program. Several processes may run concurrently,
Caltech CS184b Winter DeHon 1 CS184b: Computer Architecture [Single Threaded Architecture: abstractions, quantification, and optimizations] Day14:
I/O Computer Organization II 1 Interconnecting Components Need interconnections between – CPU, memory, I/O controllers Bus: shared communication channel.
Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 17: May 14, 2003 Threaded Abstract Machine.
Processes and Process Control 1. Processes and Process Control 2. Definitions of a Process 3. Systems state vs. Process State 4. A 2 State Process Model.
1 Threads, SMP, and Microkernels Chapter Multithreading Operating system supports multiple threads of execution within a single process MS-DOS.
1 CSE451 Architectural Supports for Operating Systems Autumn 2002 Gary Kimura Lecture #2 October 2, 2002.
LRPC Firefly RPC, Lightweight RPC, Winsock Direct and VIA.
Chapter 13 – I/O Systems (Pgs ). Devices  Two conflicting properties A. Growing uniformity in interfaces (both h/w and s/w): e.g., USB, TWAIN.
1 Lecture 1: Computer System Structures We go over the aspects of computer architecture relevant to OS design  overview  input and output (I/O) organization.
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 1: April 3, 2001 Overview and Message Passing.
Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 14: May 7, 2003 Fast Messaging.
Lecture 12 Page 1 CS 111 Online Using Devices and Their Drivers Practical use issues Achieving good performance in driver use.
Introduction Contain two or more CPU share common memory and peripherals. Provide greater system throughput. Multiple processor executing simultaneous.
Operating Systems Unit 2: – Process Context switch Interrupt Interprocess communication – Thread Thread models Operating Systems.
UDP: User Datagram Protocol Chapter 12. Introduction Multiple application programs can execute simultaneously on a given computer and can send and receive.
Low Overhead Real-Time Computing General Purpose OS’s can be highly unpredictable Linux response times seen in the 100’s of milliseconds Work around this.
Major OS Components CS 416: Operating Systems Design, Spring 2001 Department of Computer Science Rutgers University
Interrupts and Exception Handling. Execution We are quite aware of the Fetch, Execute process of the control unit of the CPU –Fetch and instruction as.
Lecture 4 Page 1 CS 111 Online Modularity and Memory Clearly, programs must have access to memory We need abstractions that give them the required access.
Embedded Real-Time Systems Processing interrupts Lecturer Department University.
CS 286 Computer Organization and Architecture
CS 258 Reading Assignment 4 Discussion Exploiting Two-Case Delivery for Fast Protected Messages Bill Kramer February 13, 2002 #
Computer System Overview
Main Memory Background Swapping Contiguous Allocation Paging
Outline Module 1 and 2 dealt with processes, scheduling and synchronization Next two modules will deal with memory and storage Processes require data to.
Lecture Topics: 11/1 General Operating System Concepts Processes
Architectural Support for OS
Basic Mechanisms How Bits Move.
Translation Buffers (TLB’s)
CSE 451: Operating Systems Autumn 2003 Lecture 2 Architectural Support for Operating Systems Hank Levy 596 Allen Center 1.
CSE 451: Operating Systems Autumn 2001 Lecture 2 Architectural Support for Operating Systems Brian Bershad 310 Sieg Hall 1.
Translation Buffers (TLB’s)
CSE 451: Operating Systems Winter 2003 Lecture 2 Architectural Support for Operating Systems Hank Levy 412 Sieg Hall 1.
Architectural Support for OS
Review What are the advantages/disadvantages of pages versus segments?
Prof. Onur Mutlu Carnegie Mellon University
Chapter 13: I/O Systems “The two main jobs of a computer are I/O and [CPU] processing. In many cases, the main job is I/O, and the [CPU] processing is.
Presentation transcript:

Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 16: May 6, 2005 Fast Messaging

Caltech CS184 Spring DeHon 2 Today Message Handling Requirements Active Messages Processor/Network Interface Integration

Caltech CS184 Spring DeHon 3 What does message handler have to do? Send –allocate buffer to compose outgoing message –figure out destination address routing? –Format header info –compose data for send –put out on network copy to privileged domain check permissions copy to network device checksums?

Caltech CS184 Spring DeHon 4 What does message handler have to do? Receive –queue result –copy buffer from queue to privileged memory –check message intact (checksum) –message arrived right place? –Reorder messages? –Filter out duplicate messages? –Figure out which process/task gets message –check privileges –allocate space for incoming data –copy data to buffer in task –hand off to task –decode message type –dispatch on message type –handle message

Caltech CS184 Spring DeHon Message Handling nCube/2 160  s (360ns/byte) CM-5 86  s (120ns/byte) (compare processors with 33ns cycle) –86  30  2600 cycles

Caltech CS184 Spring DeHon 6 Functions: Destination Addressing/Routing Prefigure and put in pointer/object –hoist out of inner loop of computation –just a lookup TLB-like translation table –provide hardware support for common case

Caltech CS184 Spring DeHon 7 Functions: Allocation Why? In general, –messages may be arbitrarily long many dynamic sized objects... –may not be expecting message ? Remote procedure invocation Remote memory request Hand off to OS –OS user/consumption asynchronous to user process (lifetime unclear)

Caltech CS184 Spring DeHon 8 Functions: Allocation Pre-allocate outside of messaging –when expecting a message –? Standard sizes for common cases? Avoid copying –use shared memory issues with synchronization –direct from user space

Caltech CS184 Spring DeHon 9 Functions: Ordering Network may reorder messages –multiple paths with different lengths, congestion Not all tasks require ordering –(may be ordered at higher level in computation) –dataflow firing example What requires ordering?

Caltech CS184 Spring DeHon 10 Functions: Idempotence Failed Acknowledgments –may lead to multiple delivery of same message Idempotent operations –bit-set, bit-unset, Non-idempotent operations –increment, exchange How make idempotent? –TCP example

Caltech CS184 Spring DeHon 11 Functions: Protection Don’t want messages from other processes/entities –give away information (get) –destroy state (put) –perform operation (transfer funds)

Caltech CS184 Spring DeHon 12 Functions: Protection How manage? –Treat network as IO OS mediates (can) trust message stamps on network –Give network to user messaging hardware tags with process id filter messages on process tags (can) trust message stamps because of hardware –Cryptographic packet encoding

Caltech CS184 Spring DeHon 13 Functions: Checksum Message could be corrupted in transit –likely with high-bit rate, long interconnect (multiple chips…multiple boxes…) Wrong bits –in address –in message id –in data Typically solve in hardware

Caltech CS184 Spring DeHon 14 What does message handler have to do? Send –allocate buffer to compose outgoing message –figure out destination address routing? –Format header info –compose data for send –put out on network copy to privileged domain check permissions copy to network device checksums Not all messages require Hardware support Avoid (don’t do it)

Caltech CS184 Spring DeHon 15 What does message handler have to do? Receive –queue result –copy buffer from queue to privilege memory –check message intact (checksum) –message arrived right place? –Reorder messages? –Filter out duplicate messages? –Figure out which process/task gets message –check privileges –allocate space for incoming data –copy data to buffer in task –hand off to task –decode message type –dispatch on message type –handle message Not all messages require Hardware support Avoid (don’t do it)

Caltech CS184 Spring DeHon 16 End-to-End Variant of the primitives argument Applications/tasks have different requirements/needs Attempt to provide in the network –mismatch –unnecessary Network should be minimal –let application do just what it needs

Caltech CS184 Spring DeHon 17 Active Messages Message contains PC of code to run –destination –message handler PC –data Receiver pickups PC and runs [similar to J-Machine, conv. CPU]

Caltech CS184 Spring DeHon 18 Active Message Dogma Integrate the data directly into the computation Short Runtime –get back to next message, allows to run directly Non-blocking No allocation Runs to completion...Make fast case common

Caltech CS184 Spring DeHon 19 User Level NI Access Avoids context switch Viable if hardware manages process filtering –Single process owns network? –Process tags used to filter into queues (compare process ID tags in virtually mapped cache…register renaming in SMT)

Caltech CS184 Spring DeHon 20 Hardware Support I Checksums Routing ID and route mapping Process ID stamping/checking Low-level formatting

Caltech CS184 Spring DeHon 21 What does AM handler do? Send –compose message destination receiving PC data –copy/queue to NI Receive –pickup PC –dispatch to PC –handler dequeues data into place in computation –[maybe more depending on application] idempotence ordering synchronization

Caltech CS184 Spring DeHon 22 Example: PUT Handler Message: –remote node id –put handler (PC) 0 –remote addr 1 –data length 2 –(flag addr) 3 –data 4 No allocation Idempotent Reciever: poll –r1 = packet_pres –beq r1 0 poll –r2=packet(0) // handler –branch r2 put_handler –r3=packet(1) // addr –r4=packet(2) // len –r6=packet+4 // first datum –r5=r6+r4 // end mdata –*r3=packet(r6) –r6++ –blt r6,r5 mdata –consume packet –goto poll

Caltech CS184 Spring DeHon 23 Example: GET Handler Message Request 1.remote node 2.get handler 3.local addr 4.data length 5.(flag addr) 6.local node 7.remote addr Message Reply can just be a PUT message –put into specified local address

Caltech CS184 Spring DeHon 24 Example: GET Handler get_handler –out_packet(0)=packet(6) –out_packet(1)=put_handler –out_packet(2)=packet(3) –out_packet(3)=packet(4) –r6=4 –r7=packet(7) –r5=packet(4) –r5=r5+4 mdata –packet(r6)=*r7 –r6++ –r7++ –blt r6,r5 mdata –consume packet –send message –goto poll Message Request 1.remote node 2.get handler 3.local addr 4.data length 5.(flag addr) 6.local node 7.remote addr

Caltech CS184 Spring DeHon 25 Example: DF Inlet synch Consider 3 input node (e.g. add3) –“inlet handler” for each incoming data –set presence bit on arrival –compute node when all present

Caltech CS184 Spring DeHon 26 Example: DF Inlet Synch inlet message –node –inlet_handler –frame base –data_addr –flag_addr –data_pos –data Inlet –move data to addr –set appropriate flag –if all flags set enable DF node computation ? Care not enable multiple times?

Caltech CS184 Spring DeHon 27 Interrupts vs. Polling What happens on message reception? Interrupts –cost context switch interrupt to kernel save state –force attention to the network guarantee get messages out of input queue in a timely fashion

Caltech CS184 Spring DeHon 28 Interrupts vs. Polling Polling –if getting many messages to same process –message handlers short / bounded time –may be fine to just poll between handlers –requires: user-level/fine-grained scheduling guarantee will get back to –avoid context switch cost

Caltech CS184 Spring DeHon 29 Interrupts vs. Polling Can be used together to minimize cost –poll network interface during batch handling of messages –interrupt to draw attention back to network if messages sit around too long –polling works for same process –interrupt if different process common case is work on same process for a while

Caltech CS184 Spring DeHon 30 AM vs. JM J-Machine handlers can fault/stall –touch futures… J-Machine fast context with small state –not get to exploit rich context/state AM exploits register locality by scheduling together larger block of data –processing related handlers together (same context) –more in two week (look at TAM)

Caltech CS184 Spring DeHon 31 Active Message Results CM5 (user-level messaging) –send 1.6  s [50 instructions] –receive/dispatch 1.7  s nCube/2 (OS must intervene) –send 11  s [21 instructions] –receive 15  s [34 instructions] Myrinet (GM) –6.5  s end-to-end GMs –1-2  s host processing time

Caltech CS184 Spring DeHon 32 Hardware Support II Roll presence tests into dispatch compose message data from registers common case –reply support –message types Integrate network interface as functional unit

Caltech CS184 Spring DeHon 33 Presence Dispatch Handler PC in common location Have hardware supply null handler PC when no messages current Poll: –read MsgPC into R1 –branch R1 Also use to handle cases and priorities –by modifying a few bits of dispatch address –e.g. queues full/empty

Caltech CS184 Spring DeHon 34 Compose from Registers Put together message in registers –reuse data from message to message –compute results directly into target –user register renaming and scoreboarding to continue immediately while data being queued

Caltech CS184 Spring DeHon 35 Common Case Msg/Replies Instructions to –fill in common data on replies node address, handler? –Indicate message type not have to copy

Caltech CS184 Spring DeHon 36 Example: GET handler Get_handler –R1=i0 // address from message register –R2=*R1 –o2=R2 // value into output data register –SEND -reply type=reply_mesg_id –NEXT

Caltech CS184 Spring DeHon 37 AM as Primitive Model Value of Active Messages –articulates a model of what primitive messaging needs to be –identify key components –then can optimize against how much hardware to support? What should be in hardware/software? What are common cases? –Should get special treatment?

Caltech CS184 Spring DeHon 38 Big Ideas Primitives/Mechanisms –End-to-end/common case don’t burden everything with features needed by only some tasks Abstract Model Separate essential/useful work from overhead

Caltech CS184 Spring DeHon 39 [MSB-1] Ideas Minimize Overhead –moving data around is not value added operation –minimize copying Overlap Compute and Communication –queue –don’t force send/receive rendevousz Get the OS out of the way of common operations