Caltech CS184 Spring2005 -- DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 15: May 4, 2005 Message Passing.

Slides:

Advertisements

Similar presentations

System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.

Advertisements

More on Processes Chapter 3. Process image _the physical representation of a process in the OS _an address space consisting of code, data and stack segments.

1 Chapter 5 Concurrency: Mutual Exclusion and Synchronization Principals of Concurrency Mutual Exclusion: Hardware Support Semaphores Readers/Writers Problem.

WHAT IS AN OPERATING SYSTEM? An interface between users and hardware - an environment "architecture ” Allows convenient usage; hides the tedious stuff.

Presented by: Quinn Gaumer CPS 221.  16,384 Processing Nodes (32 MHz)  30 m x 30 m  Teraflop  1992.

Is SC + ILP = RC? Presented by Vamshi Kadaru Chris Gniady, Babak Falsafi, and T. N. VijayKumar - Purdue University Spring 2005: CS 7968 Parallel Computer.

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

Chapter 4 Threads, SMP, and Microkernels Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design.

Computer Systems/Operating Systems - Class 8

Review: Chapters 1 – Chapter 1: OS is a layer between user and hardware to make life easier for user and use hardware efficiently Control program.

CS 153 Design of Operating Systems Spring 2015

3.5 Interprocess Communication Many operating systems provide mechanisms for interprocess communication (IPC) –Processes must communicate with one another.

Disco Running Commodity Operating Systems on Scalable Multiprocessors.

Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.

3.5 Interprocess Communication

Processes 1 CS502 Spring 2006 Processes Week 2 – CS 502.

State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.

Computer System Overview Chapter 1. Basic computer structure CPU Memory memory bus I/O bus diskNet interface.

A. Frank - P. Weisberg Operating Systems Introduction to Tasks/Threads.

1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.

CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 6: April 19, 2001 Dataflow.

Synchronization and Communication in the T3E Multiprocessor.

Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 16: May 6, 2005 Fast Messaging.

1 Lecture 4: Threads Operating System Fall Contents Overview: Processes & Threads Benefits of Threads Thread State and Operations User Thread.

LOGO OPERATING SYSTEM Dalia AL-Dabbagh

Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.

Chapter 1 Computer System Overview Dave Bremer Otago Polytechnic, N.Z. ©2008, Prentice Hall Operating Systems: Internals and Design Principles, 6/E William.

Caltech CS184b Winter DeHon 1 CS184b: Computer Architecture [Single Threaded Architecture: abstractions, quantification, and optimizations] Day3:

Chapter 4 Threads, SMP, and Microkernels Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

Recall: Three I/O Methods Synchronous: Wait for I/O operation to complete. Asynchronous: Post I/O request and switch to other work. DMA (Direct Memory.

CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 8: April 26, 2001 Simultaneous Multi-Threading (SMT)

Processes and Threads Processes have two characteristics: – Resource ownership - process includes a virtual address space to hold the process image – Scheduling/execution.

IT253: Computer Organization

Computers Operating System Essentials. Operating Systems PROGRAM HARDWARE OPERATING SYSTEM.

Caltech CS184b Winter DeHon 1 CS184b: Computer Architecture [Single Threaded Architecture: abstractions, quantification, and optimizations] Day14:

Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 17: May 14, 2003 Threaded Abstract Machine.

The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.

The Mach System Abraham Silberschatz, Peter Baer Galvin, Greg Gagne Presentation By: Agnimitra Roy.

CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 10: May 8, 2001 Synchronization.

Processes and Process Control 1. Processes and Process Control 2. Definitions of a Process 3. Systems state vs. Process State 4. A 2 State Process Model.

1 Threads, SMP, and Microkernels Chapter Multithreading Operating system supports multiple threads of execution within a single process MS-DOS.

CS533 - Concepts of Operating Systems 1 The Mach System Presented by Catherine Vilhauer.

Operating Systems Lecture 14 Segments Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard. Zhiqing Liu School of Software Engineering.

Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 11: April 30, 2003 Multithreading.

Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.

CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 7: April 24, 2001 Threaded Abstract Machine (TAM) Simultaneous.

On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2.

COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.

CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 1: April 3, 2001 Overview and Message Passing.

Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 14: May 7, 2003 Fast Messaging.

CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 9: May 3, 2001 Distributed Shared Memory.

Operating Systems Unit 2: – Process Context switch Interrupt Interprocess communication – Thread Thread models Operating Systems.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Operating Systems Overview: Using Hardware.

CSCI/CMPE 4334 Operating Systems Review: Exam 1 1.

CS161 – Design and Architecture of Computer

CS161 – Design and Architecture of Computer

The Mach System Sri Ramkrishna.

CS184c: Computer Architecture [Parallel and Multithreaded]

Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.

Task Scheduling for Multicore CPUs and NUMA Systems

Morgan Kaufmann Publishers

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

CS 258 Reading Assignment 4 Discussion Exploiting Two-Case Delivery for Fast Protected Messages Bill Kramer February 13, 2002 #

Lecture Topics: 11/1 General Operating System Concepts Processes

Morgan Kaufmann Publishers Memory Hierarchy: Virtual Memory

CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs

CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs

Implementing Processes, Threads, and Resources

Presentation transcript:

Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 15: May 4, 2005 Message Passing

Caltech CS184 Spring DeHon 2 Today Message Passing Model Examples Performance Issues Design for Multiprocessing –Engineering “Low cost” messaging

Caltech CS184 Spring DeHon 3 Message Passing Simple extension to Models –Compute Model –Programming Model –Architecture Low-level

Caltech CS184 Spring DeHon 4 Message Passing Model Collection of sequential processes Processes may communicate with each other (messages) –send –receive Each process runs sequentially –has own address space Abstraction is each process gets own processor (essentially multi-tasking model)

Caltech CS184 Spring DeHon 5 Programming for MP Have a sequential language –C, C++, Fortran, lisp… Add primitives (system calls) –send –receive –Spawn –(multitasking primitives with no shared memory)

Caltech CS184 Spring DeHon 6 Architecture for MP Sequential Architecture for processing node –add network interfaces –(process have own address space) Add network connecting processors …minimally sufficient...

Caltech CS184 Spring DeHon 7 MP Architecture Virtualization Processes virtualize nodes [0,1,infinity] –size independent/scalable Virtual connections between processes –placement independent communication

Caltech CS184 Spring DeHon 8 MP Example and Performance Issues

Caltech CS184 Spring DeHon 9 N-Body Problem Compute pairwise gravitational forces Integrate positions

Caltech CS184 Spring DeHon 10 Coding // params position, mass…. F=0 For i = 1 to N –send my params to p[body[i]] –get params from p[body[i]] –F+=force(my params, params) Update pos, velocity Repeat

Caltech CS184 Spring DeHon 11 Performance Body Work ~= cN  N processes Cycle work ~= cN 2 Ideal N p processor time: cN 2 /N p

Caltech CS184 Spring DeHon 12 Performance Sequential Body work: –read N values –compute N force updates –compute pos/velocity from F and params c=t(read value) + t(compute force) + t(write value)

Caltech CS184 Spring DeHon 13 Performance MP Body work: –send N messages –receive N messages –compute N force updates –compute pos/velocity from F and params c=t(send message) + t(receive message) + t(compute force)

Caltech CS184 Spring DeHon 14 Send/receive t(receive) –wait on message delivery –swap to kernel –copy data –return to process t(send) –similar t(send), t(receive) >> t(read value)

Caltech CS184 Spring DeHon 15 Sequential vs. MP T seq = c seq N 2 T mp =c mp N 2 /N p Speedup = T seq /T mp = c seq  N p /c mp Assuming no waiting: –c seq /c mp  (t(read)+t(write))/ (t(send)+t(rcv))

Caltech CS184 Spring DeHon 16 Waiting? Assume no network congestion: Must wait L(net) time after message sent to receive if insufficient parallelism –latency dominate performance

Caltech CS184 Spring DeHon 17 Dertouzous Latency Bound Speedup Upper Bound –processes / Latency

Caltech CS184 Spring DeHon 18 Dertouzous Latency Bound Speedup Upper Bound –processes / Latency

Caltech CS184 Spring DeHon 19 Coding/Waiting For i = 1 to N –send my params to p[body[i]] –get params from p[body[i]] –F+=force(my params, params) How long processsor i wait for first datum? –Parallelism profile? Is queuing optional?

Caltech CS184 Spring DeHon 20 More Parallelism For i = 1 to N –send my params to p[body[i]] For i = 1 to N –get params from p[body[i]] –F+=force(my params, params)

Caltech CS184 Spring DeHon 21 Dispatching Multiple processes on node Who to run? –Can a receive block waiting?

Caltech CS184 Spring DeHon 22 Dispatching Abstraction is each process gets own processor If receive blocks (holds processor) –may prevent another process from running upon which it depends Consider 2-body problem on 1 node

Caltech CS184 Spring DeHon 23 Seitz Coding [Seitz/CACM’85: Fig. 5]

Caltech CS184 Spring DeHon 24 MP Issues

Caltech CS184 Spring DeHon 25 Expensive Communication Process to process communication goes through operating system –system call, process switch –exit processor, network, enter processor –system call, processes switch Milliseconds? –Thousands to millions of cycles...

Caltech CS184 Spring DeHon 26 Why OS involved? Protection/Isolation –can this process send/receive with this other process? Translation –where does this message need to go? Scheduling –who can/should run now?

Caltech CS184 Spring DeHon 27 Issues Process Placement –locality –load balancing Cost for excessive parallelism –E.g. N-body on N p < N processor ? Message hygiene –ordering, single delivery, buffering Deadlock –user introduce, system introduce

Caltech CS184 Spring DeHon 28 Low-Level Model Places burden on user [too much] –decompose problem explicitly sequential chunk size not abstract scalability weakness in architecture –guarantee correctness in face of non- determinism –placement/load-balancing in some systems Gives considerable explicit control

Caltech CS184 Spring DeHon 29 Low-Level Primitives Has the necessary primitives for multiprocessor cooperation Maybe an appropriate compiler target? –Architecture model, but not programming/compute model?

Caltech CS184 Spring DeHon 30 Problem 1 Messages take milliseconds –(1000  10 6 s of cycles) Forces use of coarse-grained parallelism –Speedup = T seq /T mp = c seq  N p /c mp –c seq /c mp ~= t(comp) / (t(comm)+ t(comp)) –driven to make t(comp) >> t(comm)

Caltech CS184 Spring DeHon 31 Problem 2 Potential parallelism is costly –additional communication cost is born even when sequentialized (same node) Process to process switch expensive Discourages exposing maximum parallelism –works against simple/scalable model

Caltech CS184 Spring DeHon 32 Bad Cost Model Challenge –give programmer a simple model of how to write good programs Here –exposing parallelism increases performance but has cost –expose too much will decrease –hard for user to know which

Caltech CS184 Spring DeHon 33 Bad Model Poor User-level abstraction: user should not be picking granularity of exploited parallelism –this should be done by tools

Caltech CS184 Spring DeHon 34 Cosmic Cube Early 1980s Used commodity hardware –off the shelf solution –components not engineered for parallel scenario Showed –could get benefit out of parallelism –exposed issues need to address to do it right –…why need to do something different

Caltech CS184 Spring DeHon 35 Mechanisms…

Caltech CS184 Spring DeHon 36 Design for Parallelism To do it right –need to engineer for parallelism Optimize key common cases here Figuring out: – what goes in hardware vs. software

Caltech CS184 Spring DeHon 37 Vision: MDP/Mosaic Single-chip, commodity building block –[today, tile to step and repeat on die] –contains all computing components compute: sequential processor interconnect in space: net interface + network interconnect in time: memory Step-and-repeat competent  P –avoid diminishing returns trying to build monolithic processor

Caltech CS184 Spring DeHon 38 Message Driven Processor “Mechanism” Driven Processor? –Study mechanisms needed for a parallel processing node –address problems saw in using existing View as low-level (hardware) model –underlies range of compute models shared memory, dataflow, data parallel [Dally et. al./IEEE Micro, April 1992]

Caltech CS184 Spring DeHon 39 Philosophy of MDP mechanisms=primitives –like RISC focus on primitives from which to build powerful operations common support not model specific –like RISC not language specific Hardware/software interface –what should hardware support/provide –vs. what should be composed in software

Caltech CS184 Spring DeHon 40 MP Primitives SEND message self [hardware] routed network message dispatch fast context switch naming/translation support synchronization

Caltech CS184 Spring DeHon 41 MDP Components [Dally et. al. IEEE Micro 4/92]

Caltech CS184 Spring DeHon 42 MDP Organization [Dally et. al. ICCD’92]

Caltech CS184 Spring DeHon 43 Message Send Ops –SEND, SEND2 –SENDE, SEND2E ends messages to make “atomic” –SEND{2} disable interrupts –SEND{2}E reenable

Caltech CS184 Spring DeHon 44 Message Send Sequence Send R0,0 ;first word is destination node address ;priority 0 SEND2 R1,R2,0 ;opcode at receiver (translated to instr ptr) ;data SEND2E R2,[3,A3],0 ;data and end message

Caltech CS184 Spring DeHon 45 MDP Messages Few cycles to inject Not doing translation here –have to map from process to processor before can send done by user code? Trust user code? –Deliver to operation (address) on other end receiver translates op to address no protection

Caltech CS184 Spring DeHon 46 Network 3D Mesh –wormhole –minimal buffering –dimension order routing hardware routed –orthogonal to node except enter/exit messages can backup –…all the way to sender

Caltech CS184 Spring DeHon 47 Context Switch Why context switch expensive? –Exchange state (save/restore) Registers PC, etc. TLB/cache...

Caltech CS184 Spring DeHon 48 Fast Context Switch General technique: –internal vs. external setup Machine Tool analogy Pattern: Double-buffering Modern: SMT … fast change among running contexts…

Caltech CS184 Spring DeHon 49 Fast Context Switch Provide separate sets of Registers –trade space (more, large registers) easier for MDP with small # of regs –for speed Don’t have to go through serialized load/store of state Probably also have to assure minimal/necessary handling code in fast memory

Caltech CS184 Spring DeHon 50 MDP State

Caltech CS184 Spring DeHon 51 Message Dispatch Incoming message queued by priority If higher priority than running (and interrupts enabled), will start running –few cycles to switch to “create” new task Terminated with suspend instruction –removes message from input queue

Caltech CS184 Spring DeHon 52 Message Dispatch Idle MPD start running message after 3 cycles –set instruction pointer –create new message segment –A3 is message pointer

Caltech CS184 Spring DeHon 53 Message Handler: CALL MOVE [1,A3],R0 ; get method ID XLATE R0,A0 ; translate to address LDIP INITIAL_IP ; branch w/in seg

Caltech CS184 Spring DeHon 54 Translation XLATE –associative lookup –cache/TLB/mapping primitive ENTER –Used to place an entry in associative table –may evict entry PROBE

Caltech CS184 Spring DeHon 55 Translation XLATE used to map global ids to local memory could be used to map processes to processors?

Caltech CS184 Spring DeHon 56 Synchronization Future tags on data –[we’ll talk about futures in the future]

Caltech CS184 Spring DeHon 57 Example Combining Tree –Each node in tree collects up results from its children –Combines results (e.g. add) –sends combined result to parent Used to collect results of distributed computation

Caltech CS184 Spring DeHon 58 Sample code: Combining Tree COMBINE: MOVE [1,A3],COMB MOVE [2,A3], R1 ADD R1,COMB.v,R1 MOVE R1,COMB.v MOVE COMB.cnt,R2 ADD R2,-1,R2 MOVE R2,COMB.cnt BNZ R2, DONE MOVE HEADER,R0 SEND2 COMB.pnode,R0 SEND2E COMB.paddr,R1 DONE: suspend

Caltech CS184 Spring DeHon 59 MDP Area Size/tech Of  m CMOS

Caltech CS184 Spring DeHon 60 MDP Area Memory ~50% Processor ~33% Net ~10%

Caltech CS184 Spring DeHon 61 J-Machine

Caltech CS184 Spring DeHon 62 Performance Base communication: 1  s node to node Empty ping: 3-7  s round trip –depends on distance –43 cycles round trip for node pinging self MDP 12.5 MIPs –2 MIPs when fetching instructions from external memory

Caltech CS184 Spring DeHon 63 Performance Results Note: all relative to MDP; not show slowdown to parallel code and MDP. [Noakes, Wallach Dally ISCA’93]

Caltech CS184 Spring DeHon 64 Time Decomposition [Noakes, Wallach Dally ISCA’93]

Caltech CS184 Spring DeHon 65 Other Lessons “Mechanisms” important for uniprocessor performance important here as well –hardware memory hierarchy management caching, TLB –floating point hardware –large register set

Caltech CS184 Spring DeHon 66 Observation Anything with a different programming model is hard to sell …especially if some component of your machine is worse than conventional alternatives –communication in Cosmic Cube –scalar (esp. FP) performance in J-Machine

Caltech CS184 Spring DeHon 67 Modern Design Doesn’t need completely custom ISA –(at least, MDP wasn’t benefiting from) –needed: send, suspend Hardware managed hierarchy –cache, TLB Similar hardware for process/processor mapping

Caltech CS184 Spring DeHon 68 Non-Lessons Balance –network overpowered for node 3  speed of external memory Network –dimension order routing –“efficiency” of wire utilization –[will return to …]

Caltech CS184 Spring DeHon 69 Follow ons... M-Machine (research) –Clustered (VLIW-like) node Cray T3D/T3E –Alpha’s for nodes, 3-cube packet net ASCII Red MPI Myrinet

Caltech CS184 Spring DeHon 70 Big Ideas MP has minimal primitives –appropriate low-level model –too raw/primitive for user model Communication essential component –can be expensive –doing well is necessary to get good performance (come out ahead) –watch OS cost...

Caltech CS184 Spring DeHon 71 Big Ideas Common Case Primitives Highly specialized instructions [hardware mechanisms?] brittle Design pulls –simplify processor implementation –simplify coding

Caltech CS184 Spring DeHon 72 Big Ideas Compiler: fill in gap between user and hardware architecture –good idea, not being exploited here Need different/additional primitives for handling parallel cooperation efficiently –communication –cheap process virtualization