CS184c: Computer Architecture [Parallel and Multithreaded]

Slides:



Advertisements
Similar presentations
More on Processes Chapter 3. Process image _the physical representation of a process in the OS _an address space consisting of code, data and stack segments.
Advertisements

Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
Translation Buffers (TLB’s)
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
ECE669 L19: Processor Design April 8, 2004 ECE 669 Parallel Computer Architecture Lecture 19 Processor Design.
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 6: April 19, 2001 Dataflow.
Caltech CS184b Winter DeHon 1 CS184b: Computer Architecture [Single Threaded Architecture: abstractions, quantification, and optimizations] Day3:
Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 15: May 4, 2005 Message Passing.
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 8: April 26, 2001 Simultaneous Multi-Threading (SMT)
IT253: Computer Organization
The Mach System Abraham Silberschatz, Peter Baer Galvin, Greg Gagne Presentation By: Agnimitra Roy.
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 10: May 8, 2001 Synchronization.
Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 11: April 30, 2003 Multithreading.
Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 7: April 24, 2001 Threaded Abstract Machine (TAM) Simultaneous.
On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2.
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 1: April 3, 2001 Overview and Message Passing.
Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 14: May 7, 2003 Fast Messaging.
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 9: May 3, 2001 Distributed Shared Memory.
Introduction to Operating Systems Concepts
Computer Organization and Architecture Lecture 1 : Introduction
CS161 – Design and Architecture of Computer
Translation Lookaside Buffer
CMSC 611: Advanced Computer Architecture
Computer System Overview
Chapter 1 Computer System Overview
ECE232: Hardware Organization and Design
CS161 – Design and Architecture of Computer
Visit for more Learning Resources
Session 3 Memory Management
A Real Problem What if you wanted to run a program that needs more memory than you have? September 11, 2018.
Simultaneous Multithreading
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
Chapter 2 Processes and Threads Today 2.1 Processes 2.2 Threads
Embedded Systems Design
Multi-Processing in High Performance Computer Architecture:
Architecture Background
Morgan Kaufmann Publishers
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
Chapter 10 The Stack.
Hyperthreading Technology
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Lecture 4: MIPS Instruction Set
CS 258 Reading Assignment 4 Discussion Exploiting Two-Case Delivery for Fast Protected Messages Bill Kramer February 13, 2002 #
Computer System Overview
Lecture Topics: 11/1 General Operating System Concepts Processes
Architectural Support for OS
Translation Buffers (TLB’s)
CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
CSE 451: Operating Systems Autumn 2003 Lecture 2 Architectural Support for Operating Systems Hank Levy 596 Allen Center 1.
Prof. Leonardo Mostarda University of Camerino
CSE 451: Operating Systems Autumn 2001 Lecture 2 Architectural Support for Operating Systems Brian Bershad 310 Sieg Hall 1.
Translation Buffers (TLB’s)
What is Computer Architecture?
CSC3050 – Computer Architecture
CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs
Chapter 1 Computer System Overview
What is Computer Architecture?
Operating Systems: A Modern Perspective, Chapter 6
Implementing Processes, Threads, and Resources
Threads vs. Processes Hank Levy 1.
CSE 451: Operating Systems Winter 2003 Lecture 2 Architectural Support for Operating Systems Hank Levy 412 Sieg Hall 1.
Computer System Overview
Architectural Support for OS
Translation Buffers (TLBs)
Review What are the advantages/disadvantages of pages versus segments?
Lecture 12 Input/Output (programmer view)
Prof. Onur Mutlu Carnegie Mellon University
Chapter 13: I/O Systems “The two main jobs of a computer are I/O and [CPU] processing. In many cases, the main job is I/O, and the [CPU] processing is.
Presentation transcript:

CS184c: Computer Architecture [Parallel and Multithreaded] Day 2: April 5, 2001 Message Passing Mechanisms CALTECH cs184c Spring2001 -- DeHon

Today Message Driven Processor Mechanisms for Multiprocessing Engineering “Low cost” messaging CALTECH cs184c Spring2001 -- DeHon

Problem 1 Messages take milliseconds (1000s of cycles) Forces use of course-grained parallelism Speedup = Tseq/Tmp = cseq  Np /cmp cseq /cmp ~= t(comp) / (t(comm)+ t(comp)) driven to make t(comp) >> t(comm) CALTECH cs184c Spring2001 -- DeHon

Problem 2 Potential parallelism is costly additional communication cost is born even when sequentialized (same node) Process to process switch expensive Discourages exposing maximum parallelism works against simple/scalable model CALTECH cs184c Spring2001 -- DeHon

Bad Cost Model Challenge Here give programmer a simple model of how to write good programs Here expose parallelism increases but has cost expose too much will decrease hard for user to know which CALTECH cs184c Spring2001 -- DeHon

Bad Model Poor User-level abstraction: user should not be picking granularity of exploited parallelism this should be done by tools CALTECH cs184c Spring2001 -- DeHon

Cosmic Cube Used commodity hardware Showed off the shelf solution components not engineered for parallel scenario Showed could get benefit out of parallelism exposed issues need to address to do it right …why need to do something different CALTECH cs184c Spring2001 -- DeHon

Design for Parallelism To do it right need to engineer for parallelism Optimize key common cases here Figuring out what goes in hardware vs. software CALTECH cs184c Spring2001 -- DeHon

Vision: MDP/Mosaic Single-chip, commodity building block [today, tile to step and repeat on die] contains all computing components compute: sequential processor interconnect in space: net interface + network interconnect in time: memory Step-and-repeat competent uP avoid diminishing returns trying to build monolithic processor CALTECH cs184c Spring2001 -- DeHon

Message Driven Processor “Mechanism” Driven Processor? Study mechanisms needed for a parallel processing node address problems saw in using existing View as low-level (hardware) model underlies range of compute models shared memory, dataflow, data parallel CALTECH cs184c Spring2001 -- DeHon

Philosophy of MDP mechanisms=primitives like RISC focus on primitives from which to build powerful operations common support not model specific like RISC not language specific Hardware/software interface what should hardware support/provide vs. what should be composed in software CALTECH cs184c Spring2001 -- DeHon

MP Primitives SEND message self [hardware] routed network message dispatch fast context switch naming/translation support synchronization CALTECH cs184c Spring2001 -- DeHon

MDP Components [Dally et. al. IEEE Micro 4/92] CALTECH cs184c Spring2001 -- DeHon

MDP Organization [Dally et. al. ICCD’92] CALTECH cs184c Spring2001 -- DeHon

Message Send Ops to make “atomic” SEND, SEND2 SENDE, SEND2E ends messages to make “atomic” SEND{2} disable interrupts SEND{2}E reenable CALTECH cs184c Spring2001 -- DeHon

Message Send Sequence Send R0,0 SEND2 R1,R2,0 SEND2E R2,[3,A3],0 first word is destination node address priority 0 SEND2 R1,R2,0 opcode at receiver (translated to instr ptr) data SEND2E R2,[3,A3],0 data and end message CALTECH cs184c Spring2001 -- DeHon

MDP Messages Few cycles to inject Not doing translation here have to map from process to processor before can send done by user code? Trust user code? Deliver to operation (address) on other end receiver translates op to address no protection CALTECH cs184c Spring2001 -- DeHon

Network 3D Mesh hardware routed messages can backup wormhole minimal buffering dimension order routing hardware routed orthogonal to node except enter/exit contrast transputer messages can backup …all the way to sender CALTECH cs184c Spring2001 -- DeHon

Context Switch Why context switch expensive? Exchange state (save/restore) Registers PC, etc. TLB/cache... CALTECH cs184c Spring2001 -- DeHon

Fast Context Switch General technique: Machine Tool analogy internal vs. external setup Machine Tool analogy Double-buffering CALTECH cs184c Spring2001 -- DeHon

Fast Context Switch Provide separate sets of Registers trade space (more, large registers) easier for MDP with small # of regs for speed Don’t have to go through serialized load/store Probably also have to assure minimal/necessary handling code in fast memory CALTECH cs184c Spring2001 -- DeHon

MDP State CALTECH cs184c Spring2001 -- DeHon

Message Dispatch Incoming message queued by priority If higher priority than running (and interrupts enabled), will start running few cycles to switch to “create” new task Terminated with suspend instruction removes message from input queue CALTECH cs184c Spring2001 -- DeHon

Message Dispatch Idle MPD start running message after 3 cycles set instruction pointer create new message segment A3 is message pointer CALTECH cs184c Spring2001 -- DeHon

Message Handler: CALL MOVE [1,A3],R0 ; get method ID XLATE R0,A0 ; translate to address LDIP INITIAL_IP ; branch w/in seg CALTECH cs184c Spring2001 -- DeHon

Translation XLATE ENTER PROBE associative lookup cache/TLB/mapping primitive ENTER place an entry in associative table may evict entry PROBE CALTECH cs184c Spring2001 -- DeHon

Translation XLATE used to map global ids to local memory could be used to map processes to processors? CALTECH cs184c Spring2001 -- DeHon

Synchronization Future tags on data [we’ll talk about futures later] CALTECH cs184c Spring2001 -- DeHon

Example Combining Tree Each node in tree collects up results from its children Combines results (e.g. add) sends combined result to parent Used to collect results of distributed computation CALTECH cs184c Spring2001 -- DeHon

Sample code: Combining Tree COMBINE: MOVE [1,A3],COMB MOVE [2,A3], R1 ADD R1,COMB.v,R1 MOVE R1,COMB.v MOVE COMB.cnt,R2 ADD R2,-1,R2 MOVE R2,COMB.cnt BNZ R2, DONE MOVE HEADER,R0 SEND2 COMB.pnode,R0 SEND2E COMB.paddr,R1 DONE: suspend CALTECH cs184c Spring2001 -- DeHon

MDP Area CALTECH cs184c Spring2001 -- DeHon

MDP Area Memory ~50% Processor ~33% Net ~10% CALTECH cs184c Spring2001 -- DeHon

J-Machine CALTECH cs184c Spring2001 -- DeHon

Performance Base communication: 1ms node to node Empty ping: 3-7ms round trip depends on distance 43 cycles round trip for node pinging self MDP 12.5 MIPs 2 MIPs when fetching instructions from external memory CALTECH cs184c Spring2001 -- DeHon

Performance Results Note: all relative to MDP; not show slowdown to parallel code and MDP. [Noakes, Wallach Dally ISCA’93] CALTECH cs184c Spring2001 -- DeHon

Time Decomposition [Noakes, Wallach Dally ISCA’93] CALTECH cs184c Spring2001 -- DeHon

Other Lessons “Mechanisms” important for uniprocessor performance important here as well hardware memory hierarchy management caching, TLB floating point hardware large register set CALTECH cs184c Spring2001 -- DeHon

Observation Anything with a different programming model is hard to sell …especially if some component of your machine is worse than conventional alternatives communication in Cosmic Cube scalar (esp. FP) performance in J-Machine CALTECH cs184c Spring2001 -- DeHon

Non-Lessons Balance Network network overpowered for node 3 speed of external memory Network dimension order routing “efficiency” of wire utilization [will return to in week 8] CALTECH cs184c Spring2001 -- DeHon

Follow ons... M-Machine (research) Cray T3D ASCII Red CALTECH cs184c Spring2001 -- DeHon

Modern Design Doesn’t need completely custom ISA (at least, MDP wasn’t benefiting from) needed: send, suspend Hardware managed hierarchy cache, TLB Similar hardware for process/processor mapping CALTECH cs184c Spring2001 -- DeHon

Big Ideas Common Case Primitives Grabbed from CS184b Day3! Big Ideas Common Case Primitives Highly specialized instructions [hardware mechanisms?] brittle Design pulls simplify processor implementation simplify coding CALTECH cs184c Spring2001 -- DeHon

Big Ideas Compiler: fill in gap between user and hardware architecture good idea, not being exploited here Need different/additional primitives for handling parallel cooperation efficiently communication cheap process virtualization CALTECH cs184c Spring2001 -- DeHon