Caltech CS184 Spring2003 -- DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 17: May 14, 2003 Threaded Abstract Machine.

Slides:



Advertisements
Similar presentations
CPU Structure and Function
Advertisements

More on Processes Chapter 3. Process image _the physical representation of a process in the OS _an address space consisting of code, data and stack segments.
The University of Adelaide, School of Computer Science
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
ISA Issues; Performance Considerations. Testing / System Verilog: ECE385.
Ch4. Processes Process Concept (1) new waiting terminate readyrunning admitted interrupt exit scheduler dispatch I/O or event completion.
1 Lecture 4: Procedure Calls Today’s topics:  Procedure calls  Large constants  The compilation process Reminder: Assignment 1 is due on Thursday.
Computer Architecture CSCE 350
CPS3340 COMPUTER ARCHITECTURE Fall Semester, /17/2013 Lecture 12: Procedures Instructor: Ashraf Yaseen DEPARTMENT OF MATH & COMPUTER SCIENCE CENTRAL.
Computer Organization and Architecture
Computer Organization and Architecture
Latency Tolerance: what to do when it just won’t go away CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.
CS 536 Spring Intermediate Code. Local Optimizations. Lecture 22.
Computer Organization and Architecture The CPU Structure.
CS 536 Spring Code generation I Lecture 20.
User Level Interprocess Communication for Shared Memory Multiprocessor by Bershad, B.N. Anderson, A.E., Lazowska, E.D., and Levy, H.M.
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
Run time vs. Compile time
Intermediate Code. Local Optimizations
1 Run time vs. Compile time The compiler must generate code to handle issues that arise at run time Representation of various data types Procedure linkage.
CPS110: Implementing threads/locks on a uni-processor Landon Cox.
Threads CS 416: Operating Systems Design, Spring 2001 Department of Computer Science Rutgers University
Code Generation CS 480. Can be complex To do a good job of teaching about code generation I could easily spend ten weeks But, don’t have ten weeks, so.
CH12 CPU Structure and Function
1 CSC 2405: Computer Systems II Spring 2012 Dr. Tom Way.
Operating Systems CSE 411 CPU Management Sept Lecture 11 Instructor: Bhuvan Urgaonkar.
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 6: April 19, 2001 Dataflow.
Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 16: May 6, 2005 Fast Messaging.
High Performance Architectures Dataflow Part 3. 2 Dataflow Processors Recall from Basic Processor Pipelining: Hazards limit performance  Structural hazards.
Computer Systems Research at UNT 1 A Multithreaded Architecture Krishna Kavi (with minor modifcations)
CSC 501 Lecture 2: Processes. Process Process is a running program a program in execution an “instantiation” of a program Program is a bunch of instructions.
Threads, Thread management & Resource Management.
Caltech CS184b Winter DeHon 1 CS184b: Computer Architecture [Single Threaded Architecture: abstractions, quantification, and optimizations] Day3:
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Threads and Processes.
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Computer Systems Principles Processes & Threads Emery Berger and Mark Corner University.
CSc 453 Runtime Environments Saumya Debray The University of Arizona Tucson.
Fall 2012 Chapter 2: x86 Processor Architecture. Irvine, Kip R. Assembly Language for x86 Processors 6/e, Chapter Overview General Concepts IA-32.
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 8: April 26, 2001 Simultaneous Multi-Threading (SMT)
Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 15: May 9, 2003 Distributed Shared Memory.
Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 23: May 23, 2005 Dataflow.
CS333 Intro to Operating Systems Jonathan Walpole.
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 10: May 8, 2001 Synchronization.
Processes CSCI 4534 Chapter 4. Introduction Early computer systems allowed one program to be executed at a time –The program had complete control of the.
Processes CS 6560: Operating Systems Design. 2 Von Neuman Model Both text (program) and data reside in memory Execution cycle Fetch instruction Decode.
Lecture 5 Page 1 CS 111 Online Processes CS 111 On-Line MS Program Operating Systems Peter Reiher.
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 7: April 24, 2001 Threaded Abstract Machine (TAM) Simultaneous.
Processes and Virtual Memory
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 1: April 3, 2001 Overview and Message Passing.
Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 14: May 7, 2003 Fast Messaging.
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 9: May 3, 2001 Distributed Shared Memory.
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 5: January 17, 2003 What is required for Computation?
Processes and threads.
William Stallings Computer Organization and Architecture 8th Edition
CS184c: Computer Architecture [Parallel and Multithreaded]
Prof. Onur Mutlu Carnegie Mellon University
CS399 New Beginnings Jonathan Walpole.
Introduction to Compilers Tim Teitelbaum
Computer Instructions
CS510 Operating System Foundations
Computer Architecture
CSE 153 Design of Operating Systems Winter 2019
Topic 2b ISA Support for High-Level Languages
Prof. Onur Mutlu Carnegie Mellon University
Presentation transcript:

Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 17: May 14, 2003 Threaded Abstract Machine

Caltech CS184 Spring DeHon 2 Today Fine-Grained Threading Split-Phase TAM

Caltech CS184 Spring DeHon 3 Fine-Grained Threading Familiar with multiple threads of control –Multiple PCs Difference in power / weight –Costly to switch / associated state –What can do in each thread Power –Exposing parallelism –Hiding latency

Caltech CS184 Spring DeHon 4 Fine-grained Threading Computational model with explicit parallelism, synchronization

Caltech CS184 Spring DeHon 5 Split-Phase Operations Separate request and response side of operation –Idea: tolerate long latency operations Contrast with waiting on response

Caltech CS184 Spring DeHon 6 Canonical Example: Memory Fetch Conventional –Perform read –Stall waiting on reply –Hold processor resource waiting Optimizations –Prefetch memory –Then access later Goal: separate request and response

Caltech CS184 Spring DeHon 7 Split-Phase Memory Send memory fetch request –Have reply to different thread Next thread enabled on reply Go off and run rest of this thread (other threads) between request and reply

Caltech CS184 Spring DeHon 8 Prefetch vs. Split-Phase Prefetch in sequential ISA –Must guess delay –Can request before need –…but have to pick how many instructions to place between request and response With split phase –Not scheduled until return

Caltech CS184 Spring DeHon 9 Split-Phase Communication Also for non-rendezvous communication –Buffering Overlaps computation with communication Hide latency with parallelism

Caltech CS184 Spring DeHon 10 Threaded Abstract Machine

Caltech CS184 Spring DeHon 11 TAM Parallel Assembly Language –What primitives does a parallel processing node need? Fine-Grained Threading Hybrid Dataflow Scheduling Hierarchy

Caltech CS184 Spring DeHon 12 Pure Dataflow Every operation is dataflow enabled Good –Exposes maximum parallelism –Tolerant to arbitrary delays Bad –Synchronization on event costly More costly than straightline code Space and time –Exposes non-useful parallelism

Caltech CS184 Spring DeHon 13 Hybrid Dataflow Use straightline/control flow –When successor known –When more efficient Basic blocks (fine-grained threads) –Think of as coarser-grained DF objects –Collect up inputs –Run basic block like conv. RISC basic- block (known non-blocking within block)

Caltech CS184 Spring DeHon 14 TAM Fine-Grained Threading Activation Frame – block of memory associated with a procedure or loop body Thread – piece of straightline code that does not block or branch –single entry, single exit –No long/variable latency operations –(nanoThread?  handful of instructions) Inlet – lightweight thread for handling inputs

Caltech CS184 Spring DeHon 15 Analogies Activation Frame ~ Stack Frame –Heap allocated Procedure Call ~ Frame Allocation –Multiple allocation creates parallelism Thread ~ basic block Start/fork ~ branch –Multiple spawn creates local parallelism Switch ~ conditional branch

Caltech CS184 Spring DeHon 16 TL0 Model Threads grouped into activation frame –Like basic blocks into a procedure Activition Frame (like stack frame) –Variables –Synchronization –Thread stack (continuation vectors) Heap Storage –I-structures

Caltech CS184 Spring DeHon 17 Activation Frame

Caltech CS184 Spring DeHon 18 Recall Active Message Philosophy Get data into computation –No more copying / allocation Run to completion –Never block …reflected in TAM model –Definition of thread as non-blocking –Split phase operation –Inlets to integrate response into computation

Caltech CS184 Spring DeHon 19 Dataflow Inlet Synch Consider 3 input node (e.g. add3) –“inlet handler” for each incoming data –set presence bit on arrival –compute node (add3) when all present

Caltech CS184 Spring DeHon 20 Active Message DF Inlet Synch inlet message –node –inlet_handler –frame base –data_addr –flag_addr –data_pos –data Inlet –move data to addr –set appropriate flag –if all flags set enable DF node computation

Caltech CS184 Spring DeHon 21 Example of Inlet Code Add3.in: *data_addr=data *flag_addr && !(1<<data_pos) if *(flag_addr)==0 // was initialized 0x07 perform_add3 else next  lcv.pop() goto next

Caltech CS184 Spring DeHon 22 Non-Idempotent Inlet Code Add3.in: *data_addr=data *flag_addr-- // decrement synch. cntr if *(flag_addr)==0 // was initialized 0x03 perform_add3 else next  lcv.pop() goto next // this version what TAM assuming

Caltech CS184 Spring DeHon 23 TL0 Ops Start with RISC-like ALU Ops Add –FORK –SWITCH –STOP –POST –FALLOC –FFREE –SWAP

Caltech CS184 Spring DeHon 24 Scheduling Hierarchy Intra-frame –Related threads in same frame –Frame runs on single processor –Schedule together, exploit locality contiguous alloc of frame memory  cache registers Inter-frame –Only swap when exhaust work in current frame

Caltech CS184 Spring DeHon 25 Intra-Frame Scheduling Simple (local) stack of pending threads –LCV = Local Continuation Vector FORK places new PC on LCV stack STOP pops next PC off LCV stack Stack initialized with code to exit activation frame (SWAP) –Including schedule next frame –Save live registers

Caltech CS184 Spring DeHon 26 Activation Frame

Caltech CS184 Spring DeHon 27 POST POST – synchronize a thread –Decrement synchronization counter –Run if reaches zero

Caltech CS184 Spring DeHon 28 TL0/CM5 Intra-frame Fork on thread –Fall through 0 inst –Unsynch branch 3 inst –Successful synch 4 inst –Unsuccessful synch 8 inst Push thread onto LCV 3-6 inst –Local Continuation Vector

Caltech CS184 Spring DeHon 29 Fib Example CBLOCK FIB.pc FRAME_BODY RCV=3 % frame layout, RCV size is 3 threads islot1.i islot1.i islot2.i % argument and two results pfslot1.pf pfslot2.pf % frame pointers of recursive calls sslot0.s % synch variable for thread 6 pfslot0.pf jslot0.j % return frame pointer and inlet REGISTER % registers used breg0.b ireg0.i % boolean and integer temps INLET 0 % recv parent frame ptr, return inlet, argument RECEIVE pfslot0.pf jslot0.j islot0.i FINIT % initialize frame (RCV,my_fp,...) SET_ENTER 7.t % set enter­activation thread SET_LEAVE 8.t % set leave­activation thread POST 0.t "default" STOP

Caltech CS184 Spring DeHon 30 Fib Cont. INLET 1 % receive frame pointer of first recursive call RECEIVE pfslot1.pf POST 3.t "default" STOP INLET 2 % receive result of first call RECEIVE islot1.i POST 5.t "default" STOP INLET 3 % receive frame pointer of second recursive call RECEIVE pfslot2.pf POST 4.t "default" STOP INLET 4 % receive result of second call RECEIVE islot2.i POST 5.t "default" STOP

Caltech CS184 Spring DeHon 31 Fib Threads THREAD 0 % compare argument against 2 LT breg0.b = islot0.i 2.i SWITCH breg0.b 1.t 2.t STOP THREAD 1 % argument is <2 MOVE ireg0.i = 1.i % result for base case FORK 6.t STOP THREAD 2 % arg >=2, allocate frames for recursive calls MOVE sslot0.s = 2.s % initialize synchronization counter FALLOC 1.j = FIB.pc "remote" % spawn off on other processor FALLOC 3.j = FIB.pc "local" % keep something to do locally STOP THREAD 3 % got FP of first call, send its arg SUB ireg0.i = islot0.i 1.i % argument for first call SEND pfslot1.pf[0.i] <­ fp.pf 2.j ireg0.i % send it STOP

Caltech CS184 Spring DeHon 32 Fib Threads Cont. THREAD 4 % got FP of second call, send its arg SUB ireg0.i = islot0.i 2.i % argument for second call SEND pfslot2.pf[0.i] <­ fp.pf 4.j ireg0.i % send it STOP THREAD 5 SYNC sslot0.s % got results from both calls (synchronize!) ADD ireg0.i = islot1.i islot2.i % add results FORK 6.t STOP THREAD 6 % done! SEND pfslot0.pf[jslot0.j] <­ ireg0.i % send result to parent FFREE fp.pf "default" % deallocate own frame SWAP "default" % swap to next activation STOP THREAD 7 % enter­activation thread STOP % no registers to restore... THREAD 8 % leave­activation thread SWAP "default" % swap to next activation STOP % no registers to save...

Caltech CS184 Spring DeHon 33 Fib Example Work through (fib 3)

Caltech CS184 Spring DeHon 34 Multiprocessor Parallelism Comes from frame allocations Runtime policy where allocate frames –Maybe use work stealing? Idle processor goes to nearby queue looking for frames to grab and run Will require some modification of TAM model to work with

Caltech CS184 Spring DeHon 35 Frame Scheduling Inlets to non-active frames initiate pending thread stack (RCV) –RCV = Remote Continuation Vector First inlet may place frame on processor’s runable frame queue SWAP instruction picks next frame branches to its enter thread

Caltech CS184 Spring DeHon 36 CM5 Frame Scheduling Costs Inlet Posts on non-running thread –10-15 instructions Swap to next frame –14 instructions Average thread control cost 7 cycles –Constitutes 15-30% TL0 instr

Caltech CS184 Spring DeHon 37 Instruction Mix [Culler et. Al. JPDC, July 1993]

Caltech CS184 Spring DeHon 38 Cycle Breakdown [Culler et. Al. JPDC, July 1993]

Caltech CS184 Spring DeHon 39 Speedup Example [Culler et. Al. JPDC, July 1993]

Caltech CS184 Spring DeHon 40 Thread Stats Thread lengths 3—17 Threads run per “quantum” 7—530 [Culler et. Al. JPDC, July 1993]

Caltech CS184 Spring DeHon 41 Admin/Read Fri. – no lecture (project meet) Mon. GARP –Two papers Computer – more polished – read all FCCM – more architecture details – skim redundant stuff and peruse details Wed. SCORE –Assume most of you already have tutorial

Caltech CS184 Spring DeHon 42 Big Ideas Balance –Cost of synchronization –Benefit of parallelism Hide latency with parallelism Decompose into primitives –Request vs. response … schedule separately Avoid constants –Tolerate variable delays –Don’t hold on to resource across unknown delay op Exploit structure/locality –Communication –Scheduling