CALTECH cs184c Spring2001 -- DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 6: April 19, 2001 Dataflow.

Slides:



Advertisements
Similar presentations
More on Processes Chapter 3. Process image _the physical representation of a process in the OS _an address space consisting of code, data and stack segments.
Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 4: April 7, 2003 Instruction Level Parallelism ILP.
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Instruction-Level Parallelism (ILP)
Computer Organization and Architecture
Latency Tolerance: what to do when it just won’t go away CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.
1 Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software.
Processes CSCI 444/544 Operating Systems Fall 2008.
Computer Organization and Architecture The CPU Structure.
Caltech CS184b Winter DeHon 1 CS184b: Computer Architecture [Single Threaded Architecture: abstractions, quantification, and optimizations] Day6:
Multithreading and Dataflow Architectures CPSC 321 Andreas Klappenecker.
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
Chapter 17 Parallel Processing.
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
1 Lecture 18: Pipelining Today’s topics:  Hazards and instruction scheduling  Branch prediction  Out-of-order execution Reminder:  Assignment 7 will.
RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.
Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 8: February 11, 2009 Dataflow.
1 Chapter 4 Threads Threads: Resource ownership and execution.
ECE669 L19: Processor Design April 8, 2004 ECE 669 Parallel Computer Architecture Lecture 19 Processor Design.
Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.
CH12 CPU Structure and Function
Chapter 51 Threads Chapter 5. 2 Process Characteristics  Concept of Process has two facets.  A Process is: A Unit of resource ownership:  a virtual.
The Vector-Thread Architecture Ronny Krashinsky, Chris Batten, Krste Asanović Computer Architecture Group MIT Laboratory for Computer Science
Architecture Basics ECE 454 Computer Systems Programming
Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 16: May 6, 2005 Fast Messaging.
High Performance Architectures Dataflow Part 3. 2 Dataflow Processors Recall from Basic Processor Pipelining: Hazards limit performance  Structural hazards.
Computer Systems Research at UNT 1 A Multithreaded Architecture Krishna Kavi (with minor modifcations)
Threads, Thread management & Resource Management.
Caltech CS184b Winter DeHon 1 CS184b: Computer Architecture [Single Threaded Architecture: abstractions, quantification, and optimizations] Day3:
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Threads and Processes.
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 8: April 26, 2001 Simultaneous Multi-Threading (SMT)
Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 15: May 9, 2003 Distributed Shared Memory.
Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 23: May 23, 2005 Dataflow.
Caltech CS184b Winter DeHon 1 CS184b: Computer Architecture [Single Threaded Architecture: abstractions, quantification, and optimizations] Day5:
Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 17: May 14, 2003 Threaded Abstract Machine.
ECE 456 Computer Architecture Lecture #14 – CPU (III) Instruction Cycle & Pipelining Instructor: Dr. Honggang Wang Fall 2013.
CS333 Intro to Operating Systems Jonathan Walpole.
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 10: May 8, 2001 Synchronization.
Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 11: April 30, 2003 Multithreading.
SIMULTANEOUS MULTITHREADING Ting Liu Liu Ren Hua Zhong.
Lecture 04: Instruction Set Principles Kai Bu
Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 7: April 24, 2001 Threaded Abstract Machine (TAM) Simultaneous.
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 22: May 20, 2005 Synchronization.
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 11: May10, 2001 Data Parallel (SIMD, SPMD, Vector)
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 1: April 3, 2001 Overview and Message Passing.
Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 14: May 7, 2003 Fast Messaging.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 9: May 3, 2001 Distributed Shared Memory.
Caltech CS184b Winter DeHon 1 CS184b: Computer Architecture [Single Threaded Architecture: abstractions, quantification, and optimizations] Day8:
Caltech CS184b Winter DeHon 1 CS184b: Computer Architecture [Single Threaded Architecture: abstractions, quantification, and optimizations] Day7:
Threads prepared and instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University 1July 2016Processes.
William Stallings Computer Organization and Architecture 8th Edition
CS184c: Computer Architecture [Parallel and Multithreaded]
ECE/CS 552: Pipelining to Superscalar
Computer Architecture
Latency Tolerance: what to do when it just won’t go away
The Vector-Thread Architecture
The University of Adelaide, School of Computer Science
ESE532: System-on-a-Chip Architecture
Prof. Onur Mutlu Carnegie Mellon University
Presentation transcript:

CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 6: April 19, 2001 Dataflow

CALTECH cs184c Spring DeHon Talks Ron Weiss –Cellular Computation and Communication using Engineered Genetic Regulatory Networks –[(Bio) Cellular Message Passing ] –Today 4pm –Beckman Institute Auditorium

CALTECH cs184c Spring DeHon Reading More dynamic schedule revision Multithreading next –HEP historical/intro Full-Empty Bits May skip –Tullsen… simultaneous multithread… Please read –Burns … Quantifying SMT Please read

CALTECH cs184c Spring DeHon Today Fine-Grained Threading Split-Phase Futures / Full-Empty Bits / I-structures TAM

CALTECH cs184c Spring DeHon Fine-Grained Threading Familiar with multiple threads of control –Multiple PCs Difference in power / weight –Costly to switch / associated state –What can do in each thread Power –Exposing parallelism –Hiding latency

CALTECH cs184c Spring DeHon TAM Fine-Grained Threading Activation Frame – block of memory associated with a procedure or loop body Thread – piece of straightline code that does not block or branch (single entry) Inlet – lightweight thread for handling inputs

CALTECH cs184c Spring DeHon Analogies Activation Frame ~ Stack Frame –Heap allocated Procedure Call ~ Frame Allocation –Multiple allocation creates parallelism Thread ~ basic block Start/fork ~ branch –Multiple spawn creates local parallelism Switch ~ conditional branch

CALTECH cs184c Spring DeHon Fine-grained Threading Computational model with explicit parallelism, synchronization

CALTECH cs184c Spring DeHon Split-Phase Operations Separate request and response side of operation –Idea: tolerate long latency operations Contrast with waiting on response

CALTECH cs184c Spring DeHon Canonical Example: Memory Fetch Conventional –Perform read –Stall waiting on reply Optimizations –Prefetch memory –Then access later Goal: separate request and response

CALTECH cs184c Spring DeHon Split-Phase Memory Send memory fetch request –Have reply to different thread Next thread enabled on reply Go off and run rest of this thread (other threads) between request and reply

CALTECH cs184c Spring DeHon Prefetch vs. Split-Phase Prefetch in sequential ISA –Must guess delay –Can request before need –…but have to pick how many instructions to place between request and response

CALTECH cs184c Spring DeHon Split-Phase Communication Also for non-rendezvous communication –Buffering Overlaps computation with communication Hide latency with parallelism

CALTECH cs184c Spring DeHon Key Element of DF Control Synchronization on Data Presence Construct: –Futures –Full-empty bits –I-structures

CALTECH cs184c Spring DeHon Future Future is a promise An indication that a value will be computed –And a handle for getting a handle on it Sometimes used as program construct

CALTECH cs184c Spring DeHon Future Future computation immediately returns a future Future is a handle/pointer to result (define (dot a b) –(cons (future (* (first a) (first b))) –(dot (rest a) (rest b))))

CALTECH cs184c Spring DeHon Fib (define (fib n) –(if (< n 2) 1 (+ (future (fib (- n 1))) – (future (fib (- n 2))))))

CALTECH cs184c Spring DeHon Strict/non-strict Strict operation requires the value of some variable –E.g. add, multiply Non-strict operation only needs the handle for a value –E.g. cons, procedure call (anything that just passes the handle off)

CALTECH cs184c Spring DeHon Futures Safe with functional routines –Create dataflow Can introduce non-determinacy with side-effecting routines –Not clear when operation completes

CALTECH cs184c Spring DeHon Future/Side-Effect hazard (define (decrement! a b) –(set! a (- a b))) (print (* (future (decrement! c d)) (future (decrement! d 2))))

CALTECH cs184c Spring DeHon Full/Empty bit Tag on data indicates data presence –E.g. tag in memory, RF –Like Scoreboard present bit When computation allocated, set to empty –E.g. operation issued into pipe, future call made When computation completes –Future computation completes –Data written into slot (register, memory) –Bit set to full On data access –Bit full, strict operation can get value –Bit empty, strict operation block on completion

CALTECH cs184c Spring DeHon I-Structure Array/object with full-empty bits on each field Allocated empty Fill in value as compute Strict access on empty –Queue requester in structure –Send value to requester when written and becomes full

CALTECH cs184c Spring DeHon I-Structure Allows efficient “functional” updates to aggregate structures Can pass around pointers to objects Preserve ordering/determinacy E.g. arrays

CALTECH cs184c Spring DeHon Threaded Abstract Machine

CALTECH cs184c Spring DeHon TAM Parallel Assembly Language Fine-Grained Threading Hybrid Dataflow Scheduling Hierarchy

CALTECH cs184c Spring DeHon Pure Dataflow Every operation is dataflow enabled Good –Exposes maximum parallelism –Tolerant to arbitrary delays Bad –Synchronization on event costly More costly than straightline code Space and time –Exposes non-useful parallelism

CALTECH cs184c Spring DeHon Old Example Task Has Parallelism MPY R3,R2,R2 MPY R4,R2,R5 MPY R3,R6,R3 ADD R4,R4,R7 ADD R4,R3,R4 CS184b

CALTECH cs184c Spring DeHon Hybrid Dataflow Use straightline/control flow –When successor known –When more efficient Basic blocks (fine-grained threads) –Think of as coarser-grained DF objects –Collect up inputs –Run basic block like conv. RISC basic- block (known non-blocking)

CALTECH cs184c Spring DeHon Great Project Model and assess control flow versus dataflow General formulation –Whole problem could be run control or data flow –Pick when to use control vs. dataflow –Base on Cost of synchronization Model of delay predictability Minimize expected runtime

CALTECH cs184c Spring DeHon Stopped Here 4/19/01

CALTECH cs184c Spring DeHon TL0 Model Activition Frame (like stack frame) –Variables –Synchronization –Thread stack (continuation vectors) Heap Storage –I-structures

CALTECH cs184c Spring DeHon TL0 Ops RISC-like ALU Ops FORK SWITCH STOP POST FALLOC FFREE SWAP

CALTECH cs184c Spring DeHon Scheduling Hierarchy Intra-frame –Related threads in same frame –Frame runs on single processor –Schedule together, exploit locality (cache, maybe regs) Inter-frame –Only swap when exhaust work in current frame

CALTECH cs184c Spring DeHon Intra-Frame Scheduling Simple (local) stack of pending threads Fork places new PC on stack STOP pops next PC off stack Stack initialized with code to exit activation frame –Including schedule next frame –Save live registers

CALTECH cs184c Spring DeHon TL0/CM5 Intra-frame Fork on thread –Fall through 0 inst –Unsynch branch 3 inst –Successful synch 4 inst –Unsuccessful synch 8 inst Push thread onto LCV 3-6 inst

CALTECH cs184c Spring DeHon Fib Example [look at how this turns into TL0 code]

CALTECH cs184c Spring DeHon Multiprocessor Parallelism Comes from frame allocations Runtime policy where allocate frames –Maybe use work stealing?

CALTECH cs184c Spring DeHon Frame Scheduling Inlets to non-active frames initiate pending thread stack (RCV) First inlet may place frame on processor’s runable frame queue SWAP instruction picks next frame branches to its enter thread

CALTECH cs184c Spring DeHon CM5 Frame Scheduling Costs Inlet Posts on non-running thread –10-15 instructions Swap to next frame –14 instructions Average thread cost 7 cycles –Constitutes 15-30% TL0 instr

CALTECH cs184c Spring DeHon Instruction Mix [Culler et. Al. JPDC, July 1993]

CALTECH cs184c Spring DeHon Cycle Breakdown [Culler et. Al. JPDC, July 1993]

CALTECH cs184c Spring DeHon Speedup Example [Culler et. Al. JPDC, July 1993]

CALTECH cs184c Spring DeHon Thread Stats Thread lengths 3—17 Threads run per “quantum” 7—530 [Culler et. Al. JPDC, July 1993]

CALTECH cs184c Spring DeHon Great Project Develop optimized  Arch for TAM –Hardware support/architecture for single- cycle thread-switch/post

CALTECH cs184c Spring DeHon Big Ideas Primitives –Parallel Assembly Language –Threads for control –Synchronization (post, full-empty) Latency Hiding –Threads, split-phase operation Exploit Locality –Create locality Scheduling quanta