Caltech CS184 Spring2005 -- DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 23: May 23, 2005 Dataflow.

Slides:

Advertisements

Similar presentations

CPU Structure and Function

Advertisements

More on Processes Chapter 3. Process image _the physical representation of a process in the OS _an address space consisting of code, data and stack segments.

Architecture-dependent optimizations Functional units, delay slots and dependency analysis.

Computer Architecture CSCE 350

Ch. 7 Process Synchronization (1/2) I Background F Producer - Consumer process :  Compiler, Assembler, Loader, · · · · · · F Bounded buffer.

Computer Organization and Architecture

Instruction-Level Parallelism (ILP)

Chapter 12 CPU Structure and Function. CPU Sequence Fetch instructions Interpret instructions Fetch data Process data Write data.

Computer Organization and Architecture

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.

CS 536 Spring Intermediate Code. Local Optimizations. Lecture 22.

Computer Organization and Architecture The CPU Structure.

Concurrency CS 510: Programming Languages David Walker.

Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.

Run time vs. Compile time

State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.

Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 8: February 11, 2009 Dataflow.

CS294-6 Reconfigurable Computing Day 23 November 10, 1998 Stream Processing.

1 Run time vs. Compile time The compiler must generate code to handle issues that arise at run time Representation of various data types Procedure linkage.

A. Frank - P. Weisberg Operating Systems Introduction to Cooperating Processes.

User-Level Interprocess Communication for Shared Memory Multiprocessors Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy Presented.

Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.

CH12 CPU Structure and Function

CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 6: April 19, 2001 Dataflow.

Chapter 1 Computer System Overview Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design Principles,

Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 16: May 6, 2005 Fast Messaging.

High Performance Architectures Dataflow Part 3. 2 Dataflow Processors Recall from Basic Processor Pipelining: Hazards limit performance  Structural hazards.

Computer Systems Research at UNT 1 A Multithreaded Architecture Krishna Kavi (with minor modifcations)

Caltech CS184b Winter DeHon 1 CS184b: Computer Architecture [Single Threaded Architecture: abstractions, quantification, and optimizations] Day3:

Edited By Miss Sarwat Iqbal (FUUAST) Last updated:21/1/13

Operating Systems ECE344 Ashvin Goel ECE University of Toronto Threads and Processes.

Chapter 6 Multiprocessor System. Introduction  Each processor in a multiprocessor system can be executing a different instruction at any time.  The.

CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 8: April 26, 2001 Simultaneous Multi-Threading (SMT)

Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 15: May 9, 2003 Distributed Shared Memory.

Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 17: May 14, 2003 Threaded Abstract Machine.

CS333 Intro to Operating Systems Jonathan Walpole.

ECEG-3202 Computer Architecture and Organization Chapter 7 Reduced Instruction Set Computers.

Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.

CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 7: April 24, 2001 Threaded Abstract Machine (TAM) Simultaneous.

Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.

13-1 Chapter 13 Concurrency Topics Introduction Introduction to Subprogram-Level Concurrency Semaphores Monitors Message Passing Java Threads C# Threads.

Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 14: May 7, 2003 Fast Messaging.

CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 9: May 3, 2001 Distributed Shared Memory.

Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 5: January 17, 2003 What is required for Computation?

Caltech CS184b Winter DeHon 1 CS184b: Computer Architecture [Single Threaded Architecture: abstractions, quantification, and optimizations] Day7:

1 Computer Architecture. 2 Basic Elements Processor Main Memory –volatile –referred to as real memory or primary memory I/O modules –secondary memory.

Autumn 2006CSE P548 - Dataflow Machines1 Von Neumann Execution Model Fetch: send PC to memory transfer instruction from memory to CPU increment PC Decode.

LECTURE 19 Subroutines and Parameter Passing. ABSTRACTION Recall: Abstraction is the process by which we can hide larger or more complex code fragments.

Processes and threads.

William Stallings Computer Organization and Architecture 8th Edition

Multiscalar Processors

CS399 New Beginnings Jonathan Walpole.

Chapter 9 a Instruction Level Parallelism and Superscalar Processors

Lecture 4: MIPS Instruction Set

Computer System Overview

Computer Architecture

ESE535: Electronic Design Automation

CS510 Operating System Foundations

Chapter 11 Processor Structure and function

Samira Khan University of Virginia Jan 23, 2019

ESE532: System-on-a-Chip Architecture

Prof. Onur Mutlu Carnegie Mellon University

Presentation transcript:

Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 23: May 23, 2005 Dataflow

Caltech CS184 Spring DeHon 2 Today Dataflow Model Dataflow Basics Examples Basic Architecture Requirements Fine-Grained Threading TAM (Threaded Abstract Machine) –Threaded assembly language

Caltech CS184 Spring DeHon 3 Functional What is a functional language? What is a functional routine? Functional –Like a mathematical function –Given same inputs, always returns same outputs –No state –No side effects

Caltech CS184 Spring DeHon 4 Functional Functional: F(x) = x * x; (define (f x) (* x x)) int f(int x) { return(x * x); }

Caltech CS184 Spring DeHon 5 Non-Functional Non-functional: (define counter 0) (define (next-number!) (set! counter (+ counter 1)) counter) static int counter=0; int increment () { return(++counter); }

Caltech CS184 Spring DeHon 6 Dataflow Model of computation Contrast with Control flow

Caltech CS184 Spring DeHon 7 Dataflow / Control Flow Dataflow Program is a graph of operators Operator consumes tokens and produces tokens All operators run concurrently Control flow Program is a sequence of operations Operator reads inputs and writes outputs into common store One operator runs at a time –Defines successor

Caltech CS184 Spring DeHon 8 Models Programming Model: functional with I- structures Compute Model: dataflow Execution Model: TAM

Caltech CS184 Spring DeHon 9 Token Data value with presence indication

Caltech CS184 Spring DeHon 10 Operator Takes in one or more inputs Computes on the inputs Produces a result Logically self-timed –“Fires” only when input set present –Signals availability of output

Caltech CS184 Spring DeHon 11

Caltech CS184 Spring DeHon 12 Dataflow Graph Represents –computation sub-blocks –linkage Abstractly –controlled by data presence

Caltech CS184 Spring DeHon 13 Dataflow Graph Example

Caltech CS184 Spring DeHon 14 Straight-line Code Easily constructed into DAG –Same DAG saw before –No need to linearize

Caltech CS184 Spring DeHon 15 Dataflow Graph Real problem is a graph Day4

Caltech CS184 Spring DeHon 16 Task Has Parallelism MPY R3,R2,R2 MPY R4,R2,R5 MPY R3,R6,R3 ADD R4,R4,R7 ADD R4,R3,R4 Day4

Caltech CS184 Spring DeHon 17 DF Exposes Freedom Exploit dynamic ordering of data arrival Saw aggressive control flow implementations had to exploit –Scoreboarding –OO issue

Caltech CS184 Spring DeHon 18 Data Dependence Add Two Operators –Switch –Select

Caltech CS184 Spring DeHon 19 Switch

Caltech CS184 Spring DeHon 20 Select

Caltech CS184 Spring DeHon 21 Constructing If-Then-Else

Caltech CS184 Spring DeHon 22 Looping For (i=0;i<Limit;i++)

Caltech CS184 Spring DeHon 23 Dataflow Graph Computation itself may construct / unfold parallelism –Loops –Procedure calls Semantics: create a new subgraph –Start as new thread –…procedures unfold as tree / dag –Not as a linear stack –…examples shortly…

Caltech CS184 Spring DeHon 24 Key Element of DF Control Synchronization on Data Presence Constructs: –Futures (language level) –I-structures (data structure) –Full-empty bits (implementation technique)

Caltech CS184 Spring DeHon 25 I-Structure Array/object with full-empty bits on each field Allocated empty Fill in value as compute Strict access on empty –Queue requester in structure –Send value to requester when written and becomes full

Caltech CS184 Spring DeHon 26 I-Structure Allows efficient “functional” updates to aggregate structures Can pass around pointers to objects Preserve ordering/determinacy E.g. arrays

Caltech CS184 Spring DeHon 27 Future Future is a promise An indication that a value will be computed –And a handle for getting a handle on it Sometimes used as program construct

Caltech CS184 Spring DeHon 28 Future Future computation immediately returns a future Future is a handle/pointer to result (define (vmult a b) (cons (future (* (first a) (first b))) (vmult (rest a) (rest b)))) [Version for C programmers on next slide]

Caltech CS184 Spring DeHon 29 DF V-Mult product in C/Java int [] vmult (int [] a, int [] b) { // consistency check on a.length, b.length int [] res = new int[a.length]; for (int i=0;i<res.length;i++) future res[i]=a[i]*b[i]; –return (res); } // assume int [] is an I-Structure

Caltech CS184 Spring DeHon 30 I-Structure V-Mult Example

Caltech CS184 Spring DeHon 31 I-Structure V-Mult Example

Caltech CS184 Spring DeHon 32 I-Structure V-Mult Example

Caltech CS184 Spring DeHon 33 I-Structure V-Mult Example

Caltech CS184 Spring DeHon 34 I-Structure V-Mult Example

Caltech CS184 Spring DeHon 35 I-Structure V-Mult Example

Caltech CS184 Spring DeHon 36 I-Structure V-Mult Example

Caltech CS184 Spring DeHon 37 I-Structure V-Mult Example

Caltech CS184 Spring DeHon 38 I-Structure V-Mult Example

Caltech CS184 Spring DeHon 39 I-Structure V-Mult Example

Caltech CS184 Spring DeHon 40 I-Structure V-Mult Example

Caltech CS184 Spring DeHon 41 I-Structure V-Mult Example

Caltech CS184 Spring DeHon 42 I-Structure V-Mult Example

Caltech CS184 Spring DeHon 43 I-Structure V-Mult Example

Caltech CS184 Spring DeHon 44 Fib (define (fib n) (if (< n 2) 1 (+ (future (fib (- n 1))) (future (fib (- n 2)))))) int fib(int n) { if (n<2) return(1); else return ((future)fib(n-1) + (future)fib(n-2)); }

Caltech CS184 Spring DeHon 45 Fibonacci Example

Caltech CS184 Spring DeHon 46 Fibonacci Example

Caltech CS184 Spring DeHon 47 Fibonacci Example

Caltech CS184 Spring DeHon 48 Fibonacci Example

Caltech CS184 Spring DeHon 49 Fibonacci Example

Caltech CS184 Spring DeHon 50 Fibonacci Example

Caltech CS184 Spring DeHon 51 Fibonacci Example

Caltech CS184 Spring DeHon 52 Fibonacci Example

Caltech CS184 Spring DeHon 53 Fibonacci Example

Caltech CS184 Spring DeHon 54 Fibonacci Example

Caltech CS184 Spring DeHon 55 Fibonacci Example

Caltech CS184 Spring DeHon 56 Fibonacci Example

Caltech CS184 Spring DeHon 57 Futures Safe with functional routines –Create dataflow –In functional language, can wrap futures around everything Don’t need explicit future construct Safe to put it anywhere –Anywhere compiler deems worthwhile Can introduce non-determinacy with side-effecting routines –Not clear when operation completes

Caltech CS184 Spring DeHon 58 Future/Side-Effect hazard (define (decrement! a b) (set! a (- a b)) a) (print (* (future (decrement! c d)) (future (decrement! d e)))) int decrement (int &a, int &b) { *a=*a-*b; return(*a);} printf(“%d %d”, (future)decrement(&c,&d), (future)decrement(&d,&e));

Caltech CS184 Spring DeHon 59 Architecture Mechanisms? Thread spawn –Preferably lightweight Full/empty bits Pure functional dataflow –May exploit common namespace –Not need memory coherence in pure functional  values never change

Caltech CS184 Spring DeHon 60 Fine-Grained Threading

Caltech CS184 Spring DeHon 61 Fine-Grained Threading Familiar with multiple threads of control –Multiple PCs Difference in power / weight –Costly to switch / associated state –What can do in each thread Power –Exposing parallelism –Hiding latency

Caltech CS184 Spring DeHon 62 Fine-grained Threading Computational model with explicit parallelism, synchronization

Caltech CS184 Spring DeHon 63 Split-Phase Operations Separate request and response side of operation –Idea: tolerate long latency operations Contrast with waiting on response

Caltech CS184 Spring DeHon 64 Canonical Example: Memory Fetch Conventional –Perform read –Stall waiting on reply –Hold processor resource waiting Optimizations –Prefetch memory –Then access later Goal: separate request and response

Caltech CS184 Spring DeHon 65 Split-Phase Memory Send memory fetch request –Have reply to different thread Next thread enabled on reply Go off and run rest of this thread (other threads) between request and reply

Caltech CS184 Spring DeHon 66 Prefetch vs. Split-Phase Prefetch in sequential ISA –Must guess delay –Can request before need –…but have to pick how many instructions to place between request and response With split phase –Not scheduled until return

Caltech CS184 Spring DeHon 67 Split-Phase Communication Also for non-rendezvous communication –Buffering Overlaps computation with communication Hide latency with parallelism

Caltech CS184 Spring DeHon 68 Threaded Abstract Machine

Caltech CS184 Spring DeHon 69 TAM Parallel Assembly Language –What primitives does a parallel processing node need? Fine-Grained Threading Hybrid Dataflow Scheduling Hierarchy

Caltech CS184 Spring DeHon 70 Pure Dataflow Every operation is dataflow enabled Good –Exposes maximum parallelism –Tolerant to arbitrary delays Bad –Synchronization on event costly More costly than straightline code Space and time –Exposes non-useful parallelism

Caltech CS184 Spring DeHon 71 Hybrid Dataflow Use straightline/control flow –When successor known –When more efficient Basic blocks (fine-grained threads) –Think of as coarser-grained DF objects –Collect up inputs –Run basic block like conv. RISC basic- block (known non-blocking within block)

Caltech CS184 Spring DeHon 72 TAM Fine-Grained Threading Activation Frame – block of memory associated with a procedure or loop body Thread – piece of straightline code that does not block or branch –single entry, single exit –No long/variable latency operations –(nanoThread?  handful of instructions) Inlet – lightweight thread for handling inputs

Caltech CS184 Spring DeHon 73 Analogies Activation Frame ~ Stack Frame –Heap allocated Procedure Call ~ Frame Allocation –Multiple allocation creates parallelism –Recall Fib example Thread ~ basic block Start/fork ~ branch –Multiple spawn creates local parallelism Switch ~ conditional branch

Caltech CS184 Spring DeHon 74 TL0 Model Threads grouped into activation frame –Like basic blocks into a procedure Activition Frame (like stack frame) –Variables –Synchronization –Thread stack (continuation vectors) Heap Storage –I-structures

Caltech CS184 Spring DeHon 75 Activation Frame

Caltech CS184 Spring DeHon 76 Recall Active Message Philosophy Get data into computation –No more copying / allocation Run to completion –Never block …reflected in TAM model –Definition of thread as non-blocking –Split phase operation –Inlets to integrate response into computation

Caltech CS184 Spring DeHon 77 Dataflow Inlet Synch Consider 3 input node (e.g. add3) –“inlet handler” for each incoming data –set presence bit on arrival –compute node (add3) when all present

Caltech CS184 Spring DeHon 78 Active Message DF Inlet Synch inlet message –node –inlet_handler –frame base –data_addr –flag_addr –data_pos –data Inlet –move data to addr –set appropriate flag –if all flags set enable DF node computation

Caltech CS184 Spring DeHon 79 Example of Inlet Code Add3.in: *data_addr=data *flag_addr && !(1<<data_pos) if *(flag_addr)==0 // was initialized 0x07 perform_add3 else next  lcv.pop() goto next

Caltech CS184 Spring DeHon 80 TL0 Ops Start with RISC-like ALU Ops Add –FORK –SWITCH –STOP –POST –FALLOC –FFREE –SWAP

Caltech CS184 Spring DeHon 81 Scheduling Hierarchy Intra-frame –Related threads in same frame –Frame runs on single processor –Schedule together, exploit locality contiguous alloc of frame memory  cache registers Inter-frame –Only swap when exhaust work in current frame

Caltech CS184 Spring DeHon 82 Intra-Frame Scheduling Simple (local) stack of pending threads –LCV = Local Continuation Vector FORK places new PC on LCV stack STOP pops next PC off LCV stack Stack initialized with code to exit activation frame (SWAP) –Including schedule next frame –Save live registers

Caltech CS184 Spring DeHon 83 Activation Frame

Caltech CS184 Spring DeHon 84 POST POST – synchronize a thread –Decrement synchronization counter –Run if reaches zero

Caltech CS184 Spring DeHon 85 TL0/CM5 Intra-frame Fork on thread –Fall through 0 inst –Unsynch branch 3 inst –Successful synch 4 inst –Unsuccessful synch 8 inst Push thread onto LCV 3-6 inst –Local Continuation Vector

Caltech CS184 Spring DeHon 86 Multiprocessor Parallelism Comes from frame allocations Runtime policy decides where allocate frames –Maybe use work stealing? Idle processor goes to nearby queue looking for frames to grab and run Will require some modification of TAM model to work with

Caltech CS184 Spring DeHon 87 Frame Scheduling Inlets to non-active frames initiate pending thread stack (RCV) –RCV = Remote Continuation Vector First inlet may place frame on processor’s runable frame queue SWAP instruction picks next frame branches to its enter thread

Caltech CS184 Spring DeHon 88 CM5 Frame Scheduling Costs Inlet Posts on non-running thread –10-15 instructions Swap to next frame –14 instructions Average thread control cost 7 cycles –Constitutes 15-30% TL0 instr

Caltech CS184 Spring DeHon 89 Thread Stats Thread lengths 3—17 Threads run per “quantum” 7—530 [Culler et. Al. JPDC, July 1993]

Caltech CS184 Spring DeHon 90 Instruction Mix [Culler et. Al. JPDC, July 1993]

Caltech CS184 Spring DeHon 91 Correlation Suggests: need ~20+ instr/thread to amortize out control

Caltech CS184 Spring DeHon 92 Speedup Example [Culler et. Al. JPDC, July 1993]

Caltech CS184 Spring DeHon 93 Big Ideas Model Expose Parallelism –Can have model that admits parallelism –Can have dynamic (hardware) representation with parallelism exposed Tolerate latency with parallelism Primitives –Thread spawn –Synchronization: full/empty

Caltech CS184 Spring DeHon 94 Big Ideas Balance –Cost of synchronization –Benefit of parallelism Hide latency with parallelism Decompose into primitives –Request vs. response … schedule separately Avoid constants –Tolerate variable delays –Don’t hold on to resource across unknown delay op Exploit structure/locality –Communication –Scheduling