1 Arvind Computer Science and Artificial Intelligence Lab M.I.T. Dataflow: Passing the Token ISCA 2005, Madison, WI June 6, 2005.

Slides:

Advertisements

Similar presentations

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Advertisements

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

CMSC 611: Advanced Computer Architecture Scoreboard Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.

1/1/ /e/e eindhoven university of technology Microprocessor Design Course 5Z008 Dr.ir. A.C. (Ad) Verschueren Eindhoven University of Technology Section.

Instruction-Level Parallelism (ILP)

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

DATAFLOW ARHITEKTURE. Dataflow Processors - Motivation In basic processor pipelining hazards limit performance –Structural hazards –Data hazards due to.

CSCI 8150 Advanced Computer Architecture Hwang, Chapter 2 Program and Network Properties 2.3 Program Flow Mechanisms.

Computer Organization and Architecture The CPU Structure.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.

Chapter 2 Instruction-Level Parallelism and Its Exploitation

(Page 554 – 564) Ping Perez CS 147 Summer 2001 Alternative Parallel Architectures  Dataflow  Systolic arrays  Neural networks.

Multiscalar processors

Where do We Go from Here? Thoughts on Computer Architecture TIP For additional advice see Dale Canegie Training® Presentation Guidelines Jack Dennis MIT.

Processor Structure & Operations of an Accumulator Machine

Computer Organization and Architecture Reduced Instruction Set Computers (RISC) Chapter 13.

CH13 Reduced Instruction Set Computers {Make hardware Simpler, but quicker} Key features  Large number of general purpose registers  Use of compiler.

High Performance Architectures Dataflow Part 3. 2 Dataflow Processors Recall from Basic Processor Pipelining: Hazards limit performance  Structural hazards.

Computer Architecture Dataflow Machines. Data Flow Conventional programming models are control driven Instruction sequence is precisely specified Sequence.

Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.

March 4, 2008 DF4-1http://csg.csail.mit.edu/arvind/ From Id/pH to Dataflow Graphs Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute.

1 Multithreaded Architectures Lecture 3 of 4 Supercomputing ’93 Tutorial Friday, November 19, 1993 Portland, Oregon Rishiyur S. Nikhil Digital Equipment.

Super computers Parallel Processing By Lecturer: Aisha Dawood.

1 Chapter 2 Dataflow Processors. 2 Dataflow processors Recall from basic processor pipelining: Hazards limit performance. – Structural hazards – Data.

1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined control 3. Data Hazards 4. Forwarding 5. Branch Hazards.

March 4, 2008 DF5-1http://csg.csail.mit.edu/arvind/ The Monsoon Project Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of.

Different parallel processing architectures Module 2.

Von Neumann Computers Article Authors: Rudolf Eigenman & David Lilja

Recap Multicycle Operations –MIPS Floating Point Putting It All Together: the MIPS R4000 Pipeline.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

March 4, 2008http://csg.csail.mit.edu/arvindDF3-1 Dynamic Dataflow Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology.

December 5, 2006http://csg.csail.mit.edu/6.827/L22-1 Dynamic Dataflow Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of.

Computer Architecture Dataflow Machines Sunset over Lifou, New Caledonia.

Autumn 2006CSE P548 - Dataflow Machines1 Von Neumann Execution Model Fetch: send PC to memory transfer instruction from memory to CPU increment PC Decode.

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

Computer Architecture: Dataflow (Part I) Prof. Onur Mutlu Carnegie Mellon University.

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

Advanced Architectures

Processes and threads.

Conception of parallel algorithms

Multiscalar Processors

Prof. Onur Mutlu Carnegie Mellon University

Fall 2012 Parallel Computer Architecture Lecture 22: Dataflow I

Chapter 9 a Instruction Level Parallelism and Superscalar Processors

CS203 – Advanced Computer Architecture

Chapter 14 Instruction Level Parallelism and Superscalar Processors

Pipelining: Advanced ILP

What is Parallel and Distributed computing?

Superscalar Processors & VLIW Processors

Dataflow Models.

EE 445S Real-Time Digital Signal Processing Lab Spring 2014

Dataflow: Passing the Token

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Introduction to Microprocessor Programming

Levels of Parallelism within a Single Processor

Chapter 12 Pipelining and RISC

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Dynamic Hardware Prediction

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Dataflow Model of Computation (From Dataflow to Multithreading)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Lecture 4: Instruction Set Design/Pipelining

Samira Khan University of Virginia Jan 23, 2019

Advanced Computer Architecture Dataflow Processing

Prof. Onur Mutlu Carnegie Mellon University

Prof. Onur Mutlu Carnegie Mellon University

Presentation transcript:

1 Arvind Computer Science and Artificial Intelligence Lab M.I.T. Dataflow: Passing the Token ISCA 2005, Madison, WI June 6, 2005

ISCA, Madison, WI, June 6, 2005 Arvind - 2 Inspiration: Jack Dennis General purpose parallel machines based on a dataflow graph model of computation Inspired all the major players in dataflow during seventies and eighties, including Kim Gostelow and UC Irvine

ISCA, Madison, WI, June 6, 2005 Arvind - 3 Static - mostly for signal processing NEC - NEDIP and IPP Hughes, Hitachi, AT&T, Loral, TI, Sanyo M.I.T. Engineering model... Dynamic Manchester (‘81) M.I.T. - TTDA, Monsoon (‘88) ‏ M.I.T./Motorola - Monsoon (‘91) (8 PEs, 8 IS) ‏ ETL - SIGMA-1 (‘88) (128 PEs,128 IS) ‏ ETL - EM4 (‘90) (80 PEs), EM-X (‘96) (80 PEs) ‏ Sandia - EPS88, EPS-2 IBM - Empire... Dataflow Machines Related machines: Burton Smith’s Denelcor HEP, Horizon, Tera Shown at Supercomputing 96 Shown at Supercomputing 91 S. Sakai Y. Kodama EM4: single-chip dataflow micro, 80PE multiprocessor, ETL, Japan K. Hiraki T. Shimada Sigma-1: The largest dataflow machine, ETL, Japan John Gurd Greg Papadopoulos Monsoon Andy Boughton Chris Joerg Jack Costanza

ISCA, Madison, WI, June 6, 2005 Arvind - 4 Software Influences Parallel Compilers –Intermediate representations: DFG, CDFG (SSA, ,...) ‏ –Software pipelining Keshav Pingali, G. Gao, Bob Rao,.. Functional Languages and their compilers Active Messages David Culler Compiling for FPGAs,... Wim Bohm, Seth Goldstein... Synchronous dataflow –Lustre, Signal Ed Berkeley

ISCA, Madison, WI, June 6, 2005 Arvind - 5 This talk is mostly about MIT work Dataflow graphs –A clean model of parallel computation Static Dataflow Machines –Not general-purpose enough Dynamic Dataflow Machines –As easy to build as a simple pipelined processor The software view –The memory model: I-structures Monsoon and its performance Musings

ISCA, Madison, WI, June 6, 2005 Arvind - 6 Dataflow Graphs {x = a + b; y = b * 7 in (x-y) * (x+y)} a b +*7 -+ * y x Values in dataflow graphs are represented as tokens An operator executes when all its input tokens are present; copies of the result token are distributed to the destination operators token instruction ptr portdata ip = 3 p = L no separate control flow

ISCA, Madison, WI, June 6, 2005 Arvind - 7 Dataflow Operators A small set of dataflow operators can be used to define a general programming language ForkPrimitive Ops + SwitchMerge TF TF TT + TF TF TT 

ISCA, Madison, WI, June 6, 2005 Arvind - 8 Well Behaved Schemas BeforeAfter P P T F fg T F Conditional one-in-one-out & self cleaning f p T F T F F Loop Bounded Loop Needed for resource management

ISCA, Madison, WI, June 6, 2005 Arvind - 9 Outline Dataflow graphs  –A clean model of parallel computation Static Dataflow Machines  –Not general-purpose enough Dynamic Dataflow Machines –As easy to build as a simple pipelined processor The software view –The memory model: I-structures Monsoon and its performance Musings

ISCA, Madison, WI, June 6, 2005 Arvind - 10 Static Dataflow Machine: Instruction Templates Each arc in the graph has a operand slot in the program Destination 1Destination 2 Presence bits L 4L * 3R 4R - 5L + 5R * out Opcode Operand 1Operand 2 a b +*7 -+ * y x

ISCA, Madison, WI, June 6, 2005 Arvind - 11 Static Dataflow Machine Jack Dennis, 1973, FU Op dest1 dest2 p1 src1 p2 src Receive Send Instruction Templates Many such processors can be connected together Programs can be statically divided among the processor

ISCA, Madison, WI, June 6, 2005 Arvind - 12 Static Dataflow: Problems/Limitations Mismatch between the model and the implementation –The model requires unbounded FIFO token queues per arc but the architecture provides storage for one token per arc –The architecture does not ensure FIFO order in the reuse of an operand slot –The merge operator has a unique firing rule The static model does not support – Function calls – Data Structures - No easy solution in the static framework - Dynamic dataflow provided a framework for solutions

ISCA, Madison, WI, June 6, 2005 Arvind - 13 Outline Dataflow graphs  –A clean model of parallel computation Static Dataflow Machines  –Not general-purpose enough Dynamic Dataflow Machines  –As easy to build as a simple pipelined processor The software view –The memory model: I-structures Monsoon and its performance Musings

ISCA, Madison, WI, June 6, 2005 Arvind - 14 Dynamic Dataflow Architectures Allocate instruction templates, i.e., a frame, dynamically to support each loop iteration and procedure call –termination detection needed to deallocate frames The code can be shared if we separate the code and the operand storage frame pointer instruction pointer a token

ISCA, Madison, WI, June 6, 2005 Arvind - 15 A Frame in Dynamic Dataflow Program + * L, 4L 3R, 4R 5L 5R out * a b +*7 -+ * y x Need to provide storage for only one operand/operator 3 Frame

ISCA, Madison, WI, June 6, 2005 Arvind - 16 Monsoon Processor Greg Papadopoulos Instruction Fetch Operand Fetch ip fp+r Network Frames op rd1,d2 Code Form Token ALU Token Queue

ISCA, Madison, WI, June 6, 2005 Arvind - 17 Temporary Registers & Threads Robert Iannucci Registers evaporate when an instruction thread is broken n sets of registers (n = pipeline depth) ‏ Instruction Fetch Operand Fetch Network Frames op rS1,S2 Code Form Token ALU Token Queue Registers Registers are also used for exceptions & interrupts Robert Iannucci

ISCA, Madison, WI, June 6, 2005 Arvind - 18 Actual Monsoon Pipeline: Eight Stages Instruction Fetch Form Token System Queue Registers Network Instruction Memory Effective Address Presence Bit Operation Frame Operation Presence bits Frame Memory User Queue R, 2W ALU

ISCA, Madison, WI, June 6, 2005 Arvind - 19 Instructions directly control the pipeline The opcode specifies an operation for each pipeline stage: opcode r dest1 [dest2] EA - effective address FP + r: frame relative r: absolute IP + r: code relative (not supported) ‏ WM - waiting matching Unary; Normal; Sticky; Exchange; Imperative PBs X port  PBs X Frame op X ALU inhibit Register ops: ALU: V L X V R  V’ L X V’ R, CC Form token: V L X V R X Tag 1 X Tag 2 X CC  Token 1 X Token 2 EA WM RegOp ALU FormToken Easy to implement; no hazard detection

ISCA, Madison, WI, June 6, 2005 Arvind - 20 Procedure Linkage Operators f get frame extract tag change Tag 0 Graph for f change Tag 1 a1 1: change Tag n an n:... change Tag 1 Fork token in frame 0 token in frame 1 Like standard call/return but caller & callee can be active simultaneously

ISCA, Madison, WI, June 6, 2005 Arvind - 21 Outline Dataflow graphs  –A clean model of parallel computation Static Dataflow Machines  –Not general-purpose enough Dynamic Dataflow Machines  –As easy to build as a simple pipelined processor The software view  –The memory model: I-structures Monsoon and its performance Musings

ISCA, Madison, WI, June 6, 2005 Arvind - 22 Parallel Language Model Global Heap of Shared Objects Tree of Activation Frames h:g: f: loop active threads asynchronous and parallel at all levels

ISCA, Madison, WI, June 6, 2005 Arvind - 23 Dataflow Graphs + I-Structures +... TTDA Monsoon *T *T-Voyager Id World implicit parallelism Id

ISCA, Madison, WI, June 6, 2005 Arvind - 24 Id World people Rishiyur Nikhil, Keshav Pingali, Vinod Kathail, David Culler Ken Traub Steve Heller, Richard Soley, Dinart Mores Jamey Hicks, Alex Caro, Andy Shaw, Boon Ang Shail Anditya R Paul Johnson Paul Barth Jan Maessen Christine Flood Jonathan Young Derek Chiou Arun Iyangar Zena Ariola Mike Bekerle K. Eknadham (IBM) ‏ Wim Bohm (Colorado) ‏ Joe Stoy (Oxford) ‏... Steve Heller Ken Traub R.S. Nikhil Keshav Pingali David Culler Boon S. AngDerek ChiouJamey Hicks

ISCA, Madison, WI, June 6, 2005 Arvind - 25 Data Structures in Dataflow.. P P Memory Data structures reside in a structure store  tokens carry pointers I-structures: Write-once, Read multiple times or –allocate, write, read,..., read, deallocate  No problem if a reader arrives before the writer at the memory location I-fetch a I-store a v

ISCA, Madison, WI, June 6, 2005 Arvind - 26 I-Structure Storage: Split-phase operations & Presence bits I-Fetch t s a a a a v2v2 fp.ip I-structure Memory Need to deal with multiple deferred reads other operations: fetch/store, take/put, clear  v1v1 address to be read t I-Fetch s split phase forwarding address

ISCA, Madison, WI, June 6, 2005 Arvind - 27 Outline Dataflow graphs  –A clean parallel model of computation Static Dataflow Machines  –Not general-purpose enough Dynamic Dataflow Machines  –As easy to build as a simple pipelined processor The software view  –The memory model: I-structures Monsoon and its performance  Musings

ISCA, Madison, WI, June 6, 2005 Arvind - 28 Unix Box The Monsoon Project Motorola Cambridge Research Center + MIT MIT-Motorola collaboration Research Prototypes 16 2-node systems (MIT, LANL, Motorola, Colorado, Oregon, McGill, USC,...) ‏ 2 16-node systems (MIT, LANL) ‏ Id World Software I-structure Monsoon Processor 64-bit 10M tokens/sec 4M 64-bit words 100 MB/sec 16-node Fat Tree Tony Dahbura

ISCA, Madison, WI, June 6, 2005 Arvind - 29 Id Applications on MIT Numerical – Hydrodynamics - SIMPLE – Global Circulation Model - GCM – Photon-Neutron Transport code -GAMTEB – N-body problem Symbolic – Combinatorics - free tree matching,Paraffins – Id-in-Id compiler System – I/O Library – Heap Storage Allocator on Monsoon Fun and Games – Breakout – Life – Spreadsheet

ISCA, Madison, WI, June 6, 2005 Arvind - 30 Id Run Time System (RTS) on Monsoon Frame Manager: Allocates frame memory on processors for procedure and loop activations Derek Chiou Heap Manager: Allocates storage in I- Structure memory or in Processor memory for heap objects. Arun Iyengar

ISCA, Madison, WI, June 6, 2005 Arvind - 31 Single Processor Monsoon Performance Evolution One 64-bit processor (10 MHz) + 4M 64-bit I-structure Feb. 91Aug. 91 Mar. 92Sep. 92 Matrix Multiply 4:04 3:58 3:55 1:46 500x500 Wavefront 5:00 5:00 3:48 500x500, 144 iters. Paraffins n = 19 :50 :31 :02.4 n = 22 :32 GAMTEB-9C 40K particles 17:20 10:42 5:36 5:36 1M particles 7:13:20 4:17:142:36:00 2:22:00 SIMPLE iterations :19 :15 :10 :06 1K iterations 4:48:00 1:19:49 hours:minutes:seconds Need a real machine to do this

ISCA, Madison, WI, June 6, 2005 Arvind - 32 Monsoon Speed Up Results Boon Ang, Derek Chiou, Jamey Hicks Matrix Multiply 500 x 500 Paraffins n=22 GAMTEB-2C 40 K particles SIMPLE iters 1pe pe pe pe pe pe pe pe speed up critical path (millions of cycles) ‏ September, 1992 Could not have asked for more

ISCA, Madison, WI, June 6, 2005 Arvind - 33 Base Performance? Id on Monsoon vs. C / F77 on R3000 Monsoon (1pe) ‏ (x 10e6 cycles) ‏ MIPS (R3000) ‏ (x 10e6 cycles) ‏ * 1787 * Matrix Multiply 500 x 500 Paraffins n=22 GAMTEB-9C 40 K particles SIMPLE iters R3000 cycles collected via Pixie * Fortran 77, fully optimized + MIPS C, O = 3 64-bit floating point used in Matrix-Multiply, GAMTEB and SIMPLE MIPS codes won’t run on a parallel machine without recompilation/recoding 8-way superscalar? Unlikely to give 7 fold speedup

ISCA, Madison, WI, June 6, 2005 Arvind - 34 The Monsoon Experience Performance of implicitly parallel Id programs scaled effortlessly. Id programs on a single-processor Monsoon took 2 to 3 times as many cycles as Fortran/C on a modern workstation. –Can certainly be improved Effort to develop the invisible software (loaders, simulators, I/O libraries,....) dominated the effort to develop the visible software (compilers...) ‏

ISCA, Madison, WI, June 6, 2005 Arvind - 35 Outline Dataflow graphs  –A clean parallel model of computation Static Dataflow Machines  –Not general-purpose enough Dynamic Dataflow Machines  –As easy to build as a simple pipelined processor The software view  –The memory model: I-structures Monsoon and its performance  Musings 

ISCA, Madison, WI, June 6, 2005 Arvind - 36 What would we have done differently - 1 Technically: Very little –Simple, high performance design, easily exploits fine- grain parallelism, tolerates latencies efficiently –Id preserves fine-grain parallelism which is abundant –Robust compilation schemes; DFGs provide easy compilation target Of course, there is room for improvement –Functionally several different types of memories (frames, queues, heap); all are not full at the same time –Software has no direct control over large parts of the memory, e.g., token queue –Poor single-thread performance and it hurts when single thread latency is on a critical path.

ISCA, Madison, WI, June 6, 2005 Arvind - 37 What would we have done differently - 2 Non technical but perhaps even more important –It is difficult enough to cause one revolution but two? Wake up? –Cannot ignore market forces for too long – may affect acceptance even by the research community –Should the machine have been built a few years earlier (in lieu of simulation and compiler work)? Perhaps it would have had more impact (had it worked) ‏ –The follow on project should have been about: 1.Running conventional software on DF machines, or 2.About making minimum modifications to commercial microprocessors (We chose 2 but perhaps 1 would have been better) ‏

ISCA, Madison, WI, June 6, 2005 Arvind - 38 Imperative Programs and Multi-Cores Deep pointer analysis is required to extract parallelism from sequential codes –otherwise, extreme speculation is the only solution A multithreaded/dataflow model is needed to present the found parallelism to the underlying hardware Exploiting fine-grain parallelism is necessary for many situations, e.g., producer-consumer parallelism

ISCA, Madison, WI, June 6, 2005 Arvind - 39 Locality and Parallelism: Dual problems? Good performance requires exploiting both Dataflow model gives you parallelism for free, but requires analysis to get locality C (mostly) provides locality for free but one must do analysis to get parallelism –Tough problems are tough independent of representation

ISCA, Madison, WI, June 6, 2005 Arvind - 40 Parting thoughts Dataflow research as conceived by most researchers achieved its goals –The model of computation is beautiful and will be resurrected whenever people want to exploit fine-grain parallelism But installed software base has a different model of computation which provides different challenges for parallel computing –Maybe possible to implement this model effectively on dataflow machines – we did not investigate this but is absolutely worth investigating further –Current efforts on more standard hardware are having lots of their own problems –Still an open question on what will work in the end

41 Thank You! and thanks to R.S.Nikhil, Dan Rosenband, James Hoe, Derek Chiou, Larry Rudolph, Martin Rinard, Keshav Pingali for helping with this talk

ISCA, Madison, WI, June 6, 2005 Arvind - 42 DFGs vs CDFGs Both Dataflow Graphs and Control DFGs had the goal of structured, well-formed, compositional, executable graphs CDFG research (70s, 80s) approached this goal starting with original sequential control-flow graphs (“flowcharts”) and data-dependency arcs, and gradually adding structure (e.g., φ- functions) ‏ Dataflow graphs approached this goal directly, by construction –Schemata for basic blocks, conditionals, loops, procedures CDFGs is an Intermediate representation for compilers and, unlike DFGs, not a language.