Computer Architecture Dataflow Machines. Data Flow Conventional programming models are control driven Instruction sequence is precisely specified Sequence.

Slides:



Advertisements
Similar presentations
Shantanu Dutt Univ. of Illinois at Chicago
Advertisements

Accessing I/O Devices Processor Memory BUS I/O Device 1 I/O Device 2.
Parallel Processing & Parallel Algorithm May 8, 2003 B4 Yuuki Horita.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
DATAFLOW ARHITEKTURE. Dataflow Processors - Motivation In basic processor pipelining hazards limit performance –Structural hazards –Data hazards due to.
Multiple Processor Systems
Taxanomy of parallel machines. Taxonomy of parallel machines Memory – Shared mem. – Distributed mem. Control – SIMD – MIMD.
Parallel Architectures: Topologies Heiko Schröder, 2003.
Router Architecture : Building high-performance routers Ian Pratt
Interconnection Networks 1 Interconnection Networks (Chapter 6) References: [1,Wilkenson and Allyn, Ch. 1] [2, Akl, Chapter 2] [3, Quinn, Chapter 2-3]
CSCI 8150 Advanced Computer Architecture Hwang, Chapter 2 Program and Network Properties 2.3 Program Flow Mechanisms.
1 Version 3 Module 8 Ethernet Switching. 2 Version 3 Ethernet Switching Ethernet is a shared media –One node can transmit data at a time More nodes increases.

Communication operations Efficient Parallel Algorithms COMP308.
Multithreaded ASC Kevin Schaffer and Robert A. Walker ASC Processor Group Computer Science Department Kent State University.
(Page 554 – 564) Ping Perez CS 147 Summer 2001 Alternative Parallel Architectures  Dataflow  Systolic arrays  Neural networks.
1 Lecture 24: Parallel Algorithms I Topics: sort and matrix algorithms.
Fall 2008Introduction to Parallel Processing1 Introduction to Parallel Processing.
Introduction to Parallel Processing Ch. 12, Pg
ECE669 L16: Interconnection Topology March 30, 2004 ECE 669 Parallel Computer Architecture Lecture 16 Interconnection Topology.
Chapter 5 Array Processors. Introduction  Major characteristics of SIMD architectures –A single processor(CP) –Synchronous array processors(PEs) –Data-parallel.
Interconnect Network Topologies
Interconnection Networks. Applications of Interconnection Nets Interconnection networks are used everywhere! ◦ Supercomputers – connecting the processors.
High Performance Embedded Computing © 2007 Elsevier Lecture 16: Interconnection Networks Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.
Blue Gene / C Cellular architecture 64-bit Cyclops64 chip: –500 Mhz –80 processors ( each has 2 thread units and a FP unit) Software –Cyclops64 exposes.
Interconnect Networks
Network Topologies Topology – how nodes are connected – where there is a wire between 2 nodes. Routing – the path a message takes to get from one node.
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
Computer Architecture MIMD Parallel Processors Iolanthe II racing in Waitemata Harbour.
CS321 Functional Programming 2 © JAS Implementation using the Data Flow Approach In a conventional control flow system a program is a set of operations.
High Performance Architectures Dataflow Part 3. 2 Dataflow Processors Recall from Basic Processor Pipelining: Hazards limit performance  Structural hazards.
CS668- Lecture 2 - Sept. 30 Today’s topics Parallel Architectures (Chapter 2) Memory Hierarchy Busses and Switched Networks Interconnection Network Topologies.
High-Level Interconnect Architectures for FPGAs An investigation into network-based interconnect systems for existing and future FPGA architectures Nick.
Switches 1RD-CSY  In this lecture, we will learn about  Collision Domain and Microsegmentation  Switches – a layer two device ◦ MAC address.
CHAPTER 12 INTRODUCTION TO PARALLEL PROCESSING CS 147 Guy Wong page
1 Dynamic Interconnection Networks Miodrag Bolic.
CSCI 232© 2005 JW Ryder1 Parallel Processing Large class of techniques used to provide simultaneous data processing tasks Purpose: Increase computational.
CA406 Computer Architecture Networks. Data Flow - Summary Fine-Grain Dataflow Suffered from comms network overload! Coarse-Grain Dataflow Monsoon... Overtaken.
1 Multithreaded Architectures Lecture 3 of 4 Supercomputing ’93 Tutorial Friday, November 19, 1993 Portland, Oregon Rishiyur S. Nikhil Digital Equipment.
Anshul Kumar, CSE IITD CSL718 : Multiprocessors Interconnection Mechanisms Performance Models 20 th April, 2006.
Switches 1RD-CSY  In this lecture, we will learn about  Collision Domain and Microsegmentation  Switches – a layer two device ◦ MAC address.
Accessing I/O Devices Processor Memory BUS I/O Device 1 I/O Device 2.
Anshul Kumar, CSE IITD ECE729 : Advanced Computer Architecture Lecture 27, 28: Interconnection Mechanisms In Multiprocessors 29 th, 31 st March, 2010.
Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.
Super computers Parallel Processing
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
1 Lecture 24: Interconnection Networks Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix F)
Computer Architecture Dataflow Machines Sunset over Lifou, New Caledonia.
Autumn 2006CSE P548 - Dataflow Machines1 Von Neumann Execution Model Fetch: send PC to memory transfer instruction from memory to CPU increment PC Decode.
Interrupts and Exception Handling. Execution We are quite aware of the Fetch, Execute process of the control unit of the CPU –Fetch and instruction as.
Computer Architecture MIMD Parallel Processors Iolanthe II racing in Waitemata Harbour.
COMP8330/7330/7336 Advanced Parallel and Distributed Computing Tree-Based Networks Cache Coherence Dr. Xiao Qin Auburn University
Interconnection Networks Communications Among Processors.
COMP8330/7330/7336 Advanced Parallel and Distributed Computing Communication Costs in Parallel Machines Dr. Xiao Qin Auburn University
INTERCONNECTION NETWORK
Network Connected Multiprocessors
Parallel Architecture
Distributed and Parallel Processing
ARM Organization and Implementation
Lecture 23: Interconnection Networks
Prof. Onur Mutlu Carnegie Mellon University
Interconnection topologies
Course Outline Introduction in algorithms and applications
Introduction to Scalable Interconnection Network Design
Communication operations
Introduction to Scalable Interconnection Networks
CS 6290 Many-core & Interconnect
Samira Khan University of Virginia Jan 23, 2019
Prof. Onur Mutlu Carnegie Mellon University
Presentation transcript:

Computer Architecture Dataflow Machines

Data Flow Conventional programming models are control driven Instruction sequence is precisely specified Sequence specifies control which instruction the CPU will execute next Execution rule: Execute an instruction when its predecessor has completed s1: r = a*b; s2: s = c*d; s3: y = r + s; s2 executes when s1 is complete s3 executes when s2 is complete

Data Flow Consider the calculation y = a*b + c*d Represent it by a graph Nodes represent computations Data flows along arcs Execution rule: Execute an instruction when its data is available Data driven rule ab x + dc x y

Data Flow Dataflow firing rule An instruction fires (executes) when its data is available Exposes all possible parallelism Either multiplication can fire as soon as data arrives Addition must wait Data dependence analysis! Instruction issue units: Fire (issue) each instruction when its operands (registers) have been written ab x + dc x y

Data Flow - Realisations Several Experimental Machines built ManchesterGurd & Watson Tagged TokenArvind, MIT SigmaETL, Tsukuba EMC-4ETL, Tsukuba MonsoonArvind, MIT EMXETL, Tsukuba RAPIDOsaka/Sharp/Mitsubishi (Asynchronous!) NaiadTasmania and some others

Data Flow - Realisations Manchester

Data Flow - Program Program word Matching Store Entry When both Presence Flags are Y, this packet is despatched to a PE (any PE!) Operation +, -, *, / etc Left, Right Operands Presence Flags Destination Address Destination Left or Right

Data Flow - Matching Store Special purpose memory Limited processing capability Detects full slots Despatches operation packets to any idle PE Operation +, -, *, / etc Left, Right Operands Presence Flags Destination Address Destination Left or Right

Data Flow - Processing Elements Receive operation packets Generate result Form result packet Despatch to matching store

Data Flow - EM4 Architects Yamaguchi, Sakai, Kodama, Sato et al ElectroTechnical Laboratory, Tsukuba, Japan PE (EM-Y) CMOS Gate Array 80k gates / 1.0  f = 20MHz ~1992

Data Flow - Monsoon Architects Papadopoulos, Culler et al MIT, Cambridge PE f = 10MHz ~1990 I-Structure Processor

Data Flow - I-Structures Memory with a presence bit Tag each memory location with a bit indicating its validity Valid bit set -> normal read (no wait) Data not yet written (valid bit not set) çWait çRead requests queued òData driven execution Operations proceed when data is available valid data validdata

Data Flow - Monsoon Pipeline 8 stage pipeline “Presence bits” checks operand availability Frame (coarse grain) basis

Data Flow - Summary Fine-Grain Dataflow Suffered from comms network overload! Coarse-Grain Dataflow Monsoon... Overtaken by commercial technology!! A sad “fact-of-life” It’s almost impossible to generate the funds for non-”mainstream” computer architecture research $n x 10 8 required  Non-mainstream = interesting!

Data Flow - Summary As a software model … Functional languages Dataflow in a different guise! Theoretically important Practically? Inefficient ( = slow!!) ….. Ask your CS colleagues! Cilk - based on C Used on CIIPS Myrmidons Uses a dataflow model Threads become ready for execution when their data is generated Message passing efficiency Without explicit data transfer & synchronisation!

Networks Network Topology (or shape) Vital to efficient parallel algorithms Communication is the limiting factor! Ideal Cross-bar Any-to-any Non-blocking Except two sources to same receiver Realisable But only for limited order (number of ports)

Networks Cross-bars Achilles 8 x 8 Full duplex Simultaneous Input and Output at each port 32 bit data-path Target : 1Gbyte / second total throughput but we needed the 3-D arrangement to achieve bandwidth high order

Networks Cross-bars Achilles Hardware almost trivial! Single FPGA on each level Programmable VHDL Models Several topologies Just by changing the software!

Networks - More than 8 PEs Simple Use 2 8x8 routers! but …. This link gets a lot of traffic!

Networks - Fat tree Problem: High-traffic links between PEs can become a bottleneck Solution: Fat-tree Links higher up the tree are “fatter” Sustainable bandwidth between all PEs is the same

Networks - Performance Metrics Metrics for comparing network topologies Diameter Maximum distance between any pair of nodes Determines latency Bisection Bandwidth Aggregate bandwidth over any “cut” which divides the network in half Determines throughput Crossbar Diameter: 1 Every PE is directly connected to router so a single “hop” suffices Bisection Bandwidth: b bytes/sec b is the bandwidth of a single link

Networks - Performance Metrics Metrics for comparing network topologies To connect n PEs with mxm crossbars Single link bandwidth b bytes/s Simple: n = 14 (2 switches) Diameter3 Bisection Bandwidth b 1 2 3

Networks - Performance Metrics Fat-tree Diameter: 2 log m n Height is log m n Worst case distance - up and down Bisection Bandwidth: b n/2 bytes/sec Links are fatter higher up the tree log m n

Networks - Performance Metrics Mesh Diameter: 2  n-2 Bisection Bandwidth: b  n bytes/sec Order: 4

Networks - Performance Metrics Hypercube Hypercube of order m Link 2 order m-1 hypercubes with 2 m-1 links Number of PEs: n = 2 m Order: log 2 n = m Order 2 Hypercube Order 3 Hypercube

Networks - Hypercubes Embedding property In an n PE hypercube, we have hypercubes of size n/2, n/4, … Number PEs with binary numbers 000, 001, 010, 011, 100, … Joining two hypercubes add one binary digit to the numbering Each PE is connected to every PE whose index differs in only one bit

Networks - Hypercubes Embedding property Partitioning tasks Allocate to sub-cubes Sub-tasks allocated to sub-cubes of that cube, etc