CA406 Computer Architecture Networks. Data Flow - Summary Fine-Grain Dataflow Suffered from comms network overload! Coarse-Grain Dataflow Monsoon... Overtaken.

Slides:



Advertisements
Similar presentations
Accessing I/O Devices Processor Memory BUS I/O Device 1 I/O Device 2.
Advertisements

Parallel Processing & Parallel Algorithm May 8, 2003 B4 Yuuki Horita.
Computer Science and Engineering Laboratory, Transport-triggered processors Jani Boutellier Computer Science and Engineering Laboratory This.
Presentation of Designing Efficient Irregular Networks for Heterogeneous Systems-on-Chip by Christian Neeb and Norbert Wehn and Workload Driven Synthesis.
1 Chapter 1 Why Parallel Computing? An Introduction to Parallel Programming Peter Pacheco.
CIS December '99 Introduction to Parallel Architectures Dr. Laurence Boxer Niagara University.
Marwan Al-Namari Week 2. ADSL : Asymmetric Digital Subscriber Line Ethernet networks - 10BASE-T - 100BASE-TX BASE-T BASE-TX (Cat5e.
Synchronous Digital Design Methodology and Guidelines
1 Version 3 Module 8 Ethernet Switching. 2 Version 3 Ethernet Switching Ethernet is a shared media –One node can transmit data at a time More nodes increases.
Parallel Architectures: Topologies Heiko Schröder, 2003.
Router Architecture : Building high-performance routers Ian Pratt
Parallel Architectures: Topologies Heiko Schröder, 2003.
1 Lecture 23: Interconnection Networks Topics: communication latency, centralized and decentralized switches (Appendix E)
Advanced Topics in Algorithms and Data Structures An overview of the lecture 2 Models of parallel computation Characteristics of SIMD models Design issue.
Interconnection Networks 1 Interconnection Networks (Chapter 6) References: [1,Wilkenson and Allyn, Ch. 1] [2, Akl, Chapter 2] [3, Quinn, Chapter 2-3]
1 CSE 591-S04 (lect 14) Interconnection Networks (notes by Ken Ryu of Arizona State) l Measure –How quickly it can deliver how much of what’s needed to.
Network based System on Chip Performed by: Medvedev Alexey Supervisor: Walter Isaschar (Zigmond) Winter-Spring 2006.
1 Version 3 Module 8 Ethernet Switching. 2 Version 3 Ethernet Switching Ethernet is a shared media –One node can transmit data at a time More nodes increases.

Multithreaded ASC Kevin Schaffer and Robert A. Walker ASC Processor Group Computer Science Department Kent State University.
1 Lecture 24: Interconnection Networks Topics: communication latency, centralized and decentralized switches (Sections 8.1 – 8.5)
Models of Parallel Computation Advanced Algorithms & Data Structures Lecture Theme 12 Prof. Dr. Th. Ottmann Summer Semester 2006.
The importance of switching in communication The cost of switching is high Definition: Transfer input sample points to the correct output ports at the.
ECE 526 – Network Processing Systems Design
Introduction to Parallel Processing Ch. 12, Pg
ECE669 L16: Interconnection Topology March 30, 2004 ECE 669 Parallel Computer Architecture Lecture 16 Interconnection Topology.
Chapter 5 Array Processors. Introduction  Major characteristics of SIMD architectures –A single processor(CP) –Synchronous array processors(PEs) –Data-parallel.
Pipelined Two Step Iterative Matching Algorithms for CIOQ Crossbar Switches Deng Pan and Yuanyuan Yang State University of New York, Stony Brook.
Interconnect Network Topologies
High Performance Embedded Computing © 2007 Elsevier Lecture 16: Interconnection Networks Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.
Blue Gene / C Cellular architecture 64-bit Cyclops64 chip: –500 Mhz –80 processors ( each has 2 thread units and a FP unit) Software –Cyclops64 exposes.
Interconnect Networks
Network Topologies Topology – how nodes are connected – where there is a wire between 2 nodes. Routing – the path a message takes to get from one node.
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
Computer Architecture Dataflow Machines. Data Flow Conventional programming models are control driven Instruction sequence is precisely specified Sequence.
CS668- Lecture 2 - Sept. 30 Today’s topics Parallel Architectures (Chapter 2) Memory Hierarchy Busses and Switched Networks Interconnection Network Topologies.
CSE Advanced Computer Architecture Week-11 April 1, 2004 engr.smu.edu/~rewini/8383.
High-Level Interconnect Architectures for FPGAs An investigation into network-based interconnect systems for existing and future FPGA architectures Nick.
High-Level Interconnect Architectures for FPGAs Nick Barrow-Williams.
CHAPTER 12 INTRODUCTION TO PARALLEL PROCESSING CS 147 Guy Wong page
1 Dynamic Interconnection Networks Miodrag Bolic.
SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.
L11: Lower Power High Level Synthesis(2) 성균관대학교 조 준 동 교수
LogP Model Motivation BSP Model Limited to BW of Network (g) and Load of PE Requires large load per super steps. Need Better Models for Portable Algorithms.
Multiplexing FDM & TDM. Multiplexing When two communicating nodes are connected through a media, it generally happens that bandwidth of media is several.
Computer System Architecture Dept. of Info. Of Computer. Chap. 13 Multiprocessors 13-1 Chap. 13 Multiprocessors n 13-1 Characteristics of Multiprocessors.
Processor Architecture
CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/
Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.
Super computers Parallel Processing
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
Topology How the components are connected. Properties Diameter Nodal degree Bisection bandwidth A good topology: small diameter, small nodal degree, large.
Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.
Computer Architecture Dataflow Machines Sunset over Lifou, New Caledonia.
VADA Lab.SungKyunKwan Univ. 1 L5:Lower Power Architecture Design 성균관대학교 조 준 동 교수
COMP8330/7330/7336 Advanced Parallel and Distributed Computing Tree-Based Networks Cache Coherence Dr. Xiao Qin Auburn University
Interconnection Networks Communications Among Processors.
Lecture 5: Lecturer: Simon Winberg Review of paper: Temporal Partitioning Algorithm for a Coarse-grained Reconfigurable Computing Architecture by Chongyong.
Network Connected Multiprocessors
Parallel Architecture
Distributed and Parallel Processing
Lecture 23: Interconnection Networks
Switching and High-Speed Networks
Interconnection topologies
Introduction to Scalable Interconnection Network Design
Interconnection Network Design Lecture 14
Introduction to Scalable Interconnection Networks
CS 6290 Many-core & Interconnect
Presentation transcript:

CA406 Computer Architecture Networks

Data Flow - Summary Fine-Grain Dataflow Suffered from comms network overload! Coarse-Grain Dataflow Monsoon... Overtaken by commercial technology!! A sad “fact-of-life” It’s almost impossible to generate the funds for non-”mainstream” computer architecture research $n x 10 8 required  Non-mainstream = interesting!

Data Flow - Summary As a software model … Functional languages Dataflow in a different guise! Theoretically important Practically? Inefficient ( = slow!!) ….. Ask your CS colleagues! Cilk - based on C Used on CIIPS Myrmidons Uses a dataflow model Threads become ready for execution when their data is generated Message passing efficiency Without explicit data transfer & synchronisation!

Networks Network Topology (or shape) Vital to efficient parallel algorithms Communication is the limiting factor! Ideal Cross-bar Any-to-any Non-blocking Except two sources to same receiver Realisable But only for limited order (number of ports)

Networks Cross-bars Achilles 8 x 8 Full duplex Simultaneous Input and Output at each port 32 bit data-path Target : 1Gbyte / second total throughput but we needed the 3-D arrangement to achieve bandwidth high order

Networks Cross-bars Achilles Hardware almost trivial! Single FPGA on each level Programmable VHDL Models Several topologies Just by changing the software!

Networks - More than 8 PEs Simple Use 2 8x8 routers! but …. This link gets a lot of traffic!

Networks - Fat tree Problem: High-traffic links between PEs can become a bottleneck Solution: Fat-tree Links higher up the tree are “fatter” Sustainable bandwidth between all PEs is the same

Networks - Performance Metrics Metrics for comparing network topologies Diameter Maximum distance between any pair of nodes Determines latency Bisection Bandwidth Aggregate bandwidth over any “cut” which divides the network in half Determines throughput Crossbar Diameter: 1 Every PE is directly connected to router so a single “hop” suffices Bisection Bandwidth: b bytes/sec b is the bandwidth of a single link

Networks - Performance Metrics Metrics for comparing network topologies To connect n Pes with mxm crossbars Single link bandwidth b bytes/s Simple: n = 14 (2 switches) Diameter3 Bisection Bandwidth b 1 2 3

Networks - Performance Metrics Fat-tree Diameter: 2 log m n Height is log m n Worst case distance - up and down Bisection Bandwidth: b n/2 bytes/sec Links are fatter higher up the tree log m n

Networks - Performance Metrics Mesh Diameter: 2  n-2 Bisection Bandwidth: b  n bytes/sec Order: 4

Networks - Performance Metrics Hypercube Hypercube of order m Link 2 order m-1 hypercubes with 2 m-1 links Number of PEs: n = 2 m Order: log 2 n = m Order 2 Hypercube Order 3 Hypercube

Networks - Hypercubes Embedding property In an n PE hypercube, we have hypercubes of size n/2, n/4, … Number PEs with binary numbers 000, 001, 010, 011, 100, … Joining two hypercubes add one binary digit to the numbering Each PE is connected to every PE whose index differs in only one bit

Networks - Hypercubes Embedding property Partitioning tasks Allocate to sub-cubes Sub-tasks allocated to sub-cubes of that cube, etc

Futures

VLIW - Very Long Instruction Word Instruction word: multiple operations n RISC-style instructions Architecture: fixed set of functional units Each FU matched to a “slot” in the instruction

VLIW - Very Long Instruction Word Compiler responsible for allocating instructions to words Burden squarely on compiler Needs to produce near optimal schedule Inevitable: large number of empty slots! çLower code density Similar to superscalar but instruction issue flexibility missing VLIW simpler  faster? Re-compilation needed Each new generation will have different functional unit mix

Synchronous Logic Systems Clock distribution Major problem for chip architect Clock skews < ps over whole die 10% of cycle time Small changes çRe-engineer whole chip Checking for data hazards & logic races

Synchronous Logic Systems Clock distribution Power consumption Major 30W+ per chip CMOS logic consumes power only on switch but synch systems clock a lot of logic on every cycle Clock is distributed to every subsystem Even if the logic of the subsystem is disabled!

Synchronous Logic Systems Clock distribution Power consumption Worst case propagation delay Determines maximum clock speed Clock edge must wait until all logic has settled Temperature and process fabrication çEven slower clocks Design is simpler Logic designers have experience Good tools

Asynchronous Logic Systems Clock distribution No longer a problem Synchronisation bundled with data Circuits are composable No global clock … åNo need to re-engineer a whole chip to change one section! Known correct circuits can be combined Power consumption Circuits switch only when they’re computing çPotentially very low power consumption May be the biggest attraction of asynch systems!

Asynchronous Logic Systems Clock distribution problem removed Circuits are composable Power consumption Average case propagation delay Completion signal generated when result is available Independent of Temperature and process fabrication Design is harder Experience will remove this?

Laboratory 1.51 Practical Examinations will be held in this laboratory every afternoon from 1:50pm to 5:30pm next week, June 1 to June 5 The laboratory will be closed to everyone except those in CT105/CLP110 actually taking the exams during these times. Please consider the students taking the exam by not disturbing them in any way.