CS 258 Parallel Computer Architecture Lecture 5 Routing (Con’t)

Slides:



Advertisements
Similar presentations
Logic Gate Delay Modeling -1 Bishnu Prasad Das Research Scholar CEDT, IISc, Bangalore
Advertisements

CS252 Graduate Computer Architecture Lecture 16 Multiprocessor Networks (con’t) March 16 th, 2011 John Kubiatowicz Electrical Engineering and Computer.
Presentation of Designing Efficient Irregular Networks for Heterogeneous Systems-on-Chip by Christian Neeb and Norbert Wehn and Workload Driven Synthesis.
Logical Effort A Method to Optimize Circuit Topology Swarthmore College E77 VLSI Design Adem Kader David Luong Mark Piper December 6, 2005.
S. Reda EN1600 SP’08 Design and Implementation of VLSI Systems (EN1600S08) Lecture12: Logical Effort (1/2) Prof. Sherief Reda Division of Engineering,
Technology Mapping.
S. Reda EN160 SP’07 Design and Implementation of VLSI Systems (EN0160) Lecture 11: Logical Effort (1/2) Prof. Sherief Reda Division of Engineering, Brown.
Introduction to CMOS VLSI Design Lecture 5: Logical Effort David Harris Harvey Mudd College Spring 2004.
CS252 Graduate Computer Architecture Lecture 21 Multiprocessor Networks (con’t) John Kubiatowicz Electrical Engineering and Computer Sciences University.
CS 258 Parallel Computer Architecture Lecture 5 Routing February 6, 2008 Prof John D. Kubiatowicz
Logical Effort.
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania ECE VLSI Circuit Design Lecture 13 - More about.
1 Lecture 24: Interconnection Networks Topics: topologies, routing, deadlocks, flow control Final exam reminders:  Plan well – attempt every question.
1 MICROELETTRONICA Logical Effort and delay Lection 4.
Introduction to CMOS VLSI Design Lecture 5: Logical Effort
Lecture 4 – Logical Effort
EE141 © Digital Integrated Circuits 2nd Combinational Circuits 1 Logical Effort - sizing for speed.
1 Lecture 26: Interconnection Networks Topics: flow control, router microarchitecture.
Design and Implementation of VLSI Systems (EN0160)
EE 447 VLSI Design Lecture 5: Logical Effort. EE 447 VLSI Design 5: Logical Effort2 Outline Introduction Delay in a Logic Gate Multistage Logic Networks.
Z. Feng MTU EE4800 CMOS Digital IC Design & Analysis 5.1 EE4800 CMOS Digital IC Design & Analysis Lecture 5 Logic Effort Zhuo Feng.
John Kubiatowicz Electrical Engineering and Computer Sciences
Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.
Review: CMOS Inverter: Dynamic
Elastic-Buffer Flow-Control for On-Chip Networks
Dynamic Networks CS 213, LECTURE 15 L.N. Bhuyan CS258 S99.
High-Level Interconnect Architectures for FPGAs An investigation into network-based interconnect systems for existing and future FPGA architectures Nick.
Elmore Delay, Logical Effort
High-Level Interconnect Architectures for FPGAs Nick Barrow-Williams.
Logical Effort: optimal CMOS device sizing Albert Chun (M.A.Sc. Candidate) Ottawa-Carleton Institute for Electrical & Computer Engineering (OCIECE) Ottawa,
Multiprocessor Interconnection Networks Todd C. Mowry CS 740 November 3, 2000 Topics Network design issues Network Topology.
Introduction  Chip designers face a bewildering array of choices –What is the best circuit topology for a function? –How many stages of logic give least.
ECE669 L21: Routing April 15, 2004 ECE 669 Parallel Computer Architecture Lecture 21 Routing.
Optimal digital circuit design Mohammad Sharifkhani.
Logical Effort and Transistor Sizing Digital designs are usually expected to operate at high frequencies, thus designers often have to choose the fastest.
Lecture 6: Logical Effort
Introduction to CMOS VLSI Design Lecture 5: Logical Effort GRECO-CIn-UFPE Harvey Mudd College Spring 2004.
© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Switch Microarchitecture Basics.
Introduction to CMOS VLSI Design Lecture 6: Logical Effort
Linear Delay Model In general the propagation delay of a gate can be written as: d = f + p –p is the delay due to intrinsic capacitance. –f is the effort.
Timing Analysis Section Delay Time Def: Time required for output signal Y to change due to change in input signal X Up to now, we have assumed.
EEC 118 Lecture #7: Designing with Logical Effort Rajeevan Amirtharajah University of California, Davis Jeff Parkhurst Intel Corporation.
Technology Mapping. 2 Technology mapping is the phase of logic synthesis when gates are selected from a technology library to implement the circuit. Technology.
Virtual-Channel Flow Control William J. Dally
Modern VLSI Design 3e: Chapter 4 Copyright  1998, 2002 Prentice Hall PTR Topics n Transistor sizing: –Spice analysis. –Logical effort.
1 Lecture 22: Interconnection Networks Topics: Routing, deadlock, flow control, virtual channels.
COE 360 Principles of VLSI Design Delay. 2 Definitions.
Lecture 23: Interconnection Networks
Physical constraints (1/2)
Azeddien M. Sllame, Amani Hasan Abdelkader
Static and Dynamic Networks
Introduction to CMOS VLSI Design Chapter 4 Delay
Switching, routing, and flow control in interconnection networks
Lecture 14: Interconnection Networks
Lecture 6: Logical Effort
Virtual-Channel Flow Control
Logical Effort Basics from Bart Zeydel.
Lecture 6: Logical Effort
Introduction to CMOS VLSI Design Lecture 5: Logical Effort
Estimating Delays Would be nice to have a “back of the envelope” method for sizing gates for speed Logical Effort Book by Sutherland, Sproull, Harris Chapter.
CEG 4131 Computer Architecture III Miodrag Bolic
Low-Latency Virtual-Channel Routers for On-Chip Networks Robert Mullins, Andrew West, Simon Moore Presented by Sailesh Kumar.
RC Modeling and Logical Effort Basics
Interconnection Networks Contd.
Lecture: Networks Topics: TM wrap-up, networks.
Lecture 6: Logical Effort
Lecture: Interconnection Networks
Networks: Routing and Design
Lecture 25: Interconnection Networks
COMBINATIONAL LOGIC - 2.
Presentation transcript:

CS 258 Parallel Computer Architecture Lecture 5 Routing (Con’t) February 11, 2008 Prof John D. Kubiatowicz http://www.cs.berkeley.edu/~kubitron/cs258 CS258 S99

Recall: Deadlock free wormhole networks Basic dimension order routing techniques don’t work for unidirectional k-ary d-cubes only for k-ary d-arrays (bi-directional) Idea: add channels! provide multiple “virtual channels” to break the dependence cycle good for BW too! Do not need to add links, or xbar, only buffer resources This adds nodes to the CDG, remove edges?

Recall: Use of virtual channels for adaptation Want to route around hotspots/faults while avoiding deadlock “An adaptive and Fault Tolerant Wormhole Routing Strategy for k-ary n-cubes,” Linder and Harden, 1991 General technique for k-ary n-cubes Requires: 2n-1 virtual channels/lane!!! Alternative: Planar adaptive routing Chien and Kim, 1995 Divide dimensions into “planes”, i.e. in 3-cube, use X-Y and Y-Z Route planes adaptively in order: first X-Y, then Y-Z Never go back to plane once have left it Can’t leave plane until have routed lowest coordinate Use Linder-Harden technique for series of 2-dim planes Now, need only 3  number of planes virtual channels Alternative: two phase routing Provide set of virtual channels that can be used arbitrarily for routing When blocked, use unrelated virtual channels for dimension-order (deterministic) routing Never progress from deterministic routing back to adaptive routing

Breaking deadlock with virtual channels

Unidirectional k-ary n-cubes n+1 virtual channels (one wrap-around per channel) Switch to new “level” whenever wrap around in any dim Any adaptive routing solution is possible as long as: It doesn’t use more than n wrap-around channels If want more adaptivity, can add more levels (and more virtual channels)

Bidirectional k-ary n-cube Need 2n-1 virtual networks Except for lowest dimension, only involves single direction

Switch Design

How do you build a crossbar?

Input buffered swtich Independent routing logic per input FSM Scheduler logic arbitrates each output priority, FIFO, random Head-of-line blocking problem

Output Buffered Switch How would you build a shared pool?

Output scheduling n independent arbitration problems? static priority, random, round-robin simplifications due to routing algorithm? general case is max bipartite matching

When are virtual channels allocated? Hardware efficient design For crossbar Two separate processes: Virtual channel allocation Switch/connection allocation Virtual Channel Allocation Choose route and free output virtual channel Switch Allocation For each incoming virtual channel, must negotiate switch on outgoing pin In ideal case (not highly loaded), would like to optimistically allocate a virtual channel

Delay analysis of wormhole router “A Delay Model and Speculative Architecture for Pipelined Routers” Li-Shiuan Peh and William Dally Cannonical model for a virtual-channel-router Separate routing, virtual-channel allocation, and switch allocation

Virtual Channel Analysis Identified Various complex modules within router Identified a pipelining model Speculative Virtual Channel Allocation Developed process-independent models Result permits the evaluation of number of pipelining stages How might we evaluate complexity of logic? Ideally, have some measure that reflects algorithmic complexity, not technology-dependent computations What is a good normalization? Single, minimum-sized inverter Call the delay of this 

Process Independent Modeling How might we evaluate complexity of logic? Ideally, have some measure that reflects algorithmic complexity, not technology-dependent computations What is a good normalization? Single, minimum-sized inverter Call the delay of this 

Logical Effort: Delay in a Logic Gate Express delays in process-independent unit Delay has two components Effort delay f = gh (a.k.a. stage effort) Again has two components g: logical effort Measures relative ability of gate to deliver current g  1 for inverter h: electrical effort = Cout / Cin Ratio of output to input capacitance Sometimes called fanout p: Parasitic delay Represents delay of gate driving no load Set by internal parasitic capacitance

Delay Plots d = f + p = gh + p

Computing Logical Effort DEF: Logical effort is the ratio of the input capacitance of a gate to the input capacitance of an inverter delivering the same output current. Measure from delay vs. fanout plots Or estimate by counting transistor widths

Catalog of Gates Logical effort of common gates Gate type Number of inputs 1 2 3 4 n Inverter NAND 4/3 5/3 6/3 (n+2)/3 NOR 7/3 9/3 (2n+1)/3 Tristate / mux XOR, XNOR 4, 4 6, 12, 6 8, 16, 16, 8

Catalog of Gates Parasitic delay of common gates Gate type In multiples of pinv (1) Gate type Number of inputs 1 2 3 4 n Inverter NAND NOR Tristate / mux 6 8 2n XOR, XNOR

Example: Ring Oscillator Estimate the frequency of an N-stage ring oscillator Logical Effort: g = 1 Electrical Effort: h = 1 Parasitic Delay: p = 1 Stage Delay: d = 2 Frequency: fosc = 1/(2*N*d) = 1/4N 31 stage ring oscillator in 0.6 mm process has frequency of ~ 200 MHz

Example: FO4 Inverter Estimate the delay of a fanout-of-4 (FO4) inverter Logical Effort: g = 1 Electrical Effort: h = 4 Parasitic Delay: p = 1 Stage Delay: d = 5 The FO4 delay is about 200 ps in 0.6 mm process 60 ps in a 180 nm process f/3 ns in an f mm process

Multistage Logic Networks Logical effort generalizes to multistage networks Path Logical Effort Path Electrical Effort Path Effort

Multistage Logic Networks Logical effort generalizes to multistage networks Path Logical Effort Path Electrical Effort Path Effort Can we write F = GH?

Paths that Branch No! Consider paths that branch: G = 1 GH = 18 h1 = (15 +15) / 5 = 6 h2 = 90 / 15 = 6 F = g1g2h1h2 = 36 = 2GH

Branching Effort Introduce branching effort Accounts for branching between stages in path Now we compute the path effort F = GBH Note:

Multistage Delays Path Effort Delay Path Parasitic Delay Path Delay

Designing Fast Circuits Delay is smallest when each stage bears same effort Thus minimum delay of N stage path is This is a key result of logical effort Find fastest possible delay Doesn’t require calculating gate sizes

Gate Sizes How wide should the gates be for least delay? Working backward, apply capacitance transformation to find input capacitance of each gate given load it drives. Check work by verifying input cap spec is met.

How does this relate to Router Model? Example of results possible: Evaluation of latency as function of VC-allocation algorithm complexity Develop VC-allocator module as circuit, compute logical effort

Summary Deadlock-free if channel dependence graph is acyclic limit turns to eliminate dependences add separate channel resources to break dependences combination of topology, algorithm, and switch design Switch design issues input/output/pooled buffering, routing logic, selection logic Logical Effort Technology-independent delay model: compared with inverter d = gh + p g:logical effort, h:electrical effort, p:parisitic delay “A Delay Model and Speculative Architecture for Pipelined Routers” Speculation on virtual-channel allocation Improves: low conflict latency and throughput