On-Chip Interconnect Trend and Design Optimization Chung-Kuan Cheng UC San Diego, La Jolla, CA.

On-Chip Interconnect Trend and Design Optimization Chung-Kuan Cheng UC San Diego, La Jolla, CA

Outlines Global Interconnect Technologies – RC Trees and Transmission Lines Prefix Adder Synthesis – Modeling FPGA Interconnect Architecture – Modeling Interconnect Architecture – Non-Manhattan Wire Arrangement 2

Interconnect Technologies Introduction On-Chip Global Interconnection Global Wire Modeling Performance Comparison 3

4 Introduction – Performance Impact Interconnect delay determines the system performance [ITRS08]  542ps for 1mm minimum pitch Cu global wire w/o repeater @ 45nm  ~150ps for 10 level FO4 delay @ 45nm [Ho2001] “Future of Wire”

Introduction – Power Dissipation Interconnects consume a significant portion of power –1-2 order larger in magnitude compared with gates Half of the dynamic power dissipated on repeaters to minimize latency [Zhang07] –Wires consume 50% of total dynamic power for a 0.13um microprocessor [Magen04] About 1/3 burned on the global wires. 5

6 Introduction – Technology Trend On-Chip Interconnect Scaling –Dimension shrinks Wire resistance increases -> RC delay Increasing capacitive coupling -> delay, power, noise, etc. –Performance of global wires decreases w/ technology scaling. Wire CategoryTechnology Node 90nm45nm22nm M1 Wire Rw(kohm/mm)1.9148.86034.827 Cw(pF/mm)0.1830.1570.129 Global Wire Rw(kohm/mm)0.5322.97011.000 Cw(pF/mm)0.2050.1790.151 Copper resistivity versus wire width Scaling trend of PUL wire resistance and capacitance

Organization of On-Chip Global Interconnections 7

Multi-Dimensional Design Consideration 8 Preliminary analysis results assuming 65nm CMOS process. Application-oriented choice  Low Latency T-TL or UT-TL -> Single-Ended T-lines  High ThroughputR-RC  Low Power PE-TL or UE-TL  Low Noise PE-TL or UE-TL  Low Area/CostR-RC Differential T-lines For each architecture, the more area the pentagon covers, the better overall performance is achieved.

On-Chip Global Interconnect Schemes (1) 9 Repeated RC wires (R-RC) Un-Terminated and Terminated T-Line (UT-TL and T-TL) R-RC structure  Repeater size/Length of segments  Adopt previous design methodology [Zhang07] UT-TL structure  Full swing at wire-end  Tapered inverter chain as TX T-TL structure  Optimize eye-height at wire-end  Non-Tapered inverter chain as TX

On-Chip Global Interconnect Schemes (2) 10 Un-Equalized and Passive-Equalized T-Line (UE-TL and PE-TL)  Driver side: Tapered differential driver  Receiver side: Termination resistance, Sense-Amplifier (SA) + inverter chain  Passive equalizer: parallel RC network  Design Constraint: enough eye-opening (50mV) needed at the wire-end

Effects of driver impedance and termination resistance on step response 11  Larger driver impedance leads to slower rise edge and lower saturation voltage  Larger termination resistance causes sharper rise edge but with larger reflection Optimal R load

Bit-rate: 50Gbps R s =11.06ohm, R d =350ohm, C d =0.38pF, R L =107.69ohm 12

Global Wire Modeling – Single-Ended & Differential On-Chip T-lines 13 Determine the bit rate Smallest wire dimensions that satisfy eye constraint Notice PE-TL needs narrower wire -> Equalization helps to increase density. Orthogonal layers replaced by ground planes -> 2D cap extraction, accurate when loading density is high. Top-layer thick wires used -> dimension maintains as technology scales. LC-mode behavior dominant

Global Wire Modeling – RC wires and T-lines RC wire modeling T-line 2D-R(f)L(f)C parameter extraction T-line Modeling – R(f)L(f)C Tabular model -> Transient simulation to estimate eye-height. – Synthesized compact circuit model [Kopcsay02] -> Study signal integrity issue. 14 2D-C Extraction Template 2D-R(f)L(f) Extraction Template Distributed Π model composed of wire resistance and capacitance Closed-form equations [Sim03] to calculate 2D wire capacitance

15 Performance Analysis – Definitions Normalized delay (unit: ps/mm) – Propagation delay includes wire delay and gate delay. Normalized energy per bit (unit: pJ/m) – Bit rate is assumed to be the inverse of propagation delay for RC wires Normalized throughput (unit: Gbps/um)

Performance Analysis – Latency 16 Variables: technology-defined parameters  Supply voltage: Vdd (unit: V)  Dielectric constant:  Min-sized inverter FO4 delay: (unit: ps) R-RC structure (min-d)  is roughly constant  FO4 delay scales w/ scaling factor S Increasing w/ technology scaling! T-line structures  Sum of wire delay and TX delay  Wire delay  TX delay improved w/ FO4 delay Decreasing w/ technology scaling!

Performance Analysis – Energy per Bit 17 Same variables defined before R-RC structure (min-d)  Vdd reduces as technology scales  reduces as technology scales Energy decreases w/ technology scaling! T-line structures  Sum of power consumed on wire and TX.  Power of T-line  Power of TX circuit  FO4 delay reduces exponentially Energy decreases w/ larger slope!! Constant !

Performance Analysis – Throughput 18 Same variables defined before R-RC structure (min-d)  Assuming wire pitch  FO4 delay reduces exponentially Throughput increases by 20% per generation! T-line structures  TX bandwidth  Neglect the minor change of wire pitch  K 1 = 0, for UT-TL  FO4 delay reduces exponentially Throughput increases by 43% per generation !!

Design Framework for On-Chip T-line Schemes 19 Proposed framework can be applied to design UT-TL/T-TL/UE-TL/PE-TL by changing wire configuration and circuit structure. Different optimization routines (LP/ILP/SQP, etc) can be adopted according to the problem formulation.

Experimental Settings Design objective: min-d Technology nodes: 90nm-22nm Five different global interconnection structures Wire length: 5mm Parameter extraction – 2D field solver CZ2D from EIP tool suite of IBM – Tabular model or synthesized model Transistor models – Predictive transistor model from [Uemura06] – Synopsys level 3 MOSFET model tuned according to ITRS roadmap Simulation – HSPICE 2005 Modeling and Optimization – Linear or non-linear regression/SQP routine – MATLAB 2007 20

Performance Metric: Normalized Delay – Results and Comparison 21 Technology trends  R-RC ↑  T-line schemes ↓ T-line structures  Outperform R-RC beyond 90nm  Single-ended: lowest delay At 22nm node  R-RC: 55ps/mm  T-lines: 8ps/mm (85% reduction)  Speed of light: 5ps/mm Linear model  < 6% average percent error

Performance Metric: Normalized Energy per Bit – Results and Comparison 22 Technology trends  R-RC and T-lines ↓  T-lines reduce more quickly T-line structures  Outperform R-RC beyond 45nm  Differential: lowest energy.  Single-ended similar to R-RC. T-TL > UT-TL At 22nm node  R-RC: 100pJ/m  Single-ended: 60% reduction  Differential: 96% reduction Linear model  < 12% average percent error  Error for T-TL and PE-TL R L and passive equalizers.

Performance Metric: Normalized Throughput – Results and Comparison 23 Technology trends  R-RC and T-lines ↑  T-lines increase more quickly T-line structures  Outperform R-RC beyond 32nm  Differential better than single-ended At 22nm node  R-RC: 12Gbps/um  T-TL: 30% improvement  UE-TL: 75% improvement  PE-TL: ~ 2X of R-RC Linear model  < 7% average percent error

Signal Integrity – single-ended T-lines 24 Worst-case switching pattern for peak noise simulation UT-TL structure  380mV peak noise at 1V supply voltage w/ 7ps rise time  SI could be a big issue as supply voltage drops T-TL less sensitive to noise  At the same rise time, ~ 50% reduction of peak noise  Peak noise ↓ as technology scales Using w.c. pattern Using single or multiple PRBS patterns

Signal Integrity – differential T-lines 25 More reliable  Termination resistance  Common-mode noise reduction Peak noise  Within ~10mV range Eye-Heights  UE-TL Eye reduces as bit rate ↑ Harder to meet constraint.  PE-TL > 70mV eye even at 22nm node Equalization does help! Worst-case switching pattern for peak noise simulation

Summary (cont’) 26 90nm65nm45nm32nm22nm R-RC 3/351/421/461/55 UT-TL 5/155/135/105/95/8 T-TL 5/155/135/105/95/8 UE-TL 1/373/253/163/125/8 PE-TL 1/373/253/163/125/8 Tech Node Schemes 90nm65nm45nm32nm22nm R-RC 5/55/63/83/102/12 UT-TL 2/3.31/3.3 T-TL 1/32/3.42/62/93/16 UE-TL 3/33/54/94/134/21 PE-TL 4/44/5.35/95/155/24 Tech Node Schemes 90nm65nm45nm32nm22nm R-RC 2/1502/1401/1301/100 UT-TL 3/1403/1103/703/502/40 T-TL 1/2601/2002/1002/603/40 UE-TL 4/604/364/204/105/4 PE-TL 5/265/165/85/55/2 Tech Node Schemes 90nm65nm45nm32nm22nm R-RC 11111 UT-TL 11111 T-TL 33333 UE-TL 55444 PE-TL 44555 Tech Node Schemes Low-Latency Application (ps/mm)Low-Energy Application (pJ/m) High-Throughput Application (Gbps/um)Low-Noise Application Item in the table: score/value. Score: the higher, the better in terms of given metric, max. score is 5. The best structure in each column marked using red color.

Summary of Global Interconnect 27 Compare five different global interconnections in terms of latency, energy per bit, throughput and signal integrity from 90nm to 22nm. A simple linear model provided to link  Architecture-level performance metrics  Technology-defined parameters Some observations from experimental results  T-line structures have potential to replace R-RC at future node  Differential T-lines are better than single-ended Low-power/High-throughput/Low-noise  Equalization could be utilized for on-chip global interconnection Higher throughput density, improve signal integrity Even w/ lower energy dissipation (passive equalizations)

Prefix Adder Synthesis Motivation Prefix Adder Formulation – Area/Timing/Power Models – Mixed-Radix (2,3,4) Adders – ILP Formulation Experimental Results 28

Motivation: Prefix Adder Increasing impact of physical design and concern of power. 29 Logical Levels Wire Tracks Fanouts Area Physical placement Detail routing Timing Gate Cap Wire Cap Gate sizing Buffer insertion Signal slope Input arrival time Output require time Power Static power Dynamic power Power gating Activity Probability

Prefix Adder Formulation Input: two n-bit binary numbers and, one bit carry-in Output: n-bit sum and one bit carry out Prefix Addition: Carry generation & propagation 30

Prefix Addition – Formulation 31 Pre- processing: Post- processing: Prefix Computation:

Prefix Adder – Prefix Structure Graph 32 1 23 4 12:13:14:1 gp i pipi G [i:0] sisi bibi aiai GP [i, j] GP [j-1, k] GP [i, k] gp generator sum generator GP cell Pre- processing Post- processing Prefix Computation

Area Model Distinguish physical placement from logical structure, but keep the bit-slice structure. 33 Logical view Physical view Bit position Logical level Bit position Physical level Compact placement 1234567812345678

Timing Model Cell delay calculation: 34 Effort Delay Intrinsic Delay Logical Effort Electrical Effort = Cout/Cin =(fanouts+wirelength) / size Intrinsic properties of the cell

Power Model Total power consumption: Dynamic power + Static Power Static power: leakage current of device P sta = *#cells Dynamic power: current switching capacitance P dyn =   C load  is the switching probability  = j (j is the logical level*) 35 * Vanichayobon S, etc, “Power-speed Trade-off in Parallel Prefix Circuits”

Interval Adjacency Constraint 36 (column id, logic level)

Linearization for Interval Adjacency Constraint 37 Linearize Pseudo Linear Left interval bound equal to column index

ILP Formulation Overview 38 Structure variables: GP cells Connections (wires) Physical positions Capacitance variables: Gate cap Vertical wire cap Horizontal wire cap Timing variables: Input arrival time Output arrival time Power Objective ILP ILOG CPLEX Optimal Solution

Experiments – 16-bit Uniform Timing 39

Experiments – 16-bit Uniform Timing 40

Min-Power Radix-2 Adder (delay= 22, power = 45.5FO4 ) 41 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 11 12 13 14 15 16

Min-Power Radix-2&4 Adder (delay=18, power = 29.75FO4 ) 42 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 11 12 13 14 15 16 Radix-2 CellRadix-4 Cell

Min-Power Mixed-Radix Adder (delay=20, power = 28.0FO4) 43 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 11 12 13 14 15 16 Radix-2 CellRadix-4 CellRadix-3 Cell

Experiments – 64-bit Hierarchical Structure (Mixed-Radix) Handle high bit-width applications 16x4 and 8x8 44

FPGA Global Routing Architecture Synthesis Flow Formulation Experimental Results 45

46 Synthesis Flow

Formulation 47

FPGA Global Routing Architecture 48

Energy Model: Wires 0.18um tech node, grid length = 0.5mm 4 types of wires: RC wires with spacing and transmission 49

Energy and Area Model: Switch Box 50 Switch Area Model F s : Number of switches connected to each wire entering a switch box  f: Total flow incoming a switch box  N s : Per-bit number of switches inside a switch box Energy Model  P u : energy of a single switch  P s : Per-bit switch energy W

Topology Generation Candidate topologies are required for MCF interconnection synthesis – MCF optimizes flow distribution, but not topology Huge number of different topologies exists – A row of 10 cells has 2^C(10, 2) = 2^45 different connections – A 10  10 FPGA has (2^45)^20 = 2^900 different topologies! Our assumptions – Each row and column has the same connection – Wire lengths are given (e.g. wire length = 1, 2, 4, 8…) – A certain wire length repeats itself till the end of the chip 51

Representative Netlist Generation Properties of Representative Netlist – Matches the size of the benchmark netlists Geometry Distribution Function – The probability of the distance between two pins decreases exponentially when distance increases – k: distance between pins – p: probability of distance-1 links – P(k): probability of distance-k links 52

MCF Interconnection Synthesis Integrate multiple wire styles to MCF formulation Notations – Wire style parameter: (P e, A e ), P e =P w +P s – Area A r : Routing area on vertical and horizontal dimension – d j: Communication demand for net j, d j =1 – Flow f(t): flow amount on a steiner tree t 53

MCF Formulation: Energy Optimization 54 Routability constr. Routing Area constr. Obj: Min Energy

Experiment Settings Seven of MCNC benchmark circuits – Technology mapped to 4-LUTs, each logic block contains 16 4-LUTs – Size of 10x10 to 11x11 switch boxes, 500 ~ 1000 nets Candidate topologies – Available segment length = 1, 2, 4, 8 – Total number of candidate topologies: 93 alu4apex4diffeqdsipex5pmisex3tseng size11x1110x1011x11 10x1011x1110x10 # of nets621798945593745771788 55

Energy Optimization: Optimized FPGA Routing Architectures 56 Energy Impv: 19% Energy Impv: 27% Energy Impv: 28% Energy: 6.46 x10^3 pJ Energy: 5.24 x10^3 pJ Energy: 4.74 x10^3 pJ Energy: 4.63 x10^3 pJ Routing Area: 1500  m Routing Area: 2500  m Routing Area: 3500  m Routing Area: 4500  m RC 1x RC 2x RC 4x T-Line 10x

Energy Optimization: Impact of Routing Area Total energy of the 7 benchmarks with optimized FPGA routing architectures 57

Interconnect Architecture 1.Wire Directions (M, Y, X, E) 2.Layout Region (M, D, Y, X) 3.Power Ground and Clock Distributions 4.Layer Assignment 5.Via Arrangement Comparison 1.Wire Length 2.Throughput 3.Grid vs No-grid 58

(a) A 7 by 7 mesh with Y-architecture (b) A 7 by 7 mesh with Manhattan-architecture (c) A 7 by 7 mesh with X-architecture 7 by 7 meshes with different interconnect architectures 1. Wire Directions and Models 59

(a) A level 2 hexagonal mesh (b) A level 2 octagonal mesh (c) A level 2 Diamond mesh Fig. 10 Meshes with symmetrical structures 2. Layout Regions and Models 60

Length of 2 pin-nets to extend an area Length Shape Man.Y-ArchX-ArchEuclidean M: Diamond1.2501.1181.0661.016 Y: Hexagon1.101 X: Octagon1.055 E: Circle1.2731.1031.0551.000 E (worst)1.4141.1551.0821.000

Throughput : concurrent flow demand Throughput Shape ManhattanY-ArchX-Arch* M: Square1.0001.2251.346 M (Bound)1.2411.356 M: Diamond1.195 Y: Hexagon1.315 X: Octafon1.420 *ratio of 0-90 planes and 45-135 planes is not fixed

Flow congestion map for uniform 90 Degree meshes 63

12 by 1213 by 13 Congestion map of square chip using X-architecture 64

12 by 1213 by 13 Congestion map of square chip using Y-architecture 65

Explanation For Throughput Increasing Number of lines across the vertical center cut-line: d/D for 90 degree routing for 45 degree routing 66

Global Grids (Power/Ground Mesh) (http://www.xinitiative.org/img/062102forum.pdf) X-Architecture Y-Architecture

3. Clock Tree on Square Mesh N-level clock tree: – path distance = 21% less than H-tree – total wire length = 9% less than H tree, 3% less than X tree No self-overlapping between parallel wire segments 71

4. Layer Assignment III III IV Assignment Layer 1 Layer 2 Layer 3 Layer 4 Different routing direction assignment 72

N z(I) z(II) z(III)z(IV) 5 1.020.83 1.01 6 0.970.730.740.97 7 0.940.71 0.93 8 0.900.69 0.90 Normalized throughput of mixed 45-degree and 90-degree mesh with different routing layer assignments 73

Why interleaving Manhattan Layer and Diagonal Layer Improves Throughput? Shortest path between two points on the plane are always a concatenation of a Manhattan line and a Diagonal line. (2,0) (0,3) Wirelength = 5.0 Wirelength = 3.82 74

Observations Routing Direction Assignment Strategies Can Affect the Communication Throughput. Interleaving the Manhattan Routing Layers and Diagonal Routing Layers can produce better Throughput 75

5. Via Arrangement: Banks and Tunnels Use tunnels to detour around vias Use banks of tunnels to maximize the throughput Use bottom k layers to perform intra-cell routing Use top n-k layers to distribute signals to the banks 76

Via-Oriented Interconnect Planning 77

Via-Oriented Interconnect Planning tunnel 78

Via-Oriented Interconnect Planning Full bandwidth k+2 overhead #vias= kL Overhead=k+2 vertical Tracks L: dimension of the bank Bank of tunnels 79

Blocking 5 tracks on the layer of 60- degree direction Tunnel of Y Arch. 80

Tunnels of Y Arch. 81

3.2 Via-Oriented Interconnect Planning Bank of tunnels #vias= c 1 kL Overhead= k+c 2 tracks 82

Conclusion Global Interconnect Technologies – EM waves + Devices Prefix Adder Synthesis – Formulation + ILP FPGA Interconnect Architecture – Formulation + LP Interconnect Architecture – Lambda Geometry + Vias 83

Thank you! Q & A 84

On-Chip Interconnect Trend and Design Optimization Chung-Kuan Cheng UC San Diego, La Jolla, CA.

Similar presentations

Presentation on theme: "On-Chip Interconnect Trend and Design Optimization Chung-Kuan Cheng UC San Diego, La Jolla, CA."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

On-Chip Interconnect Trend and Design Optimization Chung-Kuan Cheng UC San Diego, La Jolla, CA.

Similar presentations

Presentation on theme: "On-Chip Interconnect Trend and Design Optimization Chung-Kuan Cheng UC San Diego, La Jolla, CA."— Presentation transcript:

Similar presentations

About project

Feedback