Download presentation
Presentation is loading. Please wait.
Published byGeorge Jessie Freeman Modified over 9 years ago
1
On-Chip Interconnect Trend and Design Optimization Chung-Kuan Cheng UC San Diego, La Jolla, CA
2
Outlines Global Interconnect Technologies – RC Trees and Transmission Lines Prefix Adder Synthesis – Modeling FPGA Interconnect Architecture – Modeling Interconnect Architecture – Non-Manhattan Wire Arrangement 2
3
Interconnect Technologies Introduction On-Chip Global Interconnection Global Wire Modeling Performance Comparison 3
4
4 Introduction – Performance Impact Interconnect delay determines the system performance [ITRS08] 542ps for 1mm minimum pitch Cu global wire w/o repeater @ 45nm ~150ps for 10 level FO4 delay @ 45nm [Ho2001] “Future of Wire”
5
Introduction – Power Dissipation Interconnects consume a significant portion of power –1-2 order larger in magnitude compared with gates Half of the dynamic power dissipated on repeaters to minimize latency [Zhang07] –Wires consume 50% of total dynamic power for a 0.13um microprocessor [Magen04] About 1/3 burned on the global wires. 5
6
6 Introduction – Technology Trend On-Chip Interconnect Scaling –Dimension shrinks Wire resistance increases -> RC delay Increasing capacitive coupling -> delay, power, noise, etc. –Performance of global wires decreases w/ technology scaling. Wire CategoryTechnology Node 90nm45nm22nm M1 Wire Rw(kohm/mm)1.9148.86034.827 Cw(pF/mm)0.1830.1570.129 Global Wire Rw(kohm/mm)0.5322.97011.000 Cw(pF/mm)0.2050.1790.151 Copper resistivity versus wire width Scaling trend of PUL wire resistance and capacitance
7
Organization of On-Chip Global Interconnections 7
8
Multi-Dimensional Design Consideration 8 Preliminary analysis results assuming 65nm CMOS process. Application-oriented choice Low Latency T-TL or UT-TL -> Single-Ended T-lines High ThroughputR-RC Low Power PE-TL or UE-TL Low Noise PE-TL or UE-TL Low Area/CostR-RC Differential T-lines For each architecture, the more area the pentagon covers, the better overall performance is achieved.
9
On-Chip Global Interconnect Schemes (1) 9 Repeated RC wires (R-RC) Un-Terminated and Terminated T-Line (UT-TL and T-TL) R-RC structure Repeater size/Length of segments Adopt previous design methodology [Zhang07] UT-TL structure Full swing at wire-end Tapered inverter chain as TX T-TL structure Optimize eye-height at wire-end Non-Tapered inverter chain as TX
10
On-Chip Global Interconnect Schemes (2) 10 Un-Equalized and Passive-Equalized T-Line (UE-TL and PE-TL) Driver side: Tapered differential driver Receiver side: Termination resistance, Sense-Amplifier (SA) + inverter chain Passive equalizer: parallel RC network Design Constraint: enough eye-opening (50mV) needed at the wire-end
11
Effects of driver impedance and termination resistance on step response 11 Larger driver impedance leads to slower rise edge and lower saturation voltage Larger termination resistance causes sharper rise edge but with larger reflection Optimal R load
12
Bit-rate: 50Gbps R s =11.06ohm, R d =350ohm, C d =0.38pF, R L =107.69ohm 12
13
Global Wire Modeling – Single-Ended & Differential On-Chip T-lines 13 Determine the bit rate Smallest wire dimensions that satisfy eye constraint Notice PE-TL needs narrower wire -> Equalization helps to increase density. Orthogonal layers replaced by ground planes -> 2D cap extraction, accurate when loading density is high. Top-layer thick wires used -> dimension maintains as technology scales. LC-mode behavior dominant
14
Global Wire Modeling – RC wires and T-lines RC wire modeling T-line 2D-R(f)L(f)C parameter extraction T-line Modeling – R(f)L(f)C Tabular model -> Transient simulation to estimate eye-height. – Synthesized compact circuit model [Kopcsay02] -> Study signal integrity issue. 14 2D-C Extraction Template 2D-R(f)L(f) Extraction Template Distributed Π model composed of wire resistance and capacitance Closed-form equations [Sim03] to calculate 2D wire capacitance
15
15 Performance Analysis – Definitions Normalized delay (unit: ps/mm) – Propagation delay includes wire delay and gate delay. Normalized energy per bit (unit: pJ/m) – Bit rate is assumed to be the inverse of propagation delay for RC wires Normalized throughput (unit: Gbps/um)
16
Performance Analysis – Latency 16 Variables: technology-defined parameters Supply voltage: Vdd (unit: V) Dielectric constant: Min-sized inverter FO4 delay: (unit: ps) R-RC structure (min-d) is roughly constant FO4 delay scales w/ scaling factor S Increasing w/ technology scaling! T-line structures Sum of wire delay and TX delay Wire delay TX delay improved w/ FO4 delay Decreasing w/ technology scaling!
17
Performance Analysis – Energy per Bit 17 Same variables defined before R-RC structure (min-d) Vdd reduces as technology scales reduces as technology scales Energy decreases w/ technology scaling! T-line structures Sum of power consumed on wire and TX. Power of T-line Power of TX circuit FO4 delay reduces exponentially Energy decreases w/ larger slope!! Constant !
18
Performance Analysis – Throughput 18 Same variables defined before R-RC structure (min-d) Assuming wire pitch FO4 delay reduces exponentially Throughput increases by 20% per generation! T-line structures TX bandwidth Neglect the minor change of wire pitch K 1 = 0, for UT-TL FO4 delay reduces exponentially Throughput increases by 43% per generation !!
19
Design Framework for On-Chip T-line Schemes 19 Proposed framework can be applied to design UT-TL/T-TL/UE-TL/PE-TL by changing wire configuration and circuit structure. Different optimization routines (LP/ILP/SQP, etc) can be adopted according to the problem formulation.
20
Experimental Settings Design objective: min-d Technology nodes: 90nm-22nm Five different global interconnection structures Wire length: 5mm Parameter extraction – 2D field solver CZ2D from EIP tool suite of IBM – Tabular model or synthesized model Transistor models – Predictive transistor model from [Uemura06] – Synopsys level 3 MOSFET model tuned according to ITRS roadmap Simulation – HSPICE 2005 Modeling and Optimization – Linear or non-linear regression/SQP routine – MATLAB 2007 20
21
Performance Metric: Normalized Delay – Results and Comparison 21 Technology trends R-RC ↑ T-line schemes ↓ T-line structures Outperform R-RC beyond 90nm Single-ended: lowest delay At 22nm node R-RC: 55ps/mm T-lines: 8ps/mm (85% reduction) Speed of light: 5ps/mm Linear model < 6% average percent error
22
Performance Metric: Normalized Energy per Bit – Results and Comparison 22 Technology trends R-RC and T-lines ↓ T-lines reduce more quickly T-line structures Outperform R-RC beyond 45nm Differential: lowest energy. Single-ended similar to R-RC. T-TL > UT-TL At 22nm node R-RC: 100pJ/m Single-ended: 60% reduction Differential: 96% reduction Linear model < 12% average percent error Error for T-TL and PE-TL R L and passive equalizers.
23
Performance Metric: Normalized Throughput – Results and Comparison 23 Technology trends R-RC and T-lines ↑ T-lines increase more quickly T-line structures Outperform R-RC beyond 32nm Differential better than single-ended At 22nm node R-RC: 12Gbps/um T-TL: 30% improvement UE-TL: 75% improvement PE-TL: ~ 2X of R-RC Linear model < 7% average percent error
24
Signal Integrity – single-ended T-lines 24 Worst-case switching pattern for peak noise simulation UT-TL structure 380mV peak noise at 1V supply voltage w/ 7ps rise time SI could be a big issue as supply voltage drops T-TL less sensitive to noise At the same rise time, ~ 50% reduction of peak noise Peak noise ↓ as technology scales Using w.c. pattern Using single or multiple PRBS patterns
25
Signal Integrity – differential T-lines 25 More reliable Termination resistance Common-mode noise reduction Peak noise Within ~10mV range Eye-Heights UE-TL Eye reduces as bit rate ↑ Harder to meet constraint. PE-TL > 70mV eye even at 22nm node Equalization does help! Worst-case switching pattern for peak noise simulation
26
Summary (cont’) 26 90nm65nm45nm32nm22nm R-RC 3/351/421/461/55 UT-TL 5/155/135/105/95/8 T-TL 5/155/135/105/95/8 UE-TL 1/373/253/163/125/8 PE-TL 1/373/253/163/125/8 Tech Node Schemes 90nm65nm45nm32nm22nm R-RC 5/55/63/83/102/12 UT-TL 2/3.31/3.3 T-TL 1/32/3.42/62/93/16 UE-TL 3/33/54/94/134/21 PE-TL 4/44/5.35/95/155/24 Tech Node Schemes 90nm65nm45nm32nm22nm R-RC 2/1502/1401/1301/100 UT-TL 3/1403/1103/703/502/40 T-TL 1/2601/2002/1002/603/40 UE-TL 4/604/364/204/105/4 PE-TL 5/265/165/85/55/2 Tech Node Schemes 90nm65nm45nm32nm22nm R-RC 11111 UT-TL 11111 T-TL 33333 UE-TL 55444 PE-TL 44555 Tech Node Schemes Low-Latency Application (ps/mm)Low-Energy Application (pJ/m) High-Throughput Application (Gbps/um)Low-Noise Application Item in the table: score/value. Score: the higher, the better in terms of given metric, max. score is 5. The best structure in each column marked using red color.
27
Summary of Global Interconnect 27 Compare five different global interconnections in terms of latency, energy per bit, throughput and signal integrity from 90nm to 22nm. A simple linear model provided to link Architecture-level performance metrics Technology-defined parameters Some observations from experimental results T-line structures have potential to replace R-RC at future node Differential T-lines are better than single-ended Low-power/High-throughput/Low-noise Equalization could be utilized for on-chip global interconnection Higher throughput density, improve signal integrity Even w/ lower energy dissipation (passive equalizations)
28
Prefix Adder Synthesis Motivation Prefix Adder Formulation – Area/Timing/Power Models – Mixed-Radix (2,3,4) Adders – ILP Formulation Experimental Results 28
29
Motivation: Prefix Adder Increasing impact of physical design and concern of power. 29 Logical Levels Wire Tracks Fanouts Area Physical placement Detail routing Timing Gate Cap Wire Cap Gate sizing Buffer insertion Signal slope Input arrival time Output require time Power Static power Dynamic power Power gating Activity Probability
30
Prefix Adder Formulation Input: two n-bit binary numbers and, one bit carry-in Output: n-bit sum and one bit carry out Prefix Addition: Carry generation & propagation 30
31
Prefix Addition – Formulation 31 Pre- processing: Post- processing: Prefix Computation:
32
Prefix Adder – Prefix Structure Graph 32 1 23 4 12:13:14:1 gp i pipi G [i:0] sisi bibi aiai GP [i, j] GP [j-1, k] GP [i, k] gp generator sum generator GP cell Pre- processing Post- processing Prefix Computation
33
Area Model Distinguish physical placement from logical structure, but keep the bit-slice structure. 33 Logical view Physical view Bit position Logical level Bit position Physical level Compact placement 1234567812345678
34
Timing Model Cell delay calculation: 34 Effort Delay Intrinsic Delay Logical Effort Electrical Effort = Cout/Cin =(fanouts+wirelength) / size Intrinsic properties of the cell
35
Power Model Total power consumption: Dynamic power + Static Power Static power: leakage current of device P sta = *#cells Dynamic power: current switching capacitance P dyn = C load is the switching probability = j (j is the logical level*) 35 * Vanichayobon S, etc, “Power-speed Trade-off in Parallel Prefix Circuits”
36
Interval Adjacency Constraint 36 (column id, logic level)
37
Linearization for Interval Adjacency Constraint 37 Linearize Pseudo Linear Left interval bound equal to column index
38
ILP Formulation Overview 38 Structure variables: GP cells Connections (wires) Physical positions Capacitance variables: Gate cap Vertical wire cap Horizontal wire cap Timing variables: Input arrival time Output arrival time Power Objective ILP ILOG CPLEX Optimal Solution
39
Experiments – 16-bit Uniform Timing 39
40
Experiments – 16-bit Uniform Timing 40
41
Min-Power Radix-2 Adder (delay= 22, power = 45.5FO4 ) 41 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 11 12 13 14 15 16
42
Min-Power Radix-2&4 Adder (delay=18, power = 29.75FO4 ) 42 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 11 12 13 14 15 16 Radix-2 CellRadix-4 Cell
43
Min-Power Mixed-Radix Adder (delay=20, power = 28.0FO4) 43 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 11 12 13 14 15 16 Radix-2 CellRadix-4 CellRadix-3 Cell
44
Experiments – 64-bit Hierarchical Structure (Mixed-Radix) Handle high bit-width applications 16x4 and 8x8 44
45
FPGA Global Routing Architecture Synthesis Flow Formulation Experimental Results 45
46
46 Synthesis Flow
47
Formulation 47
48
FPGA Global Routing Architecture 48
49
Energy Model: Wires 0.18um tech node, grid length = 0.5mm 4 types of wires: RC wires with spacing and transmission 49
50
Energy and Area Model: Switch Box 50 Switch Area Model F s : Number of switches connected to each wire entering a switch box f: Total flow incoming a switch box N s : Per-bit number of switches inside a switch box Energy Model P u : energy of a single switch P s : Per-bit switch energy W
51
Topology Generation Candidate topologies are required for MCF interconnection synthesis – MCF optimizes flow distribution, but not topology Huge number of different topologies exists – A row of 10 cells has 2^C(10, 2) = 2^45 different connections – A 10 10 FPGA has (2^45)^20 = 2^900 different topologies! Our assumptions – Each row and column has the same connection – Wire lengths are given (e.g. wire length = 1, 2, 4, 8…) – A certain wire length repeats itself till the end of the chip 51
52
Representative Netlist Generation Properties of Representative Netlist – Matches the size of the benchmark netlists Geometry Distribution Function – The probability of the distance between two pins decreases exponentially when distance increases – k: distance between pins – p: probability of distance-1 links – P(k): probability of distance-k links 52
53
MCF Interconnection Synthesis Integrate multiple wire styles to MCF formulation Notations – Wire style parameter: (P e, A e ), P e =P w +P s – Area A r : Routing area on vertical and horizontal dimension – d j: Communication demand for net j, d j =1 – Flow f(t): flow amount on a steiner tree t 53
54
MCF Formulation: Energy Optimization 54 Routability constr. Routing Area constr. Obj: Min Energy
55
Experiment Settings Seven of MCNC benchmark circuits – Technology mapped to 4-LUTs, each logic block contains 16 4-LUTs – Size of 10x10 to 11x11 switch boxes, 500 ~ 1000 nets Candidate topologies – Available segment length = 1, 2, 4, 8 – Total number of candidate topologies: 93 alu4apex4diffeqdsipex5pmisex3tseng size11x1110x1011x11 10x1011x1110x10 # of nets621798945593745771788 55
56
Energy Optimization: Optimized FPGA Routing Architectures 56 Energy Impv: 19% Energy Impv: 27% Energy Impv: 28% Energy: 6.46 x10^3 pJ Energy: 5.24 x10^3 pJ Energy: 4.74 x10^3 pJ Energy: 4.63 x10^3 pJ Routing Area: 1500 m Routing Area: 2500 m Routing Area: 3500 m Routing Area: 4500 m RC 1x RC 2x RC 4x T-Line 10x
57
Energy Optimization: Impact of Routing Area Total energy of the 7 benchmarks with optimized FPGA routing architectures 57
58
Interconnect Architecture 1.Wire Directions (M, Y, X, E) 2.Layout Region (M, D, Y, X) 3.Power Ground and Clock Distributions 4.Layer Assignment 5.Via Arrangement Comparison 1.Wire Length 2.Throughput 3.Grid vs No-grid 58
59
(a) A 7 by 7 mesh with Y-architecture (b) A 7 by 7 mesh with Manhattan-architecture (c) A 7 by 7 mesh with X-architecture 7 by 7 meshes with different interconnect architectures 1. Wire Directions and Models 59
60
(a) A level 2 hexagonal mesh (b) A level 2 octagonal mesh (c) A level 2 Diamond mesh Fig. 10 Meshes with symmetrical structures 2. Layout Regions and Models 60
61
Length of 2 pin-nets to extend an area Length Shape Man.Y-ArchX-ArchEuclidean M: Diamond1.2501.1181.0661.016 Y: Hexagon1.101 X: Octagon1.055 E: Circle1.2731.1031.0551.000 E (worst)1.4141.1551.0821.000
62
Throughput : concurrent flow demand Throughput Shape ManhattanY-ArchX-Arch* M: Square1.0001.2251.346 M (Bound)1.2411.356 M: Diamond1.195 Y: Hexagon1.315 X: Octafon1.420 *ratio of 0-90 planes and 45-135 planes is not fixed
63
Flow congestion map for uniform 90 Degree meshes 63
64
12 by 1213 by 13 Congestion map of square chip using X-architecture 64
65
12 by 1213 by 13 Congestion map of square chip using Y-architecture 65
66
Explanation For Throughput Increasing Number of lines across the vertical center cut-line: d/D for 90 degree routing for 45 degree routing 66
67
67
68
68
69
69
70
Global Grids (Power/Ground Mesh) (http://www.xinitiative.org/img/062102forum.pdf) X-Architecture Y-Architecture
71
3. Clock Tree on Square Mesh N-level clock tree: – path distance = 21% less than H-tree – total wire length = 9% less than H tree, 3% less than X tree No self-overlapping between parallel wire segments 71
72
4. Layer Assignment III III IV Assignment Layer 1 Layer 2 Layer 3 Layer 4 Different routing direction assignment 72
73
N z(I) z(II) z(III)z(IV) 5 1.020.83 1.01 6 0.970.730.740.97 7 0.940.71 0.93 8 0.900.69 0.90 Normalized throughput of mixed 45-degree and 90-degree mesh with different routing layer assignments 73
74
Why interleaving Manhattan Layer and Diagonal Layer Improves Throughput? Shortest path between two points on the plane are always a concatenation of a Manhattan line and a Diagonal line. (2,0) (0,3) Wirelength = 5.0 Wirelength = 3.82 74
75
Observations Routing Direction Assignment Strategies Can Affect the Communication Throughput. Interleaving the Manhattan Routing Layers and Diagonal Routing Layers can produce better Throughput 75
76
5. Via Arrangement: Banks and Tunnels Use tunnels to detour around vias Use banks of tunnels to maximize the throughput Use bottom k layers to perform intra-cell routing Use top n-k layers to distribute signals to the banks 76
77
Via-Oriented Interconnect Planning 77
78
Via-Oriented Interconnect Planning tunnel 78
79
Via-Oriented Interconnect Planning Full bandwidth k+2 overhead #vias= kL Overhead=k+2 vertical Tracks L: dimension of the bank Bank of tunnels 79
80
Blocking 5 tracks on the layer of 60- degree direction Tunnel of Y Arch. 80
81
Tunnels of Y Arch. 81
82
3.2 Via-Oriented Interconnect Planning Bank of tunnels #vias= c 1 kL Overhead= k+c 2 tracks 82
83
Conclusion Global Interconnect Technologies – EM waves + Devices Prefix Adder Synthesis – Formulation + ILP FPGA Interconnect Architecture – Formulation + LP Interconnect Architecture – Lambda Geometry + Vias 83
84
Thank you! Q & A 84
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.