Prediction of High-Performance On-Chip Global Interconnection Yulei Zhang 1, Xiang Hu 1, Alina Deutsch 2, A. Ege Engin 3 James F. Buckwalter 1, and Chung-Kuan.

Slides:

Advertisements

Similar presentations

Topics Electrical properties of static combinational gates:

Advertisements

© imec Interconnect Width Selection for Deep Submicron Designs using the Table Lookup Method Mandeep Bamal*, Evelyn Grossar*, Michele Stucchi and.

1 Interconnect and Packaging Lecture 7: Distortionless Communication Chung-Kuan Cheng UC San Diego.

© Digital Integrated Circuits 2nd Inverter CMOS Inverter: Digital Workhorse  Best Figures of Merit in CMOS Family  Noise Immunity  Performance  Power/Buffer.

Constructing Current-Based Gate Models Based on Existing Timing Library Andrew Kahng, Bao Liu, Xu Xu UC San Diego

04/09/02EECS 3121 Lecture 25: Interconnect Modeling EECS 312 Reading: 8.3 (text), 4.3.2, (2 nd edition)

Chung-Kuan Cheng†, Andrew B. Kahng†‡,

Effects of Global Interconnect Optimizations on Performance Estimation of Deep Sub-Micron Design Yu (Kevin) Cao 1, Chenming Hu 1, Xuejue Huang 1, Andrew.

Statistical Gate Delay Calculation with Crosstalk Alignment Consideration Andrew B. Kahng, Bao Liu, Xu Xu UC San Diego

Temperature-Aware Design Presented by Mehul Shah 4/29/04.

Author: D. Brooks, V.Tiwari and M. Martonosi Reviewer: Junxia Ma

Integrated Regulation for Energy- Efficient Digital Circuits Elad Alon 1 and Mark Horowitz 2 1 UC Berkeley 2 Stanford University.

Network-on-Chip: Communication Synthesis Department of Computer Science Texas A&M University.

1 A Novel Metric for Interconnect Architecture Performance Parthasarathi Dasgupta, Andrew B. Kahng, Swamy V. Muddu Dept. of CSE and ECE University of California,

1 An Interconnect-Centric Approach to Cyclic Shifter Design David M. Harris Harvey Mudd College. Haikun Zhu, Yi Zhu C.-K. Cheng Harvey Mudd College.

Ultra-Low Power On-Chip Differential Interconnects Using High-Resolution Comparator Hao Liu and Chung-Kuan Cheng University of California, San Diego 10/22/2012.

Yulei Zhang1, James F. Buckwalter1, and Chung-Kuan Cheng2

Worst-Case Timing Jitter and Amplitude Noise in Differential Signaling Wei Yao, Yiyu Shi, Lei He, Sudhakar Pamarti, and Yu Hu Electrical Engineering Dept.,

Digital logic families

MOS Inverter: Static Characteristics

ECE 424 – Introduction to VLSI Design

A Methodology for Interconnect Dimension Determination By: Jeff Cobb Rajesh Garg Sunil P Khatri Department of Electrical and Computer Engineering, Texas.

TLC: Transmission Line Caches Brad Beckmann David Wood Multifacet Project University of Wisconsin-Madison 12/3/03.

1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian.

1 Breaking the Wall of Interconnect: Research and Education Chung-Kuan Cheng CSE Department UC San Diego Ckcheng at ucsd.edu EDA Education and Research.

EZ-COURSEWARE State-of-the-Art Teaching Tools From AMS Teaching Tomorrow’s Technology Today.

ENGG 6090 Topic Review1 How to reduce the power dissipation? Switching Activity Switched Capacitance Voltage Scaling.

Review: CMOS Inverter: Dynamic

Power Reduction for FPGA using Multiple Vdd/Vth

Research on Analysis and Physical Synthesis Chung-Kuan Cheng CSE Department UC San Diego

LOPASS: A Low Power Architectural Synthesis for FPGAs with Interconnect Estimation and Optimization Harikrishnan K.C. University of Massachusetts Amherst.

1 Design Space Exploration for Power-Efficient Mixed-Radix Ling Adders Chung-Kuan Cheng Computer Science and Engineering Depart. University of California,

Chapter 07 Electronic Analysis of CMOS Logic Gates

Interconnect Focus Center e¯e¯ e¯e¯ e¯e¯ e¯e¯ IWSM 2001Sam, Chandrakasan, and Boning – MIT Variation Issues in On-Chip Optical Clock Distribution S. L.

Tinoosh Mohsenin and Bevan M. Baas VLSI Computation Lab, ECE Department University of California, Davis Split-Row: A Reduced Complexity, High Throughput.

1 Passive Distortion Compensation for Package Level Interconnect Chung-Kuan Cheng UC San Diego Dongsheng Ma & Janet Wang Univ. of Arizona.

Optimal digital circuit design Mohammad Sharifkhani.

EE141 © Digital Integrated Circuits 2nd Wires 1 Digital Integrated Circuits A Design Perspective The Interconnect Jan M. Rabaey Anantha Chandrakasan Borivoje.

1 Distributed Loss Compensation for Low-latency On-chip Interconnects Class Presentation For Advanced VLSI Design Course Instructor: Dr.Fakhraie Presented.

On-Chip Interconnect Trend and Design Optimization Chung-Kuan Cheng UC San Diego, La Jolla, CA.

10/03/2005: 1 Physical Synthesis of Latency Aware Low Power NoC Through Topology Exploration and Wire Style Optimization CK Cheng CSE Department UC San.

1 Passive Distortion Compensation for Package Level Interconnect Chung-Kuan Cheng UC San Diego Dongsheng Ma & Janet Wang Univ. of Arizona.

Inverter Chapter 5 The Inverter April 10, Inverter Objective of This Chapter  Use Inverter to know basic CMOS Circuits Operations  Watch for performance.

INTERCONNECT MODELING M.Arvind 2nd M.E Microelectronics

1 Interconnect/Via. 2 Delay of Devices and Interconnect.

Distributed Computation: Circuit Simulation CK Cheng UC San Diego

Basics of Energy & Power Dissipation

By Nasir Mahmood.  The NoC solution brings a networking method to on-chip communication.

FPGA-Based System Design: Chapter 2 Copyright  2004 Prentice Hall PTR Topics n Logic gate delay. n Logic gate power consumption. n Driving large loads.

Solid-State Devices & Circuits

Surfliner: Distortion-less Electrical Signaling for Speed of Light On- chip Communication Hongyu Chen, Rui Shi, Chung-Kuan Cheng Computer Science and Engineering.

Chapter 4: Secs ; Chapter 5: pp

Low-Power and High-Speed Interconnect Using Serial Passive Compensation Chun-Chen Liu and Chung-Kuan Cheng Computer Science and Engineering Dept. University.

High Performance Interconnect and Packaging Chung-Kuan Cheng CSE Department UC San Diego

1 Revamping Electronic Design Process to Embrace Interconnect Dominance Chung-Kuan Cheng CSE Department UC San Diego La Jolla, CA

-1- Delay Uncertainty and Signal Criticality Driven Routing Channel Optimization for Advanced DRAM Products Samyoung Bang #, Kwangsoo Han ‡, Andrew B.

M. Atef, Hong Chen, and H. Zimmermann Vienna University of Technology

14 February, 2004SLIP, 2004 Self-Consistent Power/Performance/Reliability Analysis for Copper Interconnects Bipin Rajendran, Pawan Kapur, Krishna C. Saraswat.

High Gain Transimpedance Amplifier with Current Mirror Load By: Mohamed Atef Electrical Engineering Department Assiut University Assiut, Egypt.

Exploring the Rogue Wave Phenomenon in 3D Power Distribution Networks Xiang Hu 1, Peng Du 2, Chung-Kuan Cheng 2 1 ECE Dept., 2 CSE Dept. University of.

Wires & wire delay Lecture 9 Tuesday September 27, 2016.

High Speed Properties of Digital Gates, Copyright F. Canavero, R. Fantino Licensed to HDT - High Design Technology

Power-Optimal Pipelining in Deep Submicron Technology

The Interconnect Delay Bottleneck.

Jason Cong, David Zhigang Pan & Prasanna V. Srinivas

Research on Interconnect

Low Power Passive Equalizer Design for Computer-Memory Links

Summary Current density in a signal line was estimated, based on the simple circuit shown in Fig.1. This circuit is scaled down according to ITRS 2003.

Yiyu Shi*, Wei Yao*, Jinjun Xiong+ and Lei He*

Jason Cong, David Zhigang Pan & Prasanna V. Srinivas

Presentation transcript:

Prediction of High-Performance On-Chip Global Interconnection Yulei Zhang 1, Xiang Hu 1, Alina Deutsch 2, A. Ege Engin 3 James F. Buckwalter 1, and Chung-Kuan Cheng 1 1 Dept. of ECE, UC San Diego, La Jolla, CA 2 IBM T. J. Watson Research Center, Yorktown Heights, NY 3 Dept. of ECE, San Diego State Univ., San Diego, CA

2 Outline Introduction  Technology trend  Current approaches On-Chip Global Interconnection  Overview: structures, tradeoffs  Interconnect schemes  Global wire modeling  Performance analysis Design Methodologies for T-line schemes Prediction of Performance Metrics  Experimental settings  Performance metrics comparison and scaling trend Latency Energy per bit Throughput Signal Integrity Conclusion

3 Introduction – Performance Impact Interconnect delay determines the system performance [ITRS08]  542ps for 1mm minimum pitch Cu global wire w/o 45nm  ~150ps for 10 level FO4 45nm [Ho2001] “Future of Wire”

4 Introduction – Power Dissipation Interconnects consume a significant portion of power  1-2 order larger in magnitude compared with gates Half of the dynamic power dissipated on repeaters to minimize latency [Zhang07]  Wires consume 50% of total dynamic power for a 0.13um microprocessor [Magen04] About 1/3 burned on the global wires.

5 Introduction – Different Approaches and Our Contributions Different Approaches  Repeater Insertion Approach Pros: High throughput density. Cons: Overhead in terms of power consumption and wiring complexity.  T-line Approach [Zhang09] Pros: Low latency. Cons: low throughput density due to low bandwidth and large wire dimension  Equalized T-line Approach [Zhang08] Pros: Low power, Low noise, Higher throughput than single-ended. Cons: The area overhead brought by passive components. We explore different global interconnection structures and compare their performance metrics across multiple technology nodes. Contributions:  A simple linear model  A general design framework  A complete prediction and comparison

6 Organization of On-Chip Global Interconnections

7 Multi-Dimensional Design Consideration Preliminary analysis results assuming 65nm CMOS process. Application-oriented choice  Low Latency T-TL or UT-TL -> Single-Ended T-lines  High ThroughputR-RC  Low Power PE-TL or UE-TL  Low Noise PE-TL or UE-TL  Low Area/CostR-RC Differential T-lines For each architecture, the more area the pentagon covers, the better overall performance is achieved.

8 On-Chip Global Interconnect Schemes (1) Repeated RC wires (R-RC) Un-Terminated and Terminated T-Line (UT-TL and T-TL) R-RC structure  Repeater size/Length of segments  Adopt previous design methodology [Zhang07] UT-TL structure  Full swing at wire-end  Tapered inverter chain as TX T-TL structure  Optimize eye-height at wire-end  Non-Tapered inverter chain as TX

9 On-Chip Global Interconnect Schemes (2) Un-Equalized and Passive-Equalized T-Line (UE-TL and PE-TL)  Driver side: Tapered differential driver  Receiver side: Termination resistance, Sense-Amplifier (SA) + inverter chain  Passive equalizer: parallel RC network  Design Constraint: enough eye-opening (50mV) needed at the wire-end

10 Global Wire Modeling – Single-Ended & Differential On-Chip T-lines Determine the bit rate Smallest wire dimensions that satisfy eye constraint Notice PE-TL needs narrower wire -> Equalization helps to increase density. Orthogonal layers replaced by ground planes -> 2D cap extraction, accurate when loading density is high. Top-layer thick wires used -> dimension maintains as technology scales. LC-mode behavior dominant

11 Global Wire Modeling – RC wires and T-lines RC wire modeling T-line 2D-R(f)L(f)C parameter extraction T-line Modeling  R(f)L(f)C Tabular model -> Transient simulation to estimate eye-height.  Synthesized compact circuit model [Kopcsay02] -> Study signal integrity issue. 2D-C Extraction Template 2D-R(f)L(f) Extraction Template Distributed Π model composed of wire resistance and capacitance Closed-form equations [Sim03] to calculate 2D wire capacitance

12 Performance Analysis – Definitions Normalized delay (unit: ps/mm)  Propagation delay includes wire delay and gate delay. Normalized energy per bit (unit: pJ/m)  Bit rate is assumed to be the inverse of propagation delay for RC wires Normalized throughput (unit: Gbps/um)

13 Performance Analysis – Latency Variables: technology-defined parameters  Supply voltage: Vdd (unit: V)  Dielectric constant:  Min-sized inverter FO4 delay: (unit: ps) R-RC structure (min-d)  is roughly constant  FO4 delay scales w/ scaling factor S Increasing w/ technology scaling! T-line structures  Sum of wire delay and TX delay  Wire delay  TX delay improved w/ FO4 delay Decreasing w/ technology scaling!

14 Performance Analysis – Energy per Bit Same variables defined before R-RC structure (min-d)  Vdd reduces as technology scales  reduces as technology scales Energy decreases w/ technology scaling! T-line structures  Sum of power consumed on wire and TX.  Power of T-line  Power of TX circuit  FO4 delay reduces exponentially Energy decreases w/ larger slope!! Constant !

15 Performance Analysis – Throughput Same variables defined before R-RC structure (min-d)  Assuming wire pitch  FO4 delay reduces exponentially Throughput increases by 20% per generation! T-line structures  TX bandwidth  Neglect the minor change of wire pitch  K 1 = 0, for UT-TL  FO4 delay reduces exponentially Throughput increases by 43% per generation !!

16 Design Framework for On-Chip T-line Schemes Proposed framework can be applied to design UT-TL/T-TL/UE-TL/PE-TL by changing wire configuration and circuit structure. Different optimization routines (LP/ILP/SQP, etc) can be adopted according to the problem formulation.

17 Experimental Settings Design objective: min-d Technology nodes: 90nm-22nm Five different global interconnection structures Wire length: 5mm Parameter extraction  2D field solver CZ2D from EIP tool suite of IBM  Tabular model or synthesized model Transistor models  Predictive transistor model from [Uemura06]  Synopsys level 3 MOSFET model tuned according to ITRS roadmap Simulation  HSPICE 2005 Modeling and Optimization  Linear or non-linear regression/SQP routine  MATLAB 2007

18 Performance Metric: Normalized Delay – Results and Comparison Technology trends  R-RC ↑  T-line schemes ↓ T-line structures  Outperform R-RC beyond 90nm  Single-ended: lowest delay At 22nm node  R-RC: 55ps/mm  T-lines: 8ps/mm (85% reduction)  Speed of light: 5ps/mm Linear model  < 6% average percent error

19 Performance Metric: Normalized Energy per Bit – Results and Comparison Technology trends  R-RC and T-lines ↓  T-lines reduce more quickly T-line structures  Outperform R-RC beyond 45nm  Differential: lowest energy.  Single-ended similar to R-RC. T-TL > UT-TL At 22nm node  R-RC: 100pJ/m  Single-ended: 60% reduction  Differential: 96% reduction Linear model  < 12% average percent error  Error for T-TL and PE-TL R L and passive equalizers.

20 Performance Metric: Normalized Throughput – Results and Comparison Technology trends  R-RC and T-lines ↑  T-lines increase more quickly T-line structures  Outperform R-RC beyond 32nm  Differential better than single-ended At 22nm node  R-RC: 12Gbps/um  T-TL: 30% improvement  UE-TL: 75% improvement  PE-TL: ~ 2X of R-RC Linear model  < 7% average percent error

21 Signal Integrity – single-ended T-lines Worst-case switching pattern for peak noise simulation UT-TL structure  380mV peak noise at 1V supply voltage w/ 7ps rise time  SI could be a big issue as supply voltage drops T-TL less sensitive to noise  At the same rise time, ~ 50% reduction of peak noise  Peak noise ↓ as technology scales Using w.c. pattern Using single or multiple PRBS patterns

22 Signal Integrity – differential T-lines More reliable  Termination resistance  Common-mode noise reduction Peak noise  Within ~10mV range Eye-Heights  UE-TL Eye reduces as bit rate ↑ Harder to meet constraint.  PE-TL > 70mV eye even at 22nm node Equalization does help! Worst-case switching pattern for peak noise simulation

23 Conclusion Compare five different global interconnections in terms of latency, energy per bit, throughput and signal integrity from 90nm to 22nm. A simple linear model provided to link  Architecture-level performance metrics  Technology-defined parameters Some observations from experimental results  T-line structures have potential to replace R-RC at future node  Differential T-lines are better than single-ended Low-power/High-throughput/Low-noise  Equalization could be utilized for on-chip global interconnection Higher throughput density, improve signal integrity Even w/ lower energy dissipation (passive equalizations)

Thank you! Q & A

Back Up Slides

26 Introduction – Technology Trend On-Chip Interconnect Scaling  Dimension shrinks Wire resistance increases -> RC delay Increasing capacitive coupling -> delay, power, noise, etc.  Performance of global wires decreases w/ technology scaling. Wire CategoryTechnology Node 90nm45nm22nm M1 Wire Rw(kohm/mm) Cw(pF/mm) Global Wire Rw(kohm/mm) Cw(pF/mm) Copper resistivity versus wire width Scaling trend of PUL wire resistance and capacitance

Design methodology: single-ended T-lines Single-ended; Inverter chains 2D frequency-dependent tabular Model SPICE simulation Inverter size, number of stages, Rload (if any) SPICE simulation to evaluate. Optimization Routine: 1. Optimal cycle time 2. Sweep for optimal inverter chain SPICE simulation to check in- plane crosstalk, etc 27

Design methodology: differential T-lines Differential lines; SA-based TX 2D frequency-dependent Tabular Model Closed-form equation- based model Wire width; Driver impedance; RC equalizer (if any); Termination resistance. Evaluation based on models. Optimization Routine: 1. Binary search for wire width 2. SQP for other var. optimization SPICE simulation to check in- plane crosstalk, etc 28

Effects of driver impedance and termination resistance  Lowering driver impedance improves eye  Eye reduces as frequency goes up  Optimal termination resistance. 29

Effects of driver impedance and termination resistance on step response  Larger driver impedance leads to slower rise edge and lower saturation voltage  Larger termination resistance causes sharper rise edge but with larger reflection Optimal R load 30

Crosstalk effects Three different PRBS input patterns, min-ddp solutions T-line Scheme A: Delay increased by 9.6%, Power increased by 37% T-line Scheme B: Delay increased by 2%, Power increased by 25.7% 31

Transceiver Design Double-tail latch-type voltage sense amp. Sense amplifier (SA)  Double-tail latch-type [Schinkel 07]  Optimize sizing to minimize SA delay Inverter chain  Number of stage Fixed to 6  Sizing of each inverter R S : output resistance of inverter chain Sweep the 1 st inverter size to minimize the total transceiver delay for given [V eye, R S tech node: M1/M3: 45nm/45nm M2/M4: 250nm/45nm M5/M6: 180nm/45nm M7/M8: 280nm/45nm M9: 495nm/45nm M10/M11: 200nm/45nm M12: 1.58um/45nm 32

Transceiver Modeling Driver side  Voltage source V s with output resistance R s  V s : full-swing pulse signal with rise time T r =0.1T c  R s : output resistance of the last inverter in the chain. Receiver side  Extract look-up table for TX delay and power  Fit the table using non-linear closed form formula  The relative error is within 2% for fitting models Transceiver delay map at 45nm node Histogram of fitting errors at 45nm node Transceiver power map at 45nm node 33

Bit-rate: 50Gbps R s =11.06ohm, R d =350ohm, C d =0.38pF, R L =107.69ohm 34

35 90nm65nm45nm32nm22nm R-RC 3/351/421/461/55 UT-TL 5/155/135/105/95/8 T-TL 5/155/135/105/95/8 UE-TL 1/373/253/163/125/8 PE-TL 1/373/253/163/125/8 Tech Node Schemes 90nm65nm45nm32nm22nm R-RC 5/55/63/83/102/12 UT-TL 2/3.31/3.3 T-TL 1/32/3.42/62/93/16 UE-TL 3/33/54/94/134/21 PE-TL 4/44/5.35/95/155/24 Tech Node Schemes 90nm65nm45nm32nm22nm R-RC 2/1502/1401/1301/100 UT-TL 3/1403/1103/703/502/40 T-TL 1/2601/2002/1002/603/40 UE-TL 4/604/364/204/105/4 PE-TL 5/265/165/85/55/2 Tech Node Schemes 90nm65nm45nm32nm22nm R-RC UT-TL T-TL UE-TL PE-TL Tech Node Schemes Low-Latency Application (ps/mm)Low-Energy Application (pJ/m) High-Throughput Application (Gbps/um)Low-Noise Application Conclusion (cont’) Item in the table: score/value. Score: the higher, the better in terms of given metric, max. score is 5. The best structure in each column marked using red color.

Future Works Explore novel global signaling schemes for high throughput and low energy dissipation.  Design, optimize > 50Gbps on-chip interconnection schemes  Architecture-level study to identify trade-offs Wire configuration  Dimension optimization, ground plane, etc. Un-interrupted architectures  Equalization implementation, TX/RX choice Distributed architectures  Active or Passive compensation (RC equalizers, other networks, etc)  Novel high-speed transceiver circuitry design  Develop analysis and optimization capability to aid co-design and co- optimization of wire and transceiver circuit  Fabrication to verify analysis and demonstrate feasibility 36

37 Related Publications 1. L. Zhang, H. Chen, B. Yao, K. Hamilton, and C.K. Cheng, “Repeated on-chip interconnect analysis and evaluation of delay, power and bandwidth metrics under different design goals,” IEEE International Symposium on Quality Electronic Design, 2007, pp Y. Zhang, L. Zhang, A. Deutsch, G. A. Katopis, D. M. Dreps, J. F. Buckwalter, E. S. Kuh and C.K. Cheng, “Design Methodology of High Performance On-Chip Global Interconnect Using Terminated Transmission-Line, ” IEEE International Symposium on Quality Electronic Design, 2009, pp Y. Zhang, L. Zhang, A. Tsuchiya, M. Hashimoto, and C.K. Cheng, “On-chip high performance signaling using passive compensation, ” IEEE International Conference on Computer Design, 2008, pp Y. Zhang, L. Zhang, A. Deutsch, G. A. Katopis, D. M. Dreps, J. F. Buckwalter, E. S. Kuh, and C. K. Cheng, “On-chip bus signaling using passive compensation,” IEEE Electrical Performance of Electronic Packaging, 2008, pp L. Zhang, Y. Zhang, A. Tsuchiya, M. Hashimoto, E. Kuh, and C.K. Cheng, “High performance on-chip differential signaling using passive compensation for global communication, ” Asia and South Pacific Design Automation Conference, 2009, pp Y. Zhang, X. Hu, A. Deutsch, A. E. Engin, J. F. Buckwalter, and C. K. Cheng, “Prediction of High- Performance On-Chip Global Interconnection, ” ACM workshop on System Level Interconnection Prediction, 2009 [Repeated RC Wire] [Passive-Equalized T-Line] [Un-Terminated/Terminated T-Line] [Overview and Comparison]