SLIP-2008, Newcastle upon Tyne, UKApril 5, 2008 Parallel vs. Serial On-Chip Communication Rostislav (Reuven) Dobkin Arkadiy Morgenshtein Avinoam Kolodny.

Slides:

Advertisements

Similar presentations

ASYNC07 High Rate Wave-pipelined Asynchronous On-chip Bit-serial Data Link R. Dobkin, T. Liran, Y. Perelman, A. Kolodny, R. Ginosar Technion – Israel Institute.

Advertisements

Data Synchronization Issues in GALS SoCs Rostislav (Reuven) Dobkin and Ran Ginosar Technion Christos P. Sotiriou FORTH ICS- FORTH.

Tunable Sensors for Process-Aware Voltage Scaling

Introduction to CMOS VLSI Design Sequential Circuits.

VLSI Design EE 447/547 Sequential circuits 1 EE 447/547 VLSI Design Lecture 9: Sequential Circuits.

Introduction to CMOS VLSI Design Sequential Circuits

MICROELETTRONICA Sequential circuits Lection 7.

Lecture 11: Sequential Circuit Design. CMOS VLSI DesignCMOS VLSI Design 4th Ed. 11: Sequential Circuits2 Outline  Sequencing  Sequencing Element Design.

Sequential Circuits. Outline  Floorplanning  Sequencing  Sequencing Element Design  Max and Min-Delay  Clock Skew  Time Borrowing  Two-Phase Clocking.

1 Arkadiy Morgenshtein, Eby G. Friedman, Ran Ginosar, Avinoam Kolodny Technion – Israel Institute of Technology Timing Optimization in Logic with Interconnect.

Handling Global Traffic in Future CMP NoCs Ran Manevich, Israel Cidon, and Avinoam Kolodny. Group Research QNoC Electrical Engineering Department Technion.

Introduction to CMOS VLSI Design Lecture 19: Design for Skew David Harris Harvey Mudd College Spring 2004.

Introduction to CMOS VLSI Design Clock Skew-tolerant circuits.

Sequential Definitions  Use two level sensitive latches of opposite type to build one master-slave flipflop that changes state on a clock edge (when the.

CSE477 L19 Timing Issues; Datapaths.1Irwin&Vijay, PSU, 2002 CSE477 VLSI Digital Circuits Fall 2002 Lecture 19: Timing Issues; Introduction to Datapath.

Clock Design Adopted from David Harris of Harvey Mudd College.

Chapter 11 Timing Issues in Digital Systems Boonchuay Supmonchai Integrated Design Application Research (IDAR) Laboratory August 20, 2004; Revised - July.

1 Asynchronous Bit-stream Compression (ABC) IEEE 2006 ABC Asynchronous Bit-stream Compression Arkadiy Morgenshtein, Avinoam Kolodny, Ran Ginosar Technion.

Module R R RRR R RRRRR RR R R R R Efficient Link Capacity and QoS Design for Wormhole Network-on-Chip Zvika Guz, Isask ’ har Walter, Evgeny Bolotin, Israel.

Async. Seminar FOX Fast On-Chip Interconnect Reuven Dobkin Technion – Israel Institute of Technology Electrical Engineering Department – VLSI Lab April.

1 1 Avinoam Kolodny Technion – Israel Institute of Technology Intel PVPD Symposium July 2006 Issues in the Design of Wires.

A Novel Clock Distribution and Dynamic De-skewing Methodology Arjun Kapoor – University of Colorado at Boulder Nikhil Jayakumar – Texas A&M University,

MICRO-MODEM RELIABILITY SOLUTION FOR NOC COMMUNICATIONS Arkadiy Morgenshtein, Evgeny Bolotin, Israel Cidon, Avinoam Kolodny, Ran Ginosar Technion – Israel.

LOW-LEAKAGE REPEATERS FOR NETWORK-ON-CHIP INTERCONNECTS Arkadiy Morgenshtein, Israel Cidon, Avinoam Kolodny, Ran Ginosar Technion – Israel Institute of.

Pipelining and Retiming 1 Pipelining  Adding registers along a path  split combinational logic into multiple cycles  increase clock rate  increase.

1 Link Division Multiplexing (LDM) for NoC Links IEEE 2006 LDM Link Division Multiplexing Arkadiy Morgenshtein, Avinoam Kolodny, Ran Ginosar Technion –

1 A Single-supply True Voltage Level Shifter Rajesh Garg Gagandeep Mallarapu Sunil P. Khatri Department of Electrical and Computer Engineering, Texas A&M.

CS 7810 Lecture 3 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.

S. Reda EN160 SP’07 Design and Implementation of VLSI Systems (EN0160) Lecture 22: Material Review Prof. Sherief Reda Division of Engineering, Brown University.

Network-on-Chip Links and Implementation Issues System-on-Chip Group, CSE-IMM, DTU.

1 E. Bolotin – The Power of Priority, NoCs 2007 The Power of Priority : NoC based Distributed Cache Coherency Evgeny Bolotin, Zvika Guz, Israel Cidon,

Circuit Performance Variability Decomposition Michael Orshansky, Costas Spanos, and Chenming Hu Department of Electrical Engineering and Computer Sciences,

Alpha Goal: very fast multiprocessor systems, highly scalable Main trick is high-bandwidth, low-latency data access. How to do it, how to do it?

Introduction to CMOS VLSI Design Lecture 10: Sequential Circuits Credits: David Harris Harvey Mudd College (Material taken/adapted from Harris’ lecture.

Analysis and Avoidance of Cross-talk in on-chip buses Chunjie Duan Ericsson Wireless Communications Anup Tirumala Jasmine Networks Sunil P Khatri University.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

Lecture 21, Slide 1EECS40, Fall 2004Prof. White Lecture #21 OUTLINE –Sequential logic circuits –Fan-out –Propagation delay –CMOS power consumption Reading:

Dragonfly Topology and Routing

TLC: Transmission Line Caches Brad Beckmann David Wood Multifacet Project University of Wisconsin-Madison 12/3/03.

On-Chip Networks and Testing

Research on Analysis and Physical Synthesis Chung-Kuan Cheng CSE Department UC San Diego

CAD for Physical Design of VLSI Circuits

CS Lecture 4 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.

ECE 124a/256c Advanced VLSI Design Forrest Brewer.

"A probabilistic approach to clock cycle prediction" A probabilistic approach to clock cycle prediction J. Dambre, D. Stroobandt and J. Van Campenhout.

Design of a High-Throughput Low-Power IS95 Viterbi Decoder Xun Liu Marios C. Papaefthymiou Advanced Computer Architecture Laboratory Electrical Engineering.

1 Bus Encoding for Total Power Reduction Using a Leakage-Aware Buffer Configuration 班級：積體所碩一學生：林欣緯指導教授：魏凱城老師 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION.

Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.

1 Interconnect/Via. 2 Delay of Devices and Interconnect.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

Variation. 2 Sources of Variation 1.Process (manufacturing) (physical) variations:  Uncertainty in the parameters of fabricated devices and interconnects.

By Nasir Mahmood.  The NoC solution brings a networking method to on-chip communication.

Dynamic Traffic Distribution among Hierarchy Levels in Hierarchical Networks-on-Chip Ran Manevich, Israel Cidon, and Avinoam Kolodny Group Research QNoC.

Design Tradeoffs of Long Links in Hierarchical Tiled Networks-on-Chip Group Research 1 QNoC.

Interconnect/Via.

1ISPD'03 Process Variation Aware Clock Tree Routing Bing Lu Cadence Jiang Hu Texas A&M Univ Gary Ellis IBM Corp Haihua Su IBM Corp.

12006 MAPLD International ConferenceSpaceWire 101 Seminar Data Strobe (DS) Encoding Sam Stratton 2006 MAPLD International Conference.

EE 587 SoC Design & Test Partha Pande School of EECS Washington State University

Department of Electrical and Computer Engineering University of Wisconsin - Madison Optimizing Total Power of Many-core Processors Considering Voltage.

-1- Delay Uncertainty and Signal Criticality Driven Routing Channel Optimization for Advanced DRAM Products Samyoung Bang #, Kwangsoo Han ‡, Andrew B.

Implementing Tile-based Chip Multiprocessors with GALS Clocking Styles Zhiyi Yu, Bevan Baas VLSI Computation Lab, ECE Department University of California,

EE141 Timing Issues 1 Chapter 10 Timing Issues Rev /11/2003 Rev /28/2003 Rev /05/2003.

EE141 Timing Issues 1 Chapter 10 Timing Issues Rev /11/2003.

The Interconnect Delay Bottleneck.

Reading: Hambley Ch. 7; Rabaey et al. Sec. 5.2

Israel Cidon, Ran Ginosar and Avinoam Kolodny

Chapter 10 Timing Issues Rev /11/2003 Rev /28/2003

Puneet Gupta1 , Andrew B. Kahng1 , Youngmin Kim2, Dennis Sylvester2

Low Power Digital Design

Presentation transcript:

SLIP-2008, Newcastle upon Tyne, UKApril 5, 2008 Parallel vs. Serial On-Chip Communication Rostislav (Reuven) Dobkin Arkadiy Morgenshtein Avinoam Kolodny Ran Ginosar Technion – Israel Institute of Technology Electrical Engineering Department VLSI Systems Research Center

2/33 Presentation Outline Motivation – Parallel links limitations – Novel high-speed serial links Link Architectures – "Register-Pipelined" and "Wave-pipelined" parallel links – Single gate-delay serial link Comparative study: parallel vs. serial – Analytical models – Scalability – 65nm case study

3/33 Parallel link limitations Parallel links limitations – Constructed of multiple (N) wires and repeaters – Incur high leakage power – Occupy large chip area (routing difficulty) – Present a significant capacitive load – Buses have often low utilization and most of the time just leak (line drivers and repeaters)… N

4/33 Bit-Serial Interconnect Fewer lines, fewer line drivers and fewer repeaters Reduced leakage power Reduced chip area Better routability Should work N times faster!

5/33 Serial Link Standard serial links are very slow Hope lies in novel serial links – Data cycle of a few gate-delays (inverter FO4 delay) This work considers the fastest serial link – With single gate-delay data cycle (d 4 ) Our target: to show that novel serial link outperforms the parallel one for – Long ranges – Advanced technology nodes

6/33 Method Choose – Parallel link implementation representatives – Serial link implementation representatives Compare the parallel and serial link approaches in terms of: – Area – Power – Latency – Technology scaling

7/33 "Register-Pipelined" Parallel Link Fully synchronous Interconnect as combinational logic between registers Source synchronous or global clock High cost for high bit rates!

8/33 "Wave-Pipelined" Parallel Link  Bit rate is limited by relative skew of the link wires

9/33 Crosstalk Mitigation and Power Reduction Shielding / Spacing Staggered repeaters Interleaved bi-directional lines Asynchronous signaling Data encoding Data pattern recognition with special worst-case handling This work analyzes the two extremes of shielding: – Unshielded wires (a) – Fully-shielded wires (b)

10/33 Single Gate-Delay Serial Link Transition signaling instead of sampling – Two-phase NRZ Level Encoded Dual Rail (LEDR) asynchronous protocol, a.k.a. data-strobe (DS) Acknowledge per word instead of per bit Wave-pipelining over channel Differential encoding (DS-DE, IEEE ) Low-latency synchronizers R. Dobkin, et al., High Rate Wave-Pipelined Asynchronous On-Chip Bit-Serial Data Link, ASYNC07

11/33 Analytical Models Parallel and Serial Link Bit Rates  Please refer to the paper for details on the exact analytical models employed in the work

12/33 Parallel Link Bit Rate Limitations (1) A.Fastest available clock – Ring oscillator limitation: 8  d 4 – Fast processors: 11  d 4 (e.g. CELL) – Standard SoC/ASIC:  d 4 B.Synchronization Latency – May take several clocks in case of asynchronous clock relation C.Clock uncertainty – Extended critical path

13/33 Parallel Link Bit Rate Limitations (2) D.Delay Uncertainty – The skew and jitter of the clock – Repeater delay variations – Wire delay variations  mostly metal thickness variations – Via variations – Cross-Coupling (Crosstalk) – Geometry  Outcome of routing congestion and multi-layer structure N.S. Nagaraj DAC 2005 / L. Scheffer, SLIP 2006

14/33 Parallel Link Minimal Clock Cycle (1) Notations from W.P. Burleson, et al., Wave-Pipelining: A Tutorial and Research Survey, TVLSI, 1998 Latest data clocking Earliest data clocking Clock Uncertainty

15/33 Process Variations Impact on Multi-Wire Delay Uncertainty Variation types – Random variations  closely placed devices – "Systematic" variations  location on the die Relative skew (  MAX –  MIN ) – Repeaters in the same stage are highly correlated – Random variations are averaged out thanks to large repeater sizing – Systematic inter-stage variations are averaged out along the link Random variations inside Repeater Stage are averaged out  Relative skew among the lines due to process variations is small

16/33 Cross-Coupling Impact on Multi-Wire Delay Uncertainty Let’s approximate the wire delay by: Repeater Delay Wire Delay transistor variation number of repeaters nominal repeate r delay coupling factor wire variation wire segment (L) nominal delay Worst case skew Φ between two lines:

17/33 Parallel Link Minimal Clock Cycle (2) Minimal clock cycle: System clock limitation: Register-pipelined link: – Distance between successive pipeline stages is affected by Delay Uncertainty 65nm example The rate is bounded by clock cycle rather than the delay uncertainty The rate is bounded by clock cycle rather than by the delay uncertainty

18/33 Serial Link Bit Rate Skew due to in-die variation is neglected – much smaller than in parallel link Coupling factor is always known – LEDR encoding: there is only one transition per each transmitted bit – The skew is not affected by cross-coupling  link delay is similar for all symbols Bit rate:

19/33 Scalability Number of repeaters (per millimeter) grows for more advanced technology nodes Active area and leakage: Minimal link length for serial link employment decreases with technology Dynamic power: Minimal link length for serial link employment decreases with technology Interconnect area: Serial link is always preferable Y.I. Ismail, et al., Repeater Insertion in RLC Lines for Minimum Propagation Delay, ISCAS99 RLC region repeater insertion: Power Repeaters Number of repeaters grows with technology node scaling The range for serial link employment decreases with technology node scaling Area and Leakage

20/33 65nm Case Study

21/33 Goals and Set-up Compare – Wave-pipelined (shielded/unshielded) vs. Serial – Register-pipelined (shielded/unshielded) vs. Serial In terms of: – Area – Power – Latency – Length All links deliver the same bandwidth – B SER – the bandwidth of single serial link

22/33 Parallel Link Width for Equivalent Throughput Note impractical widths for:  Unshielded WP over 6mm  RP operating with clock cycle greater than 130∙d 4 Wave-Pipelined (WP) link width Register-Pipelined (RP) link width Maximal Width (128 Lines) Equivalent Width Maximal Width (128 Lines) Unshielded Fully-Shielded (8d 4 Clock, N=8)

23/33 Wave-Pipelined Link vs. Serial Link: Active Area and Leakage Comparison Same Area / Leakage Parallel is better Serial is better Fully-Shielded Unshielded Unshielded: Impractical

24/33 Wave-Pipelined Link vs. Serial Link: Total Area Comparison (Incl. Interconnect) Serial is always better Fully-Shielded Unshielded Unshielded: Impractical

25/33 Register-Pipelined Link vs. Serial Link: Active Area and Leakage Comparison Fully-Shielded, T=10d 4 Fully-Shielded, T=130d 4 Unshielded, T=10d 4 Unshielded, T=130d 4 Serial is always better

26/33 Register-Pipelined Link vs. Serial Link: Total Area Comparison (Incl. Interconnect) Serial is always better Fully-Shielded, T=130d 4 Unshielded, T=10d 4 Unshielded, T=130d 4 Fully-Shielded, T=10d 4

27/33 Wave-Pipelined Link vs. Serial Link: Dynamic Power Comparison Impractical: Too wide parallel link Serial is better >3mm Fully-Shielded Unshielded Unshielded: Impractical Parallel is better

28/33 Wave-Pipelined Link vs. Serial Link: Total Power Comparison Fully-Shielded Unshielded Unshielded: Impractical 20% Utilization

29/33 Register-Pipelined Link vs. Serial Link: Dynamic Power Comparison Fully-Shielded, T=130d 4 Unshielded, T=130d 4 Fully-Shielded, T=10d 4 Unshielded, T=10d 4

30/33 Register-Pipelined Link vs. Serial Link: Total Power Comparison 20% Utilization Unshielded, T=130d 4 Fully-Shielded, T=10d 4 Unshielded, T=10d 4 Fully-Shielded, T=130d 4

31/33 Test Case Summary Register-pipelined vs. SerialWave-Pipeline vs. Serial UnshieldedFully Shielded UnshieldedFully ShieldedShielding unlimited up to 6mmunlimited Length of parallel link 130d 4 (slow) 10d 4 (fast) 130d 4 (slow) 10d 4 (fast) 8d 4 Clock cycle of parallel link choose a serial link for links longer than: To minimize the following: Always Area 3mm1mm3mm 4mm2 mmPower 9mm2mm12mm4mmNever*2 mmLatency Minimal length above which the serial link is preferred

32/33 Conclusions Novel high-speed serial links outperform parallel links for long range communication The serial link is more attractive for shorter ranges in future technologies Future large SoCs and NoCs should employ serial links to mitigate: – Area – Routing Congestion – Power – Latency

SLIP-2008, Newcastle upon Tyne, UKApril 5, 2008 Thank You!

SLIP-2008, Newcastle upon Tyne, UKApril 5, 2008 Back-Up Slides

35/33 Area Wave-pipelined parallel link Register-pipelined parallel link Serial link  Please refer to the paper for detailed expressions

36/33 Dynamic and Standby Power Wave-pipelined parallel link Register-pipelined parallel link Serial link Standby power Assuming clock gating for parallel link Normalized Capacitance

37/33 Area Wave-pipelined parallel link Register-pipelined parallel link Serial link Y.I. Ismail, et al., Repeater Insertion in RLC Lines for Minimum Propagation Delay, ISCAS99 RLC region repeater insertion: Driver size # lines Repeater size Shield size Pipe-Stage Flip-Flop size Repeaters & Drivers SERDES

38/33 Dynamic and Standby Power Wave-pipelined parallel link Register-pipelined parallel link Serial link Standby power Assuming clock gating for parallel link

39/33 Wave-Pipelined Link vs. Serial Link: Interconnect Area Comparison Serial is always better Fully-Shielded Unshielded Unshielded: Impractical

40/33 Register-Pipelined Link vs. Serial Link: Interconnect Area Comparison Serial is always better Fully-Shielded Unshielded

41/33 Wave-Pipelined Link vs. Serial Link: Latency Overhead Comparison Latency overhead – Relative to an inherent wire delay

42/33 Register-Pipelined Link vs. Serial Link: Latency Overhead Comparison Latency overhead – Relative to an inherent wire delay