Download presentation
Presentation is loading. Please wait.
1
Async. Seminar FOX Fast On-Chip Interconnect Reuven Dobkin Technion – Israel Institute of Technology Electrical Engineering Department – VLSI Lab April 18, 2010
2
Async. Seminar, Technion, April 18, 2010 2 Presentation Outline SoC Communication Challenges Serial Link Benefits Parallel vs. Serial Link Analysis Fast Asynchronous Serial Link Transmitter, Fast LEDR Encoder Receiver, Fast Toggle Circuit Channel Voltage Mode Async Signaling Current Mode Async Signaling FOX Test Chip Future Research Topics
3
Async. Seminar, Technion, April 18, 2010 3 System-on-Chip (SoC) Challenges Non-scalable global interconnect High power and latency Multiple number of modules Congested inter-modular communication Multiple clock and voltage domains Power-reduction dynamic techniques Complex inter-module timing relations Growing bandwidth requirements Unsupported by conventional buses
4
Async. Seminar, Technion, April 18, 2010 4 Constructed of multiple wires and repeaters Incur high leakage power (line drivers and repeaters) Often have low utilization (leakage again) Occupy large chip area (routing difficulty) Incur cross-coupling (delay matching & skew problems) Present a significant capacitive load N Parallel NoC Links and SoC Buses
5
Async. Seminar, Technion, April 18, 2010 5 Bit-Serial Interconnect Fewer lines, fewer line drivers and fewer repeaters Reduced leakage power Reduced chip area Better routability Reduced coupling Should work N times faster! F SER =N F PAR
6
Async. Seminar, Technion, April 18, 2010 6 Fast Serial Interconnect Synchronous Approach: Source-synchronous, with Clock Data Recovery (CDR) Problem: CDR power and area Asynchronous Approach: Normal methods are too slow We consider a faster one
7
Async. Seminar, Technion, April 18, 2010 7 Seeking Highest Speed No “spacers” (NRZ) One transition per bit Fast signaling Targeted Speed: One gate delay between bits For 65nm technology this results in 67 Gbps One Gate Delay = FO4 INV Delay
8
Async. Seminar, Technion, April 18, 2010 8 Serial Link – Top Structure Transition signaling instead of sampling: two-phase NRZ Level Encoded Dual Rail (LEDR) asynchronous protocol, a.k.a. data-strobe (DS) Acknowledge per word instead of per bit Wave-pipelining over channel Differential encoding (DS-DE, IEEE1355-95) Low-latency synchronizers
9
Async. Seminar, Technion, April 18, 2010 9 Encoding –Two Phase NRZ LEDR Two Phase Non-Return-to-Zero Level Encoded Dual Rail “delta” encoding (one transition per bit) Uncoded (B) State bit (S) Phase bit (P) 00110 00010
10
Async. Seminar, Technion, April 18, 2010 10 Serial Link Applications P2P long-range global interconnect Long range NoC links Pin-limited on-chip module interfaces Presently chips are pin-limited, and that will migrate inside Cross-bar Simpler routing and congestion Communications inside many-core CMPs
11
Async. Seminar, Technion, April 18, 2010 11 NameData Cycle (# of FO4 Inv. Delays) WAFT7*(3 Gbps at 180nm) IIP3*(5 Gbps at 250nm) 3-Levels3*(1 Gbps at 1200nm) Pulsed CM 2(8 Gbps at 180nm) This Work 1(67 Gbps at 65nm-HP) Related Work
12
Async. Seminar Parallel vs. Serial Link Analysis
13
Async. Seminar, Technion, April 18, 2010 13 Parallel Link dissipates less power Serial Link dissipates less power Link Length [mm] Serial Link Benefits Analysis goal To show that novel serial link outperforms the parallel one for: Long ranges Advanced technology nodes Technology Node [nm] Parallel Link requires less area Serial Link requires less area Technology Node [nm]
14
Async. Seminar, Technion, April 18, 2010 14 Method Choose Parallel link implementation representatives Serial link implementation representatives Compare the parallel and serial link approaches in terms of: Area Power Latency Technology scaling
15
Async. Seminar, Technion, April 18, 2010 15 "Register-Pipelined" Parallel Link Fully synchronous Interconnect as combinational logic between registers Source synchronous or global clock High cost for high bit rates!
16
Async. Seminar, Technion, April 18, 2010 16 "Wave-Pipelined" Parallel Link Bit rate is limited by relative skew of the link wires
17
Async. Seminar, Technion, April 18, 2010 17 Crosstalk Mitigation and Power Reduction Shielding / Spacing Staggered repeaters Interleaved bi-directional lines Asynchronous signaling Data encoding Data pattern recognition with special worst-case handling This work analyzes the two extremes of shielding: Unshielded wires (a) Fully-shielded wires (b)
18
Async. Seminar, Technion, April 18, 2010 18 Single Gate-Delay Serial Link (reminder) Transition signaling instead of sampling: two-phase NRZ Level Encoded Dual Rail (LEDR) asynchronous protocol, a.k.a. data-strobe (DS) Acknowledge per word instead of per bit Wave-pipelining over channel Differential encoding (DS-DE, IEEE1355-95) Low-latency synchronizers
19
Async. Seminar, Technion, April 18, 2010 19 Analytical Models Parallel and Serial Link Bit Rates Please refer to the paper below for details on the exact analytical models employed in the work: R. Dobkin, A. Morgenshtein, A. Kolodny, R. Ginosar, "Parallel vs. Serial On-Chip Communication," SLIP, pp. 43-50, 2008
20
Async. Seminar, Technion, April 18, 2010 20 Parallel Link Bit Rate Limitations (1) A.Fastest available clock Ring oscillator limitation: 8 d 4 Fast processors: 11 d 4 (e.g. CELL) Standard SoC/ASIC: 100-400 d 4 B.Synchronization Latency May take several clocks in case of asynchronous clock relation C.Clock uncertainty Extended critical path
21
Async. Seminar, Technion, April 18, 2010 21 Parallel Link Bit Rate Limitations (2) D.Delay Uncertainty The skew and jitter of the clock Repeater delay variations Wire delay variations mostly metal thickness variations Via variations Cross-Coupling (Crosstalk) Geometry Outcome of routing congestion and multi-layer structure N.S. Nagaraj DAC 2005 / L. Scheffer, SLIP 2006
22
Async. Seminar, Technion, April 18, 2010 22 Parallel Link Minimal Clock Cycle (1) Notations from W.P. Burleson, et al., Wave-Pipelining: A Tutorial and Research Survey, TVLSI, 1998 Latest data clocking Earliest data clocking Clock Uncertainty
23
Async. Seminar, Technion, April 18, 2010 23 Impact of Process Variations in Repeaters on Multi-Wire Delay Uncertainty Variation types Random variations closely placed devices "Systematic" variations location on the die Relative skew ( MAX – MIN ) Repeaters in the same stage are highly correlated Random variations are averaged out thanks to large repeater sizing Systematic inter-stage variations are averaged out along the link Random variations inside Repeater Stage are averaged out Relative skew among the lines due to variations in repeaters is small Multi-wire delay uncertainty is dominated by Cross-Coupling
24
Async. Seminar, Technion, April 18, 2010 24 Parallel Link Minimal Clock Cycle (2) Minimal clock cycle: System clock limitation: Register-pipelined link: Distance between successive pipeline stages is affected by Delay Uncertainty 65nm example The rate is bounded by clock cycle rather than the delay uncertainty The rate is bounded by clock cycle rather than by the delay uncertainty Worst case skew between two lines: Cross-coupling and wire variations
25
Async. Seminar, Technion, April 18, 2010 25 Serial Link Bit Rate Skew due to transistor variations is neglected much smaller than in parallel link Coupling factor is always known LEDR encoding: there is only one transition per each transmitted bit The skew is not affected by cross-coupling link delay is similar for all symbols Bit rate:
26
Async. Seminar, Technion, April 18, 2010 26 Scalability Number of repeaters (per millimeter) grows for more advanced technology nodes Active area and leakage: Minimal link length for serial link employment decreases with technology Dynamic power: Minimal link length for serial link employment decreases with technology Interconnect area: Serial link is always preferable Y.I. Ismail, et al., Repeater Insertion in RLC Lines for Minimum Propagation Delay, ISCAS99 Power Repeaters Number of repeaters grows with technology node scaling Area and Leakage Equal throughput Parallel and Serial links are assumed
27
Async. Seminar, Technion, April 18, 2010 27 65 nm Test Case Summary Register-pipelined vs. SerialWave-Pipeline vs. Serial UnshieldedFully Shielded UnshieldedFully ShieldedShielding unlimited up to 6mmunlimited Length of parallel link 130d 4 (slow) 10d 4 (fast) 130d 4 (slow) 10d 4 (fast) 8d 4 Clock cycle of parallel link choose a serial link for links longer than: To minimize the following: Always Area 3mm1mm3mm 4mm2 mmPower 9mm2mm12mm4mmNever*2 mmLatency Minimal length above which the serial link is preferred
28
Async. Seminar Single Gate Delay Serial Link Architecture
29
Async. Seminar, Technion, April 18, 2010 29 Single Gate-Delay Serial Link Transition signaling instead of sampling: two-phase NRZ Level Encoded Dual Rail (LEDR) asynchronous protocol, a.k.a. data-strobe (DS) Acknowledge per word instead of per bit Wave-pipelining over channel Differential encoding (DS-DE, IEEE1355-95) Low-latency synchronizers
30
Async. Seminar, Technion, April 18, 2010 30 Transmitter – Fast SR Approach Transition Generator Targeted Speed: One gate delay between bits
31
Async. Seminar, Technion, April 18, 2010 31 Multi-Phase Transition Generation The circuit is based on multi-phase clock generator from M.J.E. Lee, "An Efficient I/O and Clock Recovery for TERABIT Integrated Circuits Design," PhD Thesis, Stanford Univ., 2001.
32
Async. Seminar, Technion, 18 April 2010 32 Fast Asynchronous Shift Register
33
Async. Seminar, Technion, April 18, 2010 33 Wave-pipelined Control Characteristics The highest speed (the single gate- delay cycle) relates to the pole of the Bode diagram This operating point results in signal degradation along the inverter chain Single Gate Delay Rate
34
Async. Seminar, Technion, April 18, 2010 34 Splitter Architecture The shift-register is partitioned into M shift-registers M slower operation in each shift-register Signal is no longer degraded Single gate-delay operation is localized to output (input) stage only
35
Async. Seminar, Technion, April 18, 2010 35 Splitter Architecture – Performance Splitter architecture employment Better resiliency to process variation thanks to localized high-speed operation Power Reduction:
36
36 Transmitter Splitter Architecture
37
Async. Seminar, Technion, April 18, 2010 37 Transmitter – SPICE Simulation (65nm node) Simulations done at
38
Async. Seminar, Technion, 18 April 2010 38 Receiver
39
39 Receiver Splitter Architecture
40
Async. Seminar, Technion, April 18, 2010 40 Toggle Circuit Straightforward implementation (fundamental asynchronous state machine) is too slow (supports only ~1.5 gate delay cycle) Novel toggle: Single gate delay operation support Internal and output latches 010101
41
Async. Seminar, Technion, April 18, 2010 41 Channel Four transmission lines (DS-DE) High metal layers utilization Metals 5-8 of 65nm process RLC modeled Careful layout Small crosstalk Small relative variations
42
Async. Seminar, Technion, April 18, 2010 42 Channel Issues The wires should be designed as transmission lines, enabling multiple traveling signals (a 1mm wire may carry at least two successive bits simultaneously) Wire Bandwidth (metal layer/width/frequency) Careful Layout Modeling (R, L, C) Current/Voltage Signaling Repeater Insertion / Differential Repeaters Shielding Termination Low-Swing (Speed & Power)
43
Async. Seminar, Technion, April 18, 2010 43 SS LEDR Interconnect Layout
44
Async. Seminar, Technion, April 18, 2010 44 SSPP LEDR Interconnect Layout
45
Async. Seminar, Technion, April 18, 2010 45 SSPPSP LEDR Interconnect Layout
46
Async. Seminar, Technion, April 18, 2010 46 Differential Channel Driver and Receiver “Voltage Mode” Current mode differential low-swing signaling Currents in opposite directions Controllable current return path P / S Voltage Sensing at the RX side
47
Async. Seminar, Technion, April 18, 2010 47 Channel Characteristic Impedance Based on data from BPTM. Drawn for constant R, L, C Z depends on F Voltage changes with F Fast changes voltage drifts The drifts bound the operating speed F Z S S
48
Async. Seminar, Technion, April 18, 2010 48 Channel Driver with Adaptive Control Compensates for Z changes Turned on for low frequencies Adaptive Control Inertial Delay
49
Async. Seminar, Technion, April 18, 2010 49 Adaptive Control – Simulation Example SPICE simulation setup: 65nm technology, 4mm range, 67Gbps data rate RLC modeled channel (using Raphael-like three-dimensional field solver) Adaptive control is turned on only for low frequencies
50
Async. Seminar, Technion, April 18, 2010 50 Channel Receiver Amplifier
51
Async. Seminar, Technion, April 18, 2010 51 Voltage Mode Drawbacks Voltage signal is degraded by high-capacitance of long interconnects Fast full-swing voltage transitions result in high dynamic currents High power Cross-talk noise Current signaling offers a way of avoiding high voltage swings Standard “Current-mode” techniques still measure voltage at the receiver side
52
Async. Seminar, Technion, April 18, 2010 52 Current Mode: Goals Full Current Mode: Send current from TX Measure current at RX Minimal voltage swing over channel Current to voltage conversion at RX Long range repeater-less channel Targeted data cycle 15 ps (67 Gbps) @ 65nm Low power
53
Async. Seminar, Technion, April 18, 2010 53 Current Mode: Full Current-Mode Circuit I RX I TX I TX > I RX I TX > I S V q_swing =I S R I RX +I S Output Stage FS
54
Async. Seminar, Technion, April 18, 2010 54 Current Mode: Fast Feedback
55
Async. Seminar, Technion, April 18, 2010 55 Current Mode: Adaptive Control OFF During Fast Operation ON During Slow Operation
56
Async. Seminar, Technion, April 18, 2010 56 Current Mode: SPICE Simulations 7 mm link (note almost twice longer than VM) Full RLC model for interconnect Interconnect: 2 signal wires + 2 shields. 2 m partitioned wires, 5 m spacing
57
Async. Seminar, Technion, April 18, 2010 57 Current Mode: Simulations (1) Input Data RLX Currents TX Currents
58
Async. Seminar, Technion, April 18, 2010 58 Current Mode: Simulations (2) Channel Voltage Swings Input: 150mV, Output: 50mV RX Currents: 1mA RX Output Voltage: 200mV
59
Async. Seminar, Technion, April 18, 2010 59 Current Mode: Power Total Output StageRX-TX 7 mm Link, Random Pattern, 67Gbps-max 20.010.39.7I AV dynamic (mA) 0.120.070.06I leakage (mA) 23.713.759.94I peak (mA) 24.012.311.7Dynamic Power (mW, Vdd=1.2V) 0.150.080.07Leakage Power (mW, Vdd=1.2V) 16.5011.93Peak Power (mW) 4.92.52.4Total Power with 20% Utilization (mW) 35-- Total Link Power with 20% Utilization (mW) Including SerDes and two Current-Model Links
60
Async. Seminar, Technion, April 18, 2010 60 Voltage Mode vs. Current Mode 16-bit word 67Gbps serial link Maximal repeater-less range: Voltage mode: 4mm Current mode: 7mm Dynamic power (w/o SerDes circuits, 100% util.): 4mm Voltage mode: 18mW 7mm Current mode: 24mW 33% increase in power for 75% increase in length
61
Async. Seminar, Technion, April 18, 2010 61 In-Die Variations Splitter architecture High-speed operation localized to input and output stages High-speed components design and verification Monte-Carlo simulations (>5 ) 26 PVT Corners Iterative design with legging and sizing for sensitive transistors Asynchronous structure Supports any slow down Minimal time separation between successive bits must be provided!
62
Async. Seminar, Technion, April 18, 2010 62 Serial Link Performance Limitations Skew between subsequent transitions Channel deviation (e.g. short pulses degradation) Systematic On-Die Variations (TX vs. RX)
63
Async. Seminar FOX Test Chip
64
Async. Seminar, Technion, April 18, 2010 64 Proof of concept of the serial link and its components on a real process Performance mapping: Operation mode (16-bit/64-bit words) Speed (ability to change the link speed in range 0.1-30 Gbps) Link length (0-7mm) Power Channel layout geometry Metal layer Noise rejection
65
Async. Seminar, Technion, April 18, 2010 65
66
Async. Seminar, Technion, April 18, 2010 66 IBM 65nm 10 LPe process 8 Metal layers Expecting 30 Gbps throughput over single link 1 / (30ps gate delay)
67
Async. Seminar, Technion, April 18, 2010 67 Digital Controler Testing circuits TX RX FOX chip structure Power grid
68
Async. Seminar, Technion, April 18, 2010 68 Conclusions Novel high-speed serial links outperform parallel links for long range communication The serial link is more attractive for shorter ranges in future technologies Future large SoCs and NoCs should employ serial links to mitigate: Area Routing Congestion Power Latency
69
Async. Seminar, Technion, April 18, 2010 69 Publications 1.R. Dobkin, R. Ginosar, C. Sotiriou, "Data Synchronization Issues in GALS SoCs," Proc. of ASYNC, pp. 170-179, 2004. 2.R. Dobkin, V. Vishnyakov, E. Friedman and R. Ginosar, "An Asynchronous Router for Multiple Service Levels Networks on Chip," Proc. of ASYNC, pp. 44-53, 2005. 3.R. Dobkin, R. Ginosar, A. Kolodny, "High-Speed Serial Interconnect for NoC," NoC Workshop, DATE, 2006. 4.R. Dobkin, R. Ginosar, A. Kolodny, "Fast Asynchronous Shift Register for Bit-Serial Communication," Proc. of ASYNC, pp. 117-126, 2006. 5.R. Dobkin, R. Ginosar and C. P. Sotiriou, "High Rate Data Synchronization in GALS SoCs," IEEE Transactions on VLSI, 14(10), pp. 1063-1074, 2006. 6.R. Dobkin, Y. Perelman, T. Liran, R. Ginosar, A. Kolodny, "High Rate Wave-Pipelined Asynchronous On-Chip Bit-Serial Data Link," Proc. of ASYNC, pp. 3-14, 2007. 7.R. Dobkin, R. Ginosar, I. Cidon, "QNoC Asynchronous Router with Dynamic Virtual Channel Allocation," First ACM/IEEE Int. Symp. on Networks on Chip (NOCS), p. 218, 2007. 8.R. Dobkin, R. Ginosar, A. Kolodny, "QNoC Asynchronous Router", VLSI Integration Journal, 2008, doi:10.1016/j.vlsi.2008.03.001 9.R. Dobkin, A. Morgenshtein, A. Kolodny, R. Ginosar, "Parallel vs. Serial On-Chip Communication," SLIP, pp. 43-50, 2008. 10.R. Dobkin, R. Ginosar, "Fast Universal Synchronizers," PATMOS, 2008. 11.R. Dobkin, T. Kapshitz, S. Flur, R. Ginosar, "Assertion Based Functional Verification of Multiple-Clock GALS Systems," VLSI-SOC, 2008. 12.R. Dobkin, R. Ginosar, "Two Phase Synchronization with Sub-cycle Latency," In press, VLSI Integration Journal, 2008.
70
Async. Seminar Future Research
71
Async. Seminar, Technion, April 18, 2010 71 Future SoC: Three-Level Communication
72
Async. Seminar, Technion, April 18, 2010 72 Future SoC (Zoom In)
73
Async. Seminar, Technion, April 18, 2010 73 Future Research (1) – Networking Design a 2-level NOC with power, latency and bandwidth parameters for each link Create algorithms or analysis that help decide for each route whether to send the message on the highway (part of the way) or only using the slow parallel links Consider a ring architecture, where each high speed message can either pass-thru or be "dropped" to slow NoC, similar to optical ring switches Analyze N-level NoC where each communication level has its own speed tag Combine / Compare to QNoC approach Is there any benefit in sending part of the packet through high- way and the other part through slow parallel links? Broadcast / Multicast / Backpressure (Tokens) through highway?
74
Async. Seminar, Technion, April 18, 2010 74 Future Research (2) – Architecture Design a 5-port router for highway links: Does each router need to parallelize each incoming word before it can be serialized again and sent on the next link? Maybe create a switch using X-latches? Consider wave-pipelining insides the router. Design parallel NoC link/NoC architecture with wave pipelining Rely on guaranteed latency credit mechanism Design upper level architecture for the serial link Word level acknowledge technique What should be done for NACK? Encoding for error correction? Retransmit? Out of order word (flit) handling? Design Low-Power N-bit high-speed shift-register Explore the linear power reduction of Splitter architecture
75
Async. Seminar, Technion, April 18, 2010 75 Future Research (3) – Circuit High-speed Current-Mode circuits New approach for lower supply voltage / longer links Improve Output Stage stability, resiliency to variations, power Analyze trade-off power/speed for CM-based links Design an efficient power down/up technique for CM Switch off the link when there is no data Turn on the link quickly when the data is ready Adopt the idea of CM for off-chip links Design a new circuit matching off-chip requirements Compare to known approaches Further investigation of asynchronous (variable rate) signaling Explore interconnect architectures for CM signaling Layout, shielding, return current path, noise
76
Async. Seminar, Technion, April 18, 2010 76 The End Thank you
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.