Presentation is loading. Please wait.

Presentation is loading. Please wait.

Async. Seminar FOX Fast On-Chip Interconnect Reuven Dobkin Technion – Israel Institute of Technology Electrical Engineering Department – VLSI Lab April.

Similar presentations


Presentation on theme: "Async. Seminar FOX Fast On-Chip Interconnect Reuven Dobkin Technion – Israel Institute of Technology Electrical Engineering Department – VLSI Lab April."— Presentation transcript:

1 Async. Seminar FOX Fast On-Chip Interconnect Reuven Dobkin Technion – Israel Institute of Technology Electrical Engineering Department – VLSI Lab April 18, 2010

2 Async. Seminar, Technion, April 18, 2010 2 Presentation Outline SoC Communication Challenges Serial Link Benefits Parallel vs. Serial Link Analysis Fast Asynchronous Serial Link Transmitter, Fast LEDR Encoder Receiver, Fast Toggle Circuit Channel Voltage Mode Async Signaling Current Mode Async Signaling FOX Test Chip Future Research Topics

3 Async. Seminar, Technion, April 18, 2010 3 System-on-Chip (SoC) Challenges Non-scalable global interconnect High power and latency Multiple number of modules Congested inter-modular communication Multiple clock and voltage domains Power-reduction dynamic techniques Complex inter-module timing relations Growing bandwidth requirements Unsupported by conventional buses

4 Async. Seminar, Technion, April 18, 2010 4 Constructed of multiple wires and repeaters Incur high leakage power (line drivers and repeaters) Often have low utilization (leakage again) Occupy large chip area (routing difficulty) Incur cross-coupling (delay matching & skew problems) Present a significant capacitive load N Parallel NoC Links and SoC Buses

5 Async. Seminar, Technion, April 18, 2010 5 Bit-Serial Interconnect Fewer lines, fewer line drivers and fewer repeaters Reduced leakage power Reduced chip area Better routability Reduced coupling Should work N times faster! F SER =N  F PAR

6 Async. Seminar, Technion, April 18, 2010 6 Fast Serial Interconnect Synchronous Approach: Source-synchronous, with Clock Data Recovery (CDR) Problem: CDR power and area Asynchronous Approach: Normal methods are too slow We consider a faster one

7 Async. Seminar, Technion, April 18, 2010 7 Seeking Highest Speed No “spacers” (NRZ) One transition per bit Fast signaling Targeted Speed: One gate delay between bits For 65nm technology this results in 67 Gbps One Gate Delay = FO4 INV Delay

8 Async. Seminar, Technion, April 18, 2010 8 Serial Link – Top Structure Transition signaling instead of sampling: two-phase NRZ Level Encoded Dual Rail (LEDR) asynchronous protocol, a.k.a. data-strobe (DS) Acknowledge per word instead of per bit Wave-pipelining over channel Differential encoding (DS-DE, IEEE1355-95) Low-latency synchronizers

9 Async. Seminar, Technion, April 18, 2010 9 Encoding –Two Phase NRZ LEDR Two Phase Non-Return-to-Zero Level Encoded Dual Rail “delta” encoding (one transition per bit) Uncoded (B) State bit (S) Phase bit (P) 00110 00010

10 Async. Seminar, Technion, April 18, 2010 10 Serial Link Applications P2P long-range global interconnect Long range NoC links Pin-limited on-chip module interfaces Presently chips are pin-limited, and that will migrate inside Cross-bar Simpler routing and congestion Communications inside many-core CMPs

11 Async. Seminar, Technion, April 18, 2010 11 NameData Cycle (# of FO4 Inv. Delays) WAFT7*(3 Gbps at 180nm) IIP3*(5 Gbps at 250nm) 3-Levels3*(1 Gbps at 1200nm) Pulsed CM 2(8 Gbps at 180nm) This Work 1(67 Gbps at 65nm-HP) Related Work

12 Async. Seminar Parallel vs. Serial Link Analysis

13 Async. Seminar, Technion, April 18, 2010 13 Parallel Link dissipates less power Serial Link dissipates less power Link Length [mm] Serial Link Benefits Analysis goal To show that novel serial link outperforms the parallel one for: Long ranges Advanced technology nodes Technology Node [nm] Parallel Link requires less area Serial Link requires less area Technology Node [nm]

14 Async. Seminar, Technion, April 18, 2010 14 Method Choose Parallel link implementation representatives Serial link implementation representatives Compare the parallel and serial link approaches in terms of: Area Power Latency Technology scaling

15 Async. Seminar, Technion, April 18, 2010 15 "Register-Pipelined" Parallel Link Fully synchronous Interconnect as combinational logic between registers Source synchronous or global clock High cost for high bit rates!

16 Async. Seminar, Technion, April 18, 2010 16 "Wave-Pipelined" Parallel Link  Bit rate is limited by relative skew of the link wires

17 Async. Seminar, Technion, April 18, 2010 17 Crosstalk Mitigation and Power Reduction Shielding / Spacing Staggered repeaters Interleaved bi-directional lines Asynchronous signaling Data encoding Data pattern recognition with special worst-case handling This work analyzes the two extremes of shielding: Unshielded wires (a) Fully-shielded wires (b)

18 Async. Seminar, Technion, April 18, 2010 18 Single Gate-Delay Serial Link (reminder) Transition signaling instead of sampling: two-phase NRZ Level Encoded Dual Rail (LEDR) asynchronous protocol, a.k.a. data-strobe (DS) Acknowledge per word instead of per bit Wave-pipelining over channel Differential encoding (DS-DE, IEEE1355-95) Low-latency synchronizers

19 Async. Seminar, Technion, April 18, 2010 19 Analytical Models Parallel and Serial Link Bit Rates  Please refer to the paper below for details on the exact analytical models employed in the work: R. Dobkin, A. Morgenshtein, A. Kolodny, R. Ginosar, "Parallel vs. Serial On-Chip Communication," SLIP, pp. 43-50, 2008

20 Async. Seminar, Technion, April 18, 2010 20 Parallel Link Bit Rate Limitations (1) A.Fastest available clock Ring oscillator limitation: 8  d 4 Fast processors: 11  d 4 (e.g. CELL) Standard SoC/ASIC: 100-400  d 4 B.Synchronization Latency May take several clocks in case of asynchronous clock relation C.Clock uncertainty Extended critical path

21 Async. Seminar, Technion, April 18, 2010 21 Parallel Link Bit Rate Limitations (2) D.Delay Uncertainty The skew and jitter of the clock Repeater delay variations Wire delay variations  mostly metal thickness variations Via variations Cross-Coupling (Crosstalk) Geometry  Outcome of routing congestion and multi-layer structure N.S. Nagaraj DAC 2005 / L. Scheffer, SLIP 2006

22 Async. Seminar, Technion, April 18, 2010 22 Parallel Link Minimal Clock Cycle (1) Notations from W.P. Burleson, et al., Wave-Pipelining: A Tutorial and Research Survey, TVLSI, 1998 Latest data clocking Earliest data clocking Clock Uncertainty

23 Async. Seminar, Technion, April 18, 2010 23 Impact of Process Variations in Repeaters on Multi-Wire Delay Uncertainty Variation types Random variations  closely placed devices "Systematic" variations  location on the die Relative skew (  MAX –  MIN ) Repeaters in the same stage are highly correlated Random variations are averaged out thanks to large repeater sizing Systematic inter-stage variations are averaged out along the link Random variations inside Repeater Stage are averaged out  Relative skew among the lines due to variations in repeaters is small  Multi-wire delay uncertainty is dominated by Cross-Coupling

24 Async. Seminar, Technion, April 18, 2010 24 Parallel Link Minimal Clock Cycle (2) Minimal clock cycle: System clock limitation: Register-pipelined link: Distance between successive pipeline stages is affected by Delay Uncertainty 65nm example The rate is bounded by clock cycle rather than the delay uncertainty The rate is bounded by clock cycle rather than by the delay uncertainty Worst case skew between two lines: Cross-coupling and wire variations

25 Async. Seminar, Technion, April 18, 2010 25 Serial Link Bit Rate Skew due to transistor variations is neglected much smaller than in parallel link Coupling factor is always known LEDR encoding: there is only one transition per each transmitted bit The skew is not affected by cross-coupling  link delay is similar for all symbols Bit rate:

26 Async. Seminar, Technion, April 18, 2010 26 Scalability Number of repeaters (per millimeter) grows for more advanced technology nodes Active area and leakage: Minimal link length for serial link employment decreases with technology Dynamic power: Minimal link length for serial link employment decreases with technology Interconnect area: Serial link is always preferable Y.I. Ismail, et al., Repeater Insertion in RLC Lines for Minimum Propagation Delay, ISCAS99 Power Repeaters Number of repeaters grows with technology node scaling Area and Leakage Equal throughput Parallel and Serial links are assumed

27 Async. Seminar, Technion, April 18, 2010 27 65 nm Test Case Summary Register-pipelined vs. SerialWave-Pipeline vs. Serial UnshieldedFully Shielded UnshieldedFully ShieldedShielding unlimited up to 6mmunlimited Length of parallel link 130d 4 (slow) 10d 4 (fast) 130d 4 (slow) 10d 4 (fast) 8d 4 Clock cycle of parallel link choose a serial link for links longer than: To minimize the following: Always Area 3mm1mm3mm 4mm2 mmPower 9mm2mm12mm4mmNever*2 mmLatency Minimal length above which the serial link is preferred

28 Async. Seminar Single Gate Delay Serial Link Architecture

29 Async. Seminar, Technion, April 18, 2010 29 Single Gate-Delay Serial Link Transition signaling instead of sampling: two-phase NRZ Level Encoded Dual Rail (LEDR) asynchronous protocol, a.k.a. data-strobe (DS) Acknowledge per word instead of per bit Wave-pipelining over channel Differential encoding (DS-DE, IEEE1355-95) Low-latency synchronizers

30 Async. Seminar, Technion, April 18, 2010 30 Transmitter – Fast SR Approach Transition Generator Targeted Speed: One gate delay between bits

31 Async. Seminar, Technion, April 18, 2010 31 Multi-Phase Transition Generation The circuit is based on multi-phase clock generator from M.J.E. Lee, "An Efficient I/O and Clock Recovery for TERABIT Integrated Circuits Design," PhD Thesis, Stanford Univ., 2001.

32 Async. Seminar, Technion, 18 April 2010 32 Fast Asynchronous Shift Register

33 Async. Seminar, Technion, April 18, 2010 33 Wave-pipelined Control Characteristics The highest speed (the single gate- delay cycle) relates to the pole of the Bode diagram This operating point results in signal degradation along the inverter chain Single Gate Delay Rate

34 Async. Seminar, Technion, April 18, 2010 34 Splitter Architecture The shift-register is partitioned into M shift-registers  M slower operation in each shift-register Signal is no longer degraded Single gate-delay operation is localized to output (input) stage only

35 Async. Seminar, Technion, April 18, 2010 35 Splitter Architecture – Performance Splitter architecture employment Better resiliency to process variation thanks to localized high-speed operation Power Reduction:

36 36 Transmitter Splitter Architecture

37 Async. Seminar, Technion, April 18, 2010 37 Transmitter – SPICE Simulation (65nm node) Simulations done at

38 Async. Seminar, Technion, 18 April 2010 38 Receiver

39 39 Receiver Splitter Architecture

40 Async. Seminar, Technion, April 18, 2010 40 Toggle Circuit Straightforward implementation (fundamental asynchronous state machine) is too slow (supports only ~1.5 gate delay cycle) Novel toggle: Single gate delay operation support Internal and output latches 010101

41 Async. Seminar, Technion, April 18, 2010 41 Channel Four transmission lines (DS-DE) High metal layers utilization Metals 5-8 of 65nm process RLC modeled Careful layout Small crosstalk Small relative variations

42 Async. Seminar, Technion, April 18, 2010 42 Channel Issues The wires should be designed as transmission lines, enabling multiple traveling signals (a 1mm wire may carry at least two successive bits simultaneously) Wire Bandwidth (metal layer/width/frequency) Careful Layout Modeling (R, L, C) Current/Voltage Signaling Repeater Insertion / Differential Repeaters Shielding Termination Low-Swing (Speed & Power)

43 Async. Seminar, Technion, April 18, 2010 43 SS LEDR Interconnect Layout

44 Async. Seminar, Technion, April 18, 2010 44 SSPP LEDR Interconnect Layout

45 Async. Seminar, Technion, April 18, 2010 45 SSPPSP LEDR Interconnect Layout

46 Async. Seminar, Technion, April 18, 2010 46 Differential Channel Driver and Receiver “Voltage Mode” Current mode differential low-swing signaling Currents in opposite directions Controllable current return path P / S Voltage Sensing at the RX side

47 Async. Seminar, Technion, April 18, 2010 47 Channel Characteristic Impedance Based on data from BPTM. Drawn for constant R, L, C Z depends on F Voltage changes with F Fast changes  voltage drifts The drifts bound the operating speed F Z S S

48 Async. Seminar, Technion, April 18, 2010 48 Channel Driver with Adaptive Control Compensates for Z changes Turned on for low frequencies Adaptive Control Inertial Delay

49 Async. Seminar, Technion, April 18, 2010 49 Adaptive Control – Simulation Example SPICE simulation setup: 65nm technology, 4mm range, 67Gbps data rate RLC modeled channel (using Raphael-like three-dimensional field solver) Adaptive control is turned on only for low frequencies

50 Async. Seminar, Technion, April 18, 2010 50 Channel Receiver Amplifier

51 Async. Seminar, Technion, April 18, 2010 51 Voltage Mode Drawbacks Voltage signal is degraded by high-capacitance of long interconnects Fast full-swing voltage transitions result in high dynamic currents High power Cross-talk noise Current signaling offers a way of avoiding high voltage swings Standard “Current-mode” techniques still measure voltage at the receiver side

52 Async. Seminar, Technion, April 18, 2010 52 Current Mode: Goals Full Current Mode: Send current from TX Measure current at RX Minimal voltage swing over channel Current to voltage conversion at RX Long range repeater-less channel Targeted data cycle 15 ps (67 Gbps) @ 65nm Low power

53 Async. Seminar, Technion, April 18, 2010 53 Current Mode: Full Current-Mode Circuit I RX I TX I TX > I RX I TX > I S V q_swing =I S  R I RX +I S Output Stage FS

54 Async. Seminar, Technion, April 18, 2010 54 Current Mode: Fast Feedback

55 Async. Seminar, Technion, April 18, 2010 55 Current Mode: Adaptive Control OFF During Fast Operation ON During Slow Operation

56 Async. Seminar, Technion, April 18, 2010 56 Current Mode: SPICE Simulations 7 mm link (note almost twice longer than VM) Full RLC model for interconnect Interconnect: 2 signal wires + 2 shields. 2  m partitioned wires, 5  m spacing

57 Async. Seminar, Technion, April 18, 2010 57 Current Mode: Simulations (1) Input Data RLX Currents TX Currents

58 Async. Seminar, Technion, April 18, 2010 58 Current Mode: Simulations (2) Channel Voltage Swings Input: 150mV, Output: 50mV RX Currents: 1mA RX Output Voltage: 200mV

59 Async. Seminar, Technion, April 18, 2010 59 Current Mode: Power Total Output StageRX-TX 7 mm Link, Random Pattern, 67Gbps-max 20.010.39.7I AV dynamic (mA) 0.120.070.06I leakage (mA) 23.713.759.94I peak (mA) 24.012.311.7Dynamic Power (mW, Vdd=1.2V) 0.150.080.07Leakage Power (mW, Vdd=1.2V) 16.5011.93Peak Power (mW) 4.92.52.4Total Power with 20% Utilization (mW) 35-- Total Link Power with 20% Utilization (mW) Including SerDes and two Current-Model Links

60 Async. Seminar, Technion, April 18, 2010 60 Voltage Mode vs. Current Mode 16-bit word 67Gbps serial link Maximal repeater-less range: Voltage mode: 4mm Current mode: 7mm Dynamic power (w/o SerDes circuits, 100% util.): 4mm Voltage mode: 18mW 7mm Current mode: 24mW 33% increase in power for 75% increase in length

61 Async. Seminar, Technion, April 18, 2010 61 In-Die Variations Splitter architecture High-speed operation localized to input and output stages High-speed components design and verification Monte-Carlo simulations (>5  ) 26 PVT Corners Iterative design with legging and sizing for sensitive transistors Asynchronous structure Supports any slow down Minimal time separation between successive bits must be provided!

62 Async. Seminar, Technion, April 18, 2010 62 Serial Link Performance Limitations Skew between subsequent transitions Channel deviation (e.g. short pulses degradation) Systematic On-Die Variations (TX vs. RX)

63 Async. Seminar FOX Test Chip

64 Async. Seminar, Technion, April 18, 2010 64 Proof of concept of the serial link and its components on a real process Performance mapping: Operation mode (16-bit/64-bit words) Speed (ability to change the link speed in range 0.1-30 Gbps) Link length (0-7mm) Power Channel layout geometry Metal layer Noise rejection

65 Async. Seminar, Technion, April 18, 2010 65

66 Async. Seminar, Technion, April 18, 2010 66 IBM 65nm 10 LPe process 8 Metal layers Expecting 30 Gbps throughput over single link 1 / (30ps gate delay)

67 Async. Seminar, Technion, April 18, 2010 67 Digital Controler Testing circuits TX RX FOX chip structure Power grid

68 Async. Seminar, Technion, April 18, 2010 68 Conclusions Novel high-speed serial links outperform parallel links for long range communication The serial link is more attractive for shorter ranges in future technologies Future large SoCs and NoCs should employ serial links to mitigate: Area Routing Congestion Power Latency

69 Async. Seminar, Technion, April 18, 2010 69 Publications 1.R. Dobkin, R. Ginosar, C. Sotiriou, "Data Synchronization Issues in GALS SoCs," Proc. of ASYNC, pp. 170-179, 2004. 2.R. Dobkin, V. Vishnyakov, E. Friedman and R. Ginosar, "An Asynchronous Router for Multiple Service Levels Networks on Chip," Proc. of ASYNC, pp. 44-53, 2005. 3.R. Dobkin, R. Ginosar, A. Kolodny, "High-Speed Serial Interconnect for NoC," NoC Workshop, DATE, 2006. 4.R. Dobkin, R. Ginosar, A. Kolodny, "Fast Asynchronous Shift Register for Bit-Serial Communication," Proc. of ASYNC, pp. 117-126, 2006. 5.R. Dobkin, R. Ginosar and C. P. Sotiriou, "High Rate Data Synchronization in GALS SoCs," IEEE Transactions on VLSI, 14(10), pp. 1063-1074, 2006. 6.R. Dobkin, Y. Perelman, T. Liran, R. Ginosar, A. Kolodny, "High Rate Wave-Pipelined Asynchronous On-Chip Bit-Serial Data Link," Proc. of ASYNC, pp. 3-14, 2007. 7.R. Dobkin, R. Ginosar, I. Cidon, "QNoC Asynchronous Router with Dynamic Virtual Channel Allocation," First ACM/IEEE Int. Symp. on Networks on Chip (NOCS), p. 218, 2007. 8.R. Dobkin, R. Ginosar, A. Kolodny, "QNoC Asynchronous Router", VLSI Integration Journal, 2008, doi:10.1016/j.vlsi.2008.03.001 9.R. Dobkin, A. Morgenshtein, A. Kolodny, R. Ginosar, "Parallel vs. Serial On-Chip Communication," SLIP, pp. 43-50, 2008. 10.R. Dobkin, R. Ginosar, "Fast Universal Synchronizers," PATMOS, 2008. 11.R. Dobkin, T. Kapshitz, S. Flur, R. Ginosar, "Assertion Based Functional Verification of Multiple-Clock GALS Systems," VLSI-SOC, 2008. 12.R. Dobkin, R. Ginosar, "Two Phase Synchronization with Sub-cycle Latency," In press, VLSI Integration Journal, 2008.

70 Async. Seminar Future Research

71 Async. Seminar, Technion, April 18, 2010 71 Future SoC: Three-Level Communication

72 Async. Seminar, Technion, April 18, 2010 72 Future SoC (Zoom In)

73 Async. Seminar, Technion, April 18, 2010 73 Future Research (1) – Networking Design a 2-level NOC with power, latency and bandwidth parameters for each link Create algorithms or analysis that help decide for each route whether to send the message on the highway (part of the way) or only using the slow parallel links Consider a ring architecture, where each high speed message can either pass-thru or be "dropped" to slow NoC, similar to optical ring switches Analyze N-level NoC where each communication level has its own speed tag Combine / Compare to QNoC approach Is there any benefit in sending part of the packet through high- way and the other part through slow parallel links? Broadcast / Multicast / Backpressure (Tokens) through highway?

74 Async. Seminar, Technion, April 18, 2010 74 Future Research (2) – Architecture Design a 5-port router for highway links: Does each router need to parallelize each incoming word before it can be serialized again and sent on the next link? Maybe create a switch using X-latches? Consider wave-pipelining insides the router. Design parallel NoC link/NoC architecture with wave pipelining Rely on guaranteed latency credit mechanism Design upper level architecture for the serial link Word level acknowledge technique What should be done for NACK? Encoding for error correction? Retransmit? Out of order word (flit) handling? Design Low-Power N-bit high-speed shift-register Explore the linear power reduction of Splitter architecture

75 Async. Seminar, Technion, April 18, 2010 75 Future Research (3) – Circuit High-speed Current-Mode circuits New approach for lower supply voltage / longer links Improve Output Stage stability, resiliency to variations, power Analyze trade-off power/speed for CM-based links Design an efficient power down/up technique for CM Switch off the link when there is no data Turn on the link quickly when the data is ready Adopt the idea of CM for off-chip links Design a new circuit matching off-chip requirements Compare to known approaches Further investigation of asynchronous (variable rate) signaling Explore interconnect architectures for CM signaling Layout, shielding, return current path, noise

76 Async. Seminar, Technion, April 18, 2010 76 The End Thank you


Download ppt "Async. Seminar FOX Fast On-Chip Interconnect Reuven Dobkin Technion – Israel Institute of Technology Electrical Engineering Department – VLSI Lab April."

Similar presentations


Ads by Google