Exploiting Crosstalk to Speed up On-chip Buses Chunjie Duan Ericsson Wireless, Boulder Sunil P Khatri University of Colorado, Boulder.

Slides:



Advertisements
Similar presentations
Introduction to Algorithms
Advertisements

Improving Compression Ratio, Area Overhead, and Test Application Time in System-on-a-chip Test Data Compression/Decompression Paul Theo Gonciari*, Bashir.
Comp 122, Spring 2004 Order Statistics. order - 2 Lin / Devi Comp 122 Order Statistic i th order statistic: i th smallest element of a set of n elements.
Hash Tables CSC220 Winter What is strength of b-tree? Can we make an array to be as fast search and insert as B-tree and LL?
Briana B. Morrison Adapted from William Collins
Defect Tolerance for Yield Enhancement of FPGA Interconnect Using Fine-grain and Coarse-grain Redundancy Anthony J. YuGuy G.F. Lemieux September 15, 2005.
1 Decidability continued…. 2 Theorem: For a recursively enumerable language it is undecidable to determine whether is finite Proof: We will reduce the.
ECE C03 Lecture 71 Lecture 7 Delays and Timing in Multilevel Logic Synthesis Hai Zhou ECE 303 Advanced Digital Design Spring 2002.
Design Rule Generation for Interconnect Matching Andrew B. Kahng and Rasit Onur Topaloglu {abk | rtopalog University of California, San Diego.
Logic Circuits Design presented by Amr Al-Awamry
Spread Spectrum Chapter 7.
1 An Efficient, Hardware-based Multi-Hash Scheme for High Speed IP Lookup Hot Interconnects 2008 Socrates Demetriades, Michel Hanna, Sangyeun Cho and Rami.
A Routing Technique for Structured Designs which Exploits Regularity Sabyasachi Das Intel Corporation Sunil P. Khatri Univ. of Colorado, Boulder.
A simple example finding the maximum of a set S of n numbers.
1 Modeling and Optimization of VLSI Interconnect Lecture 9: Multi-net optimization Avinoam Kolodny Konstantin Moiseev.
Net-Ordering for Optimal Circuit Timing in Nanometer Interconnect Design M. Sc. work by Moiseev Konstantin Supervisors: Dr. Shmuel Wimer, Dr. Avinoam Kolodny.
March 8, 2006“Bus Stuttering”1 Bus Stuttering : An Encoding Technique To Reduce Inductive Noise In Off-Chip Data Transmission DATE 2006 Session 5B: Timing.
A Novel Clock Distribution and Dynamic De-skewing Methodology Arjun Kapoor – University of Colorado at Boulder Nikhil Jayakumar – Texas A&M University,
Weiping Shi Department of Computer Science University of North Texas HiCap: A Fast Hierarchical Algorithm for 3D Capacitance Extraction.
Whole Genome Alignment using Multithreaded Parallel Implementation Hyma S Murthy CMSC 838 Presentation.
Energy Efficient and High Speed On-Chip Ternary Bus Chunjie Duan Mitsubishi Electric Research Labs, Cambridge, MA, USA Sunil P. Khatri Texas A&M University,
Performance Driven Crosstalk Elimination at Compiler Level TingTing Hwang Department of Computer Science Tsing Hua University, Taiwan.
ECE C03 Lecture 61 Lecture 6 Delays and Timing in Multilevel Logic Synthesis Prith Banerjee ECE C03 Advanced Digital Design Spring 1998.
FPGA Defect Tolerance: Impact of Granularity Anthony YuGuy Lemieux December 14, 2005.
1 A Deep Sub-Micron VLSI Design Flow using Layout Fabrics Sunil P. Khatri University of Colorado, Boulder Amit Mehrotra University of Illinois, Urbana-Champaign.
October 5, 2005“Broadband Impedance Matching”1 Broadband Impedance Matching for Inductive Interconnect in VLSI Packages ICCD 2005 Authors: Brock J. LaMeres,
Analysis and Avoidance of Cross-talk in on-chip buses Chunjie Duan Ericsson Wireless Communications Anup Tirumala Jasmine Networks Sunil P Khatri University.
Ger man Aerospace Center Gothenburg, April, 2007 Coding Schemes for Crisscross Error Patterns Simon Plass, Gerd Richter, and A.J. Han Vinck.
Channel Polarization and Polar Codes
1 Encoding-based Minimization of Inductive Cross-talk for Off-Chip Data Transmission Brock J. LaMeres Agilent Technologies, Inc. Sunil P. Khatri Dept.
Hystor : Making the Best Use of Solid State Drivers in High Performance Storage Systems Presenter : Dong Chang.
Pipelined Two Step Iterative Matching Algorithms for CIOQ Crossbar Switches Deng Pan and Yuanyuan Yang State University of New York, Stony Brook.
Localized Asynchronous Packet Scheduling for Buffered Crossbar Switches Deng Pan and Yuanyuan Yang State University of New York Stony Brook.
EE 587 SoC Design & Test Partha Pande School of EECS Washington State University
Crosstalk Calculation and SLEM. 2 Crosstalk Calculation Topics  Crosstalk and Impedance  Superposition  Examples  SLEM.
ECE 371 Microprocessor Interfacing Unit 4 - Introduction to Memory Interfacing.
1 EE 587 SoC Design & Test Partha Pande School of EECS Washington State University
1 Sign Bit Reduction Encoding for Low Power Applications Hsin-Wei Lin Saneei, M. Afzali-Kusha, A. and Navabi, Z. Sign Bit Reduction Encoding for Low Power.
OFDM Presented by Md. Imdadul Islam.
Application of Data Compression to the MIL-STD-1553 Data Bus Scholar’s Day Feb. 1, 2008 By Bernard Lam.
The Selection Problem. 2 Median and Order Statistics In this section, we will study algorithms for finding the i th smallest element in a set of n elements.
The Fast Optimal Voltage Partitioning Algorithm For Peak Power Density Minimization Jia Wang, Shiyan Hu Department of Electrical and Computer Engineering.
Error-Correction &Crosstalk Avoidance in DSM Busses Ketan Patel and Igor Markov University of Michigan Electrical Engineering & Computer Science 2003 ACM.
Optimal digital circuit design Mohammad Sharifkhani.
Parallel Characteristics of Sequence Alignments Kyle R. Junik.
Forbidden Transition Free Crosstalk Avoidance CODEC Design Chunjie Duan Mitsubishi Electric Research Labs, Cambridge, MA, USA Chengyu Zhu Polaris Microelectronic.
Erasure Coding for Real-Time Streaming Derek Leong and Tracey Ho California Institute of Technology Pasadena, California, USA ISIT
1 Bus Encoding for Total Power Reduction Using a Leakage-Aware Buffer Configuration 班級:積體所碩一 學生:林欣緯 指導教授:魏凱城 老師 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION.
Data Communications, Kwangwoon University12-1 Chapter 12. Multiple Access 1.Random Access 2.Controlled Access 3.Channelization.
Timo O. Korhonen, HUT Communication Laboratory 1 Convolutional encoding u Convolutional codes are applied in applications that require good performance.
Surfliner: Distortion-less Electrical Signaling for Speed of Light On- chip Communication Hongyu Chen, Rui Shi, Chung-Kuan Cheng Computer Science and Engineering.
Group 1 chapter 3 Alex Francisco Mario Palomino Mohammed Ur-Rehman Maria Lopez.
Bounds on Redundancy in Constrained Delay Arithmetic Coding Ofer ShayevitzEado Meron Meir Feder Ram Zamir Tel Aviv University.
Bus Encoding to Prevent Crosstalk Delay Bert Victor and Kurt Keutzer ICCAD 2001.
Written by Whitney J. Wadlow
January 27, Controlling Inductive Cross-talk and Power in Off-chip Buses using CODECs ASP-DAC 2006 Session 8C-5: Inductive Issues in Power Grids.
Worst Case Crosstalk Noise for Nonswitching Victims in High-Speed Buses Jun Chen and Lei He.
Distributed Compression For Still Images
Jason Cong, David Zhigang Pan & Prasanna V. Srinivas
Hash Table.
Hash Tables.
12/4/2018 A Regularity-Driven Fast Gridless Detailed Router for High Frequency Datapath Designs By Sabyasachi Das (Intel Corporation) Sunil P. Khatri (Univ.
Inductance Screening and Inductance Matrix Sparsification
Guihai Yan, Yinhe Han, Xiaowei Li, and Hui Liu
Floating Point Numbers - continuing
The Selection Problem.
Author: Yaron Weinsberg ,Shimrit Tzur-David ,Danny Dolev and Tal Anker
Jason Cong, David Zhigang Pan & Prasanna V. Srinivas
Worst-Case TCAM Rule Expansion
IV. Convolutional Codes
Presentation transcript:

Exploiting Crosstalk to Speed up On-chip Buses Chunjie Duan Ericsson Wireless, Boulder Sunil P Khatri University of Colorado, Boulder

Outline Introduction Classification of Cross-talk types The Story so far.. Eliminating 3C and 4C sequences Eliminating 4C sequences Eliminating 2C sequences Eliminating 1C sequences Experimental Results Conclusions

Introduction Verified cross-talk trends Accurate 3-D capacitance extraction Delay variation 2.47:1 (200 m wires, 10X drivers, 0.1 m technology) Deep sub-micron process s t w a v a CICI CLCL v a CLCL CLCL CICI a v a CLCL v a CLCL CICI CICI CLCL a a v a CLCL v CLCL CLCL CICI CICI a CICI a a v v CICI CLCL CLCL CLCL CICI CICI CLCL CLCL CLCL CICI CLCL CICI CLCL CLCL

Cross-talk vs Bus Data Pattern When λ ~ 0.1μm, r = C I /C L ~ 10 (metal 4) Effective total capacitance depends on bus data sequence : Best case: 0 x C I Worst case: 4 x C I 0·C I C total = 0 ·C I C total = 4 ·C I 0·C I 2·C I

Classification of Cross-talk 4·C sequence: 3·C sequence: 2·C sequence: 1·C sequence: 0·C sequence: Forbidden patterns (010 and 101) Maximum bus data rate depends on total capacitance seen by any bit

Previous work – Eliminating 3C & 4C Sequences Simple approach: shielding No 3C/4C sequences ; bus-width is doubled Theorem: If no forbidden patterns are allowed on the bus, Proof: see Analysis and Avoidance of Cross-talk in Buses – Duan, Tirumala, Khatri (Hot Interconnects August 2001). So we simply encode the data on the bus to get rid of the forbidden patterns Recurrence equation for asymptotic bus overhead CODEC implementation to demonstrate practicality

Eliminating 3C & 4C sequences 44% asymptotic overhead Look-Up Table, straightforward, can achieve minimum overhead (44%), but not practical Our implementation 62.5% overhead (higher than minimum) Modular and straightforward Break bus into 4-bit groups Encode each group independently (4bit -> 5 bit) Additional logic to handle across- group forbidden patterns

Previous Work - Eliminating 4C sequences Less aggressive: eliminating 4C sequences only Less overhead (33%) Simpler algorithm: Divide the bus into 3 bit groups When 4C sequence occurs, complement group data Insert group complement indicator Special handling for across-group 4C sequences (see paper for details) > >

CODEC Results Compare waveform with and without coding Random input sequence Random sequence Recovered sequence encoderdecoder driver receiver Random sequence Recovered sequence encoderdecoder driver receiver Encoder/decoder delay ~250ps (memoryless) Max data rate more than 2X compared to scheme with no encoding Speedup is data pattern independent

CODEC Results … 2 Bus length 5mm, 10mm or 20mm Driver strength 30X, 60X and 120X of minimum

Further Speedup Possible? Can we exploit crosstalk to further speed up the bus? Eliminate 2C sequences Eliminate 1C sequences Simulation shows that eliminating 2C sequences results in a speedup of 2X – 4X over eliminating 3C/4C sequences Note that we seek memory-less CODEC based techniques Lets look at eliminating 2C and 1C sequences next…

Eliminating 2C sequences How to guarantee a 2C free sequence? Find a vector clique such that any pair of elements in this clique only exhibit 1C transitions between them For an n bit bus, we need a k bit encoded bus (k > n) such that the new bus has a 2C free clique of cardinality greater than or equal 2 n Solution is memoryless (no need to remember the last transmit word) Fast and simple CODEC implementation We have an inductive method to construct 2C free cliques

Constructing 2C free Cliques Inductive method, extends a known clique C n = {v} Let v = v. v n First set C n+1 = {}, and C n+1 <= C n+1 U v Definition: the 0-extended subset of C n+1 is: Definition: the 1-extended subset of C n+1 is: Constructing Create a new vector and Add the vector unless there exist a vector in S 1 such that: and Constructing : similar to Finally where Theorem: Both sets of the previous step are 2C free cliques. Proof - see paper

Constructing 2C free Cliques … 2 Some observations about the construction Vectors ending with 01 and 10 can not co-exist in C n The first n-bits of any vector of C n+1 is the same as some vector of C n and the last two bits are 00 or 11. In other words, C n+1 is at least as large as C n Because of (a), we know there will be no 011 or 100 in the same clique C n+1 So we can construct vectors of C n+1 ending in 001 or 110 by add 1 to vectors ending with 00 or add 0 to vectors end with 11. However, we can not have both

Constructing 2C free Cliques … 3 Consider the construction of C 4 from C 3 : Quadratic number of tests required as described above. We can do better…

Constructing C n+1 from C n using the 0-extended subset Similar algorithm when we use the 1-extended subset Clique Extension Algorithm append 0 to n-bit vectors ending with 0 append 1 to n-bit vectors ending with 1 since we use the 0-extended subset of C n+1 If there is no n-bit vector ending with 01 Append 1 to vectors ending with 00 If there is no n-bit vector ending with 11 Append 1 to vectors ending with 10 The new clique has no vectors ending with 10

Clique Extension Algorithm … 2 Simply perform both versions of the clique extension algorithm Select the result according to the rule: where Some values of clique sizes: NClique size

Area Overhead Trends Asymptotic overhead is 146% Lower for smaller bus sizes. Suggests partitioning of bus into smaller sections

1C free Configurations 1C free sequences have least delay (typically 50% of 2C free sequences) Just send any data bit multiple times (3/5…) No encoder/decoder needed (no extra codec delay) Simulation shows its the fastest compared to any other techniques with similar area overhead: 3x (or 5x) separation between wires Widening the trace (3x): small R, bigger C A B C A B C A B C

Bus configurations for 1C delay We simulated the delay of several different bus configurations Different configurations yield different delay and area trade-offs w w w w wvariablew w w w w A: 3-wire group, fixed spacing within group, variable spacing between groups. w w w w wvariablew w w w w B: similar to A but with a ground shielding between groups. variable C: no shielding wires, vary wire sizes and spacing w w w w w w w w wvariable w w w w w w w w w D: 5-wire group, fixed spacing within group, variable spacing between groups. largest overhead variable

1C free Configurations Circuit parameters are extracted using SPACE3D Bus simulations CODEC was not modeled Spice3f5, 0.1μm BPTM model Transmission line with inter-wire coupling Quantify actual delay of 1C free bus vector sequences for the 4 configurations described 20mm wire, 30X driver (IDEAL 1C free delay 153ps, 3C free delay 793ps)

Delays for 1C free Configurations Configuration C has significantly larger delay than others (3X) since its essentially a 3C free configuration (has no shielding) All other configurations shows up to 2.5X speed up over 3C free bus. For all configurations, the actually delays are larger than IDEAL 0C delay This is caused by skew on the outer shielding wires Transition of dynamic shields of any wire are slightly misaligned Verified by intentionally skewing the delay on signals

Conclusions Inter-wire capacitance increasingly significant for DSM VLSI bus delays We have developed an array of CODECs to trade off bus area overhead with delay 4C free = 33% 3C free = 62% 2C free = 146% (asymptotic), up to 4X to 6X faster Inductive algorithm for 2C free clique construction Simulated several 1C free configurations for area overhead and delays (no CODECs) 1C free techniques not as fast as expected

Thank You!