Guihai Yan, Yinhe Han, Xiaowei Li, and Hui Liu

Slides:



Advertisements
Similar presentations
Exploiting Crosstalk to Speed up On-chip Buses Chunjie Duan Ericsson Wireless, Boulder Sunil P Khatri University of Colorado, Boulder.
Advertisements

The Bus Architecture of Embedded System ESE 566 Report 1 LeTian Gu.
Author: Chengchen, Bin Liu Publisher: International Conference on Computational Science and Engineering Presenter: Yun-Yan Chang Date: 2012/04/18 1.
CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.
SuperRange: Wide Operational Range Power Delivery Design for both STV and NTV Computing Xin He, Guihai Yan, Yinhe Han, Xiaowei Li Institute of Computing.
IVF: Characterizing the Vulnerability of Microprocessor Structures to Intermittent Faults Songjun Pan 1,2, Yu Hu 1, and Xiaowei Li 1 1 Key Laboratory of.
Minimal Skew Clock Synthesis Considering Time-Variant Temperature Gradient Hao Yu, Yu Hu, Chun-Chen Liu and Lei He EE Department, UCLA Presented by Yu.
Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure Oguz Ergin*, Deniz Balkan, Kanad Ghose, Dmitry Ponomarev Department.
Energy Efficient and High Speed On-Chip Ternary Bus Chunjie Duan Mitsubishi Electric Research Labs, Cambridge, MA, USA Sunil P. Khatri Texas A&M University,
Skewed Compressed Cache
Analysis and Avoidance of Cross-talk in on-chip buses Chunjie Duan Ericsson Wireless Communications Anup Tirumala Jasmine Networks Sunil P Khatri University.
Advanced Phasor Measurement Units for the Real-Time Monitoring
1 Route Table Partitioning and Load Balancing for Parallel Searching with TCAMs Department of Computer Science and Information Engineering National Cheng.
1 CMOS Temperature Sensor with Ring Oscillator for Mobile DRAM Self-refresh Control IEEE International Symposium on Circuits and Systems, Chan-Kyung.
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
Power Reduction for FPGA using Multiple Vdd/Vth
Research on Analysis and Physical Synthesis Chung-Kuan Cheng CSE Department UC San Diego
Elastic-Buffer Flow-Control for On-Chip Networks
Amphisbaena: Modeling Two Orthogonal Ways to Hunt on Heterogeneous Many-cores an analytical performance model for boosting performance Jun Ma, Guihai Yan,
1 EE 587 SoC Design & Test Partha Pande School of EECS Washington State University
1 A Cost-effective Substantial- impact-filter Based Method to Tolerate Voltage Emergencies Songjun Pan 1,2, Yu Hu 1, Xing Hu 1,2, and Xiaowei Li 1 1 Key.
1 Bus Encoding for Total Power Reduction Using a Leakage-Aware Buffer Configuration 班級:積體所碩一 學生:林欣緯 指導教授:魏凱城 老師 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION.
MicroFix: Exploiting Path-grained Timing Adaptability for Improving Power-Performance Efficiency Guihai Yan, Yinhe Han, Hui Liu, Xiaoyao Liang, Xiaowei.
Chapter 4 MARIE: An Introduction to a Simple Computer.
Jing Ye 1,2, Yu Hu 1, and Xiaowei Li 1 1 Key Laboratory of Computer System and Architecture Institute of Computing Technology Chinese Academy of Sciences.
11 Online Computing and Predicting Architectural Vulnerability Factor of Microprocessor Structures Songjun Pan Yu Hu Xiaowei Li {pansongjun, huyu,
Variable Bandwidth Allocation Scheme for Energy Efficient Wireless Sensor Network SeongHwan Cho, Kee-Eung Kim Korea Advanced Institute of Science and Technology.
Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007.
1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.
VADA Lab.SungKyunKwan Univ. 1 L5:Lower Power Architecture Design 성균관대학교 조 준 동 교수
-1- Delay Uncertainty and Signal Criticality Driven Routing Channel Optimization for Advanced DRAM Products Samyoung Bang #, Kwangsoo Han ‡, Andrew B.
1 Double-Patterning Aware DSA Template Guided Cut Redistribution for Advanced 1-D Gridded Designs Zhi-Wen Lin and Yao-Wen Chang National Taiwan University.
Power-aware NOC Reuse on the Testing of Core-based Systems* CSCE 932 Class Presentation by Xinwang Zhang April 26, 2007 * Erika Cota, et al., International.
A Practical Performance Analysis of Stream Reuse Techniques in Peer-to-Peer VoD Systems Leonardo B. Pinho and Claudio L. Amorim Parallel Computing Laboratory.
On Reliable Modular Testing with Vulnerable Test Access Mechanisms Lin Huang, Feng Yuan and Qiang Xu.
Worst Case Crosstalk Noise for Nonswitching Victims in High-Speed Buses Jun Chen and Lei He.
DIRECT MEMORY ACCESS and Computer Buses
Conclusions on CS3014 David Gregg Department of Computer Science
Contents Introduction Bus Power Model Related Works Motivation
The Interconnect Delay Bottleneck.
Welcome To Seminar Presentation Seminar Report On Clockless Chips
An FPGA Implementation of a Brushless DC Motor Speed Controller
Improving Memory Access 1/3 The Cache and Virtual Memory
Tan Hongbing, Liu Sheng†, Chen Haiyan School of National University of
Xiaobing Wu, Guihai Chen and Sajal K. Das
SECTIONS 1-7 By Astha Chawla
Recap: Lecture 1 What is asynchronous design? Why do we want to study it? What is pipelining? How can it be used to design really fast hardware?
Pablo Abad, Pablo Prieto, Valentin Puente, Jose-Angel Gregorio
Circuits and Interconnects In Aggressively Scaled CMOS
Cache Memory Presentation I
Reading: Hambley Ch. 7; Rabaey et al. Sec. 5.2
Experiment Evaluation
Israel Cidon, Ran Ginosar and Avinoam Kolodny
Fine-Grain CAM-Tag Cache Resizing Using Miss Tags
Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform
8-3 RRAM Based Convolutional Neural Networks for High Accuracy Pattern Recognition and Online Learning Tasks Z. Dong, Z. Zhou, Z.F. Li, C. Liu, Y.N. Jiang,
Crosstalk Noise in FPGAs
FPGA Glitch Power Analysis and Reduction
Reducing Clock Skew Variability via Cross Links
Computer Evolution and Performance
Department of Electrical Engineering Joint work with Jiong Luo
Off-path Leakage Power Aware Routing for SRAM-based FPGAs
Guihai Yan, Yinhe Han, and Xiaowei Li
Load-Sensitive Flip-Flop Characterization
Patrick Akl and Andreas Moshovos AENAO Research Group
Low Power Digital Design
A Low-Power Analog Bus for On-Chip Digital Communication
A Novel Cache-Utilization Based Dynamic Voltage Frequency Scaling (DVFS) Mechanism for Reliability Enhancements *Yen-Hao Chen, *Yi-Lun Tang, **Yi-Yu Liu,
Bus Serialization for Reducing Power Consumption
Overview Problem Solution CPU vs Memory performance imbalance
Presentation transcript:

BAT: Performance-Driven Crosstalk Mitigation Based on Bus-grouping Asynchronous Transmission Guihai Yan, Yinhe Han, Xiaowei Li, and Hui Liu Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences

Outline Introduction Proposed BAT Scheme Implementation of BAT Experimental Results Conclusions 2019/1/2

Introduction Technology improvement Lower voltage Higher frequency Higher transistor density Smaller feature size Q: What are the implications for bus wires? 2019/1/2

Introduction This kinds of Capacitance is dominate 2019/1/2

Introduction Crosstalk Speed: 1.8x slower 0.25um Length = 100um Y 0.25um Length = 100um Fan-out = 2 Z Speed: 1.8x slower 2019/1/2

Crosstalk Factor Crosstalk Insensitive Crosstalk Sensitive [P.P. Sotiriadis et al, 2001] Crosstalk Insensitive Crosstalk Sensitive 2019/1/2

Introduction As the technology advanced, the impact of crosstalk gets worse! Aspect ratio gets bigger Bus width gets wider The bus transmission is likely to encounter depressing crosstalk delay. Q: How to alleviate the crosstalk delay effects on bus transmission? 2019/1/2

The conventional approaches Codec [B. Victor et al, 01] [P.P. Sotiriadis et al, 01] Pros: Relatively low bandwidth overhead: but at least 47% Cons: Hard-constructed Codec algorithm for large bus width Shield: Passive Shield, Active Shield [H.Kaul et al, 02] [R.Arunachalam et al, 03] High performance, but Area-hungry: usually 100% area overhead 2019/1/2

Several new approaches Delay-line bus [M. Ghoneima et al, 04] Pros: Nearly zero bandwidth overhead Cons: Very complicated synchronization Lack of scalability Variable cycle transmission [L. Li et al, 04] Low area overhead High performance for relatively narrow buses Due to “Cask Effect”, it is likely to fail for wide buses (width>64-bit or more) 2019/1/2

Variable cycle transmission (DYN) Supposing a transition between two patterns If the two patterns is: {0 1 0 0 1 0 0 1} → {0 0 1 1 0 1 0 0} Transition: {– ↓↑↑↓↑ – ↓} Delay vector: {1, 3, 2, 2, 4, 3, 1, 1} If {0 1 0 1 1 0 0 1} → {0 0 1 1 0 1 0 0} Transition: {– ↓↑- ↓↑ – ↓ } Delay vector: {1, 3, 3, 1, 3, 3, 1, 1} Q: What if the bus width gets larger? The probability of emergency of “4” tends to get higher, and thereby makes DYN not efficient enough! 2019/1/2

Proposed BAT scheme We propose BAT scheme by extending the Variable Cycle Transmission (DYN) scheme What is the BAT ? 2019/1/2

Crosstalk Insensitive BAT scheme Crosstalk Sensitive All transitions are Crosstalk Sensitive Crosstalk Insensitive Not all sub-transitions are Crosstalk Sensitive Asynchronous 2019/1/2

BAT How to group the bus into sub-buses? Q: Which one is best? Or Or It depends on the crosstalk factor distribution! (or application-specific) 2019/1/2

Crosstalk Factor Distribution Instruction bus VS. Data bus Grouping according CF locality Unequally grouping Equally grouping 2019/1/2

Implementation of BAT Grouping line Valid indicating line 2019/1/2

Differential Counter Cluster Synchronizing Mechanism Hold C(i, j) is a bi-directional counter, Range: –L ~ +L (L: buffer length) ‘OF’ short for ‘OverFlow’ ‘UF’ short for ‘UnderFlow’ ‘+’ means logical OR OF UF Hold Hold i th sub-bus, if and only if {OF(C(i, 1)) + OF(C(i, 2)) + · · · + OF(C(i, i−1)) + OF(C(i, i+1)) + · · · + OF(C(i, n)) } is true; Hold j th sub-bus, if and only if {UF(C(1, j)) + UF(C(2, j)) + · · · + UF(C(j − 1, j)) + UF(C(j +1, j)) + · · · + UF(C(n, j)) } is true. 2019/1/2

DAS Scheme (merge the grouping lines and valid-indicating lines) Delay Line Active Shield Simultaneous switch Delayed switch 2019/1/2

DAS Scheme Delay Active Shield Skew: T/2 Reuse the data-valid indicating line as the group line to reduce wire overhead 2019/1/2

Experiment Simplescalar 3.0 SPEC CPU2000 Benchmarks On-chip buses Instruction bus: Instruction buffer to L1 I$ Data bus: Datapath to L1 D$ Compare against ORI (Original conservative approach) 4 cycle/pattern DYN (Variable cycle transmission) 1~4 cycle/pattern CDC (Codec approaches) 2 cycle/pattern 2019/1/2

Results /1 BAT applied to 64-bit instruction bus Compared with ORI, DYN and Codec approaches, the average performance improvement using BAT scheme with 4-Group configuration is 55.3%, 30.4% and 10.5% respectively. 2019/1/2

Results /2 BAT applied to 32-bit data bus Compared with DYN scheme, we still gain 12.5% performance improvement on average with 4-Group configuration. 2019/1/2

Overhead Analysis /1 Wire routing overhead and avg. cycle/pattern 64-bit bus 4-group configuration Approach Normalized Area Avg. cycle/pattern ORI 100 4 DYN 103 2.57 CPC 145 2 PSD 199 ASD 201 1 BAT 113 1.79 2019/1/2

Overhead Analysis /2 How much buffer is sufficient to synchronize the data receiving? Buffer Size VS. Avg. cycle/pattern 2019/1/2 4 ~ 8-word is optimum!

Conclusions Proposing (BAT) Bus-grouping Asynchronous Transmission scheme Optimizing BAT with the locality of CF (crosstalk factor) Distribution Proposing DCC synchronizing mechanism and DAS scheme Improving performance by 30+% and 10+% compared with DYN and Codec approaches at the cost of 13% routing overhead when applied to a 64-bit bus 2019/1/2

Thanks for your attention!