Architecture-Level Synthesis for Automatic Interconnect Pipelining

Slides:



Advertisements
Similar presentations
OCV-Aware Top-Level Clock Tree Optimization
Advertisements

High Level Languages: A Comparison By Joel Best. 2 Sources The Challenges of Synthesizing Hardware from C-Like Languages  by Stephen A. Edwards High-Level.
Modern VLSI Design 3e: Chapter 10 Copyright  2002 Prentice Hall Adapted by Yunsi Fei ECE 300 Advanced VLSI Design Fall 2006 Lecture 24: CAD Systems &
High-Level Constructors and Estimators Majid Sarrafzadeh and Jason Cong Computer Science Department
FPGA Latency Optimization Using System-level Transformations and DFG Restructuring Daniel Gomez-Prado, Maciej Ciesielski, and Russell Tessier Department.
Application Specific Instruction Generation for Configurable Processor Architectures VLSI CAD Lab Computer Science Department, UCLA Led by Jason Cong Yiping.
Multiobjective VLSI Cell Placement Using Distributed Simulated Evolution Algorithm Sadiq M. Sait, Mustafa I. Ali, Ali Zaidi.
Circuit Retiming with Interconnect Delay CUHK CSE CAD Group Meeting One Evangeline Young Aug 19, 2003.
Behavioral Synthesis Outline –Synthesis Procedure –Example –Domain-Specific Synthesis –Silicon Compilers –Example Tools Goal –Understand behavioral synthesis.
A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering.
Pipelining and Retiming 1 Pipelining  Adding registers along a path  split combinational logic into multiple cycles  increase clock rate  increase.
1/20 Data Communication Estimation and Reduction for Reconfigurable Systems Adam Kaplan Philip Brisk Ryan Kastner Computer Science Elec. and Computer Engineering.
Architecture and Compilation for Reconfigurable Processors Jason Cong, Yiping Fan, Guoling Han, Zhiru Zhang Computer Science Department UCLA Nov 22, 2004.
Design Automation for VLSI, MS-SOCs & Nanotechnologies Dr. Malgorzata Chrzanowska-Jeske Mixed-Signal System-on-Chip (supported.
Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
Bitwidth-Aware Scheduling and Binding in High-Level Synthesis X. Cheng +, J. Cong, Y. Fan, G. Han, J. Lin, J. Xu +, Z. Zhang Computer Science Department,
1 EECS Components and Design Techniques for Digital Systems Lec 21 – RTL Design Optimization 11/16/2004 David Culler Electrical Engineering and Computer.
Synthesis of Transaction-Level Models to FPGAs Prof. Jason Cong Yiping Fan, Guoling Han, Wei Jiang, Zhiru Zhang VLSI CAD Lab Computer Science Department.
1 Integrating Logic Retiming and Register Placement Tzu-Chieh Tien, Hsiao-Pin Su, Yu-Wen Tsay Yih-Chih Chou, and Youn-Long Lin Department of Computer Science.
Merging Synthesis With Layout For Soc Design -- Research Status Jinian Bian and Hongxi Xue Dept. Of Computer Science and Technology, Tsinghua University,
Processing Rate Optimization by Sequential System Floorplanning Jia Wang 1, Ping-Chih Wu 2, and Hai Zhou 1 1 Electrical Engineering & Computer Science.
Center for Embedded Computer Systems University of California, Irvine and San Diego Loop Shifting and Compaction for the.
Thermal-Aware SoC Test Scheduling with Test Set Partitioning and Interleaving Zhiyuan He 1, Zebo Peng 1, Petru Eles 1 Paul Rosinger 2, Bashir M. Al-Hashimi.
XPilot: A Platform-Based System-Level Synthesis for Reconfigurable SOCs Prof. Jason Cong UCLA Computer Science Department.
Combining High Level Synthesis and Floorplan Together EDA Lab, Tsinghua University Jinian Bian.
Profile-Guided Microarchitectural Floorplanning for Deep Submicron Processor Design Mongkol Ekpanyapong, Jacob R. Minz, Thaisiri Watewai*, Hsien-Hsin S.
DELAY INSERTION METHOD IN CLOCK SKEW SCHEDULING BARIS TASKIN and IVAN S. KOURTEV ISPD 2005 High Performance Integrated Circuit Design Lab. Department of.
Philip Brisk 2 Paolo Ienne 2 Hadi Parandeh-Afshar 1,2 1: University of Tehran, ECE Department 2: EPFL, School of Computer and Communication Sciences Efficient.
CSE 242A Integrated Circuit Layout Automation Lecture: Partitioning Winter 2009 Chung-Kuan Cheng.
CAD Techniques for IP-Based and System-On-Chip Designs Allen C.-H. Wu Department of Computer Science Tsing Hua University Hsinchu, Taiwan, R.O.C {
CAD for Physical Design of VLSI Circuits
LOPASS: A Low Power Architectural Synthesis for FPGAs with Interconnect Estimation and Optimization Harikrishnan K.C. University of Massachusetts Amherst.
A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.
TSV-Aware Analytical Placement for 3D IC Designs Meng-Kai Hsu, Yao-Wen Chang, and Valerity Balabanov GIEE and EE department of NTU DAC 2011.
Sub-expression elimination Logic expressions: –Performed by logic optimization. –Kernel-based methods. Arithmetic expressions: –Search isomorphic patterns.
Efficient Mapping onto Coarse-Grained Reconfigurable Architectures using Graph Drawing based Algorithm Jonghee Yoon, Aviral Shrivastava *, Minwook Ahn,
Section 10: Advanced Topics 1 M. Balakrishnan Dept. of Comp. Sci. & Engg. I.I.T. Delhi.
An Efficient Clustering Algorithm For Low Power Clock Tree Synthesis Rupesh S. Shelar Enterprise Microprocessor Group Intel Corporation, Hillsboro, OR.
HDL-Based Layout Synthesis Methodologies Allen C.-H. Wu Department of Computer Science Tsing Hua University Hsinchu, Taiwan, R.O.C {
CSE 494: Electronic Design Automation Lecture 2 VLSI Design, Physical Design Automation, Design Styles.
1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.
05/04/06 1 Integrating Logic Synthesis, Tech mapping and Retiming Presented by Atchuthan Perinkulam Based on the above paper by A. Mishchenko et al, UCAL.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
L11: Lower Power High Level Synthesis(2) 성균관대학교 조 준 동 교수
Jason Cong‡†, Guojie Luo*†, Kalliopi Tsota‡, and Bingjun Xiao‡ ‡Computer Science Department, University of California, Los Angeles, USA *School of Electrical.
HYPER: An Interactive Synthesis Environment for Real Time Applications Introduction to High Level Synthesis EE690 Presentation Sanjeev Gunawardena March.
RF network in SoC1 SoC Test Architecture with RF/Wireless Connectivity 1. D. Zhao, S. Upadhyaya, M. Margala, “A new SoC test architecture with RF/wireless.
ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular.
Design of a High-Throughput Low-Power IS95 Viterbi Decoder Xun Liu Marios C. Papaefthymiou Advanced Computer Architecture Laboratory Electrical Engineering.
1 COMP541 Multicycle MIPS Montek Singh Apr 4, 2012.
System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.
February 12, 1999 Architecture and Circuits: 1 Interconnect-Oriented Architecture and Circuits William J. Dally Computer Systems Laboratory Stanford University.
A High-Level Synthesis Flow for Custom Instruction Set Extensions for Application-Specific Processors Asia and South Pacific Design Automation Conference.
1ISPD'03 Process Variation Aware Clock Tree Routing Bing Lu Cadence Jiang Hu Texas A&M Univ Gary Ellis IBM Corp Haihua Su IBM Corp.
Global Clustering-Based Performance-Driven Circuit Partitioning Jason Cong University of California Los Angeles Chang Wu Aplus Design.
Review of “Register Binding for FPGAs with Embedded Memory” by Hassan Al Atat and Iyad Ouaiss Lisa Steffen CprE 583.
Implementing Tile-based Chip Multiprocessors with GALS Clocking Styles Zhiyi Yu, Bevan Baas VLSI Computation Lab, ECE Department University of California,
Philip Brisk 2 Paolo Ienne 2 Hadi Parandeh-Afshar 1,2 1: University of Tehran, ECE Department 2: EPFL, School of Computer and Communication Sciences Improving.
1 Architecture of Datapath- oriented Coarse-grain Logic and Routing for FPGAs Andy Ye, Jonathan Rose, David Lewis Department of Electrical and Computer.
Architecture and Synthesis for Multi-Cycle Communication
Ph.D. in Computer Science
CGRA Express: Accelerating Execution using Dynamic Operation Fusion
James D. Z. Ma Department of Electrical and Computer Engineering
Jason Cong, Guoling Han, Zhiru Zhang VLSI CAD Lab
Redundancy-Aware, Fault-Tolerant Clustering
An Automated Design Flow for 3D Microarchitecture Evaluation
Steve Dai, Gai Liu, Zhiru Zhang
Architecture Synthesis
Communication Driven Remapping of Processing Element (PE) in Fault-tolerant NoC-based MPSoCs Chia-Ling Chen, Yen-Hao Chen and TingTing Hwang Department.
Presentation transcript:

Architecture-Level Synthesis for Automatic Interconnect Pipelining Jason Cong, Yiping Fan, Zhiru Zhang VLSI CAD Lab Computer Science Department University of California, Los Angeles {cong, fanyp, zhiruz}@cs.ucla.edu Funded by GSRC, NSF, and Altera Corp.

Outline Motivation Our contributions Experimental results Conclusions RDR-Pipe micro-architecture Regular Distributed Register micro-architecture with interconnect pipelining Synthesis flow and algorithms MCAS-Pipe: automatic interconnect pipelining and sharing Experimental results Conclusions

Interconnect Bottleneck in Nanometer Designs Challenge: single-cycle full chip communication will be no longer possible Not supported by the current CAD toolset 5 cycles ITRS’01 0.07um Tech 5.63 GHz across-chip clock 800 mm2 (28.3mm x 28.3mm) IPEM BIWS estimations Buffer size: 100x Driver/receiver size: 100x Semi-global layer (Tier 3) Can travel up to 11.4mm in one cycle Need 5 clock cycles From corner to corner 4 cycles 3 cycles 2 cycles 1 cycle 11.4 22.8 28.3

Related Work Retiming with placement or floorplanning Retiming + multilevel partitioning [Cong et al, ICCAD’00] and coarse placement [Cong et al, DAC’03] Retiming + floorplanning [Chong & Brayton, IWLS’01] Retiming + placement for FPGAs [Singh & Brown, FPGA’02] Global wire pipelining in ItaniumTM processor [McInerney et al. ISPD’00] Buffer and flip-flop insertion in RTL [Lu et al. DATE’02] [Cocchini, ICCAD’02]

Interconnect pipelining by flip-flop insertion ? Limitation during Logic/Physical Level to Explore Multicycle Communication Minimum clock period achievable by logic optimization is bounded by max. delay-to-register (DR) ratio of the loops in the circuits [Papaefthymiou, MST’94] In a loop, 4 logic cells, 2 registers Cell delay = 1ns Interconnect delay = 1ns DR ratio = (Dlogic+Dint)/#Registers = (4+4)/2 = 4ns Clock period  4ns Interconnect pipelining by flip-flop insertion ? Requires considerable amount of manual rework on the original RTL descriptions

Our Approach Consideration of multicycle communication during architectural (or behavioral) synthesis [Cong et al, ISPD’03] [Cong et al. ICCAD’03] Regular Distributed Register (RDR) micro-architecture Highly regular Direct support of multicycle on-chip communication MCAS: Architectural Synthesis for Multi-cycle Communication Efficiently maps the behavioral descriptions to RDR uArch Integrates architectural synthesis (e.g. resource binding, scheduling) with physical planning This work Extension of RDR and MCAS for interconnect pipelining

Outline Motivation Our contributions Experimental results Conclusions RDR-Pipe micro-architecture Regular Distributed Register micro-architecture with interconnect pipelining Synthesis flow and algorithms MCAS-Pipe: automatic interconnect pipelining and sharing Experimental results Conclusions

Regular Distributed Register Micro-Architecture … LCC FSM Reg. file Global Interconnect K cycle K cycles 2 cycle 2 cycles Local Computational Cluster (LCC) …. Wi Hi FSM ALU MUL MUX Island 1 cycle Distribute registers to each “island” Choose the island size such that local computation and communication in each island can be done in a single cycle Use register banks: registers in each island are partitioned to k banks for 1 cycle, 2 cycle, … k cycle interconnect communication in each island

Wiring Overhead in RDR Designs + Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 ALU1 r1 + r1 r2 r2 r3 r3 r4 MUL1 Interconnects with delay of 2 cycles r4 * + * ALU1 MUL1 Sender register Receiver register Data transfers r1r3 and r2r4 are overlapped Two dedicated global wires are needed

Architectural Solution: RDR-Pipe LCC FSM Reg. File V channel PRS H channel Pipeline Register Station (PRS) 1 2 4 3 5 6 Keep the intra-island structures Inter-island pipeline register station (PRS) for global communications PRS performs autonomous store-and-forward Synchronous design No global control signal needed for PRS

Reducing Wiring Overhead in RDR-Pipe ALU1 MUL1 2 cycle communication r1 r2 r3 r4 + * Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Sender register Receiver register Pipeline register Data transfers are pipelined One wire with a pipeline register is enough

Synthesis Flow: MCAS-Pipe System C / VHDL Global interconnect sharing After scheduling and functional unit binding Before register and port binding Enable multiple data communications to shar a physical link (a wire with pipeline registers) Advantages over MCAS Expect to reduce global wiring demand No multicycle path constraint needed MCAS-Pipe CDFG generation CDFG Resource allocation & Functional unit binding ICG Scheduling-driven placement Locations Placement-driven rescheduling & rebinding Global interconnect sharing Register and port binding Datapath & FSM generation RTL VHDL & Floorplan constraints

Global Interconnect Sharing Pipeline register Sender register Receiver register Cycle 4 Cycle 1 Cycle 2 Cycle 3 Cycle 5 Cycle 6 Cycle 7 ce cg pe pg Conflicted data transfers A B D = 2 pe ce pg cg Two physical links are needed to support the concurrent data transfers Cycle 4 Cycle 1 Cycle 2 Cycle 3 Cycle 5 Cycle 6 Cycle 7 ce cg pe pg Compatible data transfers A B pe, pg ce D = 2 cg Now, two producer registers can be merged, since their life-times become compatible A B pe ce D = 2 pg cg Only one physical link is required to support the scheduled data transfers

Global Pipelined Interconnect Minimization Definitions Data links: pipelined global interconnects Channel: set of data links between two islands Width of a channel: number of its data links Data transfer: movement of data from a producer to a consumer Architectural assumption Channels cannot share interconnects Theorem Global pipelined interconnects are minimized if and only if the width of every channel is minimized

Transfer Scheduling for a Single Channel A decision problem formulation Given: A channel (A, B) containing m data links A data transfer set {e | pe  A and ce  B}, where each transfer e is associated with an arrival time T(pe)+1, a deadline T(ce)-D(A, B), and unit effective occupancy time Fact: for every time slot, at most one transfer can be issued on a data link Objective: to find a feasible transfer schedule on these data links Transfer scheduling is polynomial solvable A special real-time scheduling problem [J. Blazewicz, 1979] Binary search for minimum feasible channel width m For each width, apply Earliest-Deadline-First (EDF) scheduling: O(nlogn) Overall time complexity: O(nlog2n)

EDF-Based Transfer Scheduling Example Data Link 1 EDF-Based Transfer Scheduling Example Data Link 2 Time slot Time slot 1 1 2 5 2 3 3 4 6 4 5 6 Successfully scheduling onto 2 data links Ordered by Earliest-Deadline-First 1 2 3 4 5 6 Ordered by left edge Data Link 1 Data Link 2 1 4 3 5 2 ? Failed for 2 data links!

Outline Motivation Our contributions Experimental results Conclusions RDR-Pipe micro-architecture Regular Distributed Register micro-architecture with interconnect pipelining Synthesis flow and algorithms MCAS-Pipe: automatic interconnect pipelining and sharing Experimental results Conclusions

Altera QuartusII + Stratix Experiment Settings C / VHDL CDFG generation Functional unit allocation & binding Target clock period uArch. spec. Conventional flow Scheduling-driven placement Placement-driven rebinding & rescheduling Conventional Scheduling MCAS flow Global interconnect sharing MCAS-Pipe flow Register and port binding Datapath & Control generation RTL VHDL files (for all flows) Floorplan constraints (for MCAS and MCAS-Pipe); Multicycle path constraints (for MCAS only) Altera QuartusII + Stratix

Experimental Results: Register and LE Usage Design environment: Altera QuartusII, Stratix EP1S40 MCAS vs. Conventional flow: Uses more registers and logic elements (LE) MCAS-Pipe vs. MCAS: Slightly more registers, and comparable logic element cost Designs Node# MCAS CONV / MCAS MCAS-Pipe / MCAS Reg# LE PR 46 31 1181 0.71 0.95 1.19 WANG 52 40 1435 0.63 0.81 1.20 0.85 LEE 53 29 988 0.76 0.96 1.00 0.84 MCM 98 57 2467 0.75 1.05 HONDA 101 41 2542 0.83 0.90 1.01 DIR 152 44 2260 Average  - 0.74 0.93 1.09 0.98

Experimental Results: Performance Design environment: Altera QuartusII, Stratix EP1S40 MCAS vs. Conventional flow: 36% reduction in clock period and 30% in total latency MCAS-Pipe vs. MCAS: Comparable design performance (4% better) Clock period Total latency

Interconnect Structure of Altera’s Stratix Global: H24 H8 H4 Local: LL, LO Global:V16 V4 V8

Experimental Results: Wirelength Wire types LL, LO: local wires; H4, V4, H8, V8: short global wires V16, H24: long global wires MCAS-Pipe vs. MCAS: 28.8% long global wires reduction, 19.3% total wirelength reduction

Conclusions High-level automatic on-chip interconnect pipelining RDR-Pipe: extension of RDR micro-architecture Micro-architecture supporting interconnect pipelining MCAS-Pipe: enhancement of MCAS synthesis system Add in a novel global interconnect sharing algorithm to effectively reduce the global wiring Experimental results Matches or exceeds the RDR-based approach in performance Greatly reduces wiring demand

Thank you