Download presentation
Presentation is loading. Please wait.
Published byTamsin Hoover Modified over 9 years ago
1
Architecture-Level Synthesis for Automatic Interconnect Pipelining
Jason Cong, Yiping Fan, Zhiru Zhang VLSI CAD Lab Computer Science Department University of California, Los Angeles {cong, fanyp, Funded by GSRC, NSF, and Altera Corp.
2
Outline Motivation Our contributions Experimental results Conclusions
RDR-Pipe micro-architecture Regular Distributed Register micro-architecture with interconnect pipelining Synthesis flow and algorithms MCAS-Pipe: automatic interconnect pipelining and sharing Experimental results Conclusions
3
Interconnect Bottleneck in Nanometer Designs
Challenge: single-cycle full chip communication will be no longer possible Not supported by the current CAD toolset 5 cycles ITRS’ um Tech 5.63 GHz across-chip clock 800 mm2 (28.3mm x 28.3mm) IPEM BIWS estimations Buffer size: 100x Driver/receiver size: 100x Semi-global layer (Tier 3) Can travel up to 11.4mm in one cycle Need 5 clock cycles From corner to corner 4 cycles 3 cycles 2 cycles 1 cycle 11.4 22.8 28.3
4
Related Work Retiming with placement or floorplanning
Retiming + multilevel partitioning [Cong et al, ICCAD’00] and coarse placement [Cong et al, DAC’03] Retiming + floorplanning [Chong & Brayton, IWLS’01] Retiming + placement for FPGAs [Singh & Brown, FPGA’02] Global wire pipelining in ItaniumTM processor [McInerney et al. ISPD’00] Buffer and flip-flop insertion in RTL [Lu et al. DATE’02] [Cocchini, ICCAD’02]
5
Interconnect pipelining by flip-flop insertion ?
Limitation during Logic/Physical Level to Explore Multicycle Communication Minimum clock period achievable by logic optimization is bounded by max. delay-to-register (DR) ratio of the loops in the circuits [Papaefthymiou, MST’94] In a loop, 4 logic cells, 2 registers Cell delay = 1ns Interconnect delay = 1ns DR ratio = (Dlogic+Dint)/#Registers = (4+4)/2 = 4ns Clock period 4ns Interconnect pipelining by flip-flop insertion ? Requires considerable amount of manual rework on the original RTL descriptions
6
Our Approach Consideration of multicycle communication during architectural (or behavioral) synthesis [Cong et al, ISPD’03] [Cong et al. ICCAD’03] Regular Distributed Register (RDR) micro-architecture Highly regular Direct support of multicycle on-chip communication MCAS: Architectural Synthesis for Multi-cycle Communication Efficiently maps the behavioral descriptions to RDR uArch Integrates architectural synthesis (e.g. resource binding, scheduling) with physical planning This work Extension of RDR and MCAS for interconnect pipelining
7
Outline Motivation Our contributions Experimental results Conclusions
RDR-Pipe micro-architecture Regular Distributed Register micro-architecture with interconnect pipelining Synthesis flow and algorithms MCAS-Pipe: automatic interconnect pipelining and sharing Experimental results Conclusions
8
Regular Distributed Register Micro-Architecture
… LCC FSM Reg. file Global Interconnect K cycle K cycles 2 cycle 2 cycles Local Computational Cluster (LCC) …. Wi Hi FSM ALU MUL MUX Island 1 cycle Distribute registers to each “island” Choose the island size such that local computation and communication in each island can be done in a single cycle Use register banks: registers in each island are partitioned to k banks for 1 cycle, 2 cycle, … k cycle interconnect communication in each island
9
Wiring Overhead in RDR Designs
+ Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 ALU1 r1 + r1 r2 r2 r3 r3 r4 MUL1 Interconnects with delay of 2 cycles r4 * + * ALU1 MUL1 Sender register Receiver register Data transfers r1r3 and r2r4 are overlapped Two dedicated global wires are needed
10
Architectural Solution: RDR-Pipe
LCC FSM Reg. File V channel PRS H channel Pipeline Register Station (PRS) 1 2 4 3 5 6 Keep the intra-island structures Inter-island pipeline register station (PRS) for global communications PRS performs autonomous store-and-forward Synchronous design No global control signal needed for PRS
11
Reducing Wiring Overhead in RDR-Pipe
ALU1 MUL1 2 cycle communication r1 r2 r3 r4 + * Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Sender register Receiver register Pipeline register Data transfers are pipelined One wire with a pipeline register is enough
12
Synthesis Flow: MCAS-Pipe System
C / VHDL Global interconnect sharing After scheduling and functional unit binding Before register and port binding Enable multiple data communications to shar a physical link (a wire with pipeline registers) Advantages over MCAS Expect to reduce global wiring demand No multicycle path constraint needed MCAS-Pipe CDFG generation CDFG Resource allocation & Functional unit binding ICG Scheduling-driven placement Locations Placement-driven rescheduling & rebinding Global interconnect sharing Register and port binding Datapath & FSM generation RTL VHDL & Floorplan constraints
13
Global Interconnect Sharing
Pipeline register Sender register Receiver register Cycle 4 Cycle 1 Cycle 2 Cycle 3 Cycle 5 Cycle 6 Cycle 7 ce cg pe pg Conflicted data transfers A B D = 2 pe ce pg cg Two physical links are needed to support the concurrent data transfers Cycle 4 Cycle 1 Cycle 2 Cycle 3 Cycle 5 Cycle 6 Cycle 7 ce cg pe pg Compatible data transfers A B pe, pg ce D = 2 cg Now, two producer registers can be merged, since their life-times become compatible A B pe ce D = 2 pg cg Only one physical link is required to support the scheduled data transfers
14
Global Pipelined Interconnect Minimization
Definitions Data links: pipelined global interconnects Channel: set of data links between two islands Width of a channel: number of its data links Data transfer: movement of data from a producer to a consumer Architectural assumption Channels cannot share interconnects Theorem Global pipelined interconnects are minimized if and only if the width of every channel is minimized
15
Transfer Scheduling for a Single Channel
A decision problem formulation Given: A channel (A, B) containing m data links A data transfer set {e | pe A and ce B}, where each transfer e is associated with an arrival time T(pe)+1, a deadline T(ce)-D(A, B), and unit effective occupancy time Fact: for every time slot, at most one transfer can be issued on a data link Objective: to find a feasible transfer schedule on these data links Transfer scheduling is polynomial solvable A special real-time scheduling problem [J. Blazewicz, 1979] Binary search for minimum feasible channel width m For each width, apply Earliest-Deadline-First (EDF) scheduling: O(nlogn) Overall time complexity: O(nlog2n)
16
EDF-Based Transfer Scheduling Example
Data Link 1 EDF-Based Transfer Scheduling Example Data Link 2 Time slot Time slot 1 1 2 5 2 3 3 4 6 4 5 6 Successfully scheduling onto 2 data links Ordered by Earliest-Deadline-First 1 2 3 4 5 6 Ordered by left edge Data Link 1 Data Link 2 1 4 3 5 2 ? Failed for 2 data links!
17
Outline Motivation Our contributions Experimental results Conclusions
RDR-Pipe micro-architecture Regular Distributed Register micro-architecture with interconnect pipelining Synthesis flow and algorithms MCAS-Pipe: automatic interconnect pipelining and sharing Experimental results Conclusions
18
Altera QuartusII + Stratix
Experiment Settings C / VHDL CDFG generation Functional unit allocation & binding Target clock period uArch. spec. Conventional flow Scheduling-driven placement Placement-driven rebinding & rescheduling Conventional Scheduling MCAS flow Global interconnect sharing MCAS-Pipe flow Register and port binding Datapath & Control generation RTL VHDL files (for all flows) Floorplan constraints (for MCAS and MCAS-Pipe); Multicycle path constraints (for MCAS only) Altera QuartusII + Stratix
19
Experimental Results: Register and LE Usage
Design environment: Altera QuartusII, Stratix EP1S40 MCAS vs. Conventional flow: Uses more registers and logic elements (LE) MCAS-Pipe vs. MCAS: Slightly more registers, and comparable logic element cost Designs Node# MCAS CONV / MCAS MCAS-Pipe / MCAS Reg# LE PR 46 31 1181 0.71 0.95 1.19 WANG 52 40 1435 0.63 0.81 1.20 0.85 LEE 53 29 988 0.76 0.96 1.00 0.84 MCM 98 57 2467 0.75 1.05 HONDA 101 41 2542 0.83 0.90 1.01 DIR 152 44 2260 Average - 0.74 0.93 1.09 0.98
20
Experimental Results: Performance
Design environment: Altera QuartusII, Stratix EP1S40 MCAS vs. Conventional flow: 36% reduction in clock period and 30% in total latency MCAS-Pipe vs. MCAS: Comparable design performance (4% better) Clock period Total latency
21
Interconnect Structure of Altera’s Stratix
Global: H24 H8 H4 Local: LL, LO Global:V16 V4 V8
22
Experimental Results: Wirelength
Wire types LL, LO: local wires; H4, V4, H8, V8: short global wires V16, H24: long global wires MCAS-Pipe vs. MCAS: 28.8% long global wires reduction, 19.3% total wirelength reduction
23
Conclusions High-level automatic on-chip interconnect pipelining
RDR-Pipe: extension of RDR micro-architecture Micro-architecture supporting interconnect pipelining MCAS-Pipe: enhancement of MCAS synthesis system Add in a novel global interconnect sharing algorithm to effectively reduce the global wiring Experimental results Matches or exceeds the RDR-based approach in performance Greatly reduces wiring demand
24
Thank you
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.