Presentation is loading. Please wait.

Presentation is loading. Please wait.

Architecture-Level Synthesis for Automatic Interconnect Pipelining

Similar presentations


Presentation on theme: "Architecture-Level Synthesis for Automatic Interconnect Pipelining"— Presentation transcript:

1 Architecture-Level Synthesis for Automatic Interconnect Pipelining
Jason Cong, Yiping Fan, Zhiru Zhang VLSI CAD Lab Computer Science Department University of California, Los Angeles {cong, fanyp, Funded by GSRC, NSF, and Altera Corp.

2 Outline Motivation Our contributions Experimental results Conclusions
RDR-Pipe micro-architecture Regular Distributed Register micro-architecture with interconnect pipelining Synthesis flow and algorithms MCAS-Pipe: automatic interconnect pipelining and sharing Experimental results Conclusions

3 Interconnect Bottleneck in Nanometer Designs
Challenge: single-cycle full chip communication will be no longer possible Not supported by the current CAD toolset 5 cycles ITRS’ um Tech 5.63 GHz across-chip clock 800 mm2 (28.3mm x 28.3mm) IPEM BIWS estimations Buffer size: 100x Driver/receiver size: 100x Semi-global layer (Tier 3) Can travel up to 11.4mm in one cycle Need 5 clock cycles From corner to corner 4 cycles 3 cycles 2 cycles 1 cycle 11.4 22.8 28.3

4 Related Work Retiming with placement or floorplanning
Retiming + multilevel partitioning [Cong et al, ICCAD’00] and coarse placement [Cong et al, DAC’03] Retiming + floorplanning [Chong & Brayton, IWLS’01] Retiming + placement for FPGAs [Singh & Brown, FPGA’02] Global wire pipelining in ItaniumTM processor [McInerney et al. ISPD’00] Buffer and flip-flop insertion in RTL [Lu et al. DATE’02] [Cocchini, ICCAD’02]

5 Interconnect pipelining by flip-flop insertion ?
Limitation during Logic/Physical Level to Explore Multicycle Communication Minimum clock period achievable by logic optimization is bounded by max. delay-to-register (DR) ratio of the loops in the circuits [Papaefthymiou, MST’94] In a loop, 4 logic cells, 2 registers Cell delay = 1ns Interconnect delay = 1ns DR ratio = (Dlogic+Dint)/#Registers = (4+4)/2 = 4ns Clock period  4ns Interconnect pipelining by flip-flop insertion ? Requires considerable amount of manual rework on the original RTL descriptions

6 Our Approach Consideration of multicycle communication during architectural (or behavioral) synthesis [Cong et al, ISPD’03] [Cong et al. ICCAD’03] Regular Distributed Register (RDR) micro-architecture Highly regular Direct support of multicycle on-chip communication MCAS: Architectural Synthesis for Multi-cycle Communication Efficiently maps the behavioral descriptions to RDR uArch Integrates architectural synthesis (e.g. resource binding, scheduling) with physical planning This work Extension of RDR and MCAS for interconnect pipelining

7 Outline Motivation Our contributions Experimental results Conclusions
RDR-Pipe micro-architecture Regular Distributed Register micro-architecture with interconnect pipelining Synthesis flow and algorithms MCAS-Pipe: automatic interconnect pipelining and sharing Experimental results Conclusions

8 Regular Distributed Register Micro-Architecture
LCC FSM Reg. file Global Interconnect K cycle K cycles 2 cycle 2 cycles Local Computational Cluster (LCC) …. Wi Hi FSM ALU MUL MUX Island 1 cycle Distribute registers to each “island” Choose the island size such that local computation and communication in each island can be done in a single cycle Use register banks: registers in each island are partitioned to k banks for 1 cycle, 2 cycle, … k cycle interconnect communication in each island

9 Wiring Overhead in RDR Designs
+ Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 ALU1 r1 + r1 r2 r2 r3 r3 r4 MUL1 Interconnects with delay of 2 cycles r4 * + * ALU1 MUL1 Sender register Receiver register Data transfers r1r3 and r2r4 are overlapped Two dedicated global wires are needed

10 Architectural Solution: RDR-Pipe
LCC FSM Reg. File V channel PRS H channel Pipeline Register Station (PRS) 1 2 4 3 5 6 Keep the intra-island structures Inter-island pipeline register station (PRS) for global communications PRS performs autonomous store-and-forward Synchronous design No global control signal needed for PRS

11 Reducing Wiring Overhead in RDR-Pipe
ALU1 MUL1 2 cycle communication r1 r2 r3 r4 + * Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Sender register Receiver register Pipeline register Data transfers are pipelined One wire with a pipeline register is enough

12 Synthesis Flow: MCAS-Pipe System
C / VHDL Global interconnect sharing After scheduling and functional unit binding Before register and port binding Enable multiple data communications to shar a physical link (a wire with pipeline registers) Advantages over MCAS Expect to reduce global wiring demand No multicycle path constraint needed MCAS-Pipe CDFG generation CDFG Resource allocation & Functional unit binding ICG Scheduling-driven placement Locations Placement-driven rescheduling & rebinding Global interconnect sharing Register and port binding Datapath & FSM generation RTL VHDL & Floorplan constraints

13 Global Interconnect Sharing
Pipeline register Sender register Receiver register Cycle 4 Cycle 1 Cycle 2 Cycle 3 Cycle 5 Cycle 6 Cycle 7 ce cg pe pg Conflicted data transfers A B D = 2 pe ce pg cg Two physical links are needed to support the concurrent data transfers Cycle 4 Cycle 1 Cycle 2 Cycle 3 Cycle 5 Cycle 6 Cycle 7 ce cg pe pg Compatible data transfers A B pe, pg ce D = 2 cg Now, two producer registers can be merged, since their life-times become compatible A B pe ce D = 2 pg cg Only one physical link is required to support the scheduled data transfers

14 Global Pipelined Interconnect Minimization
Definitions Data links: pipelined global interconnects Channel: set of data links between two islands Width of a channel: number of its data links Data transfer: movement of data from a producer to a consumer Architectural assumption Channels cannot share interconnects Theorem Global pipelined interconnects are minimized if and only if the width of every channel is minimized

15 Transfer Scheduling for a Single Channel
A decision problem formulation Given: A channel (A, B) containing m data links A data transfer set {e | pe  A and ce  B}, where each transfer e is associated with an arrival time T(pe)+1, a deadline T(ce)-D(A, B), and unit effective occupancy time Fact: for every time slot, at most one transfer can be issued on a data link Objective: to find a feasible transfer schedule on these data links Transfer scheduling is polynomial solvable A special real-time scheduling problem [J. Blazewicz, 1979] Binary search for minimum feasible channel width m For each width, apply Earliest-Deadline-First (EDF) scheduling: O(nlogn) Overall time complexity: O(nlog2n)

16 EDF-Based Transfer Scheduling Example
Data Link 1 EDF-Based Transfer Scheduling Example Data Link 2 Time slot Time slot 1 1 2 5 2 3 3 4 6 4 5 6 Successfully scheduling onto 2 data links Ordered by Earliest-Deadline-First 1 2 3 4 5 6 Ordered by left edge Data Link 1 Data Link 2 1 4 3 5 2 ? Failed for 2 data links!

17 Outline Motivation Our contributions Experimental results Conclusions
RDR-Pipe micro-architecture Regular Distributed Register micro-architecture with interconnect pipelining Synthesis flow and algorithms MCAS-Pipe: automatic interconnect pipelining and sharing Experimental results Conclusions

18 Altera QuartusII + Stratix
Experiment Settings C / VHDL CDFG generation Functional unit allocation & binding Target clock period uArch. spec. Conventional flow Scheduling-driven placement Placement-driven rebinding & rescheduling Conventional Scheduling MCAS flow Global interconnect sharing MCAS-Pipe flow Register and port binding Datapath & Control generation RTL VHDL files (for all flows) Floorplan constraints (for MCAS and MCAS-Pipe); Multicycle path constraints (for MCAS only) Altera QuartusII + Stratix

19 Experimental Results: Register and LE Usage
Design environment: Altera QuartusII, Stratix EP1S40 MCAS vs. Conventional flow: Uses more registers and logic elements (LE) MCAS-Pipe vs. MCAS: Slightly more registers, and comparable logic element cost Designs Node# MCAS CONV / MCAS MCAS-Pipe / MCAS Reg# LE PR 46 31 1181 0.71 0.95 1.19 WANG 52 40 1435 0.63 0.81 1.20 0.85 LEE 53 29 988 0.76 0.96 1.00 0.84 MCM 98 57 2467 0.75 1.05 HONDA 101 41 2542 0.83 0.90 1.01 DIR 152 44 2260 Average  - 0.74 0.93 1.09 0.98

20 Experimental Results: Performance
Design environment: Altera QuartusII, Stratix EP1S40 MCAS vs. Conventional flow: 36% reduction in clock period and 30% in total latency MCAS-Pipe vs. MCAS: Comparable design performance (4% better) Clock period Total latency

21 Interconnect Structure of Altera’s Stratix
Global: H24 H8 H4 Local: LL, LO Global:V16 V4 V8

22 Experimental Results: Wirelength
Wire types LL, LO: local wires; H4, V4, H8, V8: short global wires V16, H24: long global wires MCAS-Pipe vs. MCAS: 28.8% long global wires reduction, 19.3% total wirelength reduction

23 Conclusions High-level automatic on-chip interconnect pipelining
RDR-Pipe: extension of RDR micro-architecture Micro-architecture supporting interconnect pipelining MCAS-Pipe: enhancement of MCAS synthesis system Add in a novel global interconnect sharing algorithm to effectively reduce the global wiring Experimental results Matches or exceeds the RDR-based approach in performance Greatly reduces wiring demand

24 Thank you


Download ppt "Architecture-Level Synthesis for Automatic Interconnect Pipelining"

Similar presentations


Ads by Google