Design Methodology for Semi Custom Processor Cores

Design Methodology for Semi Custom Processor Cores
Victor Zyuban Sameh Asaad Thomas Fox Anne-Marie Haen Daniel Littrell Jaime Moreno IBM T.J.Watson Research Center, Yorktown Heights, NY

500 MHz WC, 350mW @ 1.5V, 105 C, 0.13um foundry technology
Introduction We describe the methodology used in the implementation of a DSP core whose requirements didn’t allow for a typical soft core or hard core approach: 500 MHz WC, 1.5V, 105 C, 0.13um foundry technology Objectives: Exceed performance and power characteristics of designs built using standard ASIC flow - typical ASIC runs at 300Mhz in this technology, without compromising its productivity and generality Enable integration of custom components Enable application of power reduction techniques not provided by ASIC flow, such as power gating, reverse bias, and data retention Allow optimizations across design phases Quick turn-around time, reproducible results

Methodology Overview ISA/uA VHDL Hiasynth/ Booledozer Scan & clock PD
modify Arch/uA (change latencies, redefine resource usage) ISA/uA Define hierarchy Clock gating Latch grouping Instantiate custom components re-arrange logic re-group latches VHDL adjust synthesis constraints Define assertion and Synthesis directives Logic Synthesis Optimization w/Pseudo-Latches Hiasynth/ Booledozer Scan & clock Clock Splitters Insertion Scan Insertion Hierarchical Verilog Porting design to Cadence rewire scan Pre-placement / pre-routing Place & Route Extract Timing/Clock Skew/Scan Order Power Analysis PD

Overview of main design techniques
Hierarchical VHDL and synthesis + pre-placement of components Grouping of latches for clock splitters in VHDL + pre-placement of latches and clock splitters Enforcing bit ordering in the datapath (bit stack seeding) Instantiation of decoupling buffers in VHDL + pre-placement of decoupling buffers Pre-routing clock grid and power-ground grid

Hierarchical Synthesis and Pre-placement of Components: methodology
Every unit is broken up into components (a few thousand gates each) Components are synthesized independently Layout of the unit is organized into a set of overlapping boxes, gates constituting components are assigned to appropriate boxes, leaving sufficient flexibility for the place and route tools FU1 FU1 FU2 FU2 dust FU5 FU4 FU3 FU4 FU5 FU3 VHDL Entry layout

Hierarchical Synthesis and Pre-placement of Components: benefits
Different components get best power/performance/area characteristics when synthesized with different directives Gates inside components are sized for smaller area, only gates constituting dust use high-power books Most of the wires are restricted within smaller areas and are therefore short Most of the gates in the design use low power books Both area and power are saved slice 0 slice 1 slice 2 slice 3 control slice pointer update unit (3) bit reverse unit (0)

Latch grouping: fine-grain clock gating
VHDL Entry Post-Synthesis Processing clk1_C clk1_B L2 L1 Splitter single bit LSSD latches clk2_C clk2_B grid clock local clock splitters replace placeholder CG-ORs single or multi-bit behavioral latches Gate1 Gate1 clk1 cclk C grid clock Gate2 Gate2 cclk clk2 C CG-OR instances define gated latch groups Control granularity down to latch group to be driven by same splitter. Performs early (L1 and L2) gating. Similar to ASIC Clock-OR methodology

Latch Grouping without clock gating
VHDL Entry Post-Synthesis Processing clk1_C clk1_B L2 L1 Splitter single bit LSSD latches clk2_C clk2_B grid clock local clock splitters replace placeholder buffers single or multi-bit behavioral latches clk1 grid clock buffer instances with special names define clock/latch groups clk2 Designer controls latch grouping by inserting special placeholder buffers to form latch groups Post-synthesis script replaces buffers with splitters and behavioral latches with LSSD L1/L2 latches

Pre-placement of latches and clock splitters
Clock wires are short – resulting in power reduction Length of clock wires is under control – resulting in small clock skew, higher frequency and faster convergence on timing Bit-precise placement of dataflow latches enforces bit ordering in the datapath – resulting in improved routability and savings in power and area clock distribution clock splitters latches clock, no preplacement

Instantiation and pre-placement of decoupling buffers: methodology
Used when a long wire or non-critical block of logic needs to be decoupled from the critical path Decoupling buffers are instantiated in VHDL, and preserved in the synthesis and post-synthesis steps Overlapping pre-placement boxes are created in the layout, decoupling buffers are assigned to the appropriate boxes latch latch latch decoupling buffer FU1 FU1 FU1 FU2 FU2 VHDL Entry (case 1) VHDL Entry (case 2) layout

Instantiation and preplacement of decoupling buffers: benefits
The power level of the decoupling buffers is precisely controlled, without impacting the gates constituting the components (FUs) Allows keeping the power level of most books inside the unit small, using high-power books only where they need to drive long wires or high FO Decoupling high capacitance nodes from critical paths improves speed decoupling buffers 40-bit latch output wires

Core assembly and timing methodology overview 1) generation of abstracts - unit level
Unit layout Chipbench extraction of global wiring (pd file) EinsTimer Unit layout abstract Unit timing abstract

Core assembly and timing methodology overview 2) final step – core level
top schematic Generate Physical Hierarchy Unit layout abstract Unit timing abstract top floorplan Placement (Cadence Preview, skill scripts) Routing (CCAR) top routed floorplan Chipbench extraction of global wiring EinsTimer

Placed and routed eLite core
custom instruction memory 32kB custom vector register file 256 x 16bit 8read / 4write VPU AU IU DEC BU BIU CR X bus control 16-bit 40-bit 16-bit 40-bit 16-bit 40-bit 16-bit 40-bit vector control custom data memory 32kB slice 0 slice 1 slice 2 slice 3 VMU SD bus control X bus control reduct unit

Conclusion Significant
speed improvement, compared with standard ASIC flow (critical path reduced from 3ns to 2ns in some units) area reduction (> 30%) due to dominant usage of low-power cells power reduction (in the range of 50%) Careful pre-placement of clock splitters and clock gating circuitry allows more time for calculating the clock gating conditions: Increased from 0.1 to 0.6ns for 500 MHz WC design with highly efficient OR-style (early) clock gating, allowing to clock gate 90% of eligible latches Generic VHDL – easy to maintain, port and simulate Short time from VHDL to layout, fast turn-around time to close on timing, with consistent convergence up to 3 VHDL-to-layout iterations per unit per day by 2 to 3 designers

Design Methodology for Semi Custom Processor Cores

Similar presentations

Presentation on theme: "Design Methodology for Semi Custom Processor Cores"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Design Methodology for Semi Custom Processor Cores

Similar presentations

Presentation on theme: "Design Methodology for Semi Custom Processor Cores"— Presentation transcript:

Similar presentations

About project

Feedback