Presentation is loading. Please wait.

Presentation is loading. Please wait.

FabScalar: Composing Synthesizable RTL Designs of Arbitrary Cores within a Canonical Superscalar Template Niket K. Choudhary, Salil V. Wadhavkar, Tanmay.

Similar presentations


Presentation on theme: "FabScalar: Composing Synthesizable RTL Designs of Arbitrary Cores within a Canonical Superscalar Template Niket K. Choudhary, Salil V. Wadhavkar, Tanmay."— Presentation transcript:

1 FabScalar: Composing Synthesizable RTL Designs of Arbitrary Cores within a Canonical Superscalar Template Niket K. Choudhary, Salil V. Wadhavkar, Tanmay A. Shah, Hiran Mayukh, Jayneel Gandhi, Brandon H. Dwiel, Sandeep Navada, Hashem H. Najaf-abadi, Eric Rotenberg Center for Efficient, Scalable, and Reliable Computing Department of Electrical & Computer Engineering North Carolina State University

2 Single Core to Multiple Cores
core A core C core D Core B core Generic microarchitecture One-size-fits-all approach Sub-optimal performance for individual applications Power inefficient Exciting opportunity to exploit application diversity Employ many microarchitecturally diverse core designs Higher performance on individual applications Power efficient © Niket K. Choudhary 38th Int'l Symp. on Computer Architecture, 2011

3 Application Diversity
ILP Characteristics App. 1 Structure Sizes Superscalar Width ILP Characteristics Pipeline Depth Customize each core to an application, class of application, or class of application behavior App. 2 Core A Core C ILP Characteristics Core D App. 3 Core B Heterogeneous Multi-core © Niket K. Choudhary 38th Int'l Symp. on Computer Architecture, 2011

4 Benefits of Employing Diverse Cores
Prior works have shown significant performance and power advantages R. Kumar et al. Single-ISA Heterogeneous Multi-core Architectures: The Potential for Processor Power Reduction (MICRO 2003) R. Kumar et al. Core Architecture Optimization for Heterogeneous Chip Multiprocessors (PACT 2006) B. C. Lee et al. Efficiency Trends and Limits from Comprehensive Microarchitectural Adaptivity (ASPLOS 2008) M. A. Suleman et al. Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures (ASPLOS 2009) H. H. Najaf-abadi et al. Core-Selectability in Chip Multiprocessors (PACT 2009) © Niket K. Choudhary 38th Int'l Symp. on Computer Architecture, 2011

5 “Achilles’ Heel” of Employing Diverse Cores
Designing and verifying a core is expensive Designing and verifying many different core types is prohibitively expensive and impractical No prior research in heterogeneous multi-core has addressed this challenge Core A Core C Core D Core B © Niket K. Choudhary 38th Int'l Symp. on Computer Architecture, 2011

6 38th Int'l Symp. on Computer Architecture, 2011
FabScalar Automate the generation of superscalar processors Our approach: Frame superscalar processors in a canonical form All superscalar processors have same set of canonical pipeline stages and interfaces among them, expressed by a Canonical Superscalar Template A Canonical Pipeline Stage Library (CPSL) provides many different designs for each canonical pipeline stage, that differ in major superscalar dimensions Automation is enabled because of Invariant interfaces among canonical pipeline stages Confinement of microarchitectural diversity within the canonical pipeline stages © Niket K. Choudhary 38th Int'l Symp. on Computer Architecture, 2011

7 Canonical Superscalar Template
© Niket K. Choudhary 38th Int'l Symp. on Computer Architecture, 2011

8 Canonical Pipeline Stage Library (CPSL)
Microarchitectural diversity is focused along key dimensions: Superscalar Complexity Superscalar width, i.e., number of superscalar “ways” Sizes of stage-specific structures for extracting instruction-level parallelism (ILP) Sub-pipelining Pipeline depth of a canonical stage Stage-specific design choices E.g., different speculation alternatives, recovery alternatives, etc. Different flavors of canonical stage implementation fundamentally trade-off associated design-cost and the performance that can be extracted. 1-Wide 2-Wide 3-Wide 1-deep 2-deep 3-deep iq-cam:8 iq-cam:32 iq-cam:64 issuing policy: oldest-first, critical-first squash policy: complete or selective Issue Stage © Niket K. Choudhary 38th Int'l Symp. on Computer Architecture, 2011

9 38th Int'l Symp. on Computer Architecture, 2011
CPSL Canonical Superscalar Template Fetch Fetch App. 1 Decode core configuration Rename Core Generator Rename Dispatch Issue synthesizable RTL of customized core Issue Register Read Execute Writeback Retire © Niket K. Choudhary 38th Int'l Symp. on Computer Architecture, 2011

10 38th Int'l Symp. on Computer Architecture, 2011
CPSL Canonical Superscalar Template Fetch Fetch App. 2 Decode core configuration Rename Core Generator Rename Dispatch Issue synthesizable RTL of customized core Issue Register Read Execute Writeback Retire © Niket K. Choudhary 38th Int'l Symp. on Computer Architecture, 2011

11 Addressing Design-Effort Problem
FabScalar boosts designer productivity by generating RTL designs of whole cores Quality RTL is an essential starting point of chip design cycle Starting point for design tuning, verification, and physical design Highly-ported RAMs and CAMs are pervasive in superscalars FabMem: generates layouts of highly-ported RAMs and CAMs See paper for more about FabMem © Niket K. Choudhary 38th Int'l Symp. on Computer Architecture, 2011

12 38th Int'l Symp. on Computer Architecture, 2011
Outline Quality assessment of FabScalar-generated cores Functional and IPC validation Timing validation Suitability for standard ASIC and FPGA flows Extensibility of CPSL G21: a workload-agnostic heterogeneous multi-core Future work and conclusion © Niket K. Choudhary 38th Int'l Symp. on Computer Architecture, 2011

13 38th Int'l Symp. on Computer Architecture, 2011
Validation Results Evaluate the quality of the register-transfer-level (RTL) designs produced by FabScalar along three fronts. Functional and IPC validation Timing validation Suitability for physical design EDA tool(s)/ Library used Functional verification Cadence NC-Verilog, vers s006 Logic synthesis Synopsys Design Compiler, vers. X SP3 Place & route Cadence SoC Encounter, vers. 7.1 Standard cell library FreePDK 45nm SPICE model BSIM4 PredictiveTechnology Model © Niket K. Choudhary 38th Int'l Symp. on Computer Architecture, 2011

14 Functional & IPC Validation
Unit testing on isolated canonical pipeline stages of different widths/depths Generate multiple arbitrary cores and run CPU2000 benchmarks on them Core-1 Core-2 Core-3 Core-4 Core-5 Core-6 Core-7 Core-8 Core-9 Core-10 Core-11 Core-12 Fetch/Decode/Rename/Dispatch width 4 5 6 8 2 Issue/RR/Execute/WB/Retire width function unit mix (simple, complex, branch, load/store) 1,1,1,1 3,1,1,1 2,1,1,1 5,1,1,1 fetch queue 16 32 64 active list (ROB) 128 256 512 physical register file (PRF) 96 192 issue queue (IQ) load queue / store queue (LQ/SQ) 32 / 32 16 / 16 branch predictor bimodal bimodal with block-ahead gshare branch history table (BHT) (# entries) 64K branch target buffer (BTB) (# entries) 4K return address stack (RAS) branch order buffer (BOB) Fetch depth 3 Rename depth Issue depth: total / wakeup-select loop 2 / 2 1 / 1 3 / 2 Register Read (and Writeback) depth 1 fetch-to-execute pipeline depth 10 9 14 12 15 we have done unit testing on isolated stage designs of different widths/depths, inserted them into the canonical structure, and observed that the top-level canonical interfaces have held up due to the natural decoupling of these interfaces from within-stage variations. © Niket K. Choudhary 38th Int'l Symp. on Computer Architecture, 2011

15 Functional & IPC Validation
Unit testing on isolated canonical pipe-stage of different widths/depths Generate multiple arbitrary cores and run CPU2000 benchmarks on them Core-1 Core-2 Core-3 Core-4 Core-5 Core-6 Core-7 Core-8 Core-9 Core-10 Core-11 Core-12 Fetch/Decode/Rename/Dispatch width 4 5 6 8 2 Issue/RR/Execute/WB/Retire width function unit mix (simple, complex, branch, load/store) 1,1,1,1 3,1,1,1 2,1,1,1 5,1,1,1 fetch queue 16 32 64 active list (ROB) 128 256 512 physical register file (PRF) 96 192 issue queue (IQ) load queue / store queue (LQ/SQ) 32 / 32 16 / 16 branch predictor bimodal bimodal with block-ahead gshare branch history table (BHT) (# entries) 64K branch target buffer (BTB) (# entries) 4K return address stack (RAS) branch order buffer (BOB) Fetch depth 3 Rename depth Issue depth: total / wakeup-select loop 2 / 2 1 / 1 3 / 2 Register Read (and Writeback) depth 1 fetch-to-execute pipeline depth 10 9 14 12 15 we have done unit testing on isolated stage designs of different widths/depths, inserted them into the canonical structure, and observed that the top-level canonical interfaces have held up due to the natural decoupling of these interfaces from within-stage variations. © Niket K. Choudhary 38th Int'l Symp. on Computer Architecture, 2011

16 Functional & IPC Validation
Unit testing on isolated canonical pipe-stage of different widths/depths Generate multiple arbitrary cores and run CPU2000 benchmarks on them Core-1 Core-2 Core-3 Core-4 Core-5 Core-6 Core-7 Core-8 Core-9 Core-10 Core-11 Core-12 Fetch/Decode/Rename/Dispatch width 4 5 6 8 2 Issue/RR/Execute/WB/Retire width function unit mix (simple, complex, branch, load/store) 1,1,1,1 3,1,1,1 2,1,1,1 5,1,1,1 fetch queue 16 32 64 active list (ROB) 128 256 512 physical register file (PRF) 96 192 issue queue (IQ) load queue / store queue (LQ/SQ) 32 / 32 16 / 16 branch predictor bimodal bimodal with block-ahead gshare branch history table (BHT) (# entries) 64K branch target buffer (BTB) (# entries) 4K return address stack (RAS) branch order buffer (BOB) Fetch depth 3 Rename depth Issue depth: total / wakeup-select loop 2 / 2 1 / 1 3 / 2 Register Read (and Writeback) depth 1 fetch-to-execute pipeline depth 10 9 14 12 15 we have done unit testing on isolated stage designs of different widths/depths, inserted them into the canonical structure, and observed that the top-level canonical interfaces have held up due to the natural decoupling of these interfaces from within-stage variations. © Niket K. Choudhary 38th Int'l Symp. on Computer Architecture, 2011

17 Functional & IPC Validation
RTL successfully simulates 100M instr. SimPoints from different benchmarks IPC from RTL closely tracks with IPC from C++ simulator IPC differences among cores correlate with microarchitecture differences © Niket K. Choudhary 38th Int'l Symp. on Computer Architecture, 2011

18 Timing Validation Compare cycle times and raw fetch-to-execute delays of FabScalar-generated cores with three different commercial processors: 90nm POWER5 180nm Alpha 21364 65nm MIPS32 74K All processors implement RISC ISAs Represent extremes from highly custom designed to fully synthesized Convert all delays to fanout-of-4 (FO4) Integer Pipeline: fetch-to-execute Delays expressed in terms of FO4 make comparison the technology agnostic. POWER5 Pipeline stages [B. Sinharoy et al., IBM Journal of R&D, 2005] © Niket K. Choudhary 38th Int'l Symp. on Computer Architecture, 2011

19 Timing Validation 17% 14% with ideal latch-based design
Power5 Alpha-21364 MIPS 74K Fetch Width 8 4 Dispatch Width 5 2 Issue Width 6 1 Fetch Queue 24 12 Issue Queue(s) Int+Ld/St: 36, FP: 24, Br.: 12, CR: 10 Int:20, FP:15 Int:8, Agen:8 Physical Reg ister File(s) Int:120, FP:120 Int:80, FP:72 64 Load Queue / Store Queue 32 / 32 8 / 8 L1 I$ / L1 D$ (KB) 64 / 32 64 / 64 fetch-to-execute pipeline depth Cycle Time of commercial core 23 FO4 25 FO4 33 FO4 Cycle Time of FabScalar core 29 FO4 37 FO4 32 FO4 Cycle Time of deeper FabScalar core 25 FO4 (depth=15) 26 FO4 (depth=11) N/A raw fetch-to-execute delay of FabScalar core 291 FO4 188 FO4 384 FO4 Cycle Time of FabScalar core with ideal latch-based design 24 FO4 with ideal latch-based design 17% 14% Delays expressed in terms of FO4 make comparison the technology agnostic. © Niket K. Choudhary 38th Int'l Symp. on Computer Architecture, 2011

20 Physical Design Validation
Synthesized and place-and-routed a 4-way superscalar processor Also synthesized the same core to a Virtex-5 FPGA in a BEE3 system ASIC Flow FPGA Flow Technology 45nm Die Area (excluding L1 caches) 2.6 mm2 Clock frequency 500MHz Timing-critical path Next-PC logic bzip gzip mcf parser FPGA & verilog retired instr. 10M FPGA & verilog cycles 1.13M 1.7M 0.84M 1.16M FPGA & verilog IPC 0.89 0.59 1.20 0.86 FPGA simulation time (s) 0.75 1.22 1.21 0.87 verilog simulation time (s) 4,018 5,536 2,870 3,748 simulation speedup 5,357 4,538 2,372 4,308 FPGA speed (MHz) 50 FPGA effective speed (MHz) 15 14 7 13 © Niket K. Choudhary 38th Int'l Symp. on Computer Architecture, 2011

21 Extensibility Extensibility of CPSL is important for proliferating microarchitectural diversity Two examples: LMP: Load Misspeculation Predictor (fix IPC bottleneck) DEAP: Decoupled Effective Ahead Pipelining for conditional branch predictors (fix cycle-time) Design Canonical Pipestages Modified Signals Added to Interface Implementation Effort LMP Dispatch, Execute (LSU), Retire Dispatch/LSU, Retire/Dispatch 2days-1author DEAP Fetch - 14days-1author © Niket K. Choudhary 38th Int'l Symp. on Computer Architecture, 2011

22 Providing Diverse Cores
With FabScalar, a chip with many different superscalar core types is conceivable FabScalar framework provides a design space of almost 38,000 different cores Fit the most complex configuration for a given clock-period and pipeline depth (to maximize single-thread performance) Prior approaches Customize a core to a specific application Co-customize multiple cores to a specific multiprogrammed workload Customizing cores to specific workloads has two drawbacks: Computationally intensive design-space exploration Not robust for general performance (or for an arbitrary workload) How many cores to provide for optimizing single-thread performance? © Niket K. Choudhary 38th Int'l Symp. on Computer Architecture, 2011

23 A Workload-Agnostic Heterogeneous Multi-Core
G21: Generic heterogeneous multi-core Not trained for specific workloads 21 core types provide a wide range of microarchitectural diversity Maximizes single-thread performance for arbitrary instruction-level behavior larger structures   higher frequency cycle time (ns) superscalar width 2 or 3 0.5 0.6 0.7 4 or 5 0.8 6 or 7 0.9 8 1.0 Three clock-frequencies accommodate small (to capture near-by ILP), medium (to capture average-ILP), and large (to capture far-flug ILP) structure sizes. G21 Core Selection © Niket K. Choudhary 38th Int'l Symp. on Computer Architecture, 2011

24 38th Int'l Symp. on Computer Architecture, 2011
Analysis of G21 Consider two multi-core designs Best-1: Homogeneous multi-core with a single core type The best harmonic-mean of BIPS across benchmarks I.e., single core type customized to workload as a whole G21: Proposed heterogeneous multi-core Assume ideal benchmark-to-core mapping for G21 Peak BIPS of a benchmark Highest BIPS possible for benchmark, considering entire design space I.e., customize core to individual benchmark Used only as upper-bound on performance Best-1 is trained for the workloads in hand. © Niket K. Choudhary 38th Int'l Symp. on Computer Architecture, 2011

25 Analysis of G21 © Niket K. Choudhary
Core cycle time (ns) fetch/issue width fetch-to-execute depth L1 I$ L1 D$ IQ LQ/SQ Phys. Reg. File Best-1 0.6 3/4 15 64 32 32/32 128 Worst sub-optimality: % of peak performance Best-1 is within 10% of peak performance Best-1 is trained for the workloads in hand. Performance of Best-1 and G21 © Niket K. Choudhary 38th Int'l Symp. on Computer Architecture, 2011

26 Analysis of G21 © Niket K. Choudhary
Core cycle time (ns) fetch/issue width fetch-to-execute depth L1 I$ L1 D$ IQ LQ/SQ Phys. Reg. File Best-1 0.6 3/4 15 64 32 32/32 128 severe sub-optimality (note: sub-optimality can become more severe for unknown workload) Best-1 is trained for the workloads in hand. Performance of Best-1 and G21 © Niket K. Choudhary 38th Int'l Symp. on Computer Architecture, 2011

27 Analysis of G21 © Niket K. Choudhary
Core cycle time (ns) fetch/issue width fetch-to-execute depth L1 I$ L1 D$ IQ LQ/SQ Phys. Reg. File Best-1 0.6 3/4 15 64 32 32/32 128 Worst sub-optimality: % of peak performance G21 is within 3% of peak performance Best-1 is trained for the workloads in hand. Performance of Best-1 and G21 © Niket K. Choudhary 38th Int'l Symp. on Computer Architecture, 2011

28 38th Int'l Symp. on Computer Architecture, 2011
Analysis of G21 G21 highlights the merits of workload-agnostic design Low computational complexity Robust performance for unknown workloads G21 outperforms Best-1 (even though Best-1 is customized to workload) Diversity not only delivers better efficiency (BIPS/Watt) but also delivers better raw-performance E.g. microarchitecture-loops and instruction-level behavior G21 is highly representative of the entire design space w.r.t. single-thread performance Can be used to distill fewer cores © Niket K. Choudhary 38th Int'l Symp. on Computer Architecture, 2011

29 38th Int'l Symp. on Computer Architecture, 2011
Future Work Correct-by-Construction Application of formal verification & tools Expanding CPSL Floating-point and multimedia instr. support More features ….. FabScalar Selection of Cores G21 is a preliminary study N-of-G21: factor-in other metrics e.g. power, area FabFPGA Automate the mapping of FabScalar-generated cores to FPGAs Accelerate verification Design space exploration © Niket K. Choudhary 38th Int'l Symp. on Computer Architecture, 2011

30 38th Int'l Symp. on Computer Architecture, 2011
Conclusion Providing microarchitecturally diverse cores has significant benefits but need multiple core designs FabScalar addresses the practical issue of designing and verifying multiple cores FabScalar is a novel toolset for automatically composing the RTL designs of arbitrary superscalar cores FabScalar toolset is available as open-source gateware © Niket K. Choudhary 38th Int'l Symp. on Computer Architecture, 2011

31 FabScalar: Composing Synthesizable RTL Designs of Arbitrary Cores within a Canonical Superscalar Template Niket K. Choudhary, Salil V. Wadhavkar, Tanmay A. Shah, Hiran Mayukh, Jayneel Gandhi, Brandon H. Dwiel, Sandeep Navada, Hashem H. Najaf-abadi, Eric Rotenberg Center for Efficient, Scalable, and Reliable Computing Department of Electrical & Computer Engineering North Carolina State University


Download ppt "FabScalar: Composing Synthesizable RTL Designs of Arbitrary Cores within a Canonical Superscalar Template Niket K. Choudhary, Salil V. Wadhavkar, Tanmay."

Similar presentations


Ads by Google