Application-Specific Customization of Soft Processor Microarchitecture Peter Yiannacouras J. Gregory Steffan Jonathan Rose University of Toronto Electrical and Computer Engineering
2 Processors and FPGA Systems We seek improvement through customization Processors are the “heart” of FPGA systems Memory Interface UART Custom Logic Ethernet Performs coordination and even computation Better processors => less hardware to design Soft Processor
3 Enablers for customizing soft processors 1. FPGA Reconfigurability No hardware cost for altering a design 2. Applications differ in architectural requirements Can specialize architecture for each application 3. A soft processor might be used to run either: a) A single application b) A single class of applications c) Many applications, but can be reconfigured We want to evaluate effectiveness of specialization
4 Research Goals 1. Investigate “Application-tuning” Tune microarchitecture to favour an application Preserve general purpose functionality 2. Investigate “Instruction-set Subsetting” Sacrifice general purpose functionality Eliminate hardware not required by application Investigate efficiency through real implementations
5 SPREE SPREE System (Soft Processor Rapid Exploration Environment) RTL ISADatapath ■ Input: Processor description 1. Verify ISA against datapath 2. Datapath Instantiation 3. Control Generation ■Multi-cycle/variable-cycle FUs ■Multiplexer select signals ■Interlocking ■Branch handling ■ SPREE System ■ Output: Synthesizable Verilog Processor Description
6 Back-end Infrastructure RTL 2. Resource Usage 3. Clock Frequency 4. Power 1.Cycle Count Quartus II 5.0 CAD Software Modelsim RTL Simulator Benchmarks (MiBench, Dhrystone 2.1, RATES, XiRisc) Stratix 1S40C5 We can measure area/performance/energy accurately
7 Exploration of Architectural Customizations 1. Architectural-tuning 2. Instruction-set subsetting
8 What exactly are we tuning? We focus on core microarchitecture Hardware vs software multiplication Shifter implementation Pipelining Depth Organization Forwarding Not ISA (we use MIPS-I)
9 Comparison to Altera’s Nios II Has three variations: Nios II/e – unpipelined, no HW multiplier Nios II/s – 5-stage, with HW multiplier Nios II/f – 6-stage, dynamic branch prediction Caveats – not completely fair comparison Very similar but tweaked ISA Nios II Supports exceptions, OS, and caches We do not and save on the hardware costs We believe the comparison is meaningful
10 SPREE vs Nios II Competitive while allowing more customization smaller faster -3-stage pipe -HW multiply -Multiply-based shifter
11 1. Architectural Tuning Experiment Hardware vs software multiplication Shifter implementation Pipelining Depth Organization Forwarding What is best overall (general purpose) configuration What are best per application (application-tuned) configurations
12 Performance per Area of All Processors 14.1% improvement over general purpose, some 30%
13 2. Instruction-set Subsetting SPREE automatically removes Unused connections Unused components Reduce processor by reducing the ISA Can create application-specific processor Eliminate unused parts of the ISA
14 Instruction-set Usage of Benchmark Set Applications do not use complete ISA Strong potential for hardware reduction
15 Fraction of Area Area Reduction from Instruction-set Subsetting Area reduced by 60% in some, 25% on average
16 Combining Application Tuning and Instruction-set Subsetting 33.2% Efficiency Gain: Subsetting 16%, Combined 24.5%
17 Summary of Presented Architectural Conclusions Application tuning: 14% average efficiency gain Will only increase as we explore more architectures Instruction-set Subsetting Up to 60% area & energy savings 16% average efficiency gain Combined Application tuning & Subsetting 24.5% average efficiency gain
18 General Purpose vs App-tuned vs Nios II Choose best Nios II overall and per application SPREE customizations allow 17% better efficiency than Nios II 17%
19 Future Work Consider other exciting architectural axes Branch prediction, aggressive forwarding ISA changes Datapaths (eg. VLIW) Caches and memory hierarchy Compiler assistance Can improve tuning & subsetting
20 Metrics for Measurement Efficiency: Performance per area Performance: MIPS Area: Equivalent Stratix Logic Elements (LEs) Relative silicon areas used for RAMs/Multipliers
21 Energy Impact of Subsetting Up to 60% energy savings and 25% on average
22 Microarchitecture What exactly are we tuning? Control Pipeline Datapath FUs Reg File ISA Extensions (Tensilica, Stretch) Memory Hierarchy Instruction Set HW Multiply FU Shifter type Pipelining Depth Organization Forwarding Are we tuning enough?
23 Performance per Area of All Processors 14.1% improvement over general purpose, some 30%
24 Processors and FPGA Designs Soft Processor Our goal is to explore customization of soft processors FPGA P Custom Logic UART Ethernet Memory Interface