Application-Specific Customization of Soft Processor Microarchitecture Peter Yiannacouras J. Gregory Steffan Jonathan Rose University of Toronto Edward S. Rogers Sr. Department of Electrical and Computer Engineering
Processors and FPGA Systems Processors lie at the “heart” of FPGA systems UART Custom Logic Soft Processor Memory Interface Ethernet Performs coordination and even computation Better processors => less hardware to design We seek improvement through customization
Motivating Application-Specific Customizations of Soft Processors FPGA Configurability Can consider unlimited processor variants A soft processor might be used to run either: A single application A single class of applications Many applications, but can be reconfigured Applications differ in architectural requirements Can specialize architecture for each application Platform allows it (1), FPGA designs allow it (2), it can be beneficial (3) We want to evaluate effectiveness of specialization
Research Goals To investigate The potential for “Application-tuning” Tune processor microarchitecture to favour an application Preserve general purpose functionality “Instruction-set Subsetting” Sacrifice general purpose functionality Eliminate hardware not required by application Combination of both methods We still build a complete processor, but pick and choose the components and pipeline to use which is best for the application. Call this Application-tuning Measure efficiency gained through real implementations
SPREE System (Soft Processor Rapid Exploration Environment) Description ISA Datapath Input: Processor description Made of hand-coded components SPREE System Verify ISA against datapath SPREE Datapath Instantiation Control Generation Multi-cycle/variable-cycle FUs Multiplexer select signals Interlocking Branch handling Exports to synthesizable Verilog RTL RTL Output: Synthesizable Verilog
Back-End Infrastructure RTL Benchmarks (MiBench, Dhrystone 2.1, RATES, XiRisc) Modelsim RTL Simulator Quartus II 4.2 CAD Software Stratix 1S40C5 Cycle Count 2. Resource Usage 3. Clock Frequency 4. Power We can measure area/performance/energy accurately
Comparison to Altera’s Nios II Has three variations: Nios II/e – unpipelined, no HW multiplier Nios II/s – 5-stage, with HW multiplier Nios II/f – 6-stage, dynamic branch prediction
Architectural Parameters Used in SPREE Multiplication Support Hardware FU or software routine Shifter implementation Flipflops, multiplier, or LUTs Pipelining Depth (2-7 stages) Organization Forwarding Ideally everything … What can we gain by customizing the actual core? We focus on core microarchitecture
SPREE vs Nios II faster smaller 3-stage pipe HW multiply Multiply-based shifter EXCITEMENT HERE – this graph should be burned in people’s memory faster smaller
Exploration of Soft Processor Architectural Customizations Architectural-tuning Instruction-set subsetting Combination (Arch-tuning + Subsetting)
1. Architectural Tuning Experiment Vary the same parameters Multiplication Support Shifter implementation Pipelining Determine Best overall (general purpose) processor Best per application (application-tuned) Metric: Performance per Area (MIPS/LE) Basically inverse of Area-Delay product Using SPREE, we’ve varied the following
Performance per Area of All Processors 32% 14.1%
2. Instruction-set Subsetting SPREE automatically removes Unused connections Unused components Reduce processor by reducing the ISA Can create application-specific processor Eliminate unused parts of the ISA
Instruction-set Usage of Benchmarks Applications do not use complete ISA Dynaminc instruction Strong potential for hardware reduction
Area Reduction from Subsetting 23% Fraction of Area Give an example of what this can do for a processor Similar seen for energy Area reduced by 60% in some , 23% on average Similar reductions for energy, small impact on performance
3. Combining Application Tuning and Instruction-set Subsetting Subsetting is effective on its own Can apply subsetting on top of tuning Compare different customization methods Tuning Subsetting Tuning + Subsetting
Combining Application Tuning and Instruction-set Subsetting 25% 16% 14% Tuning reduces the waste that subsetting eliminates
Summary of Presented Architectural Conclusions Application tuning 14% average efficiency gain Will increase with more architectural axes Instruction-set Subsetting Up to 60% area & energy savings 16% average efficiency gain Combined Tuning & Subsetting 25% average efficiency gain
Future Work Consider other promising architectural axes Branch prediction, aggressive forwarding ISA changes Datapaths (eg. VLIW) Caches and memory hierarchy Compiler assistance Can improve tuning & subsetting