The Microarchitecture of FPGA-Based Soft Processors Peter Yiannacouras CARG - June 14, 2005
FPGA vs ASIC Flows Circuit Design ASIC FlowFPGA Flow Circuit Design Reduced cost for low- volume Reduced time-to- market Programmability affords customization Designers use FPGAs!
Processors and FPGAs Custom Logic Processor FPGA Custom Logic Processor Increased board area, cost, and latency □ Option 1: Off-chip processor Custom Logic Processor FPGA Specialized part, lack of flexibility □ Option 2: On-chip “hard” processor Custom Logic Processor FPGA Can implement any number of processors Tune each one to meet design constraints □ Option 3: On-chip “soft” processor Custom Logic Processor
Tuning Processors Application, Design constraints $3 4 MHz 800 mW 2-stage pipeline $ GHz 80 W 31-stage pipeline Application, Design constraints 500 LEs 40 MHz 2-stage pipeline 1700 LEs 160 MHz 6-stage pipeline Tuning Soft Processors Application, Design constraints 500 LEs 40 MHz 2-stage pipeline 1700 LEs 160 MHz 6-stage pipeline your area, speed, power tradeoff Automatically Tuning Soft Processors
Understanding Soft Processors Tuning requires understanding of soft processor design space We implement many processors and study the design space Architecture Description Synthesized Processor Area Performance Power
Don’t we already understand architecture? Not completely We can evaluate area, power, performance Not accurately (rules of thumb) FPGA CAD tools are very accurate Not in the FPGA domain LUTs vs transistors relative speed of RAM & Multipliers
Goals 1. Develop measurement methodology 2. Populate the design space 3. Compare against industrial soft processor(s)
Measurement Methodology Require a set of metrics Area Performance Power FPGA Flow Circuit Design (RTL) Resource Usage Clock Frequency Power estimate
Area Logic Elements (LEs – LUT & flip flop) Multipliers Big RAM Little RAM Medium RAM Measure physical area in Equivalent LEs (Eg. 9-bit multiplier is equivalent to 23 LEs in area)
Performance Wall Clock Time = #Cycles * Clock Period CAD Tool dct, golRATEs bubble_sort, crc, fft, fir, des, quant, iquant, turbo, vlcXirisc Dhrystone 2.1Freescale bitcnts, CRC32, sha, stringsearch, FFT, dijkstra, patriciaMiBench BenchmarkSource From RTL Simulation, Averaged over 20 benchmarks:
Power CAD tool can estimate power from assumed toggle ratio (derived experimentally) Total Dynamic Power (mW) ÷ Clock Frequency (MHz) = Dynamic Energy excluding I/O per cycle (nJ/cycle)
Metrics summary Require the following information 1. Resource Usage (area – CAD Tool) 2. Clock Frequency (wall clock time – CAD Tool) 3. Power Estimate (energy/cycle – CAD Tool) 4. Cycle Count (wall clock time – RTL Simulator)
RTL-based Design Space Exploration Complete and accurate understanding of design space Circuit Design (RTL) 3. Area 4. Clock Frequency 5. Power 1.Correctness 2.Cycle Count CAD Tool RTL Simulator Benchmarks
Goals 1. Develop measurement methodology 2. Populate the design space 3. Compare against industrial soft processor(s)
Microarchitectural Design Space Exploration Need fast route to RTL from architectural idea Circuit Design (RTL) 3. Area 4. Clock Frequency 5. Power 1.Correctness 2.Cycle Count CAD Tool RTL Simulator Benchmarks
SPREE (Soft Processor Rapid Exploration Environment) 3. Area 4. Clock Frequency 5. Power 1.Correctness 2.Cycle Count CAD ToolRTL Simulator Benchmarks SPREE RTL Generator
Goals 1. Develop measurement methodology 2. Populate the design space 1. Rapidly 2. With interesting designs 3. Accurately (minimize overhead) 3. Compare against industrial soft processor(s) SPREE
Related Work Parametrized Cores Narrow design space, laborious changes to control Architecture Description Languages (ADLs) Too robust, inaccurate (simulator based, or behavioural RTL) PEAS-III/ASIPMeister [Itoh2000] non-fpga specific, ISA design focus
SPREE RTL Generator Overview SPREE RTL Generator Component Library ISA DescriptionDatapath Description Efficiently Synthesizable RTL Interesting Allows for interesting architectures Rapidly simple descriptions Accurately efficient component implementations
Some current limitations No caches (use fast on-chip RAM) Simple in-order issue pipelines No dynamic branch prediction No OS or exceptions support No ISA changes! Need compiler generation to support Use subset of MIPS-I
Mul IfetchReg File ALU Write Back Data Mem Mul IfetchReg File ALU Write Back Data Mem Architecture Input Mul IfetchReg File ALU Write Back Data Mem Component Library
Mul IfetchReg File ALU Write Back Data Mem Mul IfetchReg File ALU Write Back Data Mem Architecture Input Component Library Mul Ifetch Regfile ALU Write Back Data Mem Datapath Description
Architecture Input SPREE RTL Generator Mul IfetchReg File ALU Write Back Mul IfetchReg File ALU Write Back Data Mem Mul IF Reg file ALU Write Back Data Mem ISA Description Datapath Description Component Library Mul IF Reg file ALU Write Back Data Mem Decode Control generation saves time and is non-critical
Architecture Input: ISA Description Generic Operations (GENOPs) MIPS instructions made of GENOPs FETCH RFREAD ADD RFWRITE GENOPsMIPS ADD – add rd, rs, rt FETCH RFREAD ADD RFWRITE RFREAD
Complete Experimental Framework Using SPREE 3. Area 4. Clock Frequency 5. Power 1.Correctness 2.Cycle Count CAD ToolRTL Simulator Benchmarks SPREE RTL Generator Component Library ISA DescriptionDatapath Description FIXED
Goals 1. Develop measurement methodology 2. Populate the design space 3. Compare against industrial soft processor(s) SPREE Area Performance Power
Altera’s NiosII Second generation soft processor Has three variations: NiosIIe – unpipelined, no hardware multiply NiosIIs – 5-stages, no branch prediction NiosIIf – 6-stages, dynamic branch prediction Caveats Supports exceptions, OS, and caches Very similar but tweaked ISA
Design Space vs NiosII Variations
Summary 1. We span the design space 2. Remain competitive Achieved 9% faster and 11% smaller than NiosIIs => don’t suffer from prohibitive overhead Let’s explore some architecture!
Architectural Axes 1. Hardware vs Software Multiplication 2. Shifter implementation 3. Pipeline Depth Organization Forwarding
Hardware vs Software Multiplication Hardware multiplication Increases area & power consumption Speeds up execution BUT … Not all applications care about speed Not all applications use multiplication (significantly)
Cycle Count Speedup of Hardware Multiplication Must understand its cost/benefit to decide when to use
Cost of Hardware Multiply ~250 LEs (20%) 35% more Energy/cycle
Shifter Implementations Shifters (multiplexers) are big in FPGAs Consider 3 implementations: Serial shifter LUT-based barrel shifter Multiplier-based barrel shifter
Impact of Shifter Implementation Consistent across different pipe depths
Shifter Implementation Tradeoffs Averaged over all pipeline depths Smallest: Serial Fastest: LUT-based barrel Energy efficient: Serial Multiplier is very nice sweet spot
Pipelines - Depth Study different pipeline depths Over 3 shifters Arrows = possible forwarding lines (not used) All use predict not-taken branches
Pipelining & clock frequency
Impact of Pipelining Adds area, can increase speed (2 to 3 stage?)
Mul FPGA Nuance: Synchronous RAMs 2-stage Pipeline Ifetch Regfile ALU Write Back Data Mem Stall on all loads, and any operand fetches
Mul 3-stage Pipeline Ifetch Regfile ALU Write Back Data Mem Less stalls, increased frequency => Big speedup (1.7x)
3, 4 and 5 stage pipelines Increased area, small change in performance => Deeper pipelines have potential for better speedups
The 7-stage Pipeline Where Branch Delay Slots break down The ideal case: BEQORJRADD XX Never squash this stage …
Problem: Separation of Branch and Branch Delay Slot BEQADDJR Stalls on RAW hazard …
Problem: Separation of Branch and Branch Delay Slot BEQADDJRNOP X Must track and protect delay slots …
Multiple Delay Slots Must detect separation of branch from delay slot OR prevent multiple delay slots Stall branch if a delay slot exists in the pipe We did this one (+30LEs, -15% clock frequency) BEQORJRADD Can’t guard all delay slots Better off eliminating delay slots – currently researching …
Pipeline organization Where stages are placed is important Pipe stage placement can Result in all around “win/loss” Present a tradeoff
Forwarding SPREE supports stage to stage forwarding Mul Ifetch Reg File ALU Write Back Data Mem Forward line rs Forward line rt
Effect of Forwarding 20% speed increase
An Aside: ISA Subsetting Applications don’t generally use all instructions
Processor reduction Can strip away unused components/control Generator supports instruction disabling Automatically strips away unused components Create an Application Specific processor Do this for each benchmark FPGAs are a good platform for this!
Area of a Subsetted Processor
Speed of a Subsetted Processor
Conclusion Understanding architectural trade-offs => Maximize efficiency Developed SPREE & measurement methodology Performed preliminary architectural study Quantified cost of hardware multiplication Explored shift unit implementations Explored pipelines: depth, organization, forwarding