Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Microarchitecture of FPGA-Based Soft Processors Peter Yiannacouras CARG - June 14, 2005.

Similar presentations


Presentation on theme: "The Microarchitecture of FPGA-Based Soft Processors Peter Yiannacouras CARG - June 14, 2005."— Presentation transcript:

1 The Microarchitecture of FPGA-Based Soft Processors Peter Yiannacouras CARG - June 14, 2005

2 FPGA vs ASIC Flows Circuit Design ASIC FlowFPGA Flow Circuit Design  Reduced cost for low- volume  Reduced time-to- market  Programmability affords customization Designers use FPGAs!

3 Processors and FPGAs Custom Logic Processor FPGA Custom Logic Processor Increased board area, cost, and latency □ Option 1: Off-chip processor Custom Logic Processor FPGA Specialized part, lack of flexibility □ Option 2: On-chip “hard” processor Custom Logic Processor FPGA Can implement any number of processors Tune each one to meet design constraints □ Option 3: On-chip “soft” processor Custom Logic Processor

4 Tuning Processors Application, Design constraints $3 4 MHz 800 mW 2-stage pipeline $300 3.8 GHz 80 W 31-stage pipeline Application, Design constraints 500 LEs 40 MHz 2-stage pipeline 1700 LEs 160 MHz 6-stage pipeline Tuning Soft Processors Application, Design constraints 500 LEs 40 MHz 2-stage pipeline 1700 LEs 160 MHz 6-stage pipeline your area, speed, power tradeoff Automatically Tuning Soft Processors

5 Understanding Soft Processors  Tuning requires understanding of soft processor design space  We implement many processors and study the design space Architecture Description Synthesized Processor Area Performance Power

6 Don’t we already understand architecture?  Not completely We can evaluate area, power, performance  Not accurately (rules of thumb) FPGA CAD tools are very accurate  Not in the FPGA domain LUTs vs transistors relative speed of RAM & Multipliers

7 Goals 1. Develop measurement methodology 2. Populate the design space 3. Compare against industrial soft processor(s)

8 Measurement Methodology  Require a set of metrics Area Performance Power FPGA Flow Circuit Design (RTL) Resource Usage Clock Frequency Power estimate

9 Area Logic Elements (LEs – LUT & flip flop) Multipliers Big RAM Little RAM Medium RAM Measure physical area in Equivalent LEs (Eg. 9-bit multiplier is equivalent to 23 LEs in area)

10 Performance  Wall Clock Time = #Cycles * Clock Period CAD Tool dct, golRATEs bubble_sort, crc, fft, fir, des, quant, iquant, turbo, vlcXirisc Dhrystone 2.1Freescale bitcnts, CRC32, sha, stringsearch, FFT, dijkstra, patriciaMiBench BenchmarkSource From RTL Simulation, Averaged over 20 benchmarks:

11 Power  CAD tool can estimate power from assumed toggle ratio (derived experimentally) Total Dynamic Power (mW) ÷ Clock Frequency (MHz) = Dynamic Energy excluding I/O per cycle (nJ/cycle)

12 Metrics summary  Require the following information 1. Resource Usage (area – CAD Tool) 2. Clock Frequency (wall clock time – CAD Tool) 3. Power Estimate (energy/cycle – CAD Tool) 4. Cycle Count (wall clock time – RTL Simulator)

13 RTL-based Design Space Exploration Complete and accurate understanding of design space Circuit Design (RTL) 3. Area 4. Clock Frequency 5. Power 1.Correctness 2.Cycle Count CAD Tool RTL Simulator Benchmarks

14 Goals 1. Develop measurement methodology 2. Populate the design space 3. Compare against industrial soft processor(s)

15 Microarchitectural Design Space Exploration Need fast route to RTL from architectural idea Circuit Design (RTL) 3. Area 4. Clock Frequency 5. Power 1.Correctness 2.Cycle Count CAD Tool RTL Simulator Benchmarks

16 SPREE (Soft Processor Rapid Exploration Environment) 3. Area 4. Clock Frequency 5. Power 1.Correctness 2.Cycle Count CAD ToolRTL Simulator Benchmarks SPREE RTL Generator

17 Goals 1. Develop measurement methodology 2. Populate the design space 1. Rapidly 2. With interesting designs 3. Accurately (minimize overhead) 3. Compare against industrial soft processor(s) SPREE

18 Related Work  Parametrized Cores Narrow design space, laborious changes to control  Architecture Description Languages (ADLs) Too robust, inaccurate (simulator based, or behavioural RTL)  PEAS-III/ASIPMeister [Itoh2000] non-fpga specific, ISA design focus

19 SPREE RTL Generator Overview SPREE RTL Generator Component Library ISA DescriptionDatapath Description Efficiently Synthesizable RTL Interesting Allows for interesting architectures Rapidly simple descriptions Accurately efficient component implementations

20 Some current limitations  No caches (use fast on-chip RAM)  Simple in-order issue pipelines  No dynamic branch prediction  No OS or exceptions support  No ISA changes! Need compiler generation to support Use subset of MIPS-I

21 Mul IfetchReg File ALU Write Back Data Mem Mul IfetchReg File ALU Write Back Data Mem Architecture Input Mul IfetchReg File ALU Write Back Data Mem Component Library

22 Mul IfetchReg File ALU Write Back Data Mem Mul IfetchReg File ALU Write Back Data Mem Architecture Input Component Library Mul Ifetch Regfile ALU Write Back Data Mem Datapath Description

23 Architecture Input SPREE RTL Generator Mul IfetchReg File ALU Write Back Mul IfetchReg File ALU Write Back Data Mem Mul IF Reg file ALU Write Back Data Mem ISA Description Datapath Description Component Library Mul IF Reg file ALU Write Back Data Mem Decode Control generation saves time and is non-critical

24 Architecture Input: ISA Description  Generic Operations (GENOPs)  MIPS instructions made of GENOPs FETCH RFREAD ADD RFWRITE GENOPsMIPS ADD – add rd, rs, rt FETCH RFREAD ADD RFWRITE RFREAD

25 Complete Experimental Framework Using SPREE 3. Area 4. Clock Frequency 5. Power 1.Correctness 2.Cycle Count CAD ToolRTL Simulator Benchmarks SPREE RTL Generator Component Library ISA DescriptionDatapath Description FIXED

26 Goals 1. Develop measurement methodology 2. Populate the design space 3. Compare against industrial soft processor(s) SPREE Area Performance Power

27 Altera’s NiosII  Second generation soft processor  Has three variations: NiosIIe – unpipelined, no hardware multiply NiosIIs – 5-stages, no branch prediction NiosIIf – 6-stages, dynamic branch prediction  Caveats Supports exceptions, OS, and caches Very similar but tweaked ISA

28 Design Space vs NiosII Variations

29 Summary 1. We span the design space 2. Remain competitive  Achieved 9% faster and 11% smaller than NiosIIs  => don’t suffer from prohibitive overhead Let’s explore some architecture!

30 Architectural Axes 1. Hardware vs Software Multiplication 2. Shifter implementation 3. Pipeline  Depth  Organization  Forwarding

31 Hardware vs Software Multiplication  Hardware multiplication Increases area & power consumption Speeds up execution  BUT … Not all applications care about speed Not all applications use multiplication (significantly)

32 Cycle Count Speedup of Hardware Multiplication Must understand its cost/benefit to decide when to use

33 Cost of Hardware Multiply  ~250 LEs (20%)  35% more Energy/cycle

34 Shifter Implementations  Shifters (multiplexers) are big in FPGAs  Consider 3 implementations: Serial shifter LUT-based barrel shifter Multiplier-based barrel shifter

35 Impact of Shifter Implementation  Consistent across different pipe depths

36 Shifter Implementation Tradeoffs  Averaged over all pipeline depths  Smallest: Serial  Fastest: LUT-based barrel  Energy efficient: Serial Multiplier is very nice sweet spot

37 Pipelines - Depth  Study different pipeline depths Over 3 shifters  Arrows = possible forwarding lines (not used)  All use predict not-taken branches

38 Pipelining & clock frequency

39 Impact of Pipelining  Adds area, can increase speed (2 to 3 stage?)

40 Mul FPGA Nuance: Synchronous RAMs 2-stage Pipeline Ifetch Regfile ALU Write Back Data Mem  Stall on all loads, and any operand fetches

41 Mul 3-stage Pipeline Ifetch Regfile ALU Write Back Data Mem  Less stalls, increased frequency => Big speedup (1.7x)

42 3, 4 and 5 stage pipelines  Increased area, small change in performance => Deeper pipelines have potential for better speedups

43 The 7-stage Pipeline Where Branch Delay Slots break down  The ideal case: BEQORJRADD XX Never squash this stage …

44 Problem: Separation of Branch and Branch Delay Slot BEQADDJR Stalls on RAW hazard …

45 Problem: Separation of Branch and Branch Delay Slot BEQADDJRNOP X  Must track and protect delay slots …

46 Multiple Delay Slots  Must detect separation of branch from delay slot  OR prevent multiple delay slots Stall branch if a delay slot exists in the pipe We did this one (+30LEs, -15% clock frequency) BEQORJRADD  Can’t guard all delay slots Better off eliminating delay slots – currently researching …

47 Pipeline organization  Where stages are placed is important  Pipe stage placement can Result in all around “win/loss” Present a tradeoff

48 Forwarding  SPREE supports stage to stage forwarding Mul Ifetch Reg File ALU Write Back Data Mem Forward line rs Forward line rt

49 Effect of Forwarding 20% speed increase

50 An Aside: ISA Subsetting  Applications don’t generally use all instructions

51 Processor reduction  Can strip away unused components/control Generator supports instruction disabling  Automatically strips away unused components Create an Application Specific processor Do this for each benchmark  FPGAs are a good platform for this!

52 Area of a Subsetted Processor

53 Speed of a Subsetted Processor

54 Conclusion  Understanding architectural trade-offs => Maximize efficiency  Developed SPREE & measurement methodology  Performed preliminary architectural study Quantified cost of hardware multiplication Explored shift unit implementations Explored pipelines: depth, organization, forwarding


Download ppt "The Microarchitecture of FPGA-Based Soft Processors Peter Yiannacouras CARG - June 14, 2005."

Similar presentations


Ads by Google