The Microarchitecture of FPGA-Based Soft Processors Peter Yiannacouras CARG - June 14, 2005.

The Microarchitecture of FPGA-Based Soft Processors Peter Yiannacouras CARG - June 14, 2005

FPGA vs ASIC Flows Circuit Design ASIC FlowFPGA Flow Circuit Design  Reduced cost for low- volume  Reduced time-to- market  Programmability affords customization Designers use FPGAs!

Processors and FPGAs Custom Logic Processor FPGA Custom Logic Processor Increased board area, cost, and latency □ Option 1: Off-chip processor Custom Logic Processor FPGA Specialized part, lack of flexibility □ Option 2: On-chip “hard” processor Custom Logic Processor FPGA Can implement any number of processors Tune each one to meet design constraints □ Option 3: On-chip “soft” processor Custom Logic Processor

Tuning Processors Application, Design constraints $3 4 MHz 800 mW 2-stage pipeline $300 3.8 GHz 80 W 31-stage pipeline Application, Design constraints 500 LEs 40 MHz 2-stage pipeline 1700 LEs 160 MHz 6-stage pipeline Tuning Soft Processors Application, Design constraints 500 LEs 40 MHz 2-stage pipeline 1700 LEs 160 MHz 6-stage pipeline your area, speed, power tradeoff Automatically Tuning Soft Processors

Understanding Soft Processors  Tuning requires understanding of soft processor design space  We implement many processors and study the design space Architecture Description Synthesized Processor Area Performance Power

Don’t we already understand architecture?  Not completely We can evaluate area, power, performance  Not accurately (rules of thumb) FPGA CAD tools are very accurate  Not in the FPGA domain LUTs vs transistors relative speed of RAM & Multipliers

Goals 1. Develop measurement methodology 2. Populate the design space 3. Compare against industrial soft processor(s)

Measurement Methodology  Require a set of metrics Area Performance Power FPGA Flow Circuit Design (RTL) Resource Usage Clock Frequency Power estimate

Area Logic Elements (LEs – LUT & flip flop) Multipliers Big RAM Little RAM Medium RAM Measure physical area in Equivalent LEs (Eg. 9-bit multiplier is equivalent to 23 LEs in area)

Performance  Wall Clock Time = #Cycles * Clock Period CAD Tool dct, golRATEs bubble_sort, crc, fft, fir, des, quant, iquant, turbo, vlcXirisc Dhrystone 2.1Freescale bitcnts, CRC32, sha, stringsearch, FFT, dijkstra, patriciaMiBench BenchmarkSource From RTL Simulation, Averaged over 20 benchmarks:

Power  CAD tool can estimate power from assumed toggle ratio (derived experimentally) Total Dynamic Power (mW) ÷ Clock Frequency (MHz) = Dynamic Energy excluding I/O per cycle (nJ/cycle)

Metrics summary  Require the following information 1. Resource Usage (area – CAD Tool) 2. Clock Frequency (wall clock time – CAD Tool) 3. Power Estimate (energy/cycle – CAD Tool) 4. Cycle Count (wall clock time – RTL Simulator)

RTL-based Design Space Exploration Complete and accurate understanding of design space Circuit Design (RTL) 3. Area 4. Clock Frequency 5. Power 1.Correctness 2.Cycle Count CAD Tool RTL Simulator Benchmarks

Goals 1. Develop measurement methodology 2. Populate the design space 3. Compare against industrial soft processor(s)

Microarchitectural Design Space Exploration Need fast route to RTL from architectural idea Circuit Design (RTL) 3. Area 4. Clock Frequency 5. Power 1.Correctness 2.Cycle Count CAD Tool RTL Simulator Benchmarks

SPREE (Soft Processor Rapid Exploration Environment) 3. Area 4. Clock Frequency 5. Power 1.Correctness 2.Cycle Count CAD ToolRTL Simulator Benchmarks SPREE RTL Generator

Goals 1. Develop measurement methodology 2. Populate the design space 1. Rapidly 2. With interesting designs 3. Accurately (minimize overhead) 3. Compare against industrial soft processor(s) SPREE

Related Work  Parametrized Cores Narrow design space, laborious changes to control  Architecture Description Languages (ADLs) Too robust, inaccurate (simulator based, or behavioural RTL)  PEAS-III/ASIPMeister [Itoh2000] non-fpga specific, ISA design focus

SPREE RTL Generator Overview SPREE RTL Generator Component Library ISA DescriptionDatapath Description Efficiently Synthesizable RTL Interesting Allows for interesting architectures Rapidly simple descriptions Accurately efficient component implementations

Some current limitations  No caches (use fast on-chip RAM)  Simple in-order issue pipelines  No dynamic branch prediction  No OS or exceptions support  No ISA changes! Need compiler generation to support Use subset of MIPS-I

Mul IfetchReg File ALU Write Back Data Mem Mul IfetchReg File ALU Write Back Data Mem Architecture Input Mul IfetchReg File ALU Write Back Data Mem Component Library

Mul IfetchReg File ALU Write Back Data Mem Mul IfetchReg File ALU Write Back Data Mem Architecture Input Component Library Mul Ifetch Regfile ALU Write Back Data Mem Datapath Description

Architecture Input SPREE RTL Generator Mul IfetchReg File ALU Write Back Mul IfetchReg File ALU Write Back Data Mem Mul IF Reg file ALU Write Back Data Mem ISA Description Datapath Description Component Library Mul IF Reg file ALU Write Back Data Mem Decode Control generation saves time and is non-critical

Architecture Input: ISA Description  Generic Operations (GENOPs)  MIPS instructions made of GENOPs FETCH RFREAD ADD RFWRITE GENOPsMIPS ADD – add rd, rs, rt FETCH RFREAD ADD RFWRITE RFREAD

Complete Experimental Framework Using SPREE 3. Area 4. Clock Frequency 5. Power 1.Correctness 2.Cycle Count CAD ToolRTL Simulator Benchmarks SPREE RTL Generator Component Library ISA DescriptionDatapath Description FIXED

Goals 1. Develop measurement methodology 2. Populate the design space 3. Compare against industrial soft processor(s) SPREE Area Performance Power

Altera’s NiosII  Second generation soft processor  Has three variations: NiosIIe – unpipelined, no hardware multiply NiosIIs – 5-stages, no branch prediction NiosIIf – 6-stages, dynamic branch prediction  Caveats Supports exceptions, OS, and caches Very similar but tweaked ISA

Design Space vs NiosII Variations

Summary 1. We span the design space 2. Remain competitive  Achieved 9% faster and 11% smaller than NiosIIs  => don’t suffer from prohibitive overhead Let’s explore some architecture!

Architectural Axes 1. Hardware vs Software Multiplication 2. Shifter implementation 3. Pipeline  Depth  Organization  Forwarding

Hardware vs Software Multiplication  Hardware multiplication Increases area & power consumption Speeds up execution  BUT … Not all applications care about speed Not all applications use multiplication (significantly)

Cycle Count Speedup of Hardware Multiplication Must understand its cost/benefit to decide when to use

Cost of Hardware Multiply  ~250 LEs (20%)  35% more Energy/cycle

Shifter Implementations  Shifters (multiplexers) are big in FPGAs  Consider 3 implementations: Serial shifter LUT-based barrel shifter Multiplier-based barrel shifter

Impact of Shifter Implementation  Consistent across different pipe depths

Shifter Implementation Tradeoffs  Averaged over all pipeline depths  Smallest: Serial  Fastest: LUT-based barrel  Energy efficient: Serial Multiplier is very nice sweet spot

Pipelines - Depth  Study different pipeline depths Over 3 shifters  Arrows = possible forwarding lines (not used)  All use predict not-taken branches

Pipelining & clock frequency

Impact of Pipelining  Adds area, can increase speed (2 to 3 stage?)

Mul FPGA Nuance: Synchronous RAMs 2-stage Pipeline Ifetch Regfile ALU Write Back Data Mem  Stall on all loads, and any operand fetches

Mul 3-stage Pipeline Ifetch Regfile ALU Write Back Data Mem  Less stalls, increased frequency => Big speedup (1.7x)

3, 4 and 5 stage pipelines  Increased area, small change in performance => Deeper pipelines have potential for better speedups

The 7-stage Pipeline Where Branch Delay Slots break down  The ideal case: BEQORJRADD XX Never squash this stage …

Problem: Separation of Branch and Branch Delay Slot BEQADDJR Stalls on RAW hazard …

Problem: Separation of Branch and Branch Delay Slot BEQADDJRNOP X  Must track and protect delay slots …

Multiple Delay Slots  Must detect separation of branch from delay slot  OR prevent multiple delay slots Stall branch if a delay slot exists in the pipe We did this one (+30LEs, -15% clock frequency) BEQORJRADD  Can’t guard all delay slots Better off eliminating delay slots – currently researching …

Pipeline organization  Where stages are placed is important  Pipe stage placement can Result in all around “win/loss” Present a tradeoff

Forwarding  SPREE supports stage to stage forwarding Mul Ifetch Reg File ALU Write Back Data Mem Forward line rs Forward line rt

Effect of Forwarding 20% speed increase

An Aside: ISA Subsetting  Applications don’t generally use all instructions

Processor reduction  Can strip away unused components/control Generator supports instruction disabling  Automatically strips away unused components Create an Application Specific processor Do this for each benchmark  FPGAs are a good platform for this!

Area of a Subsetted Processor

Speed of a Subsetted Processor

Conclusion  Understanding architectural trade-offs => Maximize efficiency  Developed SPREE & measurement methodology  Performed preliminary architectural study Quantified cost of hardware multiplication Explored shift unit implementations Explored pipelines: depth, organization, forwarding

The Microarchitecture of FPGA-Based Soft Processors Peter Yiannacouras CARG - June 14, 2005.

Similar presentations

Presentation on theme: "The Microarchitecture of FPGA-Based Soft Processors Peter Yiannacouras CARG - June 14, 2005."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The Microarchitecture of FPGA-Based Soft Processors Peter Yiannacouras CARG - June 14, 2005.

Similar presentations

Presentation on theme: "The Microarchitecture of FPGA-Based Soft Processors Peter Yiannacouras CARG - June 14, 2005."— Presentation transcript:

Similar presentations

About project

Feedback