The Microarchitecture of FPGA-Based Soft Processors Peter Yiannacouras CARG - June 14, 2005.

Slides:

Advertisements

Similar presentations

Comparison of Altera NIOS II Processor with Analog Device’s TigerSHARC

Advertisements

1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.

Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.

Chapter 8. Pipelining.

Instruction-Level Parallelism (ILP)

Application-Specific Customization of FPGA Soft- core Processors Journal Paper Presentation Presented by: Ahmad Sghaier Course Instructor: Dr. Shawki Areibi.

Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day17: November 20, 2000 Time Multiplexing.

Instructor: Senior Lecturer SOE Dan Garcia CS 61C: Great Ideas in Computer Architecture Pipelining Hazards 1.

Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.

1 Lecture 17: Basic Pipelining Today’s topics:  5-stage pipeline  Hazards and instruction scheduling Mid-term exam stats:  Highest: 90, Mean: 58.

Mary Jane Irwin ( ) [Adapted from Computer Organization and Design,

ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )

A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

Pipelining Andreas Klappenecker CPSC321 Computer Architecture.

Computer ArchitectureFall 2007 © October 24nd, 2007 Majd F. Sakr CS-447– Computer Architecture.

CS252 Project Presentation Optimizing the Leon Soft Core Marghoob Mohiyuddin Zhangxi TanAlex Elium Dept. of EECS University of California, Berkeley.

EENG449b/Savvides Lec 4.1 1/22/04 January 22, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.

DLX Instruction Format

Appendix A Pipelining: Basic and Intermediate Concepts

ENGS 116 Lecture 51 Pipelining and Hazards Vincent H. Berk September 30, 2005 Reading for today: Chapter A.1 – A.3, article: Patterson&Ditzel Reading for.

Specific Choice of Soft Processor Features Mark Grover Prof. Greg Steffan Dept. of Electrical and Computer Engineering.

Octavo: An FPGA-Centric Processor Architecture Charles Eric LaForest J. Gregory Steffan ECE, University of Toronto FPGA 2012, February 24.

Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.

1 3-General Purpose Processors: Altera Nios II 2 Altera Nios II processor A 32-bit soft core processor from Altera Comes in three cores: Fast, Standard,

SPREE Tutorial Peter Yiannacouras April 13, 2006.

Data Parallel FPGA Workloads: Software Versus Hardware Peter Yiannacouras J. Gregory Steffan Jonathan Rose FPL 2009.

HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.

SHA-3 Candidate Evaluation 1. FPGA Benchmarking - Phase Round-2 SHA-3 Candidates implemented by 33 graduate students following the same design.

Fine-Grain Performance Scaling of Soft Vector Processors Peter Yiannacouras Jonathan Rose Gregory J. Steffan ESWEEK – CASES 2009, Grenoble, France Oct.

SPREE RTL Generator RTL Simulator RTL CAD Flow 3. Area 4. Frequency 5. Power Correctness1. 2. Cycle count SPREE Benchmarks Verilog Results 3. Architecture.

Pipelining. 10/19/ Outline 5 stage pipelining Structural and Data Hazards Forwarding Branch Schemes Exceptions and Interrupts Conclusion.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

Chapter 2 Summary Classification of architectures Features that are relatively independent of instruction sets “Different” Processors –DSP and media processors.

CS1104 – Computer Organization PART 2: Computer Architecture Lecture 12 Overview and Concluding Remarks.

Design Space Exploration for Application Specific FPGAs in System-on-a-Chip Designs Mark Hammerquist, Roman Lysecky Department of Electrical and Computer.

COMP541 Multicycle MIPS Montek Singh Apr 8, 2015.

CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.

CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.

Computer Organization CS224 Fall 2012 Lesson 22. The Big Picture  The Five Classic Components of a Computer  Chapter 4 Topic: Processor Design Control.

1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined control 3. Data Hazards 4. Forwarding 5. Branch Hazards.

VLIW Digital Signal Processor Michael Chang. Alison Chen. Candace Hobson. Bill Hodges.

1 Pipelining Part I CS What is Pipelining? Like an Automobile Assembly Line for Instructions –Each step does a little job of processing the instruction.

EECE 476: Computer Architecture Slide Set #5: Implementing Pipelining Tor Aamodt Slide background: Die photo of the MIPS R2000 (first commercial MIPS microprocessor)

CECS 440 Pipelining.1(c) 2014 – R. W. Allison [slides adapted from D. Patterson slides with additional credits to M.J. Irwin]

© 2010 Altera Corporation - Public Lutiac – Small Soft Processors for Small Programs David Galloway and David Lewis November 18, 2010.

CSIE30300 Computer Architecture Unit 04: Basic MIPS Pipelining Hsin-Chou Chi [Adapted from material by and

Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.

Pipelining Example Laundry Example: Three Stages

FPGA-Based System Design: Chapter 6 Copyright  2004 Prentice Hall PTR Topics n Low power design. n Pipelining.

Instructor: Senior Lecturer SOE Dan Garcia CS 61C: Great Ideas in Computer Architecture Pipelining Hazards 1.

11 Pipelining Kosarev Nikolay MIPT Oct, Pipelining Implementation technique whereby multiple instructions are overlapped in execution Each pipeline.

Introduction to Computer Organization Pipelining.

Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.

Application-Specific Customization of Soft Processor Microarchitecture Peter Yiannacouras J. Gregory Steffan Jonathan Rose University of Toronto Electrical.

1 Scaling Soft Processor Systems Martin Labrecque Peter Yiannacouras and Gregory Steffan University of Toronto FCCM 4/14/2008.

PipeliningPipelining Computer Architecture (Fall 2006)

Variable Word Width Computation for Low Power

ECE354 Embedded Systems Introduction C Andras Moritz.

Application-Specific Customization of Soft Processor Microarchitecture

Morgan Kaufmann Publishers

CS203 – Advanced Computer Architecture

CDA 3101 Spring 2016 Introduction to Computer Organization

The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.

Guest Lecturer TA: Shreyas Chand

ARM ORGANISATION.

Morgan Kaufmann Publishers The Processor

Application-Specific Customization of Soft Processor Microarchitecture

Guest Lecturer: Justin Hsia

Presentation transcript:

The Microarchitecture of FPGA-Based Soft Processors Peter Yiannacouras CARG - June 14, 2005

FPGA vs ASIC Flows Circuit Design ASIC FlowFPGA Flow Circuit Design  Reduced cost for low- volume  Reduced time-to- market  Programmability affords customization Designers use FPGAs!

Processors and FPGAs Custom Logic Processor FPGA Custom Logic Processor Increased board area, cost, and latency □ Option 1: Off-chip processor Custom Logic Processor FPGA Specialized part, lack of flexibility □ Option 2: On-chip “hard” processor Custom Logic Processor FPGA Can implement any number of processors Tune each one to meet design constraints □ Option 3: On-chip “soft” processor Custom Logic Processor

Tuning Processors Application, Design constraints $3 4 MHz 800 mW 2-stage pipeline $ GHz 80 W 31-stage pipeline Application, Design constraints 500 LEs 40 MHz 2-stage pipeline 1700 LEs 160 MHz 6-stage pipeline Tuning Soft Processors Application, Design constraints 500 LEs 40 MHz 2-stage pipeline 1700 LEs 160 MHz 6-stage pipeline your area, speed, power tradeoff Automatically Tuning Soft Processors

Understanding Soft Processors  Tuning requires understanding of soft processor design space  We implement many processors and study the design space Architecture Description Synthesized Processor Area Performance Power

Don’t we already understand architecture?  Not completely We can evaluate area, power, performance  Not accurately (rules of thumb) FPGA CAD tools are very accurate  Not in the FPGA domain LUTs vs transistors relative speed of RAM & Multipliers

Goals 1. Develop measurement methodology 2. Populate the design space 3. Compare against industrial soft processor(s)

Measurement Methodology  Require a set of metrics Area Performance Power FPGA Flow Circuit Design (RTL) Resource Usage Clock Frequency Power estimate

Area Logic Elements (LEs – LUT & flip flop) Multipliers Big RAM Little RAM Medium RAM Measure physical area in Equivalent LEs (Eg. 9-bit multiplier is equivalent to 23 LEs in area)

Performance  Wall Clock Time = #Cycles * Clock Period CAD Tool dct, golRATEs bubble_sort, crc, fft, fir, des, quant, iquant, turbo, vlcXirisc Dhrystone 2.1Freescale bitcnts, CRC32, sha, stringsearch, FFT, dijkstra, patriciaMiBench BenchmarkSource From RTL Simulation, Averaged over 20 benchmarks:

Power  CAD tool can estimate power from assumed toggle ratio (derived experimentally) Total Dynamic Power (mW) ÷ Clock Frequency (MHz) = Dynamic Energy excluding I/O per cycle (nJ/cycle)

Metrics summary  Require the following information 1. Resource Usage (area – CAD Tool) 2. Clock Frequency (wall clock time – CAD Tool) 3. Power Estimate (energy/cycle – CAD Tool) 4. Cycle Count (wall clock time – RTL Simulator)

RTL-based Design Space Exploration Complete and accurate understanding of design space Circuit Design (RTL) 3. Area 4. Clock Frequency 5. Power 1.Correctness 2.Cycle Count CAD Tool RTL Simulator Benchmarks

Goals 1. Develop measurement methodology 2. Populate the design space 3. Compare against industrial soft processor(s)

Microarchitectural Design Space Exploration Need fast route to RTL from architectural idea Circuit Design (RTL) 3. Area 4. Clock Frequency 5. Power 1.Correctness 2.Cycle Count CAD Tool RTL Simulator Benchmarks

SPREE (Soft Processor Rapid Exploration Environment) 3. Area 4. Clock Frequency 5. Power 1.Correctness 2.Cycle Count CAD ToolRTL Simulator Benchmarks SPREE RTL Generator

Goals 1. Develop measurement methodology 2. Populate the design space 1. Rapidly 2. With interesting designs 3. Accurately (minimize overhead) 3. Compare against industrial soft processor(s) SPREE

Related Work  Parametrized Cores Narrow design space, laborious changes to control  Architecture Description Languages (ADLs) Too robust, inaccurate (simulator based, or behavioural RTL)  PEAS-III/ASIPMeister [Itoh2000] non-fpga specific, ISA design focus

SPREE RTL Generator Overview SPREE RTL Generator Component Library ISA DescriptionDatapath Description Efficiently Synthesizable RTL Interesting Allows for interesting architectures Rapidly simple descriptions Accurately efficient component implementations

Some current limitations  No caches (use fast on-chip RAM)  Simple in-order issue pipelines  No dynamic branch prediction  No OS or exceptions support  No ISA changes! Need compiler generation to support Use subset of MIPS-I

Mul IfetchReg File ALU Write Back Data Mem Mul IfetchReg File ALU Write Back Data Mem Architecture Input Mul IfetchReg File ALU Write Back Data Mem Component Library

Mul IfetchReg File ALU Write Back Data Mem Mul IfetchReg File ALU Write Back Data Mem Architecture Input Component Library Mul Ifetch Regfile ALU Write Back Data Mem Datapath Description

Architecture Input SPREE RTL Generator Mul IfetchReg File ALU Write Back Mul IfetchReg File ALU Write Back Data Mem Mul IF Reg file ALU Write Back Data Mem ISA Description Datapath Description Component Library Mul IF Reg file ALU Write Back Data Mem Decode Control generation saves time and is non-critical

Architecture Input: ISA Description  Generic Operations (GENOPs)  MIPS instructions made of GENOPs FETCH RFREAD ADD RFWRITE GENOPsMIPS ADD – add rd, rs, rt FETCH RFREAD ADD RFWRITE RFREAD

Complete Experimental Framework Using SPREE 3. Area 4. Clock Frequency 5. Power 1.Correctness 2.Cycle Count CAD ToolRTL Simulator Benchmarks SPREE RTL Generator Component Library ISA DescriptionDatapath Description FIXED

Goals 1. Develop measurement methodology 2. Populate the design space 3. Compare against industrial soft processor(s) SPREE Area Performance Power

Altera’s NiosII  Second generation soft processor  Has three variations: NiosIIe – unpipelined, no hardware multiply NiosIIs – 5-stages, no branch prediction NiosIIf – 6-stages, dynamic branch prediction  Caveats Supports exceptions, OS, and caches Very similar but tweaked ISA

Design Space vs NiosII Variations

Summary 1. We span the design space 2. Remain competitive  Achieved 9% faster and 11% smaller than NiosIIs  => don’t suffer from prohibitive overhead Let’s explore some architecture!

Architectural Axes 1. Hardware vs Software Multiplication 2. Shifter implementation 3. Pipeline  Depth  Organization  Forwarding

Hardware vs Software Multiplication  Hardware multiplication Increases area & power consumption Speeds up execution  BUT … Not all applications care about speed Not all applications use multiplication (significantly)

Cycle Count Speedup of Hardware Multiplication Must understand its cost/benefit to decide when to use

Cost of Hardware Multiply  ~250 LEs (20%)  35% more Energy/cycle

Shifter Implementations  Shifters (multiplexers) are big in FPGAs  Consider 3 implementations: Serial shifter LUT-based barrel shifter Multiplier-based barrel shifter

Impact of Shifter Implementation  Consistent across different pipe depths

Shifter Implementation Tradeoffs  Averaged over all pipeline depths  Smallest: Serial  Fastest: LUT-based barrel  Energy efficient: Serial Multiplier is very nice sweet spot

Pipelines - Depth  Study different pipeline depths Over 3 shifters  Arrows = possible forwarding lines (not used)  All use predict not-taken branches

Pipelining & clock frequency

Impact of Pipelining  Adds area, can increase speed (2 to 3 stage?)

Mul FPGA Nuance: Synchronous RAMs 2-stage Pipeline Ifetch Regfile ALU Write Back Data Mem  Stall on all loads, and any operand fetches

Mul 3-stage Pipeline Ifetch Regfile ALU Write Back Data Mem  Less stalls, increased frequency => Big speedup (1.7x)

3, 4 and 5 stage pipelines  Increased area, small change in performance => Deeper pipelines have potential for better speedups

The 7-stage Pipeline Where Branch Delay Slots break down  The ideal case: BEQORJRADD XX Never squash this stage …

Problem: Separation of Branch and Branch Delay Slot BEQADDJR Stalls on RAW hazard …

Problem: Separation of Branch and Branch Delay Slot BEQADDJRNOP X  Must track and protect delay slots …

Multiple Delay Slots  Must detect separation of branch from delay slot  OR prevent multiple delay slots Stall branch if a delay slot exists in the pipe We did this one (+30LEs, -15% clock frequency) BEQORJRADD  Can’t guard all delay slots Better off eliminating delay slots – currently researching …

Pipeline organization  Where stages are placed is important  Pipe stage placement can Result in all around “win/loss” Present a tradeoff

Forwarding  SPREE supports stage to stage forwarding Mul Ifetch Reg File ALU Write Back Data Mem Forward line rs Forward line rt

Effect of Forwarding 20% speed increase

An Aside: ISA Subsetting  Applications don’t generally use all instructions

Processor reduction  Can strip away unused components/control Generator supports instruction disabling  Automatically strips away unused components Create an Application Specific processor Do this for each benchmark  FPGAs are a good platform for this!

Area of a Subsetted Processor

Speed of a Subsetted Processor

Conclusion  Understanding architectural trade-offs => Maximize efficiency  Developed SPREE & measurement methodology  Performed preliminary architectural study Quantified cost of hardware multiplication Explored shift unit implementations Explored pipelines: depth, organization, forwarding