Octavo: An FPGA-Centric Processor Architecture Charles Eric LaForest J. Gregory Steffan ECE, University of Toronto FPGA 2012, February 24.

Slides:



Advertisements
Similar presentations
DSPs Vs General Purpose Microprocessors
Advertisements

Instruction Set Design
1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.
Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Computer Organization and Architecture
Instruction-Level Parallelism (ILP)
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Computer Organization and Architecture
Computer Organization and Architecture
CSC 4250 Computer Architectures December 8, 2006 Chapter 5. Memory Hierarchy.
Pipelining II Andreas Klappenecker CPSC321 Computer Architecture.
Processor Technology and Architecture
Term Project Overview Yong Wang. Introduction Goal –familiarize with the design and implementation of a simple pipelined RISC processor What to do –Build.
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
1 COMP541 Sequencing – III (Sequencing a Computer) Montek Singh April 9, 2007.
Chapter 4 Processor Technology and Architecture. Chapter goals Describe CPU instruction and execution cycles Explain how primitive CPU instructions are.
Chapter 13 Reduced Instruction Set Computers (RISC) Pipelining.
RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.
Chapter 12 CPU Structure and Function. Example Register Organizations.
Pipelined Processor II CPSC 321 Andreas Klappenecker.
Appendix A Pipelining: Basic and Intermediate Concepts
The Processor Data Path & Control Chapter 5 Part 1 - Introduction and Single Clock Cycle Design N. Guydosh 2/29/04.
(6.1) Central Processing Unit Architecture  Architecture overview  Machine organization – von Neumann  Speeding up CPU operations – multiple registers.
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
Lecture 2: Field Programmable Gate Arrays September 13, 2004 ECE 697F Reconfigurable Computing Lecture 2 Field Programmable Gate Arrays.
CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.
Efficient Multi-Ported Memories for FPGAs Eric LaForest Greg Steffan University of Toronto Computer Engineering Research Group February 22, 2010.
Basics and Architectures
Embedded Supercomputing in FPGAs
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
10/27: Lecture Topics Survey results Current Architectural Trends Operating Systems Intro –What is an OS? –Issues in operating systems.
1 Multi-ported Memories for FPGAs via XOR Eric LaForest, Ming Liu, Emma Rapati, and Greg Steffan ECE, University of Toronto.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
1 TPUTCACHE: HIGH-FREQUENCY, MULTI-WAY CACHE FOR HIGH- THROUGHPUT FPGA APPLICATIONS Aaron Severance University of British Columbia Advised by Guy Lemieux.
ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular.
Pipeline Hazards. CS5513 Fall Pipeline Hazards Situations that prevent the next instructions in the instruction stream from executing during its.
CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.
CMPE 421 Parallel Computer Architecture
The Microarchitecture of FPGA-Based Soft Processors Peter Yiannacouras CARG - June 14, 2005.
Maximizing Speed and Density of Tiled FPGA Overlays via Partitioning Charles Eric LaForest J. Gregory Steffan University of Toronto ICFPT 2013.
B. Ramamurthy.  12 stage pipeline  At peak speed, the processor can request both an instruction and a data word on every clock.  We cannot afford pipeline.
Reconfigurable Computing - Pipelined Systems John Morris Chung-Ang University The University of Auckland ‘Iolanthe’ at 13 knots on Cockburn Sound, Western.
ALU (Continued) Computer Architecture (Fall 2006).
Winter 2002CSE Topic Branch Hazards in the Pipelined Processor.
© 2010 Altera Corporation - Public Lutiac – Small Soft Processors for Small Programs David Galloway and David Lewis November 18, 2010.
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
11 Pipelining Kosarev Nikolay MIPT Oct, Pipelining Implementation technique whereby multiple instructions are overlapped in execution Each pipeline.
High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.
Application-Specific Customization of Soft Processor Microarchitecture Peter Yiannacouras J. Gregory Steffan Jonathan Rose University of Toronto Electrical.
1 Scaling Soft Processor Systems Martin Labrecque Peter Yiannacouras and Gregory Steffan University of Toronto FCCM 4/14/2008.
PipeliningPipelining Computer Architecture (Fall 2006)
Presenter: Darshika G. Perera Assistant Professor
Computer Organization CS224
Computer Architecture
Central Processing Unit Architecture
William Stallings Computer Organization and Architecture 8th Edition
Application-Specific Customization of Soft Processor Microarchitecture
CSC 4250 Computer Architectures
Cache Memory Presentation I
The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.
Instruction Set Principles
Cache - Optimization.
Lecture 4: Instruction Set Design/Pipelining
Application-Specific Customization of Soft Processor Microarchitecture
Presentation transcript:

Octavo: An FPGA-Centric Processor Architecture Charles Eric LaForest J. Gregory Steffan ECE, University of Toronto FPGA 2012, February 24

Easier FPGA Programming We focus on overlay architectures –Nios, MicroBlaze, Vector Processors These inherited their architectures from ASICs –Easy to use with existing software tools –Performance penalty –ASIC architectures poor fit to FPGA hardware! ASIC ≠ FPGA –ASIC: transistors, poly, vias, metal layers –FPGA: LUTs, BRAMs, DSP Blocks, routing Fixed widths, depths, other discretizations FPGA-centric processor design? 2

Hardware (Stratix IV)Width (bits)Fmax (MHz) DSP Blocks36480 Block RAMs36550 ALUTs1800 Nios II/f32230 How do FPGAs Want to Compute? 3 What processor architecture best fits the underlying FPGA?

Research Goals 1.Assume threaded data parallelism 2.Run at maximum FPGA frequency 3.Have high performance 4.Never stall 5.Aim for simple, minimal ISA 6.Match architecture to underlying FPGA 4

Result: Octavo 10 stages, 8 threads, 550 MHz Family of designs –Word width (8 to 72 bits) –Memory depth (2 to 32k words) –Pipeline depth (8 to 16 stages) Snapshot of work-in-progress 5

Designing Octavo 6

High-Level View of Octavo 7 Unified registers and RAM

Octavo vs. Classic RISC 8 All memories unified (no loads/stores) How to pipeline Octavo?

Design For Speed: Self-Loop Characterization 9

Self-Loop Characterization Connect module outputs to inputs –Accounts for the FPGA interconnect Pipeline loop paths to absorb delays Pointed to other limits than raw delay –Minimum clock pulse widths DSP Blocks: 480 MHz BRAMs: 550 MHz We measured some surprising delays… 10

BRAM Self-Loop Characterization MHz (routing!) 656 MHz531 MHz710 MHz Must connect BRAMs using registers

Building Octavo: Memory 12

Building Octavo: Memory 13

Memory 14 Replicated “scratchpad” memories with I/O while still exceeding 550 MHz limit. Instruction ALU Result

Building Octavo: ALU 15

Building Octavo: ALU 16 Fully pipelined (4 stages) –Never stalls

Building Octavo: ALU 17 Multiplication –Uses DSP Blocks –Must overcome their 480 MHz limit…

Building Octavo: Multiplier One multiplier is wide enough but too slow Two multipliers working at half-speed –Send data to both multipliers in alternation MHz 600 MHz

Octavo: Putting It All Together 19

Octavo Pipeline –10 stages Actually 8 stages with one exception (more later) –No result forwarding or pipeline interlocks –Scalar, Single-Issue, In-Order, Multi-Threaded

Octavo 21 Instruction Memory –Indexed by current thread PC –Provides a 3-operand instruction –On-chip BRAMs only I

Octavo 22 A and B Memories –Receive operand addresses from instruction –Provide data operands to ALU and Controller Some addresses map to I/O ports –On-chip BRAMs only I A/B

Octavo 23 Pipeline Registers –Avoid an odd number of stages –Separate BRAMs for best speed Predicted by BRAM self-loop characterization Unusual but essential design constraint I A/B

Octavo 24 Controller –Receives opcode, source/destination operands –Decides branches –Provides current PC of next thread to I memory CTL0 CTL1 I A/B

Octavo 25 ALU –Receives opcode and data –Writes result to all memories ALU0 ALU1 ALU2 ALU3 CTL0 CTL1 I A/B

Octavo ALU0 ALU1 ALU2 ALU3 CTL0 CTL1 I A/B Longest mandatory loop: 8 stages –Along A/B memories and ALU –Fill with 8 threads to avoid stalls T6T7T2 T3 T4 T5 T0T1

Octavo 27 Special case longest loop: 10 stages –Along instruction memory and ALU –Does not affect most computations Adds a delay slot to subroutine and loop code ALU0 ALU1 ALU2 ALU3 CTL0 CTL1 I A/B

Results: Speed and Area 28

Experimental Framework Quartus 10.1 targeting Stratix IV (fastest) –Optimize and place for speed –Average speed over 10 placement runs Varied processor parameters: –Word width –Memory depth –Pipeline depth Measure Frequency, Area, and Density 29

Maximum Operating Frequency 30

Maximum Operating Frequency 31 Faster Wider BRAM hard limit Timing Slack!

Maximum Operating Frequency MHz 36 bits wide 230 MHz 32 bits wide 2.39x faster, but not a fair comparison

Maximum Operating Frequency 33 Multiplier CAD Anomaly! (38 to 54 bits width) Enough pipeline stages bury the inefficiency

Area Density 34

Area Density bits, 1024 words 72 bits, 4096 words “Sweet spot” 67% used (typical) 26% used

Designing Octavo: Lessons & Future Work 36

Lessons Soft-processors can hit BRAM Fmax –Octavo: 8 threads, 10 stages, 550 MHz Self-loop characterization for modules –Helps reason about their pipelining –Shows true operating envelopes on FPGA Octavo spans a large design space –Significant range of widths, depths, stages Consider FPGA-centric architecture! 37

Future Work 38