Vector Processing as a Soft-core CPU Accelerator Jason Yu, Guy Lemieux, Chris Eagleston {jasony, lemieux, University of British Columbia.

Slides:



Advertisements
Similar presentations
© 2011 Altera CorporationPublic The Trends in Programmable Solutions SoC FPGAs for Embedded Applications and Hardware-Software Co-Design Misha Burich Senior.
Advertisements

VEGAS: A Soft Vector Processor Aaron Severance Some slides from Prof. Guy Lemieux and Chris Chou 1.
TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.
1 SECURE-PARTIAL RECONFIGURATION OF FPGAs MSc.Fisnik KRAJA Computer Engineering Department, Faculty Of Information Technology, Polytechnic University of.
VEGAS: Soft Vector Processor with Scratchpad Memory Christopher Han-Yu Chou Aaron Severance, Alex D. Brant, Zhiduo Liu, Saurabh Sant, Guy Lemieux University.
1 VENICE A Soft Vector Processor Aaron Severance Advised by Prof. Guy Lemieux Zhiduo Liu, Chris Chou, Jason Yu, Alex Brant, Maxime Perreault, Chris Eagleston.
LOGO HW/SW Co-Verification -- Mentor Graphics® Seamless CVE By: Getao Liang March, 2006.
Final Class, ECE472 Midterm #2 due today – 1-5% extra credit for written report of Dally’s video Oral presentation of class project: today Graduate students:
Memory Consistency in Vector IRAM David Martin. Consistency model applies to instructions in a single instruction stream (different than multi-processor.
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
1 Optimizing multi-processor system composition Characterization Presentation November 20 th – 2007 Performing: Isaac Yarom Supervising: Mony Orbach Annual.
Instruction Level Parallelism (ILP) Colin Stevens.
A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department.
Configurable System-on-Chip: Xilinx EDK
12/1/2005Comp 120 Fall December Three Classes to Go! Questions? Multiprocessors and Parallel Computers –Slides stolen from Leonard McMillan.
1 Fast Communication for Multi – Core SOPC Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab.
1 Chapter 14 Embedded Processing Cores. 2 Overview RISC: Reduced Instruction Set Computer RISC-based processor: PowerPC, ARM and MIPS The embedded processor.
Using FPGAs with Embedded Processors for Complete Hardware and Software Systems Jonah Weber May 2, 2006.
Chapter 18 Multicore Computers
The Vector-Thread Architecture Ronny Krashinsky, Chris Batten, Krste Asanović Computer Architecture Group MIT Laboratory for Computer Science
DOP - A CPU CORE FOR TEACHING BASICS OF COMPUTER ARCHITECTURE Miloš Bečvář, Alois Pluháček and Jiří Daněček Department of Computer Science and Engineering.
Lecture 18 Lecture 18: Case Study of SoC Design ECE 412: Microcomputer Laboratory.
By: Oleg Schtofenmaher Maxim Fudim Supervisor: Walter Isaschar Characterization presentation for project Winter 2007 ( Part A)
1 3-General Purpose Processors: Altera Nios II 2 Altera Nios II processor A 32-bit soft core processor from Altera Comes in three cores: Fast, Standard,
Embedded Supercomputing in FPGAs
Previously Fetch execute cycle Pipelining and others forms of parallelism Basic architecture This week we going to consider further some of the principles.
Softcore Vector Processor Team ASP Brandon Harris Arpith Jacob.
Automated Design of Custom Architecture Tulika Mitra
집적회로 Spring 2007 Prof. Sang Sik AHN Signal Processing LAB.
Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.
SPREE RTL Generator RTL Simulator RTL CAD Flow 3. Area 4. Frequency 5. Power Correctness1. 2. Cycle count SPREE Benchmarks Verilog Results 3. Architecture.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
Embedding Constraint Satisfaction using Parallel Soft-Core Processors on FPGAs Prasad Subramanian, Brandon Eames, Department of Electrical Engineering,
Hyper Threading (HT) and  OPs (Micro-Operations) Department of Computer Science Southern Illinois University Edwardsville Summer, 2015 Dr. Hiroshi Fujinoki.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
CS5222 Advanced Computer Architecture Part 3: VLIW Architecture
POLITECNICO DI MILANO Blanket Team Blanket Reconfigurable architecture and (IP) runtime reconfiguration support in Dynamic Reconfigurability.
A few issues on the design of future multicores André Seznec IRISA/INRIA.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 Bundled Execution.
Lab 2 Parallel processing using NIOS II processors
This material exempt per Department of Commerce license exception TSU Xilinx On-Chip Debug.
© 2010 Altera Corporation - Public Lutiac – Small Soft Processors for Small Programs David Galloway and David Lewis November 18, 2010.
A Floating Point Divider for Complex Numbers in the NIOS II Presented by John-Marc Desmarais Authors: Philipp Digeser, Marco Tubolino, Martin Klemm, Daniel.
Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Dec 1, 2005 Part 2.
EKT303/4 Superscalar vs Super-pipelined.
Fail-Safe Module for Unmanned Autonomous Vehicle
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
Soft Vector Processors with Streaming Pipelines Aaron Severance Joe Edwards Hossein Omidian Guy G. F. Lemieux.
1June 9, 2006Connections 2006 FPGA-based Prototyping of the Multi-Level Computing Architecture presented by Davor Capalija Supervisor: Prof. Tarek S. Abdelrahman.
3/12/07CS Visit Days1 A Sea Change in Processor Design Uniprocessor SpecInt Performance: From Hennessy and Patterson, Computer Architecture: A Quantitative.
Winter-Spring 2001Codesign of Embedded Systems1 Essential Issues in Codesign: Architectures Part of HW/SW Codesign of Embedded Systems Course (CE )
Application-Specific Customization of Soft Processor Microarchitecture Peter Yiannacouras J. Gregory Steffan Jonathan Rose University of Toronto Electrical.
K-Nearest Neighbor Digit Recognition ApplicationDomainConstraintsKernels/Algorithms Voice Removal and Pitch ShiftingAudio ProcessingLatency (Real-Time)FFT,
Elec/Comp 526 Spring 2015 High Performance Computer Architecture Instructor Peter Varman DH 2022 (Duncan Hall) rice.edux3990 Office Hours Tue/Thu.
Implementing RISC Multi Core Processor Using HLS Language - BLUESPEC Liam Wigdor Instructor Mony Orbach Shirel Josef Semesterial Winter 2013.
Compilers: History and Context COMP Outline Compilers and languages Compilers and architectures – parallelism – memory hierarchies Other uses.
New Opportunities for Computer Architecture Research Using High-Density FPGAs and Design Tools Nahi Abdul-Ghani, Patrick Akl, Mohammad El-Majzoub, Maroulla.
ECE354 Embedded Systems Introduction C Andras Moritz.
Application-Specific Customization of Soft Processor Microarchitecture
Head-to-Head Xilinx Virtex-II Pro Altera Stratix 1.5v 130nm copper
课程名 编译原理 Compiling Techniques
Architecture Background
Scalable Processor Design
Hardware Support for Embedded Operating System Security
Embedded Units In more complex FPGAs There are many specialized circuitry, particularly for DSP. These include a variety of Adders, Multipliers, Processors.
Coe818 Advanced Computer Architecture
Introduction to Heterogeneous Parallel Computing
Customizable Soft Vector Processors
Application-Specific Customization of Soft Processor Microarchitecture
CSE 502: Computer Architecture
Presentation transcript:

Vector Processing as a Soft-core CPU Accelerator Jason Yu, Guy Lemieux, Chris Eagleston {jasony, lemieux, University of British Columbia Prepared for FPGA2008, Altera, and Xilinx February 26-28, 2008

2 Motivation  FPGAs for embedded processing High performance, computationally intensive Growing use of embedded processor on FPGA Nios/MicroBlaze too slow  Faster performance Faster Nios/MicroBlaze Multiprocessor-on-FPGA Custom hardware accelerator Synthesized accelerator

3 Problems…  Faster Nios/MicroBlaze not feasible 2 or 4-way superscalar/VLIW register file maps inefficiently to FPGA Superscalar complex dependency checking  Multiprocessor-on-FPGA complexity Parallel programming and debugging System design Cache coherence, memory consistency  Custom hardware accelerator cost Need hardware engineer Time-consuming to design and debug 1 hardware accelerator per function

4 Possible Solutions…  Automatically synthesized hardware accelerators Change software  regenerate & recompile RTL  Altera C2H  Xilinx CHiMPS  Mitrion Virtual Processor  CriticalBlue Cascade  Soft vector processor Change software  same RTL, just recompile software  Purely software-based  Decouples hardware/software development teams

5 Advantages of Vector Processing  Simple programming model Short to long vector data parallelism Regular, easy to accelerate  Purely software-based One hardware accelerator supports many applications  Scalable performance and area

6 Contributions  Configurable soft vector processor Selectable performance/resource tradeoff Area customization  FPGA-specific enhancements Partitioned register file Vector reductions using MAC chain Local vector datapath memory

Overview of Vector Processing

8 Acceleration with Vector Processing  Organize data as long vectors  Data-level parallelism  Vector instruction execution Multiple vector lanes (SIMD) Repeated SIMD operation over length of vector Sourcevectorregisters Destinationvectorregister Vector lanes for (i=0; i<NELEM; i++) a[i] = b[i] * c[i] vmult a, b, c

9 Compared to CPUs with SIMD Extensions  Intel SSE2, PowerPC Altivec, etc  Short, fixed-length vectors (eg, 4)  Single cycle per instruction  Many data pack/unpack instructions SourceSIMDregistersDestinationSIMDregister SIMD Unit

10 Hybrid Vector-SIMD  Consider the code sequence Traditional Vector Hybrid Vector-SIMDSIMD For (i=0; i<NELEM; i++) { C[i] = A[i] + B[i] E[i] = C[i] * D[i] } Loop iteration

11 Hybrid vector-SIMD vs Traditional Vector Traditional vector processing Hybrid Vector-SIMD processing For (i=0; i<NELEM; i++) { C[i] = A[i] + B[i] E[i] = C[i] * D[i] } C E C E

12 Vector ISA Features  Vector length (VL) register  Conditional execution Vector flag registers  Vector addressing modes Unit stride Constant stride Indexed offset Source registers Destination register Flag register Vector Merge Operation

13 Example: Simple 5x5 Median Filtering Pseudocode (Bubble sort) Load the 25 pixel vectors P[0..24] For i=0 to 12 { minimum = P[i] For j=i to 24 { if (P[j] < minimum) { swap (minimum, P[j]) }  Slide “window” over after 1 median  Repeated over entire image Many windows Output pixel

14 Example: Simple 5x5 Median Filtering Pseudocode (Bubble sort) Load the 25 pixel vectors P[0..24] For i=0 to 12 { minimum = P[i] For j=i to 24 { if (P[j] < minimum) { swap (minimum, P[j]) }  Bubble sort on vector registers  Vector flag register to mask execution  “VL” results at once! 25 rows -> 25 vector registers “VL” pixels each

Soft Vector Processor Architecture

16 Nios II core Shared instruction memory (scalar / vector instructions) Shared scalar / vector Memory interface Distributed vector register file Overlapped scalar / vector execution Configurable memory width Configurable number of lanes

One vector Register (eg, v0) Distributed vector register file

18 Local vector datapath memory MAC chain Result to VLane 0

19 Vector Sum Reduction with MAC  Sum reduction R =  A[i] * B[i] R =  A[i] (using B[i] = 1) Reduces VL elements in vector register to single number  Two instruction sequence: vmac  multiply accum. to accumulators vcczacc  compress copy and zero accumulators  Side effect: can only reduce 18-bit inputs Accumulate chain

20 Configurable Parameters  Some configurable features Number of vector lanes Vector ALU width Vector memory access granularity (8, 16, 32b) Local memory size (or none)  Strongly affect performance, area

21 Partial List of Configurable Parameters Primary Parameters Soft vector processors ParameterDescriptionTypicalV4V8V16M 32 NLaneNumber of vector lanes MVLMaximum vector length VPUWProcessor data width (bits)8, 16, 3232 MemMinWidthMinimum accessible data width in memory 8, 16, Parameters for Optional Features MultWMultiplier width (bits, 0 is off)0, 8, 16, 3216 MACLMAC chain length (0 is no MAC)0,1,2,4120 LMemNLocal memory number of words LMemShareShared local memory address space within lane On/OffOff

Performance Results

23 Benchmarking  3 sample application kernels 5x5 median filter Motion estimation (full search block matching) 128-bit AES encryption (MiBench)  C code, 3 versions Nios II Nios II with inline vector assembly Nios II with C2H accelerator

24 Methodology and Assumptions  Compile C code with nios2-gcc  Run time Instructions * cycles-per-instruction / Fmax  Nios II Instruction: 1 cycle Memory load: 1 cycle  Nios II with vectors Vector instruction: (VL / NLane) cycles Vector load: 2 * (VL / NLane) + 2 cycles

25 Altera C2H Compiler  Nios II with C2H accelerator Synthesizes HW accelerator from a C function C memory reference = master port to that memory Current limitations:  No automatic loop unrolling  Up to user to efficiently partition memory Memory Arbiter Avalon Fabric

26 C2H Methodology  Compile application kernels with C2H compiler Automatic pipelining and scheduling Manually unroll loops Manually “vectorize” C code  Nios II with C2H accelerator C2H compiler reports # of clock cycles Includes memory arbitration overhead

27 C2H Example  AES encryption round Shift 4 32-bit words (by different amounts) 4 table lookups XOR results, XOR with key  Acceleration steps 1. Process multiple blocks in parallel (increase array sizes) 2. Manually create 4 on-chip memories for 4 lookup tables 32-bit word

28 Synthesize system, place and route Synthesize system, place and route/

29 Resource Utilization Biggest Stratix III = 7x more resources Note: These Vector processors include a large local memory in each vector lane (an optional feature), hence the high M9K utilization. Removal would save 60% of M9K in V16.

30 Resource Utilization Estimates ALMDSP ElementsM9KFmax Smallest Stratix III Nios II/s C2H Median filtering82584*147 + C2H Motion estimation977104*135 + C2H AES encryption248086*119 UTIIe V V V * C2H results are obtained from compiling to Stratix II; uses M4K memories

31 Results: Clock Cycles

32 Speedup vs Resource Utilization Summary Nios II/s V16 V32 C2H Vector Median filtering AES encryption Motion estimation

33 Summary of Effort  C2H accelerators 1. “Vectorize” code for C2H: 1 day 2. Extra-effort optimization: 1 day 3. Place-and-route waiting: 1 hour Each iteration = 1 day + P&R  Vector soft processor 1. Vector algorithm, write vector assembly: 2 days 2. Revise vector algorithm: 0.5 day Each iteration = 0.5 day + SW compile only

34 Lessons from Vector Processor Design  Register files 2-read, 1-write memory very common for CPUs Multiple write ports for wide-issue processing  Wide, flexible vector memory interface very costly Memory crossbars: several multi-bit multiplexers ~1/3 the resources of soft vector processor (128b, byte access)  Stratix III specific DSP shift chain can no longer dynamically select input MAC chain is useful  Would like 32-bit MAC chain

35 Current Progress  Development toolchain integration Packaged as SOPC builder component No built-in debug core  Uses real Nios II processor to download code on to system Inline vector assembly in Nios II IDE  Future work Compiler Floating-point

36 Conclusion  Vector processing maps well to FPGA Many small memories, DSP blocks Simple programming model  Soft vector processor Purely software-based acceleration  No hardware design / RTL recompile needed—just program  One hardware accelerator supports many applications Scalable performance and area  More vector lanes  more performance for more area  Soft core parameters/features  area customization

37 Conclusion  FPGA-specific enhancements Partitioned register file reduces resource utilization MAC chain for efficient vector reduction Local vector datapath memory  Table lookup operations  Download the processor now!