Customizable Soft Vector Processors

Slides:



Advertisements
Similar presentations
The CPU The Central Presentation Unit What is the CPU?
Advertisements

Vector Processing as a Soft-core CPU Accelerator Jason Yu, Guy Lemieux, Chris Eagleston {jasony, lemieux, University of British Columbia.
Comparison of Altera NIOS II Processor with Analog Device’s TigerSHARC
Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow
Instruction Set Design
ARM Cortex A8 Pipeline EE126 Wei Wang. Cortex A8 is a processor core designed by ARM Holdings. Application: Apple A4, Samsung Exynos What’s the.
Programmability Issues
Vector Processors Part 2 Performance. Vector Execution Time Enhancing Performance Compiler Vectorization Performance of Vector Processors Fallacies and.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Computer Architecture A.
VEGAS: A Soft Vector Processor Aaron Severance Some slides from Prof. Guy Lemieux and Chris Chou 1.
The University of Adelaide, School of Computer Science
TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.
CENTRAL PROCESSING UNIT
Parallell Processing Systems1 Chapter 4 Vector Processors.
Graduate Computer Architecture I Lecture 16: FPGA Design.
VESPA: Portable, Scalable, and Flexible FPGA-Based Vector Processors Peter YiannacourasUniv. of Toronto J. Gregory Steffan Univ. of Toronto Jonathan Rose.
Chapter 6 Pipelining & RISCs Dr. Abraham Techniques for speeding up a computer Pipelining Parallel processing.
Term Project Overview Yong Wang. Introduction Goal –familiarize with the design and implementation of a simple pipelined RISC processor What to do –Build.
RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.
GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.
Octavo: An FPGA-Centric Processor Architecture Charles Eric LaForest J. Gregory Steffan ECE, University of Toronto FPGA 2012, February 24.
EKT303/4 PRINCIPLES OF PRINCIPLES OF COMPUTER ARCHITECTURE (PoCA)
October 26, 2006 Parallel Image Processing Programming and Architecture IST PhD Lunch Seminar Wouter Caarls Quantitative Imaging Group.
B212/MAPLD 2005 Craven1 Configurable Soft Processor Arrays Using the OpenFire Processor Stephen Craven Cameron Patterson Peter Athanas Configurable Computing.
1 3-General Purpose Processors: Altera Nios II 2 Altera Nios II processor A 32-bit soft core processor from Altera Comes in three cores: Fast, Standard,
Embedded Supercomputing in FPGAs
Automated Design of Custom Architecture Tulika Mitra
Data Parallel FPGA Workloads: Software Versus Hardware Peter Yiannacouras J. Gregory Steffan Jonathan Rose FPL 2009.
Fine-Grain Performance Scaling of Soft Vector Processors Peter Yiannacouras Jonathan Rose Gregory J. Steffan ESWEEK – CASES 2009, Grenoble, France Oct.
Optimizing Data Compression Algorithms for the Tensilica Embedded Processor Tim Chao Luis Robles Rebecca Schultz.
SPREE RTL Generator RTL Simulator RTL CAD Flow 3. Area 4. Frequency 5. Power Correctness1. 2. Cycle count SPREE Benchmarks Verilog Results 3. Architecture.
Reconfigurable Computing Using Content Addressable Memory (CAM) for Improved Performance and Resource Usage Group Members: Anderson Raid Marie Beltrao.
EKT303/4 PRINCIPLES OF PRINCIPLES OF COMPUTER ARCHITECTURE (PoCA)
C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 31/01/2014 Compilers for Embedded Systems.
Lecture 11: 10/1/2002CS170 Fall CS170 Computer Organization and Architecture I Ayman Abdel-Hamid Department of Computer Science Old Dominion University.
Reconfigurable Computing Ender YILMAZ, Hasan Tahsin OĞUZ.
Csci 136 Computer Architecture II – Summary of MIPS ISA Xiuzhen Cheng
© 2010 Altera Corporation - Public Lutiac – Small Soft Processors for Small Programs David Galloway and David Lewis November 18, 2010.
The Alpha Thomas Daniels Other Dude Matt Ziegler.
Improving Memory System Performance for Soft Vector Processors Peter Yiannacouras J. Gregory Steffan Jonathan Rose WoSPS – Oct 26, 2008.
Survey of multicore architectures Marko Bertogna Scuola Superiore S.Anna, ReTiS Lab, Pisa, Italy.
NISC set computer no-instruction
Lx: A Technology Platform for Customizable VLIW Embedded Processing.
WARP PROCESSORS ROMAN LYSECKY GREG STITT FRANK VAHID Presented by: Xin Guan Mar. 17, 2010.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.
Application-Specific Customization of Soft Processor Microarchitecture Peter Yiannacouras J. Gregory Steffan Jonathan Rose University of Toronto Electrical.
Vector computers.
1 Scaling Soft Processor Systems Martin Labrecque Peter Yiannacouras and Gregory Steffan University of Toronto FCCM 4/14/2008.
Reconfigurable Computing1 Reconfigurable Computing Part II.
Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University.
Nios II Processor: Memory Organization and Access
ECE 252 / CPS 220 Advanced Computer Architecture I Lecture 17 Vectors
Visit for more Learning Resources
ESE532: System-on-a-Chip Architecture
Application-Specific Customization of Soft Processor Microarchitecture
ESE532: System-on-a-Chip Architecture
Christopher Han-Yu Chou Supervisor: Dr. Guy Lemieux
Decoupled Access-Execute Pioneering Compilation for Energy Efficiency
Morgan Kaufmann Publishers
Vector Processing => Multimedia
Performance Optimization for Embedded Software
STUDY AND IMPLEMENTATION
Multivector and SIMD Computers
Architecture Overview
HIGH LEVEL SYNTHESIS.
Improving Memory System Performance for Soft Vector Processors
Computer Evolution and Performance
Application-Specific Customization of Soft Processor Microarchitecture
ADSP 21065L.
ESE532: System-on-a-Chip Architecture
Presentation transcript:

Customizable Soft Vector Processors Peter Yiannacouras, PhD Candidate Connections 2009

Soft Processors in FPGA Systems Weeks Soft Processor Custom HW Months Software + Compiler HDL + CAD Used in 25% of designs [source: Altera, 2009]  Faster  Smaller  Less Power  Easier COMPETE  Configurable Make FPGA technology more easily accessible Optimize soft processor to application properties

Data Level Parallelism Same operation // C code for(i=0;i<16; i++) c[i]=a[i]+b[i] Independent data c[15]=a[15]+b[15] c[14]=a[14]+b[14] Data Level Parallelism c[13]=a[13]+b[13] c[12]=a[12]+b[12] Commonly found in embedded systems c[11]=a[11]+b[11] c[10]=a[10]+b[10] c[9]= a[9]+b[9] c[8]= a[8]+b[8] c[7]= a[7]+b[7] Exploit using a Vector Processor c=a+b c[6]= a[6]+b[6] //Processor instructions load r0,a[1] load r1,b[1] add r2,r0,r1 store r2,c[1] c[5]= a[5]+b[5] c[4]= a[4]+b[4] c[3]= a[3]+b[3] c[2]= a[2]+b[2] c[1]= a[1]+b[1] c[0]= a[0]+b[0]

Vector Processing Primer vadd // C code for(i=0;i<16; i++) c[i]=a[i]+b[i] // Vectorized code set vl,16 vload vr0,a vload vr1,b vadd vr2,vr0,vr1 vstore vr2,c vr2[15]=vr0[15]+vr1[15] vr2[14]=vr0[14]+vr1[14] vr2[13]=vr0[13]+vr1[13] vr2[12]=vr0[12]+vr1[12] vr2[11]=vr0[11]+vr1[11] vr2[10]=vr0[10]+vr1[10] vr2[9]= vr0[9]+vr1[9] vr2[8]= vr0[8]+vr1[8] vr2[7]= vr0[7]+vr1[7] vr2[6]= vr0[6]+vr1[6] vr2[5]= vr0[5]+vr1[5] vr2[4]= vr0[4]+vr1[4] Each vector instruction holds many units of independent operations vr2[3]= vr0[3]+vr1[3] vr2[2]= vr0[2]+vr1[2] vr2[1]= vr0[1]+vr1[1] vr2[0]= vr0[0]+vr1[0] 1 Vector Lane

Vector Processing Primer vadd // C code for(i=0;i<16; i++) c[i]=a[i]+b[i] // Vectorized code set vl,16 vload vr0,a vload vr1,b vadd vr2,vr0,vr1 vstore vr2,c 16 Vector Lanes vr2[15]=vr0[15]+vr1[15] vr2[14]=vr0[14]+vr1[14] vr2[13]=vr0[13]+vr1[13] 16x speedup vr2[12]=vr0[12]+vr1[12] Implemented on an FPGA (Soft Vector Processor) Is it scalable? vr2[11]=vr0[11]+vr1[11] vr2[10]=vr0[10]+vr1[10] vr2[9]= vr0[9]+vr1[9] vr2[8]= vr0[8]+vr1[8] vr2[7]= vr0[7]+vr1[7] vr2[6]= vr0[6]+vr1[6] vr2[5]= vr0[5]+vr1[5] vr2[4]= vr0[4]+vr1[4] Each vector instruction holds many units of independent operations vr2[3]= vr0[3]+vr1[3] vr2[2]= vr0[2]+vr1[2] vr2[1]= vr0[1]+vr1[1] vr2[0]= vr0[0]+vr1[0]

Soft Vector Processor Scalability 9x 14x 7 configurations: 14x speed, 9x area => coarse-grained!

More Architectural Parameters Description Symbol Values Number of Lanes L 1,2,4,8, … Memory Crossbar Lanes M 1,2, …, L Multiplier Lanes X Register Banks for Chaining B 1,2,4, … ALU per Register Bank APB true/false Maximum Vector Length MVL 2,4,8, … Width of Lanes (in bits) W 1-32 Instruction Enable (each) - on/off Data Cache Capacity DD any Data Cache Line Size DW Data Prefetch Size DPK < DD Vector Data Prefetch Size DPV < DD/MVL Processor Architecture Instruction Set Architecture Memory System

Fine-Grained Trade Off Space Memory System: Weak Moderate Good