W AVEFRONT S KIPPING USING BRAM S FOR C ONDITIONAL A LGORITHMS ON V ECTOR P ROCESSORS Aaron Severance Joe Edwards Guy G.F. Lemieux.

Slides:



Advertisements
Similar presentations
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Advertisements

DSPs Vs General Purpose Microprocessors
Computer Science and Engineering Laboratory, Transport-triggered processors Jani Boutellier Computer Science and Engineering Laboratory This.
ARM Cortex A8 Pipeline EE126 Wei Wang. Cortex A8 is a processor core designed by ARM Holdings. Application: Apple A4, Samsung Exynos What’s the.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Vector Processors Part 2 Performance. Vector Execution Time Enhancing Performance Compiler Vectorization Performance of Vector Processors Fallacies and.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Computer Architecture A.
The University of Adelaide, School of Computer Science
SPIM and MIPS programming
Lecture: Pipelining Basics
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Computer Organization and Architecture
VEGAS: Soft Vector Processor with Scratchpad Memory Christopher Han-Yu Chou Aaron Severance, Alex D. Brant, Zhiduo Liu, Saurabh Sant, Guy Lemieux University.
Midterm Wednesday Chapter 1-3: Number /character representation and conversion Number arithmetic Combinational logic elements and design (DeMorgan’s Law)
Chapter 12 CPU Structure and Function. Example Register Organizations.
Data Transfer & Decisions I (1) Fall 2005 Lecture 3: MIPS Assembly language Decisions I.
EECE476: Computer Architecture Lecture 27: Virtual Memory, TLBs, and Caches Chapter 7 The University of British ColumbiaEECE 476© 2005 Guy Lemieux.
Digital signature using MD5 algorithm Hardware Acceleration
Secure Embedded Processing through Hardware-assisted Run-time Monitoring Zubin Kumar.
The von Neumann Model – Chapter 4 COMP 2620 Dr. James Money COMP
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
Upcrc.illinois.edu OpenMP Lab Introduction. Compiling for OpenMP Open project Properties dialog box Select OpenMP Support from C/C++ -> Language.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 11 Parallel.
RICE UNIVERSITY Implementing the Viterbi algorithm on programmable processors Sridhar Rajagopal Elec 696
Embedded Supercomputing in FPGAs
Lecture 16 Today’s topics: –MARIE Instruction Decoding and Control –Hardwired control –Micro-programmed control 1.
1-1 NET+OS Software Group Flash API Multiple flash memory bank support New Flash API introduction Detailed Flash API Function presentation Supporting.
Optimizing Data Compression Algorithms for the Tensilica Embedded Processor Tim Chao Luis Robles Rebecca Schultz.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 12: Application Lessons When the tires.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.
Lecture 4: MIPS Instruction Set Reminders: –Homework #1 posted: due next Wed. –Midterm #1 scheduled Friday September 26 th, 2014 Location: TODD 430 –Midterm.
1 TPUTCACHE: HIGH-FREQUENCY, MULTI-WAY CACHE FOR HIGH- THROUGHPUT FPGA APPLICATIONS Aaron Severance University of British Columbia Advised by Guy Lemieux.
Divergence-Aware Warp Scheduling
© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 1 CUDA Lecture 7: Reductions and.
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core Processors for Science and Engineering.
Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.
Understanding and Implementing Cache Coherency Policies CSE 8380: Parallel and Distributed Processing Dr. Hesham El-Rewini Presented by, Fazela Vohra CSE.
PLC ARCHITECTURE – Memory 2 by Dr. Amin Danial Asham.
Accelerating an N-Body Simulation Anuj Kalia Maxeler Technologies.
PART 4: (1/2) Central Processing Unit (CPU) Basics CHAPTER 12: P ROCESSOR S TRUCTURE AND F UNCTION.
Sequencers SQO,SQC,SQL.
Soft Vector Processors with Streaming Pipelines Aaron Severance Joe Edwards Hossein Omidian Guy G. F. Lemieux.
Zhiduo Liu Aaron Severance Satnam Singh Guy Lemieux Accelerator Compiler for the VENICE Vector Processor.
Vector computers.
A Closer Look at Instruction Set Architectures
Multiscalar Processors
Chapter 12 Variables and Operators
ECE408 Fall 2015 Applied Parallel Programming Lecture 11 Parallel Computation Patterns – Reduction Trees © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al,
Christopher Han-Yu Chou Supervisor: Dr. Guy Lemieux
Array Processor.
AT91RM9200 Boot strategies This training module describes the boot strategies on the AT91RM9200 including the internal Boot ROM and the U-Boot program.
Linchuan Chen, Peng Jiang and Gagan Agrawal
Mihir Awatramani Lakshmi kiran Tondehal Xinying Wang Y. Ravi Chandra
Processor Organization and Architecture
Parallel Computation Patterns (Scan)
Multivector and SIMD Computers
Parallel Computation Patterns (Reduction)
Computer Architecture
ECE 498AL Lecture 15: Reductions and Their Implementation
ECE 498AL Lecture 10: Control Flow
CPU Structure CPU must:
ECE 498AL Spring 2010 Lecture 10: Control Flow
Implementation of a De-blocking Filter and Optimization in PLX
Chapter 11 Processor Structure and function
Assembly Language for Intel-Based Computers, 5th Edition
Lecture 11: Machine-Dependent Optimization
Presentation transcript:

W AVEFRONT S KIPPING USING BRAM S FOR C ONDITIONAL A LGORITHMS ON V ECTOR P ROCESSORS Aaron Severance Joe Edwards Guy G.F. Lemieux

VectorBlox MXP Soft Vector Processor (SVP) Software programmable 1 to 128 parallel ALUs (4 shown) Flat scratchpad memory Vectors at any address, have any length 2

Vectors in Scratchpad Memory 3

Early Exit Algorithms 4 Stage 1 Stage 2 Stage 3

Divergent Control Flow on SVPs Predication/CMOV Process entire vector Vector Compress Load inputs, compress Expand outputs, store 5 Mask Vector: 1’s in corresponding vector are active 0’s are disabled Wavefronts (1 per cycle) Vector Lanes

Vector Compress: 5x1 Sliding Window N*M compress operations (time) N*M temporary vectors (space) 6

Divergent Control Flow on SVPs Predication/CMOV Process entire vector Vector Compress Load inputs, compress Expand outputs, store Compress + Scatter/Gather Compress addresses Load/store indexed Density-time Scan mask for next valid wavefront 7 Mask Vector: 1’s in corresponding vector are active 0’s are disabled Wavefronts (1 per cycle) Vector Lanes

Wavefront Skipping using BRAMs 8 Wavefront Number (Offset) Mask BRAM Mask Vector mask_setup() comparison populates mask BRAM Only wavefronts with active elements stored Masked execution adds offsets from mask BRAM to base

Code Example #define STRIDE 5 //Toy functition to double every fifth element in the vector v_a void double_every_fifth_element( int *v_a, int *v_temp int vector_length ){ int i; //Initialize every fifth element of v_temp to zero for( i = 0; i < vector_length; i++ ) { v_temp[i] = i % STRIDE; } //Set the vector length which will be processed vbx_set_vl( vector_length ); //Set mask for every element equal to zero vbx_setup_mask( CMOV_Z, v_temp ); //Perform the wavefront skipping operation: multiply all non-masked off //elements of v_a by 2 and store the result back in v_a vbx_masked( SVW, MUL, v_a, 2, v_a ); } 9

Partial Wavefront Skipping 10

11 BRAM Usage for Varying MMVL Increasing Maximum Masked Vector Length (MMVL): More wavefronts, more speedup with sparse masks Deeper BRAMs needed

BRAM Usage for Multiple Partitions 12

Work Done for Different Strategies 13

Different Strategies (Mandelbrot Set) 14

Face Detect Results 15

Speedup per Area (eALM) 16

Varying MMVL (Face Detect) 17

Limitations and Future Work Large area cost for multiple partitions Area overhead on V4: 1 partition 4%, 4 partitions 26% Single mask Requires extra instructions to store/restore 18

Conclusions BRAMs used to implement Wavefront Skipping Wide configurations space Full Wavefront Skipping uses <5% more eALMs Multiple partitions, longer MMVL can use 100’s of BRAMs 3x higher performance per area on early exit algorithms 19

Thank You! 20