Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor.

Slides:



Advertisements
Similar presentations
Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters
Advertisements

Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property.
VEGAS: A Soft Vector Processor Aaron Severance Some slides from Prof. Guy Lemieux and Chris Chou 1.
Computer Abstractions and Technology
Advanced microprocessor optimization Kampala August, 2007 Agner Fog
Carnegie Mellon Lessons From Building Spiral The C Of My Dreams Franz Franchetti Carnegie Mellon University Lessons From Building Spiral The C Of My Dreams.
VEGAS: Soft Vector Processor with Scratchpad Memory Christopher Han-Yu Chou Aaron Severance, Alex D. Brant, Zhiduo Liu, Saurabh Sant, Guy Lemieux University.
Introduction to Advanced Topics Chapter 1 Mooly Sagiv Schrierber
1 100M CUDA GPUs Oil & GasFinanceMedicalBiophysicsNumericsAudioVideoImaging Heterogeneous Computing CPUCPU GPUGPU Joy Lee Senior SW Engineer, Development.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
Presented by Rengan Xu LCPC /16/2014
Basic CUDA Programming Shin-Kai Chen VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao.
Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)
A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.
Contemporary Languages in Parallel Computing Raymond Hummel.
A Source-to-Source OpenACC compiler for CUDA Akihiro Tabuchi †1 Masahiro Nakao †2 Mitsuhisa Sato †1 †1. Graduate School of Systems and Information Engineering,
CISC673 – Optimizing Compilers1/34 Presented by: Sameer Kulkarni Dept of Computer & Information Sciences University of Delaware Phase Ordering.
1 1 Profiling & Optimization David Geldreich (DREAM)
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.
2012/06/22 Contents  GPU (Graphic Processing Unit)  CUDA Programming  Target: Clustering with Kmeans  How to use.
OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.
Dilemma of Parallel Programming Xinhua Lin ( 林新华 ) HPC Lab of 17 th Oct 2011.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.
Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.
Embedded Supercomputing in FPGAs
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
A Compiler-Based Tool for Array Analysis in HPC Applications Presenter: Ahmad Qawasmeh Advisor: Dr. Barbara Chapman 2013 PhD Showcase Event.
+ CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now.
The Java profiler based on byte code analysis and instrumentation for many-core hardware accelerators Marcin Pietroń 1,2, Michał Karwatowski 1,2, Kazimierz.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
Seunghwa Kang David A. Bader Optimizing Discrete Wavelet Transform on the Cell Broadband Engine.
ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Execution Characteristics of SPEC CPU2000 Benchmarks: Intel C++ vs. Microsoft VC++
Offloading to the GPU: An Objective Approach
AES Encryption Code Generator Undergraduate Research Project by Paul Magrath. Supervised by Dr David Gregg.
Detector Simulation on Modern Processors Vectorization of Physics Models Philippe Canal, Soon Yung Jun (FNAL) John Apostolakis, Mihaly Novak, Sandro Wenzel.
JPEG-GPU: A GPGPU IMPLEMENTATION OF JPEG CORE CODING SYSTEMS Ang Li University of Wisconsin-Madison.
Sep 08, 2009 SPEEDUP – Optimization and Porting of Path Integral MC Code to New Computing Architectures V. Slavnić, A. Balaž, D. Stojiljković, A. Belić,
Automating and Optimizing Data Transfers for Many-core Coprocessors Student: Bin Ren, Advisor: Gagan Agrawal, NEC Intern Mentor: Nishkam Ravi, Yi Yang.
OpenCL Programming James Perry EPCC The University of Edinburgh.
By Edward A. Lee, J.Reineke, I.Liu, H.D.Patel, S.Kim
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Yang Yu, Tianyang Lei, Haibo Chen, Binyu Zang Fudan University, China Shanghai Jiao Tong University, China Institute of Parallel and Distributed Systems.
Experiences with Achieving Portability across Heterogeneous Architectures Lukasz G. Szafaryn +, Todd Gamblin ++, Bronis R. de Supinski ++ and Kevin Skadron.
Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.
University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.
Sunpyo Hong, Hyesoon Kim
Introduction to Computer Organization Pipelining.
Zhiduo Liu Aaron Severance Satnam Singh Guy Lemieux Accelerator Compiler for the VENICE Vector Processor.
GPU Programming Contest. Contents Target: Clustering with Kmeans How to use toolkit1.0 Towards the fastest program.
S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.
A Study of Data Partitioning on OpenCL-based FPGAs Zeke Wang (NTU Singapore), Bingsheng He (NTU Singapore), Wei Zhang (HKUST) 1.
A Performance Analysis Framework for Optimizing OpenCL Applications on FPGAs Zeke Wang (NTU Singapore), Bingsheng He (NTU Singapore), Wei Zhang (HKUST)
Relational Query Processing on OpenCL-based FPGAs Zeke Wang, Johns Paul, Hui Yan Cheah (NTU, Singapore), Bingsheng He (NUS, Singapore), Wei Zhang (HKUST,
Konstantinos Krommydas, Ruchira Sasanka (Intel), Wu-chun Feng
Employing compression solutions under openacc
Basic CUDA Programming
Christopher Han-Yu Chou Supervisor: Dr. Guy Lemieux
FPGAs in AWS and First Use Cases, Kees Vissers
CDA 3101 Spring 2016 Introduction to Computer Organization
GPU Programming using OpenCL
Brook GLES Pi: Democratising Accelerator Programming
Faster File matching using GPGPU’s Deephan Mohan Professor: Dr
Konstantis Daloukas Nikolaos Bellas Christos D. Antonopoulos
Optimization 薛智文 (textbook ch# 9) 薛智文 96 Spring.
Presentation transcript:

Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor

Outline:  Motivation  Background  Implementation  Results  Conclusion

Outline:  Motivation  Background  Implementation  Results  Conclusion

Motivation Multi-core GPU FPGA Many-core … CUDA Syste m Verilog VHDL OpenCL Erlang Computer clusters OpenMP MPI Pthread OpenHMPP Verilog Bluespe c Cilk X1 0 OpenGL ShSh aJava ParC Fortress Chapel Vector Processor StreamIt Sponge SS E

Motivation Multi-core GPU FPGA Many-core … CUDA Syste m Verilog VHDL OpenCL Erlang Computer clusters OpenMP MPI Pthread OpenHMPP Verilog Bluespe c Cilk X1 0 OpenGL ShSh aJava ParC Fortress Chapel Vector Processor StreamIt Sponge SS E Simplification

Motivation … Single Description

Contributions  The compiler serves as a new back-end of a single- description multiple-device language.  The compiler makes VENICE easier to program and debug.  The compiler provides auto-parallelization and optimization. [1] Z. Liu, A. Severance, S. Singh and G. Lemieux, “Accelerator Compiler for the VENICE Vector Processor,” in FPGA [2] C. Chou, A. Severance, A. Brant, Z. Liu, S. Sant, G. Lemieux, “VEGAS: soft vector processor with scratchpad memory,” in FPGA 2011.

Outline:  Motivation  Background  Implementation  Results  Conclusion

Complicated ALIGN WR RD ALIGN EX1 EX2 ACCUM

#include "vector.h“ int main() { int A[] = {1,2,3,4,5,6,7,8}; const int data_len = sizeof ( A ); int *va = ( int *) vector_malloc ( data_len ); vector_dma_to_vector ( va, A, data_len ); vector_wait_for_dma (); vector_set_vl ( data_len / sizeof (int) ); vector ( SVW, VADD, va, 42, va ); vector_instr_sync (); vector_dma_to_host ( A, va, data_len ); vector_wait_for_dma (); vector_free (); } Program in VENICE assembly Allocate vectors in scratchpad Move data from main memory to scratchpad Wait for DMA transaction to be completed Setup for vector instructions Perform vector computations Wait for vector operations to be completed Move data from scratchpad to main memory Wait for DMA transaction to be completed Deallocate memory from scratchpad

#include "Accelerator.h" using namespace ParallelArrays; using namespace MicrosoftTargets; int main() { int A[] = {1,2,3,4,5,6,7,8}; Target *tgt = CreateVectorTarget(); IPA b = IPA( A, sizeof (A)/sizeof (int)); IPA c = b + 42; tgt->ToArray( c, A, sizeof (A)/sizeof (int)); tgt->Delete(); } Target *tgt= CreateDX9Target(); Target *tgt = CreateMulticoreTarget(); Program in Accelerator Create a Target Create Parallel Array objects Write expressions Call ToArray to evaluate expressions Delete Target object

Assembly Programming : Write Assembly Download to board Compile with Gcc Get Result Doesn’t compile? Result Incorrect? Accelerator Programming : Write in Accelerator Download to board Compile with Microsoft Visual Studio Get Result Compile with Gcc Doesn’t compile? Or result incorrect?

Assembly Programming : 1.Hard to program 2.Long debug cycle 3.Not portable 4.Manual – Not always optimal or correct (wysiwyg) Accelerator Programming : 1.Easy to program 2.Easy to debug 3.Can also target other devices 4.Automated compiler optimizations

Outline:  Motivation  Background  Implementation  Results  Conclusion

#include "Accelerator.h" using namespace ParallelArrays; using namespace MicrosoftTargets; int main() { Target *tgtVector = CreateVectorTarget(); const int length = 8192; int a[] = {1,2,3,4, …, 8192}; int d[length]; IPA A = IPA( a, length); IPA B = Evaluate( Rotate(A, [1]) + 1 ); IPA C = Evaluate( Abs( A + 2 )); IPA D = ( A + B ) * C ; tgtVector->ToArray( D, d, length * sizeof(int)); tgtVector->Delete(); } × × D D + + A A + + A A 2 2 Abs + + A A 1 1 Rot

× × D D + + A A + + A A 2 2 Abs Rot A A

× × D D + + A A + + A A 2 2 Abs A (rot) A (rot)

× × D D + + A A + + A A 2 2 Abs A (rot) A (rot) × × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B C C + + A A 2 2 Abs

× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B C C + + A A 2 2 Abs Combine Operations

× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B |+| 2 2 C C A A Combine Operations

Scratchpad Memory“Virtual Vector Register File”

Number of vector registers = ? Vector register size = ?

“Virtual Vector Register File” Number of vector registers = ? Vector register size = ?

× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Evaluation Order

× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers

× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers

× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A3

× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A3

× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A3 B1

× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A3 B1

× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A3 B1 C1

× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A3 B1 C1 Active AYes BNo C

× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A3 B1 C1 Active AYes BNo C numLoads = 1

× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A3 B1 C1 Active AYes BNo C numLoads = 1

× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A2 B1 C1 Active AYes BNo C numLoads = 1

× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A2 B1 C1 Active AYes BNo C numLoads = 1

× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A2 B1 C1 Active AYes BNo C numLoads = 1

× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A2 B1 C1 Active AYes BNo C numLoads = 1 numTemps = 1

× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A2 B1 C1 Active AYes BNo C numLoads = 1 numTemps = 1 numTotal = 2 maxTotal = 2

× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A2 B1 C1 Active AYes B CNo numLoads = 2 numTemps = 1

× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A2 B1 C1 Active AYes B CNo numLoads = 2 numTemps = 0

× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A1 B1 C1 Active AYes B CNo numLoads = 2 numTemps = 0

× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A1 B1 C1 Active AYes B CNo numLoads = 2 numTemps = 0

× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A1 B1 C1 Active AYes B CNo numLoads = 2 numTemps = 1

× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A1 B1 C1 Active AYes B CNo numLoads = 2 numTemps = 1 numTotal = 3 maxTotal = 3

× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A1 B1 C1 Active AYes B C numLoads = 3 numTemps = 0

× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A0 B1 C1 Active ANo BYes C numLoads = 3 numTemps = 0

× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A0 B0 C1 Active ANo B CYes numLoads = 3 numTemps = 0

× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A0 B0 C1 Active ANo B CYes numLoads = 3 numTemps = 0

× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A0 B0 C0 Active ANo B C numLoads = 3 numTemps = 0

× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A0 B0 C0 Active ANo B C numLoads = 3 numTemps = 0

× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A0 B0 C0 Active ANo B C numLoads = 3 numTemps = 0 numTotal = 3 maxTotal = 3

× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A0 B0 C0 Active ANo B C numLoads = 0 numTemps = 0 numTotal = 0 maxTotal = 3

“Virtual Vector Register File” Number of vector registers = 3 Vector register size = ?

“Virtual Vector Register File” Number of vector registers = 3 Vector register size = Capacity/3

Convert to LIR Result:B A(rot) 1 + Result:D A B + C × × × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B |+| 2 2 C C A A Result:C A 2 |+|

Code Generation Result:B A(rot) 1 + Result:D A B + C × Result:C A 2 |+|

Code Generation Result:B A(rot) 1 + Result:D A B + C × Result:C A 2 |+|

Code Generation Result:B A(rot) 1 + Result:D A B + C × Result:C A 2 |+|

Code Generation Result:B A(rot) 1 + Result:D A B + C × Result:C A 2 |+| #include "vector.h“ int main(){ int A[8192] = {1,2,3,4, … 8192}; int *va = ( int *) vector_malloc ( ); int *vb = ( int *) vector_malloc ( ); int *vc = ( int *) vector_malloc ( ); int *vd = ( int *) vector_malloc ( ); int *vtemp = va; vector_dma_to_vector ( va, A, ); for(int i=0; i<4; i++){ vector_set_vl ( 1024 ); vtemp = va; va = vd; vd = vtemp; vector_wait_for_dma (); if(i<3) vector_dma_to_vector ( va, A+(i+1)*1024, ); if(i>0){ vector_instr_sync (); vector_dma_to_host ( A+(i-1)*1024, vc, ); } vector ( SVW, VADD, vb, 1, va+1 );

Code Generation Result:D A B + C × Result:C A 2 |+| #include "vector.h“ int main(){ int A[8192] = {1,2,3,4, … 8192}; int *va = ( int *) vector_malloc ( ); int *vb = ( int *) vector_malloc ( ); int *vc = ( int *) vector_malloc ( ); int *vd = ( int *) vector_malloc ( ); int *vtemp = va; vector_dma_to_vector ( va, A, ); for(int i=0; i<4; i++){ vector_set_vl ( 1024 ); vtemp = va; va = vd; vd = vtemp; vector_wait_for_dma (); if(i<3) vector_dma_to_vector ( va, A+(i+1)*1024, ); if(i>0){ vector_instr_sync (); vector_dma_to_host ( A+(i-1)*1024, vc, ); } vector ( SVW, VADD, vb, 1, va+1 ); vector_abs ( SVW, VADD, vc, 2, va );

Code Generation Result:D A B + C × #include "vector.h“ int main(){ int A[8192] = {1,2,3,4, … 8192}; int *va = ( int *) vector_malloc ( ); int *vb = ( int *) vector_malloc ( ); int *vc = ( int *) vector_malloc ( ); int *vd = ( int *) vector_malloc ( ); int *vtemp = va; vector_dma_to_vector ( va, A, ); for(int i=0; i<4; i++){ vector_set_vl ( 1024 ); vtemp = va; va = vd; vd = vtemp; vector_wait_for_dma (); if(i<3) vector_dma_to_vector ( va, A+(i+1)*1024, ); if(i>0){ vector_instr_sync (); vector_dma_to_host ( A+(i-1)*1024, vc, ); } vector ( SVW, VADD, vb, 1, va+1 ); vector_abs ( SVW, VADD, vc, 2, va ); vector ( VVW, VADD, vb, vb, va );

Code Generation #include "vector.h“ int main(){ int A[8192] = {1,2,3,4, … 8192}; int *va = ( int *) vector_malloc ( ); int *vb = ( int *) vector_malloc ( ); int *vc = ( int *) vector_malloc ( ); int *vd = ( int *) vector_malloc ( ); int *vtemp = va; vector_dma_to_vector ( va, A, ); for(int i=0; i<4; i++){ vector_set_vl ( 1024 ); vtemp = va; va = vd; vd = vtemp; vector_wait_for_dma (); if(i<3) vector_dma_to_vector ( va, A+(i+1)*1024, ); if(i>0){ vector_instr_sync (); vector_dma_to_host ( A+(i-1)*1024, vc, ); } vector ( SVW, VADD, vb, 1, va+1 ); vector_abs ( SVW, VADD, vc, 2, va ); vector ( VVW, VADD, vb, vb, va ); vector ( VVW, VADD, vc, vc, vb ); } vector_instr_sync (); vector_dma_to_host ( A+(i-1)*1024, vc, ); vector_wait_for_dma (); vector_free (); } Result:D A B + C ×

Convert To LIR IR Combine Memory transforms Combine Operations Evaluation Ordering Buffer Counting Calculate Register Size Need Double buffering? LIR Expression Graph Convert to IR Sub-divide IR Constant folding CSE Move Bounds to Leaves VENICE Code Initialize Memory Transfer Data To Scratchpad Set VL Write Vector Instructions Transfer Result To Host Allocate Memory

Outline:  Motivation  Background  Implementation  Results  Conclusion

370x Speedups Compiler vs. Human fir2Dfirlifeimgblendmedianmotest V1 1.04x0.97x1.01x1.00x0.99x0.81x V4 1.01x1.12x1.10x1.02x1.07x1.01x V x1.12x1.38x0.90x0.96x1.01x V x1.42x2.24x0.92x0.81x1.04x

CPU Benchmark Runtime (ms) fir2Dfirlifeimgblendmedianmotest Xeon E5540 (2.53GHz) VENICE (V64,100MHz) Speedup1.0 x1.5 x2.3 x0.4 x3.2 x1.1 x Compare to Intel CPU Compile Time fir2D firlifeimgblendmedianmotestgeomean Compile time(ms)

Using smaller data types fir2D firlifeimgblendmedianmotestgeomean bytehalfwordbytehalfwordbyteword V13.93x4.36x4.07x 4.12x V43.54x3.83x4.03x3.79x V162.90x3.22x4.00x3.34x V11.96x1.54x1.74x V42.00x1.46x1.71x V161.97x1.83x1.90x Speedup using bytes Speedup using halfwords

Outline:  Motivation  Background  Implementation  Results  Conclusion

Conclusions:  The compiler greatly improves the programming and debugging experience for VENICE.  The compiler produces highly optimized VENICE code and achieves performance close-to or better- than hand-optimized code.  The compiler demonstrates the feasibility of using high-abstraction languages, such as Microsoft Accelerator with pluggable 3rd-party back-end support to provide a sustainable solution for future emerging hardware.

Thank you !

Optimal VL for V16 Input Data Sizes (words) Instr- uction Count Look-up Table

“Virtual Vector Register File” Number of vector registers = 4 Vector register size = 1024

Combine Operators for Motion Estimation V4V16V64 Before (ms) After (ms) Speedup1.49x1.48x1.43x

Performance Degradation on median int *v_min = v_input1; int *v_max = v_input2; vector ( VVW, VOR, v_tmp, v_min, v_min ); vector ( VVW, VSUB, v_sub, v_max, v_min ); vector ( VVW, VCMV_LTZ, v_min, v_max, v_sub ); vector ( VVW, VCMV_LTZ, v_max, v_tmp, v_sub ); vector ( VVW, VSUB, v_sub, v_input1, v_input2 ); vector ( VVW, VCMV_GTEZ, v_min, v_input2, v_sub ); vector ( VVW, VCMV_LTZ, v_min, v_input1, v_sub ); vector ( VVW, VCMV_GTEZ, v_min, v_input1, v_sub ); vector ( VVW, VCMV_LTZ, v_max, v_input2, v_sub ); Human-written compare-and-swap Compiler-generated compare-and-swap

Double Buffering