Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor.

Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor

Outline:  Motivation  Background  Implementation  Results  Conclusion

Motivation Multi-core GPU FPGA Many-core … CUDA Syste m Verilog VHDL OpenCL Erlang Computer clusters OpenMP MPI Pthread OpenHMPP Verilog Bluespe c Cilk X1 0 OpenGL ShSh aJava ParC Fortress Chapel Vector Processor StreamIt Sponge SS E

Motivation Multi-core GPU FPGA Many-core … CUDA Syste m Verilog VHDL OpenCL Erlang Computer clusters OpenMP MPI Pthread OpenHMPP Verilog Bluespe c Cilk X1 0 OpenGL ShSh aJava ParC Fortress Chapel Vector Processor StreamIt Sponge SS E Simplification

Motivation … Single Description

Contributions  The compiler serves as a new back-end of a single- description multiple-device language.  The compiler makes VENICE easier to program and debug.  The compiler provides auto-parallelization and optimization. [1] Z. Liu, A. Severance, S. Singh and G. Lemieux, “Accelerator Compiler for the VENICE Vector Processor,” in FPGA 2012. [2] C. Chou, A. Severance, A. Brant, Z. Liu, S. Sant, G. Lemieux, “VEGAS: soft vector processor with scratchpad memory,” in FPGA 2011.

Complicated ALIGN WR RD ALIGN EX1 EX2 ACCUM

#include "vector.h“ int main() { int A[] = {1,2,3,4,5,6,7,8}; const int data_len = sizeof ( A ); int *va = ( int *) vector_malloc ( data_len ); vector_dma_to_vector ( va, A, data_len ); vector_wait_for_dma (); vector_set_vl ( data_len / sizeof (int) ); vector ( SVW, VADD, va, 42, va ); vector_instr_sync (); vector_dma_to_host ( A, va, data_len ); vector_wait_for_dma (); vector_free (); } Program in VENICE assembly Allocate vectors in scratchpad Move data from main memory to scratchpad Wait for DMA transaction to be completed Setup for vector instructions Perform vector computations Wait for vector operations to be completed Move data from scratchpad to main memory Wait for DMA transaction to be completed Deallocate memory from scratchpad

#include "Accelerator.h" using namespace ParallelArrays; using namespace MicrosoftTargets; int main() { int A[] = {1,2,3,4,5,6,7,8}; Target *tgt = CreateVectorTarget(); IPA b = IPA( A, sizeof (A)/sizeof (int)); IPA c = b + 42; tgt->ToArray( c, A, sizeof (A)/sizeof (int)); tgt->Delete(); } Target *tgt= CreateDX9Target(); Target *tgt = CreateMulticoreTarget(); Program in Accelerator Create a Target Create Parallel Array objects Write expressions Call ToArray to evaluate expressions Delete Target object

Assembly Programming : Write Assembly Download to board Compile with Gcc Get Result Doesn’t compile? Result Incorrect? Accelerator Programming : Write in Accelerator Download to board Compile with Microsoft Visual Studio Get Result Compile with Gcc Doesn’t compile? Or result incorrect?

Assembly Programming : 1.Hard to program 2.Long debug cycle 3.Not portable 4.Manual – Not always optimal or correct (wysiwyg) Accelerator Programming : 1.Easy to program 2.Easy to debug 3.Can also target other devices 4.Automated compiler optimizations

#include "Accelerator.h" using namespace ParallelArrays; using namespace MicrosoftTargets; int main() { Target *tgtVector = CreateVectorTarget(); const int length = 8192; int a[] = {1,2,3,4, …, 8192}; int d[length]; IPA A = IPA( a, length); IPA B = Evaluate( Rotate(A, [1]) + 1 ); IPA C = Evaluate( Abs( A + 2 )); IPA D = ( A + B ) * C ; tgtVector->ToArray( D, d, length * sizeof(int)); tgtVector->Delete(); } × × D D + + A A + + A A 2 2 Abs + + A A 1 1 Rot

× × D D + + A A + + A A 2 2 Abs + + 1 1 Rot A A

× × D D + + A A + + A A 2 2 Abs + + 1 1 A (rot) A (rot)

× × D D + + A A + + A A 2 2 Abs + + 1 1 A (rot) A (rot) × × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B C C + + A A 2 2 Abs

× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B C C + + A A 2 2 Abs Combine Operations

× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B |+| 2 2 C C A A Combine Operations

Scratchpad Memory“Virtual Vector Register File”

Number of vector registers = ? Vector register size = ?

“Virtual Vector Register File” Number of vector registers = ? Vector register size = ?

× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C 1 0 1 1 0 1 1 1 2 1 2 1 1 2 2 3 3 1 1 2 2 3 3 1 1 2 2 3 3 4 4 5 5 Evaluation Order

× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers

× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A3

× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A3 B1

× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A3 B1 C1

× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A3 B1 C1 Active AYes BNo C

× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A3 B1 C1 Active AYes BNo C numLoads = 1

× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A2 B1 C1 Active AYes BNo C numLoads = 1

× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A2 B1 C1 Active AYes BNo C numLoads = 1 numTemps = 1

× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A2 B1 C1 Active AYes BNo C numLoads = 1 numTemps = 1 numTotal = 2 maxTotal = 2

× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A2 B1 C1 Active AYes B CNo numLoads = 2 numTemps = 1

× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A1 B1 C1 Active AYes B CNo numLoads = 2 numTemps = 1 numTotal = 3 maxTotal = 3

× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A1 B1 C1 Active AYes B C numLoads = 3 numTemps = 0

× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A0 B1 C1 Active ANo BYes C numLoads = 3 numTemps = 0

× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A0 B0 C1 Active ANo B CYes numLoads = 3 numTemps = 0

× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A0 B0 C0 Active ANo B C numLoads = 3 numTemps = 0

× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A0 B0 C0 Active ANo B C numLoads = 3 numTemps = 0 numTotal = 3 maxTotal = 3

× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A0 B0 C0 Active ANo B C numLoads = 0 numTemps = 0 numTotal = 0 maxTotal = 3

“Virtual Vector Register File” Number of vector registers = 3 Vector register size = ?

“Virtual Vector Register File” Number of vector registers = 3 Vector register size = Capacity/3

Convert to LIR Result:B A(rot) 1 + Result:D A B + C × × × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B |+| 2 2 C C A A Result:C A 2 |+| 1 1 2 2 3 3 1 1 2 2 3 3 1 1 2 2 3 3 4 4 5 5

Code Generation Result:B A(rot) 1 + Result:D A B + C × Result:C A 2 |+|

Code Generation Result:B A(rot) 1 + Result:D A B + C × Result:C A 2 |+| 1234...8192

Code Generation Result:B A(rot) 1 + Result:D A B + C × Result:C A 2 |+| 1234...81921

Code Generation Result:B A(rot) 1 + Result:D A B + C × Result:C A 2 |+| #include "vector.h“ int main(){ int A[8192] = {1,2,3,4, … 8192}; int *va = ( int *) vector_malloc ( 32772 ); int *vb = ( int *) vector_malloc ( 32768 ); int *vc = ( int *) vector_malloc ( 32768 ); int *vd = ( int *) vector_malloc ( 32772 ); int *vtemp = va; vector_dma_to_vector ( va, A, 32772 ); for(int i=0; i<4; i++){ vector_set_vl ( 1024 ); vtemp = va; va = vd; vd = vtemp; vector_wait_for_dma (); if(i<3) vector_dma_to_vector ( va, A+(i+1)*1024, 32772 ); if(i>0){ vector_instr_sync (); vector_dma_to_host ( A+(i-1)*1024, vc, 32768 ); } vector ( SVW, VADD, vb, 1, va+1 );

Code Generation Result:D A B + C × Result:C A 2 |+| #include "vector.h“ int main(){ int A[8192] = {1,2,3,4, … 8192}; int *va = ( int *) vector_malloc ( 32772 ); int *vb = ( int *) vector_malloc ( 32768 ); int *vc = ( int *) vector_malloc ( 32768 ); int *vd = ( int *) vector_malloc ( 32772 ); int *vtemp = va; vector_dma_to_vector ( va, A, 32772 ); for(int i=0; i<4; i++){ vector_set_vl ( 1024 ); vtemp = va; va = vd; vd = vtemp; vector_wait_for_dma (); if(i<3) vector_dma_to_vector ( va, A+(i+1)*1024, 32772 ); if(i>0){ vector_instr_sync (); vector_dma_to_host ( A+(i-1)*1024, vc, 32768 ); } vector ( SVW, VADD, vb, 1, va+1 ); vector_abs ( SVW, VADD, vc, 2, va );

Code Generation Result:D A B + C × #include "vector.h“ int main(){ int A[8192] = {1,2,3,4, … 8192}; int *va = ( int *) vector_malloc ( 32772 ); int *vb = ( int *) vector_malloc ( 32768 ); int *vc = ( int *) vector_malloc ( 32768 ); int *vd = ( int *) vector_malloc ( 32772 ); int *vtemp = va; vector_dma_to_vector ( va, A, 32772 ); for(int i=0; i<4; i++){ vector_set_vl ( 1024 ); vtemp = va; va = vd; vd = vtemp; vector_wait_for_dma (); if(i<3) vector_dma_to_vector ( va, A+(i+1)*1024, 32772 ); if(i>0){ vector_instr_sync (); vector_dma_to_host ( A+(i-1)*1024, vc, 32768 ); } vector ( SVW, VADD, vb, 1, va+1 ); vector_abs ( SVW, VADD, vc, 2, va ); vector ( VVW, VADD, vb, vb, va );

Code Generation #include "vector.h“ int main(){ int A[8192] = {1,2,3,4, … 8192}; int *va = ( int *) vector_malloc ( 32772 ); int *vb = ( int *) vector_malloc ( 32768 ); int *vc = ( int *) vector_malloc ( 32768 ); int *vd = ( int *) vector_malloc ( 32772 ); int *vtemp = va; vector_dma_to_vector ( va, A, 32772 ); for(int i=0; i<4; i++){ vector_set_vl ( 1024 ); vtemp = va; va = vd; vd = vtemp; vector_wait_for_dma (); if(i<3) vector_dma_to_vector ( va, A+(i+1)*1024, 32772 ); if(i>0){ vector_instr_sync (); vector_dma_to_host ( A+(i-1)*1024, vc, 32768 ); } vector ( SVW, VADD, vb, 1, va+1 ); vector_abs ( SVW, VADD, vc, 2, va ); vector ( VVW, VADD, vb, vb, va ); vector ( VVW, VADD, vc, vc, vb ); } vector_instr_sync (); vector_dma_to_host ( A+(i-1)*1024, vc, 32768 ); vector_wait_for_dma (); vector_free (); } Result:D A B + C ×

Convert To LIR IR Combine Memory transforms Combine Operations Evaluation Ordering Buffer Counting Calculate Register Size Need Double buffering? LIR Expression Graph Convert to IR Sub-divide IR Constant folding CSE Move Bounds to Leaves VENICE Code Initialize Memory Transfer Data To Scratchpad Set VL Write Vector Instructions Transfer Result To Host Allocate Memory

370x Speedups Compiler vs. Human fir2Dfirlifeimgblendmedianmotest V1 1.04x0.97x1.01x1.00x0.99x0.81x V4 1.01x1.12x1.10x1.02x1.07x1.01x V16 1.09x1.12x1.38x0.90x0.96x1.01x V64 1.30x1.42x2.24x0.92x0.81x1.04x

CPU Benchmark Runtime (ms) fir2Dfirlifeimgblendmedianmotest Xeon E5540 (2.53GHz) 0.070.440.530.129.970.24 VENICE (V64,100MHz) 0.070.290.230.333.110.22 Speedup1.0 x1.5 x2.3 x0.4 x3.2 x1.1 x Compare to Intel CPU Compile Time fir2D firlifeimgblendmedianmotestgeomean Compile time(ms) 4.745.054.494.4492.7224.2710.12

Using smaller data types fir2D firlifeimgblendmedianmotestgeomean bytehalfwordbytehalfwordbyteword V13.93x4.36x4.07x 4.12x V43.54x3.83x4.03x3.79x V162.90x3.22x4.00x3.34x V11.96x1.54x1.74x V42.00x1.46x1.71x V161.97x1.83x1.90x Speedup using bytes Speedup using halfwords

Conclusions:  The compiler greatly improves the programming and debugging experience for VENICE.  The compiler produces highly optimized VENICE code and achieves performance close-to or better- than hand-optimized code.  The compiler demonstrates the feasibility of using high-abstraction languages, such as Microsoft Accelerator with pluggable 3rd-party back-end support to provide a sustainable solution for future emerging hardware.

Thank you !

Optimal VL for V16 Input Data Sizes (words) 81921638432768655361310722621445242881048576 Instr- uction Count 140968192 240968192 32048 4096 8192 410242048 4096 8192 510242048 4096 8192 610242048 4096 8192 710242048 4096 8192 810242048 4096 8192 910242048 4096 8192 1010242048 4096 8192 1110242048 4096 8192 1210242048 4096 8192 1310242048 4096 8192 1410242048 4096 8192 1510242048 4096 8192 1610242048 4096 8192 Look-up Table

“Virtual Vector Register File” Number of vector registers = 4 Vector register size = 1024

Combine Operators for Motion Estimation V4V16V64 Before (ms)2.030.550.30 After (ms)1.360.370.21 Speedup1.49x1.48x1.43x

Performance Degradation on median int *v_min = v_input1; int *v_max = v_input2; vector ( VVW, VOR, v_tmp, v_min, v_min ); vector ( VVW, VSUB, v_sub, v_max, v_min ); vector ( VVW, VCMV_LTZ, v_min, v_max, v_sub ); vector ( VVW, VCMV_LTZ, v_max, v_tmp, v_sub ); vector ( VVW, VSUB, v_sub, v_input1, v_input2 ); vector ( VVW, VCMV_GTEZ, v_min, v_input2, v_sub ); vector ( VVW, VCMV_LTZ, v_min, v_input1, v_sub ); vector ( VVW, VCMV_GTEZ, v_min, v_input1, v_sub ); vector ( VVW, VCMV_LTZ, v_max, v_input2, v_sub ); Human-written compare-and-swap Compiler-generated compare-and-swap

Double Buffering

Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor.

Similar presentations

Presentation on theme: "Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor.

Similar presentations

Presentation on theme: "Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor."— Presentation transcript:

Similar presentations

About project

Feedback