Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor
Outline: Motivation Background Implementation Results Conclusion
Outline: Motivation Background Implementation Results Conclusion
Motivation Multi-core GPU FPGA Many-core … CUDA Syste m Verilog VHDL OpenCL Erlang Computer clusters OpenMP MPI Pthread OpenHMPP Verilog Bluespe c Cilk X1 0 OpenGL ShSh aJava ParC Fortress Chapel Vector Processor StreamIt Sponge SS E
Motivation Multi-core GPU FPGA Many-core … CUDA Syste m Verilog VHDL OpenCL Erlang Computer clusters OpenMP MPI Pthread OpenHMPP Verilog Bluespe c Cilk X1 0 OpenGL ShSh aJava ParC Fortress Chapel Vector Processor StreamIt Sponge SS E Simplification
Motivation … Single Description
Contributions The compiler serves as a new back-end of a single- description multiple-device language. The compiler makes VENICE easier to program and debug. The compiler provides auto-parallelization and optimization. [1] Z. Liu, A. Severance, S. Singh and G. Lemieux, “Accelerator Compiler for the VENICE Vector Processor,” in FPGA [2] C. Chou, A. Severance, A. Brant, Z. Liu, S. Sant, G. Lemieux, “VEGAS: soft vector processor with scratchpad memory,” in FPGA 2011.
Outline: Motivation Background Implementation Results Conclusion
Complicated ALIGN WR RD ALIGN EX1 EX2 ACCUM
#include "vector.h“ int main() { int A[] = {1,2,3,4,5,6,7,8}; const int data_len = sizeof ( A ); int *va = ( int *) vector_malloc ( data_len ); vector_dma_to_vector ( va, A, data_len ); vector_wait_for_dma (); vector_set_vl ( data_len / sizeof (int) ); vector ( SVW, VADD, va, 42, va ); vector_instr_sync (); vector_dma_to_host ( A, va, data_len ); vector_wait_for_dma (); vector_free (); } Program in VENICE assembly Allocate vectors in scratchpad Move data from main memory to scratchpad Wait for DMA transaction to be completed Setup for vector instructions Perform vector computations Wait for vector operations to be completed Move data from scratchpad to main memory Wait for DMA transaction to be completed Deallocate memory from scratchpad
#include "Accelerator.h" using namespace ParallelArrays; using namespace MicrosoftTargets; int main() { int A[] = {1,2,3,4,5,6,7,8}; Target *tgt = CreateVectorTarget(); IPA b = IPA( A, sizeof (A)/sizeof (int)); IPA c = b + 42; tgt->ToArray( c, A, sizeof (A)/sizeof (int)); tgt->Delete(); } Target *tgt= CreateDX9Target(); Target *tgt = CreateMulticoreTarget(); Program in Accelerator Create a Target Create Parallel Array objects Write expressions Call ToArray to evaluate expressions Delete Target object
Assembly Programming : Write Assembly Download to board Compile with Gcc Get Result Doesn’t compile? Result Incorrect? Accelerator Programming : Write in Accelerator Download to board Compile with Microsoft Visual Studio Get Result Compile with Gcc Doesn’t compile? Or result incorrect?
Assembly Programming : 1.Hard to program 2.Long debug cycle 3.Not portable 4.Manual – Not always optimal or correct (wysiwyg) Accelerator Programming : 1.Easy to program 2.Easy to debug 3.Can also target other devices 4.Automated compiler optimizations
Outline: Motivation Background Implementation Results Conclusion
#include "Accelerator.h" using namespace ParallelArrays; using namespace MicrosoftTargets; int main() { Target *tgtVector = CreateVectorTarget(); const int length = 8192; int a[] = {1,2,3,4, …, 8192}; int d[length]; IPA A = IPA( a, length); IPA B = Evaluate( Rotate(A, [1]) + 1 ); IPA C = Evaluate( Abs( A + 2 )); IPA D = ( A + B ) * C ; tgtVector->ToArray( D, d, length * sizeof(int)); tgtVector->Delete(); } × × D D + + A A + + A A 2 2 Abs + + A A 1 1 Rot
× × D D + + A A + + A A 2 2 Abs Rot A A
× × D D + + A A + + A A 2 2 Abs A (rot) A (rot)
× × D D + + A A + + A A 2 2 Abs A (rot) A (rot) × × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B C C + + A A 2 2 Abs
× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B C C + + A A 2 2 Abs Combine Operations
× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B |+| 2 2 C C A A Combine Operations
Scratchpad Memory“Virtual Vector Register File”
Number of vector registers = ? Vector register size = ?
“Virtual Vector Register File” Number of vector registers = ? Vector register size = ?
× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Evaluation Order
× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers
× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers
× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A3
× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A3
× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A3 B1
× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A3 B1
× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A3 B1 C1
× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A3 B1 C1 Active AYes BNo C
× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A3 B1 C1 Active AYes BNo C numLoads = 1
× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A3 B1 C1 Active AYes BNo C numLoads = 1
× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A2 B1 C1 Active AYes BNo C numLoads = 1
× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A2 B1 C1 Active AYes BNo C numLoads = 1
× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A2 B1 C1 Active AYes BNo C numLoads = 1
× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A2 B1 C1 Active AYes BNo C numLoads = 1 numTemps = 1
× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A2 B1 C1 Active AYes BNo C numLoads = 1 numTemps = 1 numTotal = 2 maxTotal = 2
× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A2 B1 C1 Active AYes B CNo numLoads = 2 numTemps = 1
× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A2 B1 C1 Active AYes B CNo numLoads = 2 numTemps = 0
× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A1 B1 C1 Active AYes B CNo numLoads = 2 numTemps = 0
× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A1 B1 C1 Active AYes B CNo numLoads = 2 numTemps = 0
× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A1 B1 C1 Active AYes B CNo numLoads = 2 numTemps = 1
× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A1 B1 C1 Active AYes B CNo numLoads = 2 numTemps = 1 numTotal = 3 maxTotal = 3
× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A1 B1 C1 Active AYes B C numLoads = 3 numTemps = 0
× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A0 B1 C1 Active ANo BYes C numLoads = 3 numTemps = 0
× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A0 B0 C1 Active ANo B CYes numLoads = 3 numTemps = 0
× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A0 B0 C1 Active ANo B CYes numLoads = 3 numTemps = 0
× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A0 B0 C0 Active ANo B C numLoads = 3 numTemps = 0
× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A0 B0 C0 Active ANo B C numLoads = 3 numTemps = 0
× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A0 B0 C0 Active ANo B C numLoads = 3 numTemps = 0 numTotal = 3 maxTotal = 3
× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A0 B0 C0 Active ANo B C numLoads = 0 numTemps = 0 numTotal = 0 maxTotal = 3
“Virtual Vector Register File” Number of vector registers = 3 Vector register size = ?
“Virtual Vector Register File” Number of vector registers = 3 Vector register size = Capacity/3
Convert to LIR Result:B A(rot) 1 + Result:D A B + C × × × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B |+| 2 2 C C A A Result:C A 2 |+|
Code Generation Result:B A(rot) 1 + Result:D A B + C × Result:C A 2 |+|
Code Generation Result:B A(rot) 1 + Result:D A B + C × Result:C A 2 |+|
Code Generation Result:B A(rot) 1 + Result:D A B + C × Result:C A 2 |+|
Code Generation Result:B A(rot) 1 + Result:D A B + C × Result:C A 2 |+| #include "vector.h“ int main(){ int A[8192] = {1,2,3,4, … 8192}; int *va = ( int *) vector_malloc ( ); int *vb = ( int *) vector_malloc ( ); int *vc = ( int *) vector_malloc ( ); int *vd = ( int *) vector_malloc ( ); int *vtemp = va; vector_dma_to_vector ( va, A, ); for(int i=0; i<4; i++){ vector_set_vl ( 1024 ); vtemp = va; va = vd; vd = vtemp; vector_wait_for_dma (); if(i<3) vector_dma_to_vector ( va, A+(i+1)*1024, ); if(i>0){ vector_instr_sync (); vector_dma_to_host ( A+(i-1)*1024, vc, ); } vector ( SVW, VADD, vb, 1, va+1 );
Code Generation Result:D A B + C × Result:C A 2 |+| #include "vector.h“ int main(){ int A[8192] = {1,2,3,4, … 8192}; int *va = ( int *) vector_malloc ( ); int *vb = ( int *) vector_malloc ( ); int *vc = ( int *) vector_malloc ( ); int *vd = ( int *) vector_malloc ( ); int *vtemp = va; vector_dma_to_vector ( va, A, ); for(int i=0; i<4; i++){ vector_set_vl ( 1024 ); vtemp = va; va = vd; vd = vtemp; vector_wait_for_dma (); if(i<3) vector_dma_to_vector ( va, A+(i+1)*1024, ); if(i>0){ vector_instr_sync (); vector_dma_to_host ( A+(i-1)*1024, vc, ); } vector ( SVW, VADD, vb, 1, va+1 ); vector_abs ( SVW, VADD, vc, 2, va );
Code Generation Result:D A B + C × #include "vector.h“ int main(){ int A[8192] = {1,2,3,4, … 8192}; int *va = ( int *) vector_malloc ( ); int *vb = ( int *) vector_malloc ( ); int *vc = ( int *) vector_malloc ( ); int *vd = ( int *) vector_malloc ( ); int *vtemp = va; vector_dma_to_vector ( va, A, ); for(int i=0; i<4; i++){ vector_set_vl ( 1024 ); vtemp = va; va = vd; vd = vtemp; vector_wait_for_dma (); if(i<3) vector_dma_to_vector ( va, A+(i+1)*1024, ); if(i>0){ vector_instr_sync (); vector_dma_to_host ( A+(i-1)*1024, vc, ); } vector ( SVW, VADD, vb, 1, va+1 ); vector_abs ( SVW, VADD, vc, 2, va ); vector ( VVW, VADD, vb, vb, va );
Code Generation #include "vector.h“ int main(){ int A[8192] = {1,2,3,4, … 8192}; int *va = ( int *) vector_malloc ( ); int *vb = ( int *) vector_malloc ( ); int *vc = ( int *) vector_malloc ( ); int *vd = ( int *) vector_malloc ( ); int *vtemp = va; vector_dma_to_vector ( va, A, ); for(int i=0; i<4; i++){ vector_set_vl ( 1024 ); vtemp = va; va = vd; vd = vtemp; vector_wait_for_dma (); if(i<3) vector_dma_to_vector ( va, A+(i+1)*1024, ); if(i>0){ vector_instr_sync (); vector_dma_to_host ( A+(i-1)*1024, vc, ); } vector ( SVW, VADD, vb, 1, va+1 ); vector_abs ( SVW, VADD, vc, 2, va ); vector ( VVW, VADD, vb, vb, va ); vector ( VVW, VADD, vc, vc, vb ); } vector_instr_sync (); vector_dma_to_host ( A+(i-1)*1024, vc, ); vector_wait_for_dma (); vector_free (); } Result:D A B + C ×
Convert To LIR IR Combine Memory transforms Combine Operations Evaluation Ordering Buffer Counting Calculate Register Size Need Double buffering? LIR Expression Graph Convert to IR Sub-divide IR Constant folding CSE Move Bounds to Leaves VENICE Code Initialize Memory Transfer Data To Scratchpad Set VL Write Vector Instructions Transfer Result To Host Allocate Memory
Outline: Motivation Background Implementation Results Conclusion
370x Speedups Compiler vs. Human fir2Dfirlifeimgblendmedianmotest V1 1.04x0.97x1.01x1.00x0.99x0.81x V4 1.01x1.12x1.10x1.02x1.07x1.01x V x1.12x1.38x0.90x0.96x1.01x V x1.42x2.24x0.92x0.81x1.04x
CPU Benchmark Runtime (ms) fir2Dfirlifeimgblendmedianmotest Xeon E5540 (2.53GHz) VENICE (V64,100MHz) Speedup1.0 x1.5 x2.3 x0.4 x3.2 x1.1 x Compare to Intel CPU Compile Time fir2D firlifeimgblendmedianmotestgeomean Compile time(ms)
Using smaller data types fir2D firlifeimgblendmedianmotestgeomean bytehalfwordbytehalfwordbyteword V13.93x4.36x4.07x 4.12x V43.54x3.83x4.03x3.79x V162.90x3.22x4.00x3.34x V11.96x1.54x1.74x V42.00x1.46x1.71x V161.97x1.83x1.90x Speedup using bytes Speedup using halfwords
Outline: Motivation Background Implementation Results Conclusion
Conclusions: The compiler greatly improves the programming and debugging experience for VENICE. The compiler produces highly optimized VENICE code and achieves performance close-to or better- than hand-optimized code. The compiler demonstrates the feasibility of using high-abstraction languages, such as Microsoft Accelerator with pluggable 3rd-party back-end support to provide a sustainable solution for future emerging hardware.
Thank you !
Optimal VL for V16 Input Data Sizes (words) Instr- uction Count Look-up Table
“Virtual Vector Register File” Number of vector registers = 4 Vector register size = 1024
Combine Operators for Motion Estimation V4V16V64 Before (ms) After (ms) Speedup1.49x1.48x1.43x
Performance Degradation on median int *v_min = v_input1; int *v_max = v_input2; vector ( VVW, VOR, v_tmp, v_min, v_min ); vector ( VVW, VSUB, v_sub, v_max, v_min ); vector ( VVW, VCMV_LTZ, v_min, v_max, v_sub ); vector ( VVW, VCMV_LTZ, v_max, v_tmp, v_sub ); vector ( VVW, VSUB, v_sub, v_input1, v_input2 ); vector ( VVW, VCMV_GTEZ, v_min, v_input2, v_sub ); vector ( VVW, VCMV_LTZ, v_min, v_input1, v_sub ); vector ( VVW, VCMV_GTEZ, v_min, v_input1, v_sub ); vector ( VVW, VCMV_LTZ, v_max, v_input2, v_sub ); Human-written compare-and-swap Compiler-generated compare-and-swap
Double Buffering