Download presentation
Presentation is loading. Please wait.
Published byJeffery Ramsey Modified over 9 years ago
1
Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor
2
Outline: Motivation Background Implementation Results Conclusion
3
Outline: Motivation Background Implementation Results Conclusion
4
Motivation Multi-core GPU FPGA Many-core … CUDA Syste m Verilog VHDL OpenCL Erlang Computer clusters OpenMP MPI Pthread OpenHMPP Verilog Bluespe c Cilk X1 0 OpenGL ShSh aJava ParC Fortress Chapel Vector Processor StreamIt Sponge SS E
5
Motivation Multi-core GPU FPGA Many-core … CUDA Syste m Verilog VHDL OpenCL Erlang Computer clusters OpenMP MPI Pthread OpenHMPP Verilog Bluespe c Cilk X1 0 OpenGL ShSh aJava ParC Fortress Chapel Vector Processor StreamIt Sponge SS E Simplification
6
Motivation … Single Description
7
Contributions The compiler serves as a new back-end of a single- description multiple-device language. The compiler makes VENICE easier to program and debug. The compiler provides auto-parallelization and optimization. [1] Z. Liu, A. Severance, S. Singh and G. Lemieux, “Accelerator Compiler for the VENICE Vector Processor,” in FPGA 2012. [2] C. Chou, A. Severance, A. Brant, Z. Liu, S. Sant, G. Lemieux, “VEGAS: soft vector processor with scratchpad memory,” in FPGA 2011.
8
Outline: Motivation Background Implementation Results Conclusion
9
Complicated ALIGN WR RD ALIGN EX1 EX2 ACCUM
10
#include "vector.h“ int main() { int A[] = {1,2,3,4,5,6,7,8}; const int data_len = sizeof ( A ); int *va = ( int *) vector_malloc ( data_len ); vector_dma_to_vector ( va, A, data_len ); vector_wait_for_dma (); vector_set_vl ( data_len / sizeof (int) ); vector ( SVW, VADD, va, 42, va ); vector_instr_sync (); vector_dma_to_host ( A, va, data_len ); vector_wait_for_dma (); vector_free (); } Program in VENICE assembly Allocate vectors in scratchpad Move data from main memory to scratchpad Wait for DMA transaction to be completed Setup for vector instructions Perform vector computations Wait for vector operations to be completed Move data from scratchpad to main memory Wait for DMA transaction to be completed Deallocate memory from scratchpad
11
#include "Accelerator.h" using namespace ParallelArrays; using namespace MicrosoftTargets; int main() { int A[] = {1,2,3,4,5,6,7,8}; Target *tgt = CreateVectorTarget(); IPA b = IPA( A, sizeof (A)/sizeof (int)); IPA c = b + 42; tgt->ToArray( c, A, sizeof (A)/sizeof (int)); tgt->Delete(); } Target *tgt= CreateDX9Target(); Target *tgt = CreateMulticoreTarget(); Program in Accelerator Create a Target Create Parallel Array objects Write expressions Call ToArray to evaluate expressions Delete Target object
12
Assembly Programming : Write Assembly Download to board Compile with Gcc Get Result Doesn’t compile? Result Incorrect? Accelerator Programming : Write in Accelerator Download to board Compile with Microsoft Visual Studio Get Result Compile with Gcc Doesn’t compile? Or result incorrect?
13
Assembly Programming : 1.Hard to program 2.Long debug cycle 3.Not portable 4.Manual – Not always optimal or correct (wysiwyg) Accelerator Programming : 1.Easy to program 2.Easy to debug 3.Can also target other devices 4.Automated compiler optimizations
14
Outline: Motivation Background Implementation Results Conclusion
17
#include "Accelerator.h" using namespace ParallelArrays; using namespace MicrosoftTargets; int main() { Target *tgtVector = CreateVectorTarget(); const int length = 8192; int a[] = {1,2,3,4, …, 8192}; int d[length]; IPA A = IPA( a, length); IPA B = Evaluate( Rotate(A, [1]) + 1 ); IPA C = Evaluate( Abs( A + 2 )); IPA D = ( A + B ) * C ; tgtVector->ToArray( D, d, length * sizeof(int)); tgtVector->Delete(); } × × D D + + A A + + A A 2 2 Abs + + A A 1 1 Rot
18
× × D D + + A A + + A A 2 2 Abs + + 1 1 Rot A A
19
× × D D + + A A + + A A 2 2 Abs + + 1 1 A (rot) A (rot)
20
× × D D + + A A + + A A 2 2 Abs + + 1 1 A (rot) A (rot) × × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B C C + + A A 2 2 Abs
23
× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B C C + + A A 2 2 Abs Combine Operations
24
× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B |+| 2 2 C C A A Combine Operations
25
Scratchpad Memory“Virtual Vector Register File”
27
Number of vector registers = ? Vector register size = ?
28
“Virtual Vector Register File” Number of vector registers = ? Vector register size = ?
29
× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C 1 0 1 1 0 1 1 1 2 1 2 1 1 2 2 3 3 1 1 2 2 3 3 1 1 2 2 3 3 4 4 5 5 Evaluation Order
30
× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers
31
× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers
32
× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A3
33
× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A3
34
× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A3 B1
35
× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A3 B1
36
× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A3 B1 C1
37
× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A3 B1 C1 Active AYes BNo C
38
× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A3 B1 C1 Active AYes BNo C numLoads = 1
39
× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A3 B1 C1 Active AYes BNo C numLoads = 1
40
× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A2 B1 C1 Active AYes BNo C numLoads = 1
41
× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A2 B1 C1 Active AYes BNo C numLoads = 1
42
× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A2 B1 C1 Active AYes BNo C numLoads = 1
43
× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A2 B1 C1 Active AYes BNo C numLoads = 1 numTemps = 1
44
× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A2 B1 C1 Active AYes BNo C numLoads = 1 numTemps = 1 numTotal = 2 maxTotal = 2
45
× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A2 B1 C1 Active AYes B CNo numLoads = 2 numTemps = 1
46
× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A2 B1 C1 Active AYes B CNo numLoads = 2 numTemps = 0
47
× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A1 B1 C1 Active AYes B CNo numLoads = 2 numTemps = 0
48
× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A1 B1 C1 Active AYes B CNo numLoads = 2 numTemps = 0
49
× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A1 B1 C1 Active AYes B CNo numLoads = 2 numTemps = 1
50
× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A1 B1 C1 Active AYes B CNo numLoads = 2 numTemps = 1 numTotal = 3 maxTotal = 3
51
× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A1 B1 C1 Active AYes B C numLoads = 3 numTemps = 0
52
× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A0 B1 C1 Active ANo BYes C numLoads = 3 numTemps = 0
53
× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A0 B0 C1 Active ANo B CYes numLoads = 3 numTemps = 0
54
× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A0 B0 C1 Active ANo B CYes numLoads = 3 numTemps = 0
55
× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A0 B0 C0 Active ANo B C numLoads = 3 numTemps = 0
56
× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A0 B0 C0 Active ANo B C numLoads = 3 numTemps = 0
57
× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A0 B0 C0 Active ANo B C numLoads = 3 numTemps = 0 numTotal = 3 maxTotal = 3
58
× × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A0 B0 C0 Active ANo B C numLoads = 0 numTemps = 0 numTotal = 0 maxTotal = 3
59
“Virtual Vector Register File” Number of vector registers = 3 Vector register size = ?
60
“Virtual Vector Register File” Number of vector registers = 3 Vector register size = Capacity/3
61
Convert to LIR Result:B A(rot) 1 + Result:D A B + C × × × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B |+| 2 2 C C A A Result:C A 2 |+| 1 1 2 2 3 3 1 1 2 2 3 3 1 1 2 2 3 3 4 4 5 5
62
Code Generation Result:B A(rot) 1 + Result:D A B + C × Result:C A 2 |+|
63
Code Generation Result:B A(rot) 1 + Result:D A B + C × Result:C A 2 |+| 1234...8192
64
Code Generation Result:B A(rot) 1 + Result:D A B + C × Result:C A 2 |+| 1234...81921
65
Code Generation Result:B A(rot) 1 + Result:D A B + C × Result:C A 2 |+| #include "vector.h“ int main(){ int A[8192] = {1,2,3,4, … 8192}; int *va = ( int *) vector_malloc ( 32772 ); int *vb = ( int *) vector_malloc ( 32768 ); int *vc = ( int *) vector_malloc ( 32768 ); int *vd = ( int *) vector_malloc ( 32772 ); int *vtemp = va; vector_dma_to_vector ( va, A, 32772 ); for(int i=0; i<4; i++){ vector_set_vl ( 1024 ); vtemp = va; va = vd; vd = vtemp; vector_wait_for_dma (); if(i<3) vector_dma_to_vector ( va, A+(i+1)*1024, 32772 ); if(i>0){ vector_instr_sync (); vector_dma_to_host ( A+(i-1)*1024, vc, 32768 ); } vector ( SVW, VADD, vb, 1, va+1 );
66
Code Generation Result:D A B + C × Result:C A 2 |+| #include "vector.h“ int main(){ int A[8192] = {1,2,3,4, … 8192}; int *va = ( int *) vector_malloc ( 32772 ); int *vb = ( int *) vector_malloc ( 32768 ); int *vc = ( int *) vector_malloc ( 32768 ); int *vd = ( int *) vector_malloc ( 32772 ); int *vtemp = va; vector_dma_to_vector ( va, A, 32772 ); for(int i=0; i<4; i++){ vector_set_vl ( 1024 ); vtemp = va; va = vd; vd = vtemp; vector_wait_for_dma (); if(i<3) vector_dma_to_vector ( va, A+(i+1)*1024, 32772 ); if(i>0){ vector_instr_sync (); vector_dma_to_host ( A+(i-1)*1024, vc, 32768 ); } vector ( SVW, VADD, vb, 1, va+1 ); vector_abs ( SVW, VADD, vc, 2, va );
67
Code Generation Result:D A B + C × #include "vector.h“ int main(){ int A[8192] = {1,2,3,4, … 8192}; int *va = ( int *) vector_malloc ( 32772 ); int *vb = ( int *) vector_malloc ( 32768 ); int *vc = ( int *) vector_malloc ( 32768 ); int *vd = ( int *) vector_malloc ( 32772 ); int *vtemp = va; vector_dma_to_vector ( va, A, 32772 ); for(int i=0; i<4; i++){ vector_set_vl ( 1024 ); vtemp = va; va = vd; vd = vtemp; vector_wait_for_dma (); if(i<3) vector_dma_to_vector ( va, A+(i+1)*1024, 32772 ); if(i>0){ vector_instr_sync (); vector_dma_to_host ( A+(i-1)*1024, vc, 32768 ); } vector ( SVW, VADD, vb, 1, va+1 ); vector_abs ( SVW, VADD, vc, 2, va ); vector ( VVW, VADD, vb, vb, va );
68
Code Generation #include "vector.h“ int main(){ int A[8192] = {1,2,3,4, … 8192}; int *va = ( int *) vector_malloc ( 32772 ); int *vb = ( int *) vector_malloc ( 32768 ); int *vc = ( int *) vector_malloc ( 32768 ); int *vd = ( int *) vector_malloc ( 32772 ); int *vtemp = va; vector_dma_to_vector ( va, A, 32772 ); for(int i=0; i<4; i++){ vector_set_vl ( 1024 ); vtemp = va; va = vd; vd = vtemp; vector_wait_for_dma (); if(i<3) vector_dma_to_vector ( va, A+(i+1)*1024, 32772 ); if(i>0){ vector_instr_sync (); vector_dma_to_host ( A+(i-1)*1024, vc, 32768 ); } vector ( SVW, VADD, vb, 1, va+1 ); vector_abs ( SVW, VADD, vc, 2, va ); vector ( VVW, VADD, vb, vb, va ); vector ( VVW, VADD, vc, vc, vb ); } vector_instr_sync (); vector_dma_to_host ( A+(i-1)*1024, vc, 32768 ); vector_wait_for_dma (); vector_free (); } Result:D A B + C ×
69
Convert To LIR IR Combine Memory transforms Combine Operations Evaluation Ordering Buffer Counting Calculate Register Size Need Double buffering? LIR Expression Graph Convert to IR Sub-divide IR Constant folding CSE Move Bounds to Leaves VENICE Code Initialize Memory Transfer Data To Scratchpad Set VL Write Vector Instructions Transfer Result To Host Allocate Memory
70
Outline: Motivation Background Implementation Results Conclusion
71
370x Speedups Compiler vs. Human fir2Dfirlifeimgblendmedianmotest V1 1.04x0.97x1.01x1.00x0.99x0.81x V4 1.01x1.12x1.10x1.02x1.07x1.01x V16 1.09x1.12x1.38x0.90x0.96x1.01x V64 1.30x1.42x2.24x0.92x0.81x1.04x
74
CPU Benchmark Runtime (ms) fir2Dfirlifeimgblendmedianmotest Xeon E5540 (2.53GHz) 0.070.440.530.129.970.24 VENICE (V64,100MHz) 0.070.290.230.333.110.22 Speedup1.0 x1.5 x2.3 x0.4 x3.2 x1.1 x Compare to Intel CPU Compile Time fir2D firlifeimgblendmedianmotestgeomean Compile time(ms) 4.745.054.494.4492.7224.2710.12
75
Using smaller data types fir2D firlifeimgblendmedianmotestgeomean bytehalfwordbytehalfwordbyteword V13.93x4.36x4.07x 4.12x V43.54x3.83x4.03x3.79x V162.90x3.22x4.00x3.34x V11.96x1.54x1.74x V42.00x1.46x1.71x V161.97x1.83x1.90x Speedup using bytes Speedup using halfwords
76
Outline: Motivation Background Implementation Results Conclusion
77
Conclusions: The compiler greatly improves the programming and debugging experience for VENICE. The compiler produces highly optimized VENICE code and achieves performance close-to or better- than hand-optimized code. The compiler demonstrates the feasibility of using high-abstraction languages, such as Microsoft Accelerator with pluggable 3rd-party back-end support to provide a sustainable solution for future emerging hardware.
78
Thank you !
79
Optimal VL for V16 Input Data Sizes (words) 81921638432768655361310722621445242881048576 Instr- uction Count 140968192 240968192 32048 4096 8192 410242048 4096 8192 510242048 4096 8192 610242048 4096 8192 710242048 4096 8192 810242048 4096 8192 910242048 4096 8192 1010242048 4096 8192 1110242048 4096 8192 1210242048 4096 8192 1310242048 4096 8192 1410242048 4096 8192 1510242048 4096 8192 1610242048 4096 8192 Look-up Table
82
“Virtual Vector Register File” Number of vector registers = 4 Vector register size = 1024
83
Combine Operators for Motion Estimation V4V16V64 Before (ms)2.030.550.30 After (ms)1.360.370.21 Speedup1.49x1.48x1.43x
84
Performance Degradation on median int *v_min = v_input1; int *v_max = v_input2; vector ( VVW, VOR, v_tmp, v_min, v_min ); vector ( VVW, VSUB, v_sub, v_max, v_min ); vector ( VVW, VCMV_LTZ, v_min, v_max, v_sub ); vector ( VVW, VCMV_LTZ, v_max, v_tmp, v_sub ); vector ( VVW, VSUB, v_sub, v_input1, v_input2 ); vector ( VVW, VCMV_GTEZ, v_min, v_input2, v_sub ); vector ( VVW, VCMV_LTZ, v_min, v_input1, v_sub ); vector ( VVW, VCMV_GTEZ, v_min, v_input1, v_sub ); vector ( VVW, VCMV_LTZ, v_max, v_input2, v_sub ); Human-written compare-and-swap Compiler-generated compare-and-swap
85
Double Buffering
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.