Presentation is loading. Please wait.

Presentation is loading. Please wait.

Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor.

Similar presentations


Presentation on theme: "Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor."— Presentation transcript:

1 Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor

2 Outline:  Motivation  Background  Implementation  Results  Conclusion

3 Outline:  Motivation  Background  Implementation  Results  Conclusion

4 Motivation Multi-core GPU FPGA Many-core … CUDA Syste m Verilog VHDL OpenCL Erlang Computer clusters OpenMP MPI Pthread OpenHMPP Verilog Bluespe c Cilk X1 0 OpenGL ShSh aJava ParC Fortress Chapel Vector Processor StreamIt Sponge SS E

5 Motivation Multi-core GPU FPGA Many-core … CUDA Syste m Verilog VHDL OpenCL Erlang Computer clusters OpenMP MPI Pthread OpenHMPP Verilog Bluespe c Cilk X1 0 OpenGL ShSh aJava ParC Fortress Chapel Vector Processor StreamIt Sponge SS E Simplification

6 Motivation … Single Description

7 Contributions  The compiler serves as a new back-end of a single- description multiple-device language.  The compiler makes VENICE easier to program and debug.  The compiler provides auto-parallelization and optimization. [1] Z. Liu, A. Severance, S. Singh and G. Lemieux, “Accelerator Compiler for the VENICE Vector Processor,” in FPGA 2012. [2] C. Chou, A. Severance, A. Brant, Z. Liu, S. Sant, G. Lemieux, “VEGAS: soft vector processor with scratchpad memory,” in FPGA 2011.

8 Outline:  Motivation  Background  Implementation  Results  Conclusion

9 Complicated ALIGN WR RD ALIGN EX1 EX2 ACCUM

10 #include "vector.h“ int main() { int A[] = {1,2,3,4,5,6,7,8}; const int data_len = sizeof ( A ); int *va = ( int *) vector_malloc ( data_len ); vector_dma_to_vector ( va, A, data_len ); vector_wait_for_dma (); vector_set_vl ( data_len / sizeof (int) ); vector ( SVW, VADD, va, 42, va ); vector_instr_sync (); vector_dma_to_host ( A, va, data_len ); vector_wait_for_dma (); vector_free (); } Program in VENICE assembly Allocate vectors in scratchpad Move data from main memory to scratchpad Wait for DMA transaction to be completed Setup for vector instructions Perform vector computations Wait for vector operations to be completed Move data from scratchpad to main memory Wait for DMA transaction to be completed Deallocate memory from scratchpad

11 #include "Accelerator.h" using namespace ParallelArrays; using namespace MicrosoftTargets; int main() { int A[] = {1,2,3,4,5,6,7,8}; Target *tgt = CreateVectorTarget(); IPA b = IPA( A, sizeof (A)/sizeof (int)); IPA c = b + 42; tgt->ToArray( c, A, sizeof (A)/sizeof (int)); tgt->Delete(); } Target *tgt= CreateDX9Target(); Target *tgt = CreateMulticoreTarget(); Program in Accelerator Create a Target Create Parallel Array objects Write expressions Call ToArray to evaluate expressions Delete Target object

12 Assembly Programming : Write Assembly Download to board Compile with Gcc Get Result Doesn’t compile? Result Incorrect? Accelerator Programming : Write in Accelerator Download to board Compile with Microsoft Visual Studio Get Result Compile with Gcc Doesn’t compile? Or result incorrect?

13 Assembly Programming : 1.Hard to program 2.Long debug cycle 3.Not portable 4.Manual – Not always optimal or correct (wysiwyg) Accelerator Programming : 1.Easy to program 2.Easy to debug 3.Can also target other devices 4.Automated compiler optimizations

14 Outline:  Motivation  Background  Implementation  Results  Conclusion

15

16

17 #include "Accelerator.h" using namespace ParallelArrays; using namespace MicrosoftTargets; int main() { Target *tgtVector = CreateVectorTarget(); const int length = 8192; int a[] = {1,2,3,4, …, 8192}; int d[length]; IPA A = IPA( a, length); IPA B = Evaluate( Rotate(A, [1]) + 1 ); IPA C = Evaluate( Abs( A + 2 )); IPA D = ( A + B ) * C ; tgtVector->ToArray( D, d, length * sizeof(int)); tgtVector->Delete(); } × × D D + + A A + + A A 2 2 Abs + + A A 1 1 Rot

18 × × D D + + A A + + A A 2 2 Abs + + 1 1 Rot A A

19 × × D D + + A A + + A A 2 2 Abs + + 1 1 A (rot) A (rot)

20 × × D D + + A A + + A A 2 2 Abs + + 1 1 A (rot) A (rot) × × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B C C + + A A 2 2 Abs

21

22

23 × × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B C C + + A A 2 2 Abs Combine Operations

24 × × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B |+| 2 2 C C A A Combine Operations

25 Scratchpad Memory“Virtual Vector Register File”

26

27 Number of vector registers = ? Vector register size = ?

28 “Virtual Vector Register File” Number of vector registers = ? Vector register size = ?

29 × × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C 1 0 1 1 0 1 1 1 2 1 2 1 1 2 2 3 3 1 1 2 2 3 3 1 1 2 2 3 3 4 4 5 5 Evaluation Order

30 × × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers

31 × × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers

32 × × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A3

33 × × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A3

34 × × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A3 B1

35 × × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A3 B1

36 × × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A3 B1 C1

37 × × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A3 B1 C1 Active AYes BNo C

38 × × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A3 B1 C1 Active AYes BNo C numLoads = 1

39 × × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A3 B1 C1 Active AYes BNo C numLoads = 1

40 × × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A2 B1 C1 Active AYes BNo C numLoads = 1

41 × × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A2 B1 C1 Active AYes BNo C numLoads = 1

42 × × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A2 B1 C1 Active AYes BNo C numLoads = 1

43 × × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A2 B1 C1 Active AYes BNo C numLoads = 1 numTemps = 1

44 × × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A2 B1 C1 Active AYes BNo C numLoads = 1 numTemps = 1 numTotal = 2 maxTotal = 2

45 × × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A2 B1 C1 Active AYes B CNo numLoads = 2 numTemps = 1

46 × × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A2 B1 C1 Active AYes B CNo numLoads = 2 numTemps = 0

47 × × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A1 B1 C1 Active AYes B CNo numLoads = 2 numTemps = 0

48 × × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A1 B1 C1 Active AYes B CNo numLoads = 2 numTemps = 0

49 × × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A1 B1 C1 Active AYes B CNo numLoads = 2 numTemps = 1

50 × × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A1 B1 C1 Active AYes B CNo numLoads = 2 numTemps = 1 numTotal = 3 maxTotal = 3

51 × × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A1 B1 C1 Active AYes B C numLoads = 3 numTemps = 0

52 × × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A0 B1 C1 Active ANo BYes C numLoads = 3 numTemps = 0

53 × × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A0 B0 C1 Active ANo B CYes numLoads = 3 numTemps = 0

54 × × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A0 B0 C1 Active ANo B CYes numLoads = 3 numTemps = 0

55 × × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A0 B0 C0 Active ANo B C numLoads = 3 numTemps = 0

56 × × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A0 B0 C0 Active ANo B C numLoads = 3 numTemps = 0

57 × × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A0 B0 C0 Active ANo B C numLoads = 3 numTemps = 0 numTotal = 3 maxTotal = 3

58 × × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B + + A (rot) A (rot) 2 2 C C Count number of virtual vector registers Ref Count A0 B0 C0 Active ANo B C numLoads = 0 numTemps = 0 numTotal = 0 maxTotal = 3

59 “Virtual Vector Register File” Number of vector registers = 3 Vector register size = ?

60 “Virtual Vector Register File” Number of vector registers = 3 Vector register size = Capacity/3

61 Convert to LIR Result:B A(rot) 1 + Result:D A B + C × × × D D + + A A C C B B + + A (rot) A (rot) 1 1 B B |+| 2 2 C C A A Result:C A 2 |+| 1 1 2 2 3 3 1 1 2 2 3 3 1 1 2 2 3 3 4 4 5 5

62 Code Generation Result:B A(rot) 1 + Result:D A B + C × Result:C A 2 |+|

63 Code Generation Result:B A(rot) 1 + Result:D A B + C × Result:C A 2 |+| 1234...8192

64 Code Generation Result:B A(rot) 1 + Result:D A B + C × Result:C A 2 |+| 1234...81921

65 Code Generation Result:B A(rot) 1 + Result:D A B + C × Result:C A 2 |+| #include "vector.h“ int main(){ int A[8192] = {1,2,3,4, … 8192}; int *va = ( int *) vector_malloc ( 32772 ); int *vb = ( int *) vector_malloc ( 32768 ); int *vc = ( int *) vector_malloc ( 32768 ); int *vd = ( int *) vector_malloc ( 32772 ); int *vtemp = va; vector_dma_to_vector ( va, A, 32772 ); for(int i=0; i<4; i++){ vector_set_vl ( 1024 ); vtemp = va; va = vd; vd = vtemp; vector_wait_for_dma (); if(i<3) vector_dma_to_vector ( va, A+(i+1)*1024, 32772 ); if(i>0){ vector_instr_sync (); vector_dma_to_host ( A+(i-1)*1024, vc, 32768 ); } vector ( SVW, VADD, vb, 1, va+1 );

66 Code Generation Result:D A B + C × Result:C A 2 |+| #include "vector.h“ int main(){ int A[8192] = {1,2,3,4, … 8192}; int *va = ( int *) vector_malloc ( 32772 ); int *vb = ( int *) vector_malloc ( 32768 ); int *vc = ( int *) vector_malloc ( 32768 ); int *vd = ( int *) vector_malloc ( 32772 ); int *vtemp = va; vector_dma_to_vector ( va, A, 32772 ); for(int i=0; i<4; i++){ vector_set_vl ( 1024 ); vtemp = va; va = vd; vd = vtemp; vector_wait_for_dma (); if(i<3) vector_dma_to_vector ( va, A+(i+1)*1024, 32772 ); if(i>0){ vector_instr_sync (); vector_dma_to_host ( A+(i-1)*1024, vc, 32768 ); } vector ( SVW, VADD, vb, 1, va+1 ); vector_abs ( SVW, VADD, vc, 2, va );

67 Code Generation Result:D A B + C × #include "vector.h“ int main(){ int A[8192] = {1,2,3,4, … 8192}; int *va = ( int *) vector_malloc ( 32772 ); int *vb = ( int *) vector_malloc ( 32768 ); int *vc = ( int *) vector_malloc ( 32768 ); int *vd = ( int *) vector_malloc ( 32772 ); int *vtemp = va; vector_dma_to_vector ( va, A, 32772 ); for(int i=0; i<4; i++){ vector_set_vl ( 1024 ); vtemp = va; va = vd; vd = vtemp; vector_wait_for_dma (); if(i<3) vector_dma_to_vector ( va, A+(i+1)*1024, 32772 ); if(i>0){ vector_instr_sync (); vector_dma_to_host ( A+(i-1)*1024, vc, 32768 ); } vector ( SVW, VADD, vb, 1, va+1 ); vector_abs ( SVW, VADD, vc, 2, va ); vector ( VVW, VADD, vb, vb, va );

68 Code Generation #include "vector.h“ int main(){ int A[8192] = {1,2,3,4, … 8192}; int *va = ( int *) vector_malloc ( 32772 ); int *vb = ( int *) vector_malloc ( 32768 ); int *vc = ( int *) vector_malloc ( 32768 ); int *vd = ( int *) vector_malloc ( 32772 ); int *vtemp = va; vector_dma_to_vector ( va, A, 32772 ); for(int i=0; i<4; i++){ vector_set_vl ( 1024 ); vtemp = va; va = vd; vd = vtemp; vector_wait_for_dma (); if(i<3) vector_dma_to_vector ( va, A+(i+1)*1024, 32772 ); if(i>0){ vector_instr_sync (); vector_dma_to_host ( A+(i-1)*1024, vc, 32768 ); } vector ( SVW, VADD, vb, 1, va+1 ); vector_abs ( SVW, VADD, vc, 2, va ); vector ( VVW, VADD, vb, vb, va ); vector ( VVW, VADD, vc, vc, vb ); } vector_instr_sync (); vector_dma_to_host ( A+(i-1)*1024, vc, 32768 ); vector_wait_for_dma (); vector_free (); } Result:D A B + C ×

69 Convert To LIR IR Combine Memory transforms Combine Operations Evaluation Ordering Buffer Counting Calculate Register Size Need Double buffering? LIR Expression Graph Convert to IR Sub-divide IR Constant folding CSE Move Bounds to Leaves VENICE Code Initialize Memory Transfer Data To Scratchpad Set VL Write Vector Instructions Transfer Result To Host Allocate Memory

70 Outline:  Motivation  Background  Implementation  Results  Conclusion

71 370x Speedups Compiler vs. Human fir2Dfirlifeimgblendmedianmotest V1 1.04x0.97x1.01x1.00x0.99x0.81x V4 1.01x1.12x1.10x1.02x1.07x1.01x V16 1.09x1.12x1.38x0.90x0.96x1.01x V64 1.30x1.42x2.24x0.92x0.81x1.04x

72

73

74 CPU Benchmark Runtime (ms) fir2Dfirlifeimgblendmedianmotest Xeon E5540 (2.53GHz) 0.070.440.530.129.970.24 VENICE (V64,100MHz) 0.070.290.230.333.110.22 Speedup1.0 x1.5 x2.3 x0.4 x3.2 x1.1 x Compare to Intel CPU Compile Time fir2D firlifeimgblendmedianmotestgeomean Compile time(ms) 4.745.054.494.4492.7224.2710.12

75 Using smaller data types fir2D firlifeimgblendmedianmotestgeomean bytehalfwordbytehalfwordbyteword V13.93x4.36x4.07x 4.12x V43.54x3.83x4.03x3.79x V162.90x3.22x4.00x3.34x V11.96x1.54x1.74x V42.00x1.46x1.71x V161.97x1.83x1.90x Speedup using bytes Speedup using halfwords

76 Outline:  Motivation  Background  Implementation  Results  Conclusion

77 Conclusions:  The compiler greatly improves the programming and debugging experience for VENICE.  The compiler produces highly optimized VENICE code and achieves performance close-to or better- than hand-optimized code.  The compiler demonstrates the feasibility of using high-abstraction languages, such as Microsoft Accelerator with pluggable 3rd-party back-end support to provide a sustainable solution for future emerging hardware.

78 Thank you !

79 Optimal VL for V16 Input Data Sizes (words) 81921638432768655361310722621445242881048576 Instr- uction Count 140968192 240968192 32048 4096 8192 410242048 4096 8192 510242048 4096 8192 610242048 4096 8192 710242048 4096 8192 810242048 4096 8192 910242048 4096 8192 1010242048 4096 8192 1110242048 4096 8192 1210242048 4096 8192 1310242048 4096 8192 1410242048 4096 8192 1510242048 4096 8192 1610242048 4096 8192 Look-up Table

80

81

82 “Virtual Vector Register File” Number of vector registers = 4 Vector register size = 1024

83 Combine Operators for Motion Estimation V4V16V64 Before (ms)2.030.550.30 After (ms)1.360.370.21 Speedup1.49x1.48x1.43x

84 Performance Degradation on median int *v_min = v_input1; int *v_max = v_input2; vector ( VVW, VOR, v_tmp, v_min, v_min ); vector ( VVW, VSUB, v_sub, v_max, v_min ); vector ( VVW, VCMV_LTZ, v_min, v_max, v_sub ); vector ( VVW, VCMV_LTZ, v_max, v_tmp, v_sub ); vector ( VVW, VSUB, v_sub, v_input1, v_input2 ); vector ( VVW, VCMV_GTEZ, v_min, v_input2, v_sub ); vector ( VVW, VCMV_LTZ, v_min, v_input1, v_sub ); vector ( VVW, VCMV_GTEZ, v_min, v_input1, v_sub ); vector ( VVW, VCMV_LTZ, v_max, v_input2, v_sub ); Human-written compare-and-swap Compiler-generated compare-and-swap

85 Double Buffering

86


Download ppt "Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor."

Similar presentations


Ads by Google