Download presentation
Presentation is loading. Please wait.
1
University of Michigan Electrical Engineering and Computer Science 1 Streamroller: Automatic Synthesis of Prescribed Throughput Accelerator Pipelines Manjunath Kudlur, Kevin Fan, Scott Mahlke Advanced Computer Architecture Lab University of Michigan
2
Electrical Engineering and Computer Science 2 Automated C to Gates Solution SoC design –10-100 Gops, 200 mW power budget –Low level tools ineffective Automated accelerator synthesis for whole application –Correct by construction –Increase designer productivity –Faster time to market app.c LA
3
University of Michigan Electrical Engineering and Computer Science 3 Streaming Applications Quantizer Motion Estimator TransformCoder Inverse Quantizer Inverse Transform Motion Predictor Image Coded Image H.264 Encoder Data “streaming” through kernels Kernels are tight loops –FIR, Viterbi, DCT Coarse grain dataflow between kernels –Sub-blocks of images, network packets Data in Data out CRC Conv./ Turbo Block Interleaver OVSF Generator Spreader/ Scrambler Baseband Trasmitter W-CDMA Transmitter RRC Filter
4
University of Michigan Electrical Engineering and Computer Science 4 Software Overview Whole Application 1 23 4 System Level Synthesis Frontend Analyses Accelerator Pipeline Multifunction Accelerator SRAM Buffers Loop Graph
5
University of Michigan Electrical Engineering and Computer Science 5 Input Specification for(i=0; i<8; i++) { for(j=0; j<8; j++) {... = inp[i][j]; out[i][j] =... ; } row_trans(char inp[8][8], char out[8][8] ) { } col_trans(char inp[8][8], char out[8][8]); zigzag_trans(char inp[8][8], char out[8][8]); dct (char inp[8][8], char out[8][8]) { row_trans col_trans zigzag_trans inp tmp1 tmp2 out Sequential C program Kernel specification –Perfectly nested FOR loop –Wrapped inside C function –All data access made explicit char tmp1[8][8], tmp2[8][8]; row_trans(inp, tmp1); col_trans(tmp1, tmp2); zigzag_trans(tmp2, out); } System specification –Function with main input/output –Local arrays to pass data –Sequence of calls to kernels
6
University of Michigan Electrical Engineering and Computer Science 6 Performance Specification High performance DCT –Process one 1024x768 image every 2ms –Given 400 Mhz clock One image every 800000 cycles One block every 64 cycles Low Performance DCT –Process one 1024x768 image every 4ms –One block every 128 cycles 8 8 row_trans col_trans zigzag_trans inp tmp1 tmp2 out 8 8 Input image (1024 x 768) Output coeffs Task Performance goal : Task throughput in number of cycles between tasks
7
University of Michigan Electrical Engineering and Computer Science 7 Building Blocks Kernel 1 Kernel 2 Kernel 3 Kernel 4 Multifunction Loop Accelerator [CODES/ISSS ’06] tmp1 tmp2 tmp3 SRAM buffers
8
University of Michigan Electrical Engineering and Computer Science 8 System Schema Overview Kernel 1 Kernel 2 Kernel 4 LA 1 LA 2 LA 3 Kernel 3 Kernel 5 Kernel 1 Kernel 4 Kernel 5 K2 K3 Kernel 1 Kernel 4 Kernel 5 K2 K3 Kernel 1 Kernel 4 Kernel 5 K2 K3 time Task throughput
9
University of Michigan Electrical Engineering and Computer Science 9 Cost Components Cost of loop accelerator data path –Cost of FUs, shift registers, muxes, interconnect Initiation interval (II) –Key parameter that decides LA cost Low II → high performance → high cost –Loop execution time ≈ (trip count) x II –Appropriate II chosen to satisfy task throughput II=1 K1 K2 K3 TC=100 II=2 Low performance K1 K2 K3 TC=100 K1 K2 K3 K1 K2 K3 Task 1 Task 2 K1 K2 K3 Task 3 100 200 300 High performance Throughput = 1 task/100 cycles K1 K2 K3 K1 K2 K3 Task 1 Task 2 200 400 600 Throughput = 1 task/200 cycles
10
University of Michigan Electrical Engineering and Computer Science 10 Cost Components (Contd..) Grouping of loops into a multifunction LA –More loops in a single LA → LA occupied for longer time in current task K1 K2 K3 TC=100 K3 TC=100 LA 2 LA 3 LA 1 K1 K2 K3 K4 LA 1 occupied for 200 cycles K1 K2 K3 100 200 300 K4 400 Throughput = 1 task / 200 cycles
11
University of Michigan Electrical Engineering and Computer Science 11 Cost Components (Contd..) Cost of SRAM buffers for intermediate arrays More buffers → more task overlap → high performance II=1 K1 K2 K3 TC=100 tmp1 tmp2 LA 1 LA 2 LA 3 K1 K2 K3 K1 K2 K3 100 200 300 LA 1 LA 2 LA 3 tmp1 buffer in use by LA2 K1 K2 K3 K1 K2 K3 100 200 300 Adjacent tasks use different buffers
12
University of Michigan Electrical Engineering and Computer Science 12 ILP Formulation Variables –II for each loop –Which loops are combined into single LA –Number of buffers for temp array Objective function –Cost of LAs + cost of buffers Constraints –Overall task throughput should be achieved
13
University of Michigan Electrical Engineering and Computer Science 13 Non-linear LA Cost 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1234567891011121314151617181920 II min II max II = 1*II 1 + 2*II 2 + 3*II 3 +.... + 14*II 14 and 0 ≤ II i ≤ 1 Cost(II) = C 1 *II 1 + C 2 *II 2 + C 3 *II 3 +.... + C 14 *II 14 II min ≤ II ≤ II max Relative Cost Initiation interval
14
University of Michigan Electrical Engineering and Computer Science 14 Multifunction Accelerator Cost LA 1 LA 2 LA 3 LA 4 LA 1 LA 2 LA 3 LA 4 LA 1 LA 2 LA 3 LA 4 Worst Case : No sharing Cost = Sum Realistic Case : Some sharing Cost = Between Sum and Max Best case : Full sharing Cost = Max Impractical to obtain accurate cost of all combinations C LA = 0.5 * (SUMC LA + MAXC LA )
15
University of Michigan Electrical Engineering and Computer Science 15 Case Study : “Simple” benchmark Loop graph TC=256 1 1 1 1 1 1 1 1 512 cycles LA 1 LA 2 LA 3 LA 4 3 1 1 2 1 1 1 3 3 1792 cycles 1536 cycles LA 1 LA 2 LA 1 1 1 1 1 1 1 1 1 2048 cycles
16
University of Michigan Electrical Engineering and Computer Science 16 Beamformer Beamformer 10 loops Memory Cost – 60% to 70% Up to 20% cost savings due to hardware sharing in multifunction accelerators Systems at lower throughput have over-designed LAs –Not profitable to pick a lower performance LA Memory buffer cost significant –High performance producer consumer better than more buffers
17
University of Michigan Electrical Engineering and Computer Science 17 Conclusions Automated design realistic for system of loops Designers can move up the abstraction hierarchy Observations –Macro level hardware sharing can achieve significant cost savings –Memory cost is significant – need to simultaneously optimize for datapath and memory cost ILP formulation tractable –Solver took less than 1 minute for systems with 30 loops
18
University of Michigan Electrical Engineering and Computer Science 18
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.