Presentation is loading. Please wait.

Presentation is loading. Please wait.

University of Michigan Electrical Engineering and Computer Science 1 Streamroller: Compiler Orchestrated Synthesis of Accelerator Pipelines Manjunath Kudlur,

Similar presentations


Presentation on theme: "University of Michigan Electrical Engineering and Computer Science 1 Streamroller: Compiler Orchestrated Synthesis of Accelerator Pipelines Manjunath Kudlur,"— Presentation transcript:

1 University of Michigan Electrical Engineering and Computer Science 1 Streamroller: Compiler Orchestrated Synthesis of Accelerator Pipelines Manjunath Kudlur, Kevin Fan, Ganesh Dasika, and Scott Mahlke University of Michigan

2 Electrical Engineering and Computer Science 2 Automated C to Gates Solution SoC design –10-100 Gops, 200 mW power budget –Low level tools ineffective Automated accelerator synthesis for whole application –Correct by construction –Increase designer productivity –Faster time to market app.c LA

3 University of Michigan Electrical Engineering and Computer Science 3 Streaming Applications Quantizer Motion Estimator TransformCoder Inverse Quantizer Inverse Transform Motion Predictor Image Coded Image H.264 Encoder Data “streaming” through kernels Kernels are tight loops –FIR, Viterbi, DCT Coarse grain dataflow between kernels –Sub-blocks of images, network packets Data in Data out CRC Conv./ Turbo Block Interleaver OVSF Generator Spreader/ Scrambler Baseband Trasmitter W-CDMA Transmitter RRC Filter

4 University of Michigan Electrical Engineering and Computer Science 4 System Schema Overview Kernel 1 Kernel 2 Kernel 4 LA 1 LA 2 LA 3 Kernel 3 Kernel 5 Kernel 1 Kernel 4 Kernel 5 K2 K3 Kernel 1 Kernel 4 Kernel 5 K2 K3 Kernel 1 Kernel 4 Kernel 5 K2 K3 time Task throughput

5 University of Michigan Electrical Engineering and Computer Science 5 Input Specification for(i=0; i<8; i++) { for(j=0; j<8; j++) {... = inp[i][j]; out[i][j] =... ; } row_trans(char inp[8][8], char out[8][8] ) { } col_trans(char inp[8][8], char out[8][8]); zigzag_trans(char inp[8][8], char out[8][8]); dct (char inp[8][8], char out[8][8]) { row_trans col_trans zigzag_trans inp tmp1 tmp2 out Sequential C program Kernel specification –Perfectly nested FOR loop –Wrapped inside C function –All data access made explicit char tmp1[8][8], tmp2[8][8]; row_trans(inp, tmp1); col_trans(tmp1, tmp2); zigzag_trans(tmp2, out); } System specification –Function with main input/output –Local arrays to pass data –Sequence of calls to kernels

6 University of Michigan Electrical Engineering and Computer Science 6 System Level Decisions Throughput of each LA – Initiation Interval Grouping of loops into a multifunction LA –More loops in a single LA → LA occupied for longer time in current task K1 K2 K3 TC=100 K3 TC=100 LA 2 LA 3 LA 1 K1 K2 K3 K4 LA 1 occupied for 200 cycles K1 K2 K3 100 200 300 K4 400 Throughput = 1 task / 200 cycles

7 University of Michigan Electrical Engineering and Computer Science 7 System Decisions (Contd..) Cost of SRAM buffers for intermediate arrays More buffers → more task overlap → high performance II=1 K1 K2 K3 TC=100 tmp1 tmp2 LA 1 LA 2 LA 3 K1 K2 K3 K1 K2 K3 100 200 300 LA 1 LA 2 LA 3 tmp1 buffer in use by LA2 K1 K2 K3 K1 K2 K3 100 200 300 Adjacent tasks use different buffers

8 University of Michigan Electrical Engineering and Computer Science 8 Case Study : “Simple” benchmark Loop graph TC=256 1 1 1 1 1 1 1 1 512 cycles LA 1 LA 2 LA 3 LA 4 3 1 1 2 1 1 1 3 3 1792 cycles 1536 cycles LA 1 LA 2 LA 1 1 1 1 1 1 1 1 1 2048 cycles

9 University of Michigan Electrical Engineering and Computer Science 9 Prescribed Throughput Accelerators Traditional behavioral synthesis –Directly translate C operators into gates Our approach: Application-centric Architectures –Achieve fixed throughput –Maximize hardware sharing ApplicationArchitecture Operation graphDatapath

10 University of Michigan Electrical Engineering and Computer Science 10 Loop Accelerator Template Parameterized execution resources, storage, connectivity Hardware realization of modulo scheduled loop

11 University of Michigan Electrical Engineering and Computer Science 11 Loop Accelerator Design Flow FU Alloc.c C Code, Performance (Throughput) Abstract Arch Modulo Schedule Op1 Op2 Op3 … time FUs Scheduled Ops RF FU Build Datapath Concrete Arch FU Instantiate Arch Synthesize Verilog, Control Signals.v Loop Accelerator

12 University of Michigan Electrical Engineering and Computer Science 12 LA1 LA2 LA4 Accelerator Pipeline Loop Accelerator LA3 LA5 Multifunction Accelerator Map multiple loops to single accelerator Improve hardware efficiency via reuse Opportunities for sharing –Disjoint stages (loops 2, 3) –Pipeline slack (loops 4, 5) Frame Type? Loop 2Loop 3 Loop 1 Loop 4 Application … Block 5 LA1 LA2 LA3 Accelerator Pipeline … Loop Accelerator Multifunction Loop Accelerator Multifunction Loop Accelerator

13 University of Michigan Electrical Engineering and Computer Science 13 Union Loop 1 Loop 2 Cost Sensitive Modulo Scheduler FU Datapath Union 43% average savings over sum of accelerators Smart union within 3% of joint scheduling solution

14 University of Michigan Electrical Engineering and Computer Science 14 Algorithm-level pipeline retiming –Splitting loops based on tiling –Co-scheduling adjacent loops Challenges: Throughput Enabling Transformations Loop 2 Loop 3 Loop 4 Loop 1 Loop 2a Loop 2b Loop 3,4 Critical loop

15 University of Michigan Electrical Engineering and Computer Science 15 Challenges: Programmable Loop Accelerator Support bug fixes, evolving standards Accelerate loops not known at design time Minimize additional control overhead Interconnect FU …… …… MEM …… Local Mem Control II Control signals

16 University of Michigan Electrical Engineering and Computer Science 16 Challenges: Timing Aware Synthesis Technology scaling, increasing # FUs → rising interconnect cost, wire capacitance Strategies to eliminate long wires –Preemptive: predict & prevent long wires –Reactive: use feedback from floorplanner FU1FU2FU3 - Insert flip flop on long path - Reschedule with added latency

17 University of Michigan Electrical Engineering and Computer Science 17 Challenges: Adaptable Voltage/Frequency Levels Allow voltage scaling beyond margins Using shadow latches in loop accelerator –Localized error detection –Control is predefined: simple error recovery D CLK Q error flip-flop shadow latch delay FU Shadow latch Extra queue entries

18 University of Michigan Electrical Engineering and Computer Science 18 For More Information Visit http://cccp.eecs.umich.edu


Download ppt "University of Michigan Electrical Engineering and Computer Science 1 Streamroller: Compiler Orchestrated Synthesis of Accelerator Pipelines Manjunath Kudlur,"

Similar presentations


Ads by Google