University of Michigan Electrical Engineering and Computer Science 1 Increasing Hardware Efficiency with Multifunction Loop Accelerators Kevin Fan, Manjunath Kudlur, Hyunchul Park, Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan October 25, 2006
University of Michigan Electrical Engineering and Computer Science 2 Introduction Emerging applications have high performance, cost, energy demands –H.264, wireless, software radio, signal processing – Gops required –200 mW power budget Applications dominated by tight loops processing large amounts of streaming data CPU Accelerators
University of Michigan Electrical Engineering and Computer Science 3 Loop Accelerators Order-of-magnitude performance and efficiency wins –Viterbi: 100x speedup vs. ARM9.C Automated C gates solution Correct by construction Close designer productivity gap Achieve short time-to-market
University of Michigan Electrical Engineering and Computer Science 4 Prescribed Throughput Accelerators Traditional behavioral synthesis –Directly translate C operators into gates Operation graphDatapath ApplicationArchitecture Our approach: Application-centric Architectures –Achieve fixed throughput –Maximize hardware sharing
University of Michigan Electrical Engineering and Computer Science 5 Outline Loop accelerator schema and design flow Cost sensitive scheduling Designing multifunction accelerators –Naïve –Joint scheduling –Datapath union Synthesis results
University of Michigan Electrical Engineering and Computer Science 6 Loop Accelerator Template Parameterized execution resources, storage, connectivity Hardware realization of modulo scheduled loop
University of Michigan Electrical Engineering and Computer Science 7 Loop Accelerator Design Flow FU Alloc.c C Code, Performance (Throughput) Abstract Arch Modulo Schedule Op1 Op2 Op3 … time FUs Scheduled Ops RF FU Build Datapath Concrete Arch FU Instantiate Arch Synthesize Verilog, Control Signals.v Loop Accelerator
University of Michigan Electrical Engineering and Computer Science 8 Datapath Derived from Schedule Schedule to abstract architecture (FUs) Determine register and interconnect requirements from schedule r1 = Mem[r2] r3 = r Source Code Datapath MEM+ 12 time 1 time 4 FU1FU2 Schedule... ADD LOAD
University of Michigan Electrical Engineering and Computer Science 9 Cost Sensitive Scheduling 27% cost reduction with same performance [MICRO ’05] +1+1 LD LD time FU1FU2FU3 FU1FU2FU LD 2 LD 1 time FU1FU2FU3 FU1FU2FU Traditional scheduling is hardware unaware Intelligent scheduling needed to reduce hardware cost
University of Michigan Electrical Engineering and Computer Science 10 LA1 LA2 LA4 Accelerator Pipeline Loop Accelerator LA3 LA5 Multifunction Accelerator Map multiple loops to single accelerator Improve hardware efficiency via reuse Opportunities for sharing –Disjoint stages (loops 2, 3) –Pipeline slack (loops 4, 5) Frame Type? Loop 2Loop 3 Loop 1 Loop 4 Application … Block 5 LA1 LA2 LA3 Accelerator Pipeline … Loop Accelerator Multifunction Loop Accelerator Multifunction Loop Accelerator
University of Michigan Electrical Engineering and Computer Science 11 Design Strategies Naïve method: Design single function accelerators, place side by side –Misses potential hardware sharing of FUs, storage, interconnect Loop 1 Loop 2 Cost Sensitive Modulo Scheduler FU Multifunction datapath
University of Michigan Electrical Engineering and Computer Science 12 Joint Scheduling Loops are independent: # possible schedules exponential in # of loops! Infeasible for modest problems Loop 1 Loop 2 Joint Cost Sensitive Modulo Scheduler Op1 Op2 Op3 … time FUs Op2 Op1 … Op3 time FUs FU
University of Michigan Electrical Engineering and Computer Science 13 Multifunction Gate Costs 43% average savings over sum of accelerators ABCDEFGHIJ
University of Michigan Electrical Engineering and Computer Science 14 Datapath Union Loop 1 Loop 2 Cost Sensitive Modulo Scheduler FU Datapath Union
University of Michigan Electrical Engineering and Computer Science 15 Datapath Union Combine similar components → better hardware sharing → lower cost Trade off FU and register cost –Combining dissimilar FUs can enable register cost savings ILP formulation minimizes FU and register cost Accel 1 Accel 2 +-MM + +*M+*/-MM/+ Multi- function accel ++/-M/*M
University of Michigan Electrical Engineering and Computer Science 16 Multifunction Gate Costs Smart union within 3% of joint scheduling solution ABCDEFGHIJ
University of Michigan Electrical Engineering and Computer Science 17 Conclusion Multifunction accelerators highly effective in exploiting coarse grained hardware sharing Joint scheduling achieves 43% average cost savings, but is impractical Smart union of independent accelerators achieves 40% average savings Compile times of 5 minutes – 1 hour
University of Michigan Electrical Engineering and Computer Science 18 Questions?