University of Michigan Electrical Engineering and Computer Science Compiler-directed Synthesis of Programmable Loop Accelerators Kevin Fan, Hyunchul Park, Scott Mahlke September 25, 2004 EDCEP Workshop
University of Michigan Electrical Engineering and Computer Science Loop Accelerators Hardware implementation of a critical loop nest –Hardwired state machine –Digital camera appln – 1000x vs Pentium III –Multiple accelerators hooked up in a pipeline Loop accelerator vs. customized processor –1 block of code vs. multiple blocks –Trivial control flow vs. handling generic branches –Traditionally state machine vs. instruction driven
University of Michigan Electrical Engineering and Computer Science Programmable Loop Accelerators Goals –Multifunction accelerators – Accelerator hardware can handle multiple loops (re-use) –Post-programmable – To a degree, allow changes to the application –Use compiler as architecture synthesis tool But … –Don’t build a customized processor –Maintain ASIC-level efficiency
University of Michigan Electrical Engineering and Computer Science NPA (Nonprogrammable Accelerator) Synthesis in PICO
University of Michigan Electrical Engineering and Computer Science PICO Frontend for i = 1 to ni for j = 1 to nj y[i] += w[j] * x[i+j] for jt = 1 to 100 step 10 for t = 0 to 502 for p = 0 to 1 (i,j) = function of (t,p) if (i>1) W[t][p] = W[t-5][p] else w[jt+j] if (i>1 && j<bj) X[t][p] = X[t-4][p+1] else x[i+jt+j] Y[t][p] += W[t][p] * X[t][p] Goals –Exploit loop-level parallelism –Map loop to abstract hardware –Manage global memory BW Steps –Tiling –Load/store elimination –Iteration mapping –Iteration scheduling –Virtual processor clustering
University of Michigan Electrical Engineering and Computer Science PICO Backend Resource allocation (II, operation graph) Synthesize machine description for “fake” fully connected processor with allocated resources
University of Michigan Electrical Engineering and Computer Science Reduced VLIW Processor after Modulo Scheduling
University of Michigan Electrical Engineering and Computer Science Data/control-path Synthesis NPA
University of Michigan Electrical Engineering and Computer Science PICO Methodology – Why it Works? Systematic design methodology –1. Parameterized meta-architecture – all NPAs have same general organization –2. Performance/throughput is input –3. Abstract architecture – We know how to build compilers for this –4. Mapping mechanism – Determine architecture specifics from schedule for abstract architecture
University of Michigan Electrical Engineering and Computer Science Direct Generalization of PICO? Programmability would require full interconnect between elements Back to the meta architecture! Generalize connectivity to enable post-programmability But stylize it
University of Michigan Electrical Engineering and Computer Science Programmable Loop Accelerator – Design Strategy Compile for partially defined architecture –Build long distance communication into schedule –Limit global communication bandwidth Proposed meta-architecture –Multi-cluster VLIW Explicit inter-cluster transfers (varying latency/BW) Intra-cluster communication is complete –Hardware partially defined – expensive units
University of Michigan Electrical Engineering and Computer Science Programmable Loop Accelerator Schema Intra-cluster Communication Shift Register SRAM DRAM … Stream Buffer Accelerator Pipeline of Tiled or Clustered Accelerators Accelerator Datapath Control Unit Stream Unit II FU …… …… …… MEM Inter-cluster Register File ……
University of Michigan Electrical Engineering and Computer Science Flow Diagram FU Alloc Partition Modulo Schedule Assembly code, II # clusters # expensive FUs # cheap FUs FUs assigned to clusters Shift register depth, width, porting Intercluster bandwidth Loop Accelerator
University of Michigan Electrical Engineering and Computer Science Sobel Kernel for (i = 0; i < N1; i++) { for (j = 0; j < N2; j++) { int t00, t01, t02, t10, t12, t20, t21, t22; int e, tmp; t00 = x[i ][j ]; t01 = x[i ][j+1]; t02 = x[i ][j+2]; t10 = x[i+1][j ]; t12 = x[i+1][j+2]; t20 = x[i+2][j ]; t21 = x[i+2][j+1]; t22 = x[i+2][j+2]; e1 = ((t00 + t01) + (t01 + t02)) – ((t20 + t21) + (t21 + t22)); e2 = ((t00 + t10) + (t10 + t20)) – ((t02 + t12) + (t12 + t22)); e12 = e1*e1; e22 = e2*e2; e = e12 + e22; if (e > threshold) tmp = 1; else tmp = 0; edge[i][j] = tmp; }
University of Michigan Electrical Engineering and Computer Science FU Allocation Determine number of clusters: Determine number of expensive FUs –MPY, DIV, memory Sobel with II=4 41 ops 3 clusters 2 MPY ops 1 multiplier 9 memory ops 3 memory units
University of Michigan Electrical Engineering and Computer Science Partitioning Multi-level approach consists of two phases –Coarsening –Refinement Minimize inter-cluster communication Load balance –Max of 4 II operations per cluster Take FU allocation into account –Restricted # of expensive units –# of cheap units (ADD, logic) determined from partition
University of Michigan Electrical Engineering and Computer Science Coarsening Group highly related operations together –Pair operations together at each step –Forces partitioner to consider several operations as a single unit Coarsening Sobel subgraph into 2 groups: LLLLL LLLLL LLLLL LLLLL
University of Michigan Electrical Engineering and Computer Science Refinement Move operations between clusters Good moves: –Reduce inter-cluster communication –Improve load balance –Reduce hardware cost Reduce number of expensive units to meet limit Collect similar bitwidth operations together LLLLL ?
University of Michigan Electrical Engineering and Computer Science Partitioning Example From sobel, II=4 Place MPYs together Place each tree of ADD- LOAD-ADDs together Cuts 6 edges
University of Michigan Electrical Engineering and Computer Science Modulo Scheduling Determines shift register width, depth, and number of read ports Sobel II=4 LD ADD LD ADD cycle FU0FU1FU2FU3 FUCycleMax result lifetime Req’d depth Req’d ports
University of Michigan Electrical Engineering and Computer Science Test Cases Sobel and fsed kernels, II=4 designs Each machine has 4 clusters with 4 FUs per cluster M+ - M M *& B<< + -<< M+ - << M+ - << M+ & B+ - * sobel fsed
University of Michigan Electrical Engineering and Computer Science Cross Compile Results Computation is localized –sobel: 1.5 moves/cycle –fsed: 1 move/cycle Cross compile –Can still achieve II=4 –More inter-cluster communication –May require more units –sobel on fsed machine: ~2 moves/cycle –fsed on sobel machine: ~3 moves/cycle
University of Michigan Electrical Engineering and Computer Science Concluding Remarks Programmable loop accelerator design strategy –Meta-architecture with stylized interconnect –Systematic compiler-directed design flow Costs of programmability: –Interconnect, inter-cluster communication –Control – “micro-instructions” are necessary Just scratching the surface of this work For more, see the CCCP group webpage –