University of Michigan Electrical Engineering and Computer Science 1 Modulo Scheduling for Highly Customized Datapaths to Increase Hardware Reusability Kevin Fan, Hyunchul Park, Manjunath Kudlur, Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan April 8, 2008
University of Michigan Electrical Engineering and Computer Science 2 Introduction Emerging applications have high performance, cost, energy demands –H.264, wireless, software radio, signal processing – Gops required –200 mW power budget Applications dominated by tight loops processing large amounts of streaming data iPhone board
University of Michigan Electrical Engineering and Computer Science 3 Loop Accelerators C CodeHardware Loop LD+/-*
University of Michigan Electrical Engineering and Computer Science 4 FPGAs Hardware Implementations Customization gets order-of-magnitude performance and efficiency wins –Viterbi: 100x speedup vs. ARM9 General Purpose Processors DSPs CGRAs Loop Accelerators, ASICs Efficiency, Performance Flexibility Multifunction Loop Accelerators
University of Michigan Electrical Engineering and Computer Science 5 What About Programmability? Software changes – bug fixes, evolving standards dct_8x8() from H.264 reference implementation Version 13.0Version 13.1Version 13.2 for (coeff_ctr = 0; coeff_ctr < 64; coeff_ctr++) { i=pos_scan[coeff_ctr][0]; j=pos_scan[coeff_ctr][1]; run++; ilev=0; if (currMB->luma_transform_size_8x8_flag && input->symbol_mode == CAVLC) { MCcoeff = MC(coeff_ctr); runs[MCcoeff]++; } m7 = &curr_res[block_y + j][block_x]; level = iabs (m7[i]); if (img->AdaptiveRounding) { fadjust8x8[j][block_x+i] = 0; } if (level != 0) { nonzero = TRUE; if (currMB->luma_transform_size_8x8_flag && input->symbol_mode == CAVLC) { *coeff_cost += MAX_VALUE; img->cofAC[b8+pl_off][MCcoeff][0][scan_poss[MCcoeff] ] = isignab(level,m7[i]); img->cofAC[b8+pl_off][MCcoeff][1][scan_poss[MCcoeff]++] = runs[MCcoeff]; ++scan_pos; runs[MCcoeff]=-1; } else { *coeff_cost += MAX_VALUE; ACLevel[scan_pos ] = isignab(level,m7[i]); ACRun [scan_pos++] = run; run=-1; // reset zero level counter } level = isignab(level, m7[i]); ilev = level; } for (coeff_ctr = 0; coeff_ctr < 64; coeff_ctr++) { i=pos_scan[coeff_ctr][0]; j=pos_scan[coeff_ctr][1]; run++; ilev=0; if (currMB->luma_transform_size_8x8_flag && input->symbol_mode == CAVLC) { MCcoeff = MC(coeff_ctr); runs[MCcoeff]++; } m7 = &curr_res[block_y + j][block_x]; level = iabs (m7[i]); if (img->AdaptiveRounding) { fadjust8x8[j][block_x+i] = 0; } if (level != 0) { nonzero = TRUE; if (currMB->luma_transform_size_8x8_flag && input->symbol_mode == CAVLC) { *coeff_cost += MAX_VALUE; img->cofAC[pl_off][MCcoeff][0][scan_poss[MCcoeff] ] = isignab(level,m7[i]); img->cofAC[pl_off][MCcoeff][1][scan_poss[MCcoeff]++] = runs[MCcoeff]; ++scan_pos; runs[MCcoeff]=-1; } else { *coeff_cost += MAX_VALUE; ACLevel[scan_pos ] = isignab(level,m7[i]); ACRun [scan_pos++] = run; run=-1; // reset zero level counter } level = isignab(level, m7[i]); ilev = level; } for (coeff_ctr = 0; coeff_ctr < 64; coeff_ctr++) { i=pos_scan[coeff_ctr][0]; j=pos_scan[coeff_ctr][1]; run++; ilev=0; if (currMB->luma_transform_size_8x8_flag && input->symbol_mode == CAVLC) { MCcoeff = MC(coeff_ctr); runs[MCcoeff]++; } m7 = &curr_res[block_y + j][block_x]; level = iabs (m7[i]); if (img->AdaptiveRounding) { fadjust8x8[j][block_x+i] = 0; } if (level != 0) { nonzero = TRUE; if (currMB->luma_transform_size_8x8_flag && input->symbol_mode == CAVLC) { *coeff_cost += MAX_VALUE; img->cofAC[pl_off][MCcoeff][0][scan_poss[MCcoeff] ] = isignab(level,m7[i]); img->cofAC[pl_off][MCcoeff][1][scan_poss[MCcoeff]++] = runs[MCcoeff]; ++scan_pos; runs[MCcoeff]=-1; } else { *coeff_cost += MAX_VALUE; ACLevel[scan_pos ] = isignab(level,m7[i]); ACRun [scan_pos++] = run; run=-1; // reset zero level counter } level = isignab(level, m7[i]); ilev = level; }
University of Michigan Electrical Engineering and Computer Science 6 FPGAs Programmable Loop Accelerator Reusable hardware → reduced NRE costs Generalize accelerator without losing efficiency General Purpose Processors DSPs CGRAs Loop Accelerators, ASICs Efficiency, Performance Flexibility Multifunction Loop Accelerators Programmable Loop Accelerators
University of Michigan Electrical Engineering and Computer Science 7 Flexible Accelerators Hardware Loop 1 Synthesis System Loop 2 Compiler Generalize accelerator architecture Map new loops to existing hardware
University of Michigan Electrical Engineering and Computer Science 8 Loop Accelerator Architecture Point-to-point Connections + …… & …… MEM …… Local Mem FSM Control signals CRF BR Hardware realization of modulo scheduled loop Parameterized execution resources, storage, connectivity
University of Michigan Electrical Engineering and Computer Science 9 Programmable Accelerator Architecture Point-to-point Connections +/- …… &/| …… MEM …… Local Mem Control Memory Control signals CRF BR RR Literals Bus ~50% area overhead vs. non-programmable accelerator Generalize architectural features that limit programmability
University of Michigan Electrical Engineering and Computer Science 10 Mapping Loops onto Hardware General-purposeCustomized Central register fileDistributed registers HomogeneousPoint-to-point ProcessorAccelerator FUs Storage Connectivity ALU CRF LD+/-*
University of Michigan Electrical Engineering and Computer Science 11 Scheduling Example ADDER1ADDER2MEM II=2 Time LD LD LD ?
University of Michigan Electrical Engineering and Computer Science 12 Modulo Scheduling for LAs Large search space, few solutions Op-centric approaches unable to find solutions Satisfiability Modulo Theory (SMT) formulation to solve linear and SAT constraints simultaneously Move Insertion SMT Scheduling Register Allocation Loop Control Signals Machine description Increment II
University of Michigan Electrical Engineering and Computer Science 13 SMT Formulation Boolean variables X i,f,t are true if operation i is scheduled on FU f at time slot t. Integer variables S i represent stage of operation i. ( X i,fi,ti X j,fj,tj ) ( ) sched_time(j) sched_time(i) + lat(i) – dist(i,j) II i j lat(i) dist(i,j) S j II + t j S i II + t i + lat(i) – dist(i,j) II More details in paper
University of Michigan Electrical Engineering and Computer Science 14 Measuring Programmability How well can different loops be mapped onto the same hardware? Performance matters – how much does II increase? Need set of loops with different degrees of similarity FU Hardware Loop ?
University of Michigan Electrical Engineering and Computer Science 15 Graph Perturbation Synthetically generated graphs More perturbations → less similar to original graph Iteratively apply random transformations: Add edge between existing operations Add edge with new producer Add edge with new consumer Remove edge
University of Michigan Electrical Engineering and Computer Science 16 Results – Perturbed Graphs Average II increase MPEG4Signal processingImageMath Base II
University of Michigan Electrical Engineering and Computer Science 17 Results – Restricted Datapath
University of Michigan Electrical Engineering and Computer Science 18 Conclusion Increase flexibility of customized hardware without sacrificing performance, efficiency Successfully map loops to heterogeneous hardware Compile times of 5 minutes – 1 hour Software changing faster than hardware → patchable ASIC
University of Michigan Electrical Engineering and Computer Science 19 Questions?
University of Michigan Electrical Engineering and Computer Science 20
University of Michigan Electrical Engineering and Computer Science 21 Results – Cross Compilation