University of Michigan Electrical Engineering and Computer Science Bridging the Computation Gap Between Programmable Processors and Hardwired Accelerators Kevin Fan 1, Manjunath Kudlur 2, Ganesh Dasika, Scott Mahlke 1 Parakinetics, Inc. University of Michigan Advanced Computer Architecture Laboratory 2 Nvidia
University of Michigan Electrical Engineering and Computer Science Emerging applications have high performance, cost, energy demands –High-quality video –Flash animation Clear need for application and domain-specific hardware Introduction 2 24 fps min. Frames/sec MPEG-4 Decoder Cell-phone battery life (hours)
University of Michigan Electrical Engineering and Computer Science Flexibility Multiple instances of the same application –E.g multiple video codecs Software algorithms change over time NRE Time-to-market 3 XvidDivX FFMpeg
University of Michigan Electrical Engineering and Computer Science ASIC Alternatives FPGAs General Purpose Processors DSPs Domain-specific accelerators Efficiency, Performance Flexibility 4 Loop Accelerators, ASICs ??? Highly efficient, some programmability
University of Michigan Electrical Engineering and Computer Science How much post-programmability is really required? for(k=0; k<N4; k++) {... real = Z1[k][0]; img = Z1[k][1]; Z1[k][0] = real * sincos[k][0] - img*sincos[k][1]; Z1[k][0] = Z1[k][0] << 1; } for(k=0; k<N4; k++) {... real = Z1[k][0]; img = Z1[k][1]; Z1[k][0] = real * sincos[k][0] - img*sincos[k][1]; Z1[k][0] = Z1[k][0] << 1; } if(b_scale) { Z1[k][0] = Z1[k][0] * scale; } Version 1.39Version 1.40 mdct.c in faad2 5
University of Michigan Electrical Engineering and Computer Science How much post-programmability is really required? for(k=0; k<N4; k++) {... uint16_t n = k << 1; ComplexMult(...); X_out[ n] = RE(x); X_out[N n] = -IM(x); X_out[N2 + n] = IM(x); X_out[N n] = -RE(x); } for(k=0; k<N4; k++) {... uint16_t n = k << 1; ComplexMult(...); X_out[ n] = -RE(x); X_out[N n] = IM(x); X_out[N2 + n] = -IM(x); X_out[N n] = RE(x); } Version 1.33Version 1.34 mdct.c in faad2 6
University of Michigan Electrical Engineering and Computer Science How much post-programmability is really required? Version 13.0Version 13.1Version 13.2 for (coeff_ctr = 0; coeff_ctr < 64; coeff_ctr++) { i=pos_scan[coeff_ctr][0]; j=pos_scan[coeff_ctr][1]; run++; ilev=0; if (currMB->luma_transform_size_8x8_flag && input->symbol_mode == CAVLC) { MCcoeff = MC(coeff_ctr); runs[MCcoeff]++; } m7 = &curr_res[block_y + j][block_x]; level = iabs (m7[i]); if (img->AdaptiveRounding) { fadjust8x8[j][block_x+i] = 0; } if (level != 0) { nonzero = TRUE; if (currMB->luma_transform_size_8x8_flag && input->symbol_mode == CAVLC) { *coeff_cost += MAX_VALUE; img->cofAC[b8+pl_off][MCcoeff][0][scan_poss[MCcoeff] ] = isignab(level,m7[i]); img->cofAC[b8+pl_off][MCcoeff][1][scan_poss[MCcoeff]++] = runs[MCcoeff]; ++scan_pos; runs[MCcoeff]=-1; } else { *coeff_cost += MAX_VALUE; ACLevel[scan_pos ] = isignab(level,m7[i]); ACRun [scan_pos++] = run; run=-1; // reset zero level counter } level = isignab(level, m7[i]); ilev = level; } for (coeff_ctr = 0; coeff_ctr < 64; coeff_ctr++) { i=pos_scan[coeff_ctr][0]; j=pos_scan[coeff_ctr][1]; run++; ilev=0; if (currMB->luma_transform_size_8x8_flag && input->symbol_mode == CAVLC) { MCcoeff = MC(coeff_ctr); runs[MCcoeff]++; } m7 = &curr_res[block_y + j][block_x]; level = iabs (m7[i]); if (img->AdaptiveRounding) { fadjust8x8[j][block_x+i] = 0; } if (level != 0) { nonzero = TRUE; if (currMB->luma_transform_size_8x8_flag && input->symbol_mode == CAVLC) { *coeff_cost += MAX_VALUE; img->cofAC[pl_off][MCcoeff][0][scan_poss[MCcoeff] ] = isignab(level,m7[i]); img->cofAC[pl_off][MCcoeff][1][scan_poss[MCcoeff]++] = runs[MCcoeff]; ++scan_pos; runs[MCcoeff]=-1; } else { *coeff_cost += MAX_VALUE; ACLevel[scan_pos ] = isignab(level,m7[i]); ACRun [scan_pos++] = run; run=-1; // reset zero level counter } level = isignab(level, m7[i]); ilev = level; } for (coeff_ctr = 0; coeff_ctr < 64; coeff_ctr++) { i=pos_scan[coeff_ctr][0]; j=pos_scan[coeff_ctr][1]; run++; ilev=0; if (currMB->luma_transform_size_8x8_flag && input->symbol_mode == CAVLC) { MCcoeff = MC(coeff_ctr); runs[MCcoeff]++; } m7 = &curr_res[block_y + j][block_x]; level = iabs (m7[i]); if (img->AdaptiveRounding) { fadjust8x8[j][block_x+i] = 0; } if (level != 0) { nonzero = TRUE; if (currMB->luma_transform_size_8x8_flag && input->symbol_mode == CAVLC) { *coeff_cost += MAX_VALUE; img->cofAC[pl_off][MCcoeff][0][scan_poss[MCcoeff] ] = isignab(level,m7[i]); img->cofAC[pl_off][MCcoeff][1][scan_poss[MCcoeff]++] = runs[MCcoeff]; ++scan_pos; runs[MCcoeff]=-1; } else { *coeff_cost += MAX_VALUE; ACLevel[scan_pos ] = isignab(level,m7[i]); ACRun [scan_pos++] = run; run=-1; // reset zero level counter } level = isignab(level, m7[i]); ilev = level; } H.264 reference implementation Mostly minor changes to loops –Bug fixes –Revisions Possible to design custom HW with minor programmability extensions 7
University of Michigan Electrical Engineering and Computer Science Programmable Loop Accelerator Generalize accelerator without losing efficiency FPGAs Efficiency, Performance Flexibility Loop Accelerators, ASICs Programmable Loop Accelerators 8 General Purpose Processors DSPs Domain-specific accelerators ???
University of Michigan Electrical Engineering and Computer Science Designing Loop Accelerators C Code Loop 9 Hardware Point-to-point Connections BR CRF + …… & …… MEM …… Local Mem + …… * …… MEM …… << …… Local Mem
University of Michigan Electrical Engineering and Computer Science Loop Accelerator Architecture Point-to-point Connections + …… & …… MEM …… Local Mem FSM Control signals CRF BR Hardware realization of modulo scheduled loop Parameterized execution resources, storage, connectivity 10
University of Michigan Electrical Engineering and Computer Science LA Scheduling TimeFU0FU1FU2FU LD x Mem X ++ 4 FIR Loop Kernel ???? LD x x - No subtract Paths missing Mult result has longer lifetime
University of Michigan Electrical Engineering and Computer Science LA Datapath Restrictions Slow-Down Graph Difference 8 12
University of Michigan Electrical Engineering and Computer Science Programmable Loop-Accelerator Architecture Point-to-point Connections +/- …… &/| …… MEM …… Local Mem Control Memory Control signals CRF BR RR Literals Bus Functionality Storage Connectivity Control LA PLA Custom FU setGeneralized FUs + MOVs Point-to-pointBus + Port-swapping Limited size, no addr.Rotating Reg. Files Hardwired ControlLit. Reg. File + Control Mem 13 +& SRF FSM
University of Michigan Electrical Engineering and Computer Science Experimental Setup Wide variety of benchmarks –DSP –Media –Linear Algebra Baseline LAs: –Used LA synthesis system to generate HDL – um Comparisons: –PLAs ( um) –OR-1200 ( um) 14
University of Michigan Electrical Engineering and Computer Science Area 15 OR-1200
University of Michigan Electrical Engineering and Computer Science Power Consumption 16 OR-1200
University of Michigan Electrical Engineering and Computer Science Power Consumption 17 OR-1200 OR-1200 equiv
University of Michigan Electrical Engineering and Computer Science Power Breakdown 18
University of Michigan Electrical Engineering and Computer Science Scheduling for PLAs Hardware Loop 1 Synthesis System Loop 2 Compiler + SMT-solver Generalize accelerator architecture Map new loops to existing hardware 19
University of Michigan Electrical Engineering and Computer Science PLA Scheduling Mem X ++ LA Hardware Mem X +/- Bus PLA Hardware 20
University of Michigan Electrical Engineering and Computer Science PLA Scheduling TimeFU0FU1FU2FU3Bus 01MOV5X Mem X +/- Bus PLA Hardware MOV Bus LD x x - sched_time(j) sched_time(i) + lat(i) – dist(i,j) II S j I( X i,fi,ti X j,fj,tj ) (I + t j S i II + t i + lat(i) – dist(i,j) II) SMT
University of Michigan Electrical Engineering and Computer Science Programmability Small, with simple communication Small, with complex communication 22
University of Michigan Electrical Engineering and Computer Science Power Efficiency LA: 105 MIPS/mW PLA: 24 MIPS/mW OR1K: 2 MIPS/mW TI C6x: 5 MIPS/mW ARM11: 3 MIPS/mW Itanium2: 0.08 MIPS/mW Tensilica Diamond Core: 12 MIPS/mW 23 Faster than ARM11 AND 8x more efficient! Performance (MIPS) Power (mW)
University of Michigan Electrical Engineering and Computer Science Conclusion Programmable loop accelerators retain efficiency while being programmable Loop accelerator datapath generalized in a cost- effective way Significant benefits over GPP: –4x-34x improved power efficiency –30x improved area efficiency 24
University of Michigan Electrical Engineering and Computer Science Questions? ? 25
University of Michigan Electrical Engineering and Computer Science
University of Michigan Electrical Engineering and Computer Science