University of Michigan Electrical Engineering and Computer Science Bridging the Computation Gap Between Programmable Processors and Hardwired Accelerators.

University of Michigan Electrical Engineering and Computer Science Bridging the Computation Gap Between Programmable Processors and Hardwired Accelerators Kevin Fan 1, Manjunath Kudlur 2, Ganesh Dasika, Scott Mahlke 1 Parakinetics, Inc. University of Michigan Advanced Computer Architecture Laboratory 2 Nvidia

University of Michigan Electrical Engineering and Computer Science Emerging applications have high performance, cost, energy demands –High-quality video –Flash animation Clear need for application and domain-specific hardware Introduction 2 24 fps min. Frames/sec MPEG-4 Decoder Cell-phone battery life (hours)

University of Michigan Electrical Engineering and Computer Science Flexibility Multiple instances of the same application –E.g multiple video codecs Software algorithms change over time NRE Time-to-market 3 XvidDivX FFMpeg

University of Michigan Electrical Engineering and Computer Science ASIC Alternatives FPGAs General Purpose Processors DSPs Domain-specific accelerators Efficiency, Performance Flexibility 4 Loop Accelerators, ASICs ??? Highly efficient, some programmability

University of Michigan Electrical Engineering and Computer Science How much post-programmability is really required? for(k=0; k<N4; k++) {... real = Z1[k][0]; img = Z1[k][1]; Z1[k][0] = real * sincos[k][0] - img*sincos[k][1]; Z1[k][0] = Z1[k][0] << 1; } for(k=0; k<N4; k++) {... real = Z1[k][0]; img = Z1[k][1]; Z1[k][0] = real * sincos[k][0] - img*sincos[k][1]; Z1[k][0] = Z1[k][0] << 1; } if(b_scale) { Z1[k][0] = Z1[k][0] * scale; } Version 1.39Version 1.40 mdct.c in faad2 5

University of Michigan Electrical Engineering and Computer Science How much post-programmability is really required? for(k=0; k<N4; k++) {... uint16_t n = k << 1; ComplexMult(...); X_out[ n] = RE(x); X_out[N2 - 1 - n] = -IM(x); X_out[N2 + n] = IM(x); X_out[N - 1 - n] = -RE(x); } for(k=0; k<N4; k++) {... uint16_t n = k << 1; ComplexMult(...); X_out[ n] = -RE(x); X_out[N2 - 1 - n] = IM(x); X_out[N2 + n] = -IM(x); X_out[N - 1 - n] = RE(x); } Version 1.33Version 1.34 mdct.c in faad2 6

University of Michigan Electrical Engineering and Computer Science How much post-programmability is really required? Version 13.0Version 13.1Version 13.2 for (coeff_ctr = 0; coeff_ctr < 64; coeff_ctr++) { i=pos_scan[coeff_ctr][0]; j=pos_scan[coeff_ctr][1]; run++; ilev=0; if (currMB->luma_transform_size_8x8_flag && input->symbol_mode == CAVLC) { MCcoeff = MC(coeff_ctr); runs[MCcoeff]++; } m7 = &curr_res[block_y + j][block_x]; level = iabs (m7[i]); if (img->AdaptiveRounding) { fadjust8x8[j][block_x+i] = 0; } if (level != 0) { nonzero = TRUE; if (currMB->luma_transform_size_8x8_flag && input->symbol_mode == CAVLC) { *coeff_cost += MAX_VALUE; img->cofAC[b8+pl_off][MCcoeff][0][scan_poss[MCcoeff] ] = isignab(level,m7[i]); img->cofAC[b8+pl_off][MCcoeff][1][scan_poss[MCcoeff]++] = runs[MCcoeff]; ++scan_pos; runs[MCcoeff]=-1; } else { *coeff_cost += MAX_VALUE; ACLevel[scan_pos ] = isignab(level,m7[i]); ACRun [scan_pos++] = run; run=-1; // reset zero level counter } level = isignab(level, m7[i]); ilev = level; } for (coeff_ctr = 0; coeff_ctr < 64; coeff_ctr++) { i=pos_scan[coeff_ctr][0]; j=pos_scan[coeff_ctr][1]; run++; ilev=0; if (currMB->luma_transform_size_8x8_flag && input->symbol_mode == CAVLC) { MCcoeff = MC(coeff_ctr); runs[MCcoeff]++; } m7 = &curr_res[block_y + j][block_x]; level = iabs (m7[i]); if (img->AdaptiveRounding) { fadjust8x8[j][block_x+i] = 0; } if (level != 0) { nonzero = TRUE; if (currMB->luma_transform_size_8x8_flag && input->symbol_mode == CAVLC) { *coeff_cost += MAX_VALUE; img->cofAC[pl_off][MCcoeff][0][scan_poss[MCcoeff] ] = isignab(level,m7[i]); img->cofAC[pl_off][MCcoeff][1][scan_poss[MCcoeff]++] = runs[MCcoeff]; ++scan_pos; runs[MCcoeff]=-1; } else { *coeff_cost += MAX_VALUE; ACLevel[scan_pos ] = isignab(level,m7[i]); ACRun [scan_pos++] = run; run=-1; // reset zero level counter } level = isignab(level, m7[i]); ilev = level; } for (coeff_ctr = 0; coeff_ctr < 64; coeff_ctr++) { i=pos_scan[coeff_ctr][0]; j=pos_scan[coeff_ctr][1]; run++; ilev=0; if (currMB->luma_transform_size_8x8_flag && input->symbol_mode == CAVLC) { MCcoeff = MC(coeff_ctr); runs[MCcoeff]++; } m7 = &curr_res[block_y + j][block_x]; level = iabs (m7[i]); if (img->AdaptiveRounding) { fadjust8x8[j][block_x+i] = 0; } if (level != 0) { nonzero = TRUE; if (currMB->luma_transform_size_8x8_flag && input->symbol_mode == CAVLC) { *coeff_cost += MAX_VALUE; img->cofAC[pl_off][MCcoeff][0][scan_poss[MCcoeff] ] = isignab(level,m7[i]); img->cofAC[pl_off][MCcoeff][1][scan_poss[MCcoeff]++] = runs[MCcoeff]; ++scan_pos; runs[MCcoeff]=-1; } else { *coeff_cost += MAX_VALUE; ACLevel[scan_pos ] = isignab(level,m7[i]); ACRun [scan_pos++] = run; run=-1; // reset zero level counter } level = isignab(level, m7[i]); ilev = level; } H.264 reference implementation Mostly minor changes to loops –Bug fixes –Revisions Possible to design custom HW with minor programmability extensions 7

University of Michigan Electrical Engineering and Computer Science Programmable Loop Accelerator Generalize accelerator without losing efficiency FPGAs Efficiency, Performance Flexibility Loop Accelerators, ASICs Programmable Loop Accelerators 8 General Purpose Processors DSPs Domain-specific accelerators ???

University of Michigan Electrical Engineering and Computer Science Designing Loop Accelerators C Code Loop 9 Hardware Point-to-point Connections BR CRF + …… & …… MEM …… Local Mem + …… * …… MEM …… << …… Local Mem

University of Michigan Electrical Engineering and Computer Science Loop Accelerator Architecture Point-to-point Connections + …… & …… MEM …… Local Mem FSM Control signals CRF BR Hardware realization of modulo scheduled loop Parameterized execution resources, storage, connectivity 10

University of Michigan Electrical Engineering and Computer Science LA Scheduling TimeFU0FU1FU2FU3 0145 1236 ++ LD x + 12 43 5 6 Mem X ++ 4 FIR Loop Kernel ???? 11 + + LD x x - No subtract Paths missing Mult result has longer lifetime

University of Michigan Electrical Engineering and Computer Science LA Datapath Restrictions Slow-Down Graph Difference 8 12

University of Michigan Electrical Engineering and Computer Science Programmable Loop-Accelerator Architecture Point-to-point Connections +/- …… &/| …… MEM …… Local Mem Control Memory Control signals CRF BR RR Literals Bus  Functionality  Storage  Connectivity  Control LA PLA Custom FU setGeneralized FUs + MOVs Point-to-pointBus + Port-swapping Limited size, no addr.Rotating Reg. Files Hardwired ControlLit. Reg. File + Control Mem 13 +& SRF FSM

University of Michigan Electrical Engineering and Computer Science Experimental Setup Wide variety of benchmarks –DSP –Media –Linear Algebra Baseline LAs: –Used LA synthesis system to generate HDL –200 MHz @ 0.13um Comparisons: –PLAs (200 MHz @ 0.13um) –OR-1200 (300 MHz @ 0.13um) 14

University of Michigan Electrical Engineering and Computer Science Area 15 OR-1200

University of Michigan Electrical Engineering and Computer Science Power Consumption 16 OR-1200

University of Michigan Electrical Engineering and Computer Science Power Consumption 17 OR-1200 OR-1200 equiv

University of Michigan Electrical Engineering and Computer Science Power Breakdown 18

University of Michigan Electrical Engineering and Computer Science Scheduling for PLAs Hardware Loop 1 Synthesis System Loop 2 Compiler + SMT-solver Generalize accelerator architecture Map new loops to existing hardware 19

University of Michigan Electrical Engineering and Computer Science PLA Scheduling Mem X ++ LA Hardware Mem X +/- Bus PLA Hardware 20

University of Michigan Electrical Engineering and Computer Science PLA Scheduling TimeFU0FU1FU2FU3Bus 01MOV5X 13246 Mem X +/- Bus PLA Hardware MOV Bus 21 + + LD x 3 6 2 1 5 4 x - sched_time(j)  sched_time(i) + lat(i) – dist(i,j)  II S j  I( X i,fi,ti  X j,fj,tj )  (I + t j  S i  II + t i + lat(i) – dist(i,j)  II) SMT

University of Michigan Electrical Engineering and Computer Science Programmability Small, with simple communication Small, with complex communication 22

University of Michigan Electrical Engineering and Computer Science Power Efficiency LA: 105 MIPS/mW PLA: 24 MIPS/mW OR1K: 2 MIPS/mW TI C6x: 5 MIPS/mW ARM11: 3 MIPS/mW Itanium2: 0.08 MIPS/mW Tensilica Diamond Core: 12 MIPS/mW 23 Faster than ARM11 AND 8x more efficient! Performance (MIPS) Power (mW)

University of Michigan Electrical Engineering and Computer Science Conclusion Programmable loop accelerators retain efficiency while being programmable Loop accelerator datapath generalized in a cost- effective way Significant benefits over GPP: –4x-34x improved power efficiency –30x improved area efficiency 24

University of Michigan Electrical Engineering and Computer Science Questions? ? 25 http://cccp.eecs.umich.edu

University of Michigan Electrical Engineering and Computer Science

University of Michigan Electrical Engineering and Computer Science Bridging the Computation Gap Between Programmable Processors and Hardwired Accelerators.

Similar presentations

Presentation on theme: "University of Michigan Electrical Engineering and Computer Science Bridging the Computation Gap Between Programmable Processors and Hardwired Accelerators."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

University of Michigan Electrical Engineering and Computer Science Bridging the Computation Gap Between Programmable Processors and Hardwired Accelerators.

Similar presentations

Presentation on theme: "University of Michigan Electrical Engineering and Computer Science Bridging the Computation Gap Between Programmable Processors and Hardwired Accelerators."— Presentation transcript:

Similar presentations

About project

Feedback