Download presentation
Presentation is loading. Please wait.
Published byClaude Byrd Modified over 9 years ago
1
University of Michigan Electrical Engineering and Computer Science Bridging the Computation Gap Between Programmable Processors and Hardwired Accelerators Kevin Fan 1, Manjunath Kudlur 2, Ganesh Dasika, Scott Mahlke 1 Parakinetics, Inc. University of Michigan Advanced Computer Architecture Laboratory 2 Nvidia
2
University of Michigan Electrical Engineering and Computer Science Emerging applications have high performance, cost, energy demands –High-quality video –Flash animation Clear need for application and domain-specific hardware Introduction 2 24 fps min. Frames/sec MPEG-4 Decoder Cell-phone battery life (hours)
3
University of Michigan Electrical Engineering and Computer Science Flexibility Multiple instances of the same application –E.g multiple video codecs Software algorithms change over time NRE Time-to-market 3 XvidDivX FFMpeg
4
University of Michigan Electrical Engineering and Computer Science ASIC Alternatives FPGAs General Purpose Processors DSPs Domain-specific accelerators Efficiency, Performance Flexibility 4 Loop Accelerators, ASICs ??? Highly efficient, some programmability
5
University of Michigan Electrical Engineering and Computer Science How much post-programmability is really required? for(k=0; k<N4; k++) {... real = Z1[k][0]; img = Z1[k][1]; Z1[k][0] = real * sincos[k][0] - img*sincos[k][1]; Z1[k][0] = Z1[k][0] << 1; } for(k=0; k<N4; k++) {... real = Z1[k][0]; img = Z1[k][1]; Z1[k][0] = real * sincos[k][0] - img*sincos[k][1]; Z1[k][0] = Z1[k][0] << 1; } if(b_scale) { Z1[k][0] = Z1[k][0] * scale; } Version 1.39Version 1.40 mdct.c in faad2 5
6
University of Michigan Electrical Engineering and Computer Science How much post-programmability is really required? for(k=0; k<N4; k++) {... uint16_t n = k << 1; ComplexMult(...); X_out[ n] = RE(x); X_out[N2 - 1 - n] = -IM(x); X_out[N2 + n] = IM(x); X_out[N - 1 - n] = -RE(x); } for(k=0; k<N4; k++) {... uint16_t n = k << 1; ComplexMult(...); X_out[ n] = -RE(x); X_out[N2 - 1 - n] = IM(x); X_out[N2 + n] = -IM(x); X_out[N - 1 - n] = RE(x); } Version 1.33Version 1.34 mdct.c in faad2 6
7
University of Michigan Electrical Engineering and Computer Science How much post-programmability is really required? Version 13.0Version 13.1Version 13.2 for (coeff_ctr = 0; coeff_ctr < 64; coeff_ctr++) { i=pos_scan[coeff_ctr][0]; j=pos_scan[coeff_ctr][1]; run++; ilev=0; if (currMB->luma_transform_size_8x8_flag && input->symbol_mode == CAVLC) { MCcoeff = MC(coeff_ctr); runs[MCcoeff]++; } m7 = &curr_res[block_y + j][block_x]; level = iabs (m7[i]); if (img->AdaptiveRounding) { fadjust8x8[j][block_x+i] = 0; } if (level != 0) { nonzero = TRUE; if (currMB->luma_transform_size_8x8_flag && input->symbol_mode == CAVLC) { *coeff_cost += MAX_VALUE; img->cofAC[b8+pl_off][MCcoeff][0][scan_poss[MCcoeff] ] = isignab(level,m7[i]); img->cofAC[b8+pl_off][MCcoeff][1][scan_poss[MCcoeff]++] = runs[MCcoeff]; ++scan_pos; runs[MCcoeff]=-1; } else { *coeff_cost += MAX_VALUE; ACLevel[scan_pos ] = isignab(level,m7[i]); ACRun [scan_pos++] = run; run=-1; // reset zero level counter } level = isignab(level, m7[i]); ilev = level; } for (coeff_ctr = 0; coeff_ctr < 64; coeff_ctr++) { i=pos_scan[coeff_ctr][0]; j=pos_scan[coeff_ctr][1]; run++; ilev=0; if (currMB->luma_transform_size_8x8_flag && input->symbol_mode == CAVLC) { MCcoeff = MC(coeff_ctr); runs[MCcoeff]++; } m7 = &curr_res[block_y + j][block_x]; level = iabs (m7[i]); if (img->AdaptiveRounding) { fadjust8x8[j][block_x+i] = 0; } if (level != 0) { nonzero = TRUE; if (currMB->luma_transform_size_8x8_flag && input->symbol_mode == CAVLC) { *coeff_cost += MAX_VALUE; img->cofAC[pl_off][MCcoeff][0][scan_poss[MCcoeff] ] = isignab(level,m7[i]); img->cofAC[pl_off][MCcoeff][1][scan_poss[MCcoeff]++] = runs[MCcoeff]; ++scan_pos; runs[MCcoeff]=-1; } else { *coeff_cost += MAX_VALUE; ACLevel[scan_pos ] = isignab(level,m7[i]); ACRun [scan_pos++] = run; run=-1; // reset zero level counter } level = isignab(level, m7[i]); ilev = level; } for (coeff_ctr = 0; coeff_ctr < 64; coeff_ctr++) { i=pos_scan[coeff_ctr][0]; j=pos_scan[coeff_ctr][1]; run++; ilev=0; if (currMB->luma_transform_size_8x8_flag && input->symbol_mode == CAVLC) { MCcoeff = MC(coeff_ctr); runs[MCcoeff]++; } m7 = &curr_res[block_y + j][block_x]; level = iabs (m7[i]); if (img->AdaptiveRounding) { fadjust8x8[j][block_x+i] = 0; } if (level != 0) { nonzero = TRUE; if (currMB->luma_transform_size_8x8_flag && input->symbol_mode == CAVLC) { *coeff_cost += MAX_VALUE; img->cofAC[pl_off][MCcoeff][0][scan_poss[MCcoeff] ] = isignab(level,m7[i]); img->cofAC[pl_off][MCcoeff][1][scan_poss[MCcoeff]++] = runs[MCcoeff]; ++scan_pos; runs[MCcoeff]=-1; } else { *coeff_cost += MAX_VALUE; ACLevel[scan_pos ] = isignab(level,m7[i]); ACRun [scan_pos++] = run; run=-1; // reset zero level counter } level = isignab(level, m7[i]); ilev = level; } H.264 reference implementation Mostly minor changes to loops –Bug fixes –Revisions Possible to design custom HW with minor programmability extensions 7
8
University of Michigan Electrical Engineering and Computer Science Programmable Loop Accelerator Generalize accelerator without losing efficiency FPGAs Efficiency, Performance Flexibility Loop Accelerators, ASICs Programmable Loop Accelerators 8 General Purpose Processors DSPs Domain-specific accelerators ???
9
University of Michigan Electrical Engineering and Computer Science Designing Loop Accelerators C Code Loop 9 Hardware Point-to-point Connections BR CRF + …… & …… MEM …… Local Mem + …… * …… MEM …… << …… Local Mem
10
University of Michigan Electrical Engineering and Computer Science Loop Accelerator Architecture Point-to-point Connections + …… & …… MEM …… Local Mem FSM Control signals CRF BR Hardware realization of modulo scheduled loop Parameterized execution resources, storage, connectivity 10
11
University of Michigan Electrical Engineering and Computer Science LA Scheduling TimeFU0FU1FU2FU3 0145 1236 ++ LD x + 12 43 5 6 Mem X ++ 4 FIR Loop Kernel ???? 11 + + LD x x - No subtract Paths missing Mult result has longer lifetime
12
University of Michigan Electrical Engineering and Computer Science LA Datapath Restrictions Slow-Down Graph Difference 8 12
13
University of Michigan Electrical Engineering and Computer Science Programmable Loop-Accelerator Architecture Point-to-point Connections +/- …… &/| …… MEM …… Local Mem Control Memory Control signals CRF BR RR Literals Bus Functionality Storage Connectivity Control LA PLA Custom FU setGeneralized FUs + MOVs Point-to-pointBus + Port-swapping Limited size, no addr.Rotating Reg. Files Hardwired ControlLit. Reg. File + Control Mem 13 +& SRF FSM
14
University of Michigan Electrical Engineering and Computer Science Experimental Setup Wide variety of benchmarks –DSP –Media –Linear Algebra Baseline LAs: –Used LA synthesis system to generate HDL –200 MHz @ 0.13um Comparisons: –PLAs (200 MHz @ 0.13um) –OR-1200 (300 MHz @ 0.13um) 14
15
University of Michigan Electrical Engineering and Computer Science Area 15 OR-1200
16
University of Michigan Electrical Engineering and Computer Science Power Consumption 16 OR-1200
17
University of Michigan Electrical Engineering and Computer Science Power Consumption 17 OR-1200 OR-1200 equiv
18
University of Michigan Electrical Engineering and Computer Science Power Breakdown 18
19
University of Michigan Electrical Engineering and Computer Science Scheduling for PLAs Hardware Loop 1 Synthesis System Loop 2 Compiler + SMT-solver Generalize accelerator architecture Map new loops to existing hardware 19
20
University of Michigan Electrical Engineering and Computer Science PLA Scheduling Mem X ++ LA Hardware Mem X +/- Bus PLA Hardware 20
21
University of Michigan Electrical Engineering and Computer Science PLA Scheduling TimeFU0FU1FU2FU3Bus 01MOV5X 13246 Mem X +/- Bus PLA Hardware MOV Bus 21 + + LD x 3 6 2 1 5 4 x - sched_time(j) sched_time(i) + lat(i) – dist(i,j) II S j I( X i,fi,ti X j,fj,tj ) (I + t j S i II + t i + lat(i) – dist(i,j) II) SMT
22
University of Michigan Electrical Engineering and Computer Science Programmability Small, with simple communication Small, with complex communication 22
23
University of Michigan Electrical Engineering and Computer Science Power Efficiency LA: 105 MIPS/mW PLA: 24 MIPS/mW OR1K: 2 MIPS/mW TI C6x: 5 MIPS/mW ARM11: 3 MIPS/mW Itanium2: 0.08 MIPS/mW Tensilica Diamond Core: 12 MIPS/mW 23 Faster than ARM11 AND 8x more efficient! Performance (MIPS) Power (mW)
24
University of Michigan Electrical Engineering and Computer Science Conclusion Programmable loop accelerators retain efficiency while being programmable Loop accelerator datapath generalized in a cost- effective way Significant benefits over GPP: –4x-34x improved power efficiency –30x improved area efficiency 24
25
University of Michigan Electrical Engineering and Computer Science Questions? ? 25 http://cccp.eecs.umich.edu
26
University of Michigan Electrical Engineering and Computer Science
27
University of Michigan Electrical Engineering and Computer Science
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.