University of Michigan Electrical Engineering and Computer Science Bridging the Computation Gap Between Programmable Processors and Hardwired Accelerators.

Slides:

Advertisements

Similar presentations

Computer Science and Engineering Laboratory, Transport-triggered processors Jani Boutellier Computer Science and Engineering Laboratory This.

Advertisements

University of Michigan Electrical Engineering and Computer Science 1 Application-Specific Processing on a General Purpose Core via Transparent Instruction.

University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,

TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.

University of Michigan Electrical Engineering and Computer Science 1 Libra: Tailoring SIMD Execution using Heterogeneous Hardware and Dynamic Configurability.

 Understanding the Sources of Inefficiency in General-Purpose Chips.

University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution.

11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.

University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.

University of Michigan Electrical Engineering and Computer Science 1 Modulo Scheduling for Highly Customized Datapaths to Increase Hardware Reusability.

University of Michigan Electrical Engineering and Computer Science 1 Parallelizing Sequential Applications on Commodity Hardware Using a Low-Cost Software.

University of Michigan Electrical Engineering and Computer Science Compiler-directed Synthesis of Programmable Loop Accelerators Kevin Fan, Hyunchul Park,

University of Michigan Electrical Engineering and Computer Science FLASH: Foresighted Latency-Aware Scheduling Heuristic for Processors with Customized.

University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,

A System Solution for High- Performance, Low Power SDR Yuan Lin 1, Hyunseok Lee 1, Yoav Harel 1, Mark Woh 1, Scott Mahlke 1, Trevor Mudge 1 and Krisztian.

A Programmable Coprocessor Architecture for Wireless Applications Yuan Lin, Nadav Baron, Hyunseok Lee, Scott Mahlke, Trevor Mudge Advance Computer Architecture.

University of Michigan Electrical Engineering and Computer Science 1 An Architecture Framework for Transparent Instruction Set Customization in Embedded.

University of Michigan Electrical Engineering and Computer Science 1 Resource Recycling: Putting Idle Resources to Work on a Composable Accelerator Yongjun.

Dynamically Reconfigurable Architectures: An Overview Juanjo Noguera Dept. Computer Architecture (DAC-UPC)

Hot Chips 16August 24, 2004 OptimoDE: Programmable Accelerator Engines Through Retargetable Customization Nathan Clark, Hongtao Zhong, Kevin Fan, Scott.

University of Michigan Electrical Engineering and Computer Science 1 Streamroller: Automatic Synthesis of Prescribed Throughput Accelerator Pipelines Manjunath.

1 Design and Implementation of Turbo Decoders for Software Defined Radio Yuan Lin 1, Scott Mahlke 1, Trevor Mudge 1, Chaitali.

University of Michigan Electrical Engineering and Computer Science 1 Processor Acceleration Through Automated Instruction Set Customization Nathan Clark,

University of Michigan Electrical Engineering and Computer Science Power-Efficient Medical Image Processing using PUMA Ganesh Dasika, Kevin Fan 1, Scott.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

1 A survey on Reconfigurable Computing for Signal Processing Applications Anne Pratoomtong Spring2002.

Presenter MaxAcademy Lecture Series – V1.0, September 2011 Introduction and Motivation.

An Energy-Efficient Reconfigurable Multiprocessor IC for DSP Applications Multiple programmable VLIW processors arranged in a ring topology –Balances its.

The 6713 DSP Starter Kit (DSK) is a low-cost platform which lets customers evaluate and develop applications for the Texas Instruments C67X DSP family.

SAGE: Self-Tuning Approximation for Graphics Engines

1 NETWORKED EMBEDDED SYSTEMS SRIKANTH SUBRAMANIAN.

RICE UNIVERSITY Implementing the Viterbi algorithm on programmable processors Sridhar Rajagopal Elec 696

1 3-General Purpose Processors: Altera Nios II 2 Altera Nios II processor A 32-bit soft core processor from Altera Comes in three cores: Fast, Standard,

University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.

University of Michigan Electrical Engineering and Computer Science 1 Integrating Post-programmability Into the High-level Synthesis Equation* Scott Mahlke.

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

XStream: Rapid Generation of Custom Processors for ASIC Designs Binu Mathew * ASIC: Application Specific Integrated Circuit.

Automated Design of Custom Architecture Tulika Mitra

Efficient Mapping onto Coarse-Grained Reconfigurable Architectures using Graph Drawing based Algorithm Jonghee Yoon, Aviral Shrivastava *, Minwook Ahn,

ASIP Architecture for Future Wireless Systems: Flexibility and Customization Joseph Cavallaro and Predrag Radosavljevic Rice University Center for Multimedia.

CML REGISTER FILE ORGANIZATION FOR COARSE GRAINED RECONFIGURABLE ARCHITECTURES (CGRAs) Dipal Saluja Compiler Microarchitecture Lab, Arizona State University,

ENG3050 Embedded Reconfigurable Computing Systems Application Specific Instruction Processors “ASIPS” Application Specific Instruction Processors “ASIPS”

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 Bundled Execution.

System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.

University of Michigan Electrical Engineering and Computer Science 1 Streamroller: Compiler Orchestrated Synthesis of Accelerator Pipelines Manjunath Kudlur,

Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.

Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke

Hy-C A Compiler Retargetable for Single-Chip Heterogeneous Multiprocessors Philip Sweany 8/27/2010.

University of Michigan Electrical Engineering and Computer Science 1 Compiler-directed Synthesis of Multifunction Loop Accelerators Kevin Fan, Manjunath.

NISC set computer no-instruction

Lx: A Technology Platform for Customizable VLIW Embedded Processing.

University of Michigan Electrical Engineering and Computer Science Automatic Synthesis of Customized Local Memories for Multicluster Application Accelerators.

University of Michigan Electrical Engineering and Computer Science 1 Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System Kevin Fan,

VEAL: Virtualized Execution Accelerator for Loops Nate Clark 1, Amir Hormati 2, Scott Mahlke 2 1 Georgia Tech., 2 U. Michigan.

Hyunchul Park†, Kevin Fan†, Scott Mahlke†,

CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.

University of Michigan Electrical Engineering and Computer Science 1 Increasing Hardware Efficiency with Multifunction Loop Accelerators Kevin Fan, Manjunath.

1 of 14 Lab 2: Design-Space Exploration with MPARM.

Application-Specific Customization of Soft Processor Microarchitecture Peter Yiannacouras J. Gregory Steffan Jonathan Rose University of Toronto Electrical.

University of Michigan Electrical Engineering and Computer Science Dynamic Voltage/Frequency Scaling in Loop Accelerators using BLADES Ganesh Dasika 1,

ECE354 Embedded Systems Introduction C Andras Moritz.

Ph.D. in Computer Science

CGRA Express: Accelerating Execution using Dynamic Operation Fusion

Chapter 1: Introduction

Anne Pratoomtong ECE734, Spring2002

Dynamically Reconfigurable Architectures: An Overview

Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke

Application-Specific Processing on a General Purpose Core via Transparent Instruction Set Customization Nathan Clark, Manjunath Kudlur, Hyunchul Park,

DSPs in emerging wireless systems

Mapping DSP algorithms to a general purpose out-of-order processor

Presentation transcript:

University of Michigan Electrical Engineering and Computer Science Bridging the Computation Gap Between Programmable Processors and Hardwired Accelerators Kevin Fan 1, Manjunath Kudlur 2, Ganesh Dasika, Scott Mahlke 1 Parakinetics, Inc. University of Michigan Advanced Computer Architecture Laboratory 2 Nvidia

University of Michigan Electrical Engineering and Computer Science Emerging applications have high performance, cost, energy demands –High-quality video –Flash animation Clear need for application and domain-specific hardware Introduction 2 24 fps min. Frames/sec MPEG-4 Decoder Cell-phone battery life (hours)

University of Michigan Electrical Engineering and Computer Science Flexibility Multiple instances of the same application –E.g multiple video codecs Software algorithms change over time NRE Time-to-market 3 XvidDivX FFMpeg

University of Michigan Electrical Engineering and Computer Science ASIC Alternatives FPGAs General Purpose Processors DSPs Domain-specific accelerators Efficiency, Performance Flexibility 4 Loop Accelerators, ASICs ??? Highly efficient, some programmability

University of Michigan Electrical Engineering and Computer Science How much post-programmability is really required? for(k=0; k<N4; k++) {... real = Z1[k][0]; img = Z1[k][1]; Z1[k][0] = real * sincos[k][0] - img*sincos[k][1]; Z1[k][0] = Z1[k][0] << 1; } for(k=0; k<N4; k++) {... real = Z1[k][0]; img = Z1[k][1]; Z1[k][0] = real * sincos[k][0] - img*sincos[k][1]; Z1[k][0] = Z1[k][0] << 1; } if(b_scale) { Z1[k][0] = Z1[k][0] * scale; } Version 1.39Version 1.40 mdct.c in faad2 5

University of Michigan Electrical Engineering and Computer Science How much post-programmability is really required? for(k=0; k<N4; k++) {... uint16_t n = k << 1; ComplexMult(...); X_out[ n] = RE(x); X_out[N n] = -IM(x); X_out[N2 + n] = IM(x); X_out[N n] = -RE(x); } for(k=0; k<N4; k++) {... uint16_t n = k << 1; ComplexMult(...); X_out[ n] = -RE(x); X_out[N n] = IM(x); X_out[N2 + n] = -IM(x); X_out[N n] = RE(x); } Version 1.33Version 1.34 mdct.c in faad2 6

University of Michigan Electrical Engineering and Computer Science How much post-programmability is really required? Version 13.0Version 13.1Version 13.2 for (coeff_ctr = 0; coeff_ctr < 64; coeff_ctr++) { i=pos_scan[coeff_ctr][0]; j=pos_scan[coeff_ctr][1]; run++; ilev=0; if (currMB->luma_transform_size_8x8_flag && input->symbol_mode == CAVLC) { MCcoeff = MC(coeff_ctr); runs[MCcoeff]++; } m7 = &curr_res[block_y + j][block_x]; level = iabs (m7[i]); if (img->AdaptiveRounding) { fadjust8x8[j][block_x+i] = 0; } if (level != 0) { nonzero = TRUE; if (currMB->luma_transform_size_8x8_flag && input->symbol_mode == CAVLC) { *coeff_cost += MAX_VALUE; img->cofAC[b8+pl_off][MCcoeff][0][scan_poss[MCcoeff] ] = isignab(level,m7[i]); img->cofAC[b8+pl_off][MCcoeff][1][scan_poss[MCcoeff]++] = runs[MCcoeff]; ++scan_pos; runs[MCcoeff]=-1; } else { *coeff_cost += MAX_VALUE; ACLevel[scan_pos ] = isignab(level,m7[i]); ACRun [scan_pos++] = run; run=-1; // reset zero level counter } level = isignab(level, m7[i]); ilev = level; } for (coeff_ctr = 0; coeff_ctr < 64; coeff_ctr++) { i=pos_scan[coeff_ctr][0]; j=pos_scan[coeff_ctr][1]; run++; ilev=0; if (currMB->luma_transform_size_8x8_flag && input->symbol_mode == CAVLC) { MCcoeff = MC(coeff_ctr); runs[MCcoeff]++; } m7 = &curr_res[block_y + j][block_x]; level = iabs (m7[i]); if (img->AdaptiveRounding) { fadjust8x8[j][block_x+i] = 0; } if (level != 0) { nonzero = TRUE; if (currMB->luma_transform_size_8x8_flag && input->symbol_mode == CAVLC) { *coeff_cost += MAX_VALUE; img->cofAC[pl_off][MCcoeff][0][scan_poss[MCcoeff] ] = isignab(level,m7[i]); img->cofAC[pl_off][MCcoeff][1][scan_poss[MCcoeff]++] = runs[MCcoeff]; ++scan_pos; runs[MCcoeff]=-1; } else { *coeff_cost += MAX_VALUE; ACLevel[scan_pos ] = isignab(level,m7[i]); ACRun [scan_pos++] = run; run=-1; // reset zero level counter } level = isignab(level, m7[i]); ilev = level; } for (coeff_ctr = 0; coeff_ctr < 64; coeff_ctr++) { i=pos_scan[coeff_ctr][0]; j=pos_scan[coeff_ctr][1]; run++; ilev=0; if (currMB->luma_transform_size_8x8_flag && input->symbol_mode == CAVLC) { MCcoeff = MC(coeff_ctr); runs[MCcoeff]++; } m7 = &curr_res[block_y + j][block_x]; level = iabs (m7[i]); if (img->AdaptiveRounding) { fadjust8x8[j][block_x+i] = 0; } if (level != 0) { nonzero = TRUE; if (currMB->luma_transform_size_8x8_flag && input->symbol_mode == CAVLC) { *coeff_cost += MAX_VALUE; img->cofAC[pl_off][MCcoeff][0][scan_poss[MCcoeff] ] = isignab(level,m7[i]); img->cofAC[pl_off][MCcoeff][1][scan_poss[MCcoeff]++] = runs[MCcoeff]; ++scan_pos; runs[MCcoeff]=-1; } else { *coeff_cost += MAX_VALUE; ACLevel[scan_pos ] = isignab(level,m7[i]); ACRun [scan_pos++] = run; run=-1; // reset zero level counter } level = isignab(level, m7[i]); ilev = level; } H.264 reference implementation Mostly minor changes to loops –Bug fixes –Revisions Possible to design custom HW with minor programmability extensions 7

University of Michigan Electrical Engineering and Computer Science Programmable Loop Accelerator Generalize accelerator without losing efficiency FPGAs Efficiency, Performance Flexibility Loop Accelerators, ASICs Programmable Loop Accelerators 8 General Purpose Processors DSPs Domain-specific accelerators ???

University of Michigan Electrical Engineering and Computer Science Designing Loop Accelerators C Code Loop 9 Hardware Point-to-point Connections BR CRF + …… & …… MEM …… Local Mem + …… * …… MEM …… << …… Local Mem

University of Michigan Electrical Engineering and Computer Science Loop Accelerator Architecture Point-to-point Connections + …… & …… MEM …… Local Mem FSM Control signals CRF BR Hardware realization of modulo scheduled loop Parameterized execution resources, storage, connectivity 10

University of Michigan Electrical Engineering and Computer Science LA Scheduling TimeFU0FU1FU2FU LD x Mem X ++ 4 FIR Loop Kernel ???? LD x x - No subtract Paths missing Mult result has longer lifetime

University of Michigan Electrical Engineering and Computer Science LA Datapath Restrictions Slow-Down Graph Difference 8 12

University of Michigan Electrical Engineering and Computer Science Programmable Loop-Accelerator Architecture Point-to-point Connections +/- …… &/| …… MEM …… Local Mem Control Memory Control signals CRF BR RR Literals Bus  Functionality  Storage  Connectivity  Control LA PLA Custom FU setGeneralized FUs + MOVs Point-to-pointBus + Port-swapping Limited size, no addr.Rotating Reg. Files Hardwired ControlLit. Reg. File + Control Mem 13 +& SRF FSM

University of Michigan Electrical Engineering and Computer Science Experimental Setup Wide variety of benchmarks –DSP –Media –Linear Algebra Baseline LAs: –Used LA synthesis system to generate HDL – um Comparisons: –PLAs ( um) –OR-1200 ( um) 14

University of Michigan Electrical Engineering and Computer Science Area 15 OR-1200

University of Michigan Electrical Engineering and Computer Science Power Consumption 16 OR-1200

University of Michigan Electrical Engineering and Computer Science Power Consumption 17 OR-1200 OR-1200 equiv

University of Michigan Electrical Engineering and Computer Science Power Breakdown 18

University of Michigan Electrical Engineering and Computer Science Scheduling for PLAs Hardware Loop 1 Synthesis System Loop 2 Compiler + SMT-solver Generalize accelerator architecture Map new loops to existing hardware 19

University of Michigan Electrical Engineering and Computer Science PLA Scheduling Mem X ++ LA Hardware Mem X +/- Bus PLA Hardware 20

University of Michigan Electrical Engineering and Computer Science PLA Scheduling TimeFU0FU1FU2FU3Bus 01MOV5X Mem X +/- Bus PLA Hardware MOV Bus LD x x - sched_time(j)  sched_time(i) + lat(i) – dist(i,j)  II S j  I( X i,fi,ti  X j,fj,tj )  (I + t j  S i  II + t i + lat(i) – dist(i,j)  II) SMT

University of Michigan Electrical Engineering and Computer Science Programmability Small, with simple communication Small, with complex communication 22

University of Michigan Electrical Engineering and Computer Science Power Efficiency LA: 105 MIPS/mW PLA: 24 MIPS/mW OR1K: 2 MIPS/mW TI C6x: 5 MIPS/mW ARM11: 3 MIPS/mW Itanium2: 0.08 MIPS/mW Tensilica Diamond Core: 12 MIPS/mW 23 Faster than ARM11 AND 8x more efficient! Performance (MIPS) Power (mW)

University of Michigan Electrical Engineering and Computer Science Conclusion Programmable loop accelerators retain efficiency while being programmable Loop accelerator datapath generalized in a cost- effective way Significant benefits over GPP: –4x-34x improved power efficiency –30x improved area efficiency 24

University of Michigan Electrical Engineering and Computer Science Questions? ? 25

University of Michigan Electrical Engineering and Computer Science

University of Michigan Electrical Engineering and Computer Science