University of Michigan Electrical Engineering and Computer Science 1 Increasing Hardware Efficiency with Multifunction Loop Accelerators Kevin Fan, Manjunath.

Slides:

Advertisements

Similar presentations

Computer Science and Engineering Laboratory, Transport-triggered processors Jani Boutellier Computer Science and Engineering Laboratory This.

Advertisements

University of Michigan Electrical Engineering and Computer Science 1 Application-Specific Processing on a General Purpose Core via Transparent Instruction.

School of Engineering & Technology Computer Architecture Pipeline.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,

TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.

University of Michigan Electrical Engineering and Computer Science 1 Libra: Tailoring SIMD Execution using Heterogeneous Hardware and Dynamic Configurability.

Instruction-Level Parallelism (ILP)

1 U NIVERSITY OF M ICHIGAN 11 1 SODA: A Low-power Architecture For Software Radio Author: Yuan Lin, Hyunseok Lee, Mark Woh, Yoav Harel, Scott Mahlke, Trevor.

University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution.

11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.

University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.

1 HW/SW Partitioning Embedded Systems Design. 2 Hardware/Software Codesign “Exploration of the system design space formed by combinations of hardware.

University of Michigan Electrical Engineering and Computer Science 1 Modulo Scheduling for Highly Customized Datapaths to Increase Hardware Reusability.

University of Michigan Electrical Engineering and Computer Science 1 Increasing the Number of Effective Registers in a Low-Power Processor Using a Windowed.

University of Michigan Electrical Engineering and Computer Science Compiler-directed Synthesis of Programmable Loop Accelerators Kevin Fan, Hyunchul Park,

University of Michigan Electrical Engineering and Computer Science FLASH: Foresighted Latency-Aware Scheduling Heuristic for Processors with Customized.

A Programmable Coprocessor Architecture for Wireless Applications Yuan Lin, Nadav Baron, Hyunseok Lee, Scott Mahlke, Trevor Mudge Advance Computer Architecture.

University of Michigan Electrical Engineering and Computer Science 1 An Architecture Framework for Transparent Instruction Set Customization in Embedded.

University of Michigan Electrical Engineering and Computer Science 1 Resource Recycling: Putting Idle Resources to Work on a Composable Accelerator Yongjun.

University of Michigan Electrical Engineering and Computer Science 1 Liquid SIMD: Abstracting SIMD Hardware Using Lightweight Dynamic Mapping Nathan Clark,

Hot Chips 16August 24, 2004 OptimoDE: Programmable Accelerator Engines Through Retargetable Customization Nathan Clark, Hongtao Zhong, Kevin Fan, Scott.

University of Michigan Electrical Engineering and Computer Science 1 Streamroller: Automatic Synthesis of Prescribed Throughput Accelerator Pipelines Manjunath.

1 Design and Implementation of Turbo Decoders for Software Defined Radio Yuan Lin 1, Scott Mahlke 1, Trevor Mudge 1, Chaitali.

University of Michigan Electrical Engineering and Computer Science Data-centric Subgraph Mapping for Narrow Computation Accelerators Amir Hormati, Nathan.

University of Michigan Electrical Engineering and Computer Science Power-Efficient Medical Image Processing using PUMA Ganesh Dasika, Kevin Fan 1, Scott.

The Vector-Thread Architecture Ronny Krashinsky, Chris Batten, Krste Asanović Computer Architecture Group MIT Laboratory for Computer Science

University of Michigan Electrical Engineering and Computer Science 1 Systematic Register Bypass Customization for Application-Specific Processors Kevin.

University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.

University of Michigan Electrical Engineering and Computer Science 1 Libra: Tailoring SIMD Execution using Heterogeneous Hardware and Dynamic Configurability.

University of Michigan Electrical Engineering and Computer Science 1 Integrating Post-programmability Into the High-level Synthesis Equation* Scott Mahlke.

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.

Automated Design of Custom Architecture Tulika Mitra

Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.

University of Michigan Electrical Engineering and Computer Science 1 SIMD Defragmenter: Efficient ILP Realization on Data-parallel Architectures Yongjun.

University of Michigan Electrical Engineering and Computer Science Composite Cores: Pushing Heterogeneity into a Core Andrew Lukefahr, Shruti Padmanabha,

1 Optimizing compiler tools and building blocks project Alexander Drozdov, PhD Sergey Novikov, PhD.

Design of a High-Throughput Low-Power IS95 Viterbi Decoder Xun Liu Marios C. Papaefthymiou Advanced Computer Architecture Laboratory Electrical Engineering.

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 Bundled Execution.

System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.

University of Michigan Electrical Engineering and Computer Science 1 Streamroller: Compiler Orchestrated Synthesis of Accelerator Pipelines Manjunath Kudlur,

1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.

Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke

University of Michigan Electrical Engineering and Computer Science 1 Compiler-directed Synthesis of Multifunction Loop Accelerators Kevin Fan, Manjunath.

NISC set computer no-instruction

University of Michigan Electrical Engineering and Computer Science Automatic Synthesis of Customized Local Memories for Multicluster Application Accelerators.

University of Michigan Electrical Engineering and Computer Science 1 Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System Kevin Fan,

VEAL: Virtualized Execution Accelerator for Loops Nate Clark 1, Amir Hormati 2, Scott Mahlke 2 1 Georgia Tech., 2 U. Michigan.

Hyunchul Park†, Kevin Fan†, Scott Mahlke†,

VADA Lab.SungKyunKwan Univ. 1 L5:Lower Power Architecture Design 성균관대학교 조 준 동 교수

CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.

University of Michigan Electrical Engineering and Computer Science Bridging the Computation Gap Between Programmable Processors and Hardwired Accelerators.

University of Michigan Electrical Engineering and Computer Science Dynamic Voltage/Frequency Scaling in Loop Accelerators using BLADES Ganesh Dasika 1,

Multi-Core CPUs Matt Kuehn. Roadmap ► Intel vs AMD ► Early multi-core processors ► Threads vs Physical Cores ► Multithreading and Multi-core processing.

Optimizing Interconnection Complexity for Realizing Fixed Permutation in Data and Signal Processing Algorithms Ren Chen, Viktor K. Prasanna Ming Hsieh.

Ph.D. in Computer Science

CGRA Express: Accelerating Execution using Dynamic Operation Fusion

Hyperthreading Technology

DynaMOS: Dynamic Schedule Migration for Heterogeneous Cores

RegLess: Just-in-Time Operand Staging for GPUs

Introduction to cosynthesis Rabi Mahapatra CSCE617

Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke

Application-Specific Processing on a General Purpose Core via Transparent Instruction Set Customization Nathan Clark, Manjunath Kudlur, Hyunchul Park,

Architecture Synthesis

The Vector-Thread Architecture

ARM ORGANISATION.

CMSC 611: Advanced Computer Architecture

Application-Specific Customization of Soft Processor Microarchitecture

Computer Architecture

Presentation transcript:

University of Michigan Electrical Engineering and Computer Science 1 Increasing Hardware Efficiency with Multifunction Loop Accelerators Kevin Fan, Manjunath Kudlur, Hyunchul Park, Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan October 25, 2006

University of Michigan Electrical Engineering and Computer Science 2 Introduction Emerging applications have high performance, cost, energy demands –H.264, wireless, software radio, signal processing – Gops required –200 mW power budget Applications dominated by tight loops processing large amounts of streaming data CPU Accelerators

University of Michigan Electrical Engineering and Computer Science 3 Loop Accelerators Order-of-magnitude performance and efficiency wins –Viterbi: 100x speedup vs. ARM9.C Automated C  gates solution Correct by construction Close designer productivity gap Achieve short time-to-market

University of Michigan Electrical Engineering and Computer Science 4 Prescribed Throughput Accelerators Traditional behavioral synthesis –Directly translate C operators into gates Operation graphDatapath ApplicationArchitecture Our approach: Application-centric Architectures –Achieve fixed throughput –Maximize hardware sharing

University of Michigan Electrical Engineering and Computer Science 5 Outline Loop accelerator schema and design flow Cost sensitive scheduling Designing multifunction accelerators –Naïve –Joint scheduling –Datapath union Synthesis results

University of Michigan Electrical Engineering and Computer Science 6 Loop Accelerator Template Parameterized execution resources, storage, connectivity Hardware realization of modulo scheduled loop

University of Michigan Electrical Engineering and Computer Science 7 Loop Accelerator Design Flow FU Alloc.c C Code, Performance (Throughput) Abstract Arch Modulo Schedule Op1 Op2 Op3 … time FUs Scheduled Ops RF FU Build Datapath Concrete Arch FU Instantiate Arch Synthesize Verilog, Control Signals.v Loop Accelerator

University of Michigan Electrical Engineering and Computer Science 8 Datapath Derived from Schedule Schedule to abstract architecture (FUs) Determine register and interconnect requirements from schedule r1 = Mem[r2] r3 = r Source Code Datapath MEM+ 12 time 1 time 4 FU1FU2 Schedule... ADD LOAD

University of Michigan Electrical Engineering and Computer Science 9 Cost Sensitive Scheduling 27% cost reduction with same performance [MICRO ’05] +1+1 LD LD time FU1FU2FU3 FU1FU2FU LD 2 LD 1 time FU1FU2FU3 FU1FU2FU Traditional scheduling is hardware unaware Intelligent scheduling needed to reduce hardware cost

University of Michigan Electrical Engineering and Computer Science 10 LA1 LA2 LA4 Accelerator Pipeline Loop Accelerator LA3 LA5 Multifunction Accelerator Map multiple loops to single accelerator Improve hardware efficiency via reuse Opportunities for sharing –Disjoint stages (loops 2, 3) –Pipeline slack (loops 4, 5) Frame Type? Loop 2Loop 3 Loop 1 Loop 4 Application … Block 5 LA1 LA2 LA3 Accelerator Pipeline … Loop Accelerator Multifunction Loop Accelerator Multifunction Loop Accelerator

University of Michigan Electrical Engineering and Computer Science 11 Design Strategies Naïve method: Design single function accelerators, place side by side –Misses potential hardware sharing of FUs, storage, interconnect Loop 1 Loop 2 Cost Sensitive Modulo Scheduler FU Multifunction datapath

University of Michigan Electrical Engineering and Computer Science 12 Joint Scheduling Loops are independent: # possible schedules exponential in # of loops! Infeasible for modest problems Loop 1 Loop 2 Joint Cost Sensitive Modulo Scheduler Op1 Op2 Op3 … time FUs Op2 Op1 … Op3 time FUs FU

University of Michigan Electrical Engineering and Computer Science 13 Multifunction Gate Costs 43% average savings over sum of accelerators ABCDEFGHIJ

University of Michigan Electrical Engineering and Computer Science 14 Datapath Union Loop 1 Loop 2 Cost Sensitive Modulo Scheduler FU Datapath Union

University of Michigan Electrical Engineering and Computer Science 15 Datapath Union Combine similar components → better hardware sharing → lower cost Trade off FU and register cost –Combining dissimilar FUs can enable register cost savings ILP formulation minimizes FU and register cost Accel 1 Accel 2 +-MM + +*M+*/-MM/+ Multi- function accel ++/-M/*M

University of Michigan Electrical Engineering and Computer Science 16 Multifunction Gate Costs Smart union within 3% of joint scheduling solution ABCDEFGHIJ

University of Michigan Electrical Engineering and Computer Science 17 Conclusion Multifunction accelerators highly effective in exploiting coarse grained hardware sharing Joint scheduling achieves 43% average cost savings, but is impractical Smart union of independent accelerators achieves 40% average savings Compile times of 5 minutes – 1 hour

University of Michigan Electrical Engineering and Computer Science 18 Questions?