University of Michigan Electrical Engineering and Computer Science 1 Compiler-directed Synthesis of Multifunction Loop Accelerators Kevin Fan, Manjunath.

Slides:

Advertisements

Similar presentations

Computer Science and Engineering Laboratory, Transport-triggered processors Jani Boutellier Computer Science and Engineering Laboratory This.

Advertisements

University of Michigan Electrical Engineering and Computer Science 1 Application-Specific Processing on a General Purpose Core via Transparent Instruction.

School of Engineering & Technology Computer Architecture Pipeline.

Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,

TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.

Logic Synthesis – 3 Optimization Ahmed Hemani Sources: Synopsys Documentation.

*time Optimization Heiko, Diego, Thomas, Kevin, Andreas, Jens.

Improving Placement under the Constant Delay Model Kolja Sulimma 1, Ingmar Neumann 1, Lukas Van Ginneken 2, Wolfgang Kunz 1 1 EE and IT Department University.

FPGA Latency Optimization Using System-level Transformations and DFG Restructuring Daniel Gomez-Prado, Maciej Ciesielski, and Russell Tessier Department.

University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution.

11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.

University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.

Courseware Path-Based Scheduling Sune Fallgaard Nielsen Informatics and Mathematical Modelling Technical University of Denmark Richard Petersens Plads,

University of Michigan Electrical Engineering and Computer Science 1 Modulo Scheduling for Highly Customized Datapaths to Increase Hardware Reusability.

University of Michigan Electrical Engineering and Computer Science 1 Increasing the Number of Effective Registers in a Low-Power Processor Using a Windowed.

University of Michigan Electrical Engineering and Computer Science Compiler-directed Synthesis of Programmable Loop Accelerators Kevin Fan, Hyunchul Park,

University of Michigan Electrical Engineering and Computer Science FLASH: Foresighted Latency-Aware Scheduling Heuristic for Processors with Customized.

University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,

University of Michigan Electrical Engineering and Computer Science 1 An Architecture Framework for Transparent Instruction Set Customization in Embedded.

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

1 EECS Components and Design Techniques for Digital Systems Lec 21 – RTL Design Optimization 11/16/2004 David Culler Electrical Engineering and Computer.

Hot Chips 16August 24, 2004 OptimoDE: Programmable Accelerator Engines Through Retargetable Customization Nathan Clark, Hongtao Zhong, Kevin Fan, Scott.

University of Michigan Electrical Engineering and Computer Science 1 Streamroller: Automatic Synthesis of Prescribed Throughput Accelerator Pipelines Manjunath.

University of Michigan Electrical Engineering and Computer Science Data-centric Subgraph Mapping for Narrow Computation Accelerators Amir Hormati, Nathan.

University of Michigan Electrical Engineering and Computer Science Power-Efficient Medical Image Processing using PUMA Ganesh Dasika, Kevin Fan 1, Scott.

University of Michigan Electrical Engineering and Computer Science 1 Systematic Register Bypass Customization for Application-Specific Processors Kevin.

University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.

University of Michigan Electrical Engineering and Computer Science 1 Integrating Post-programmability Into the High-level Synthesis Equation* Scott Mahlke.

Caltech CS184b Winter DeHon 1 CS184b: Computer Architecture [Single Threaded Architecture: abstractions, quantification, and optimizations] Day3:

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

LOPASS: A Low Power Architectural Synthesis for FPGAs with Interconnect Estimation and Optimization Harikrishnan K.C. University of Massachusetts Amherst.

Automated Design of Custom Architecture Tulika Mitra

1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.

Section 10: Advanced Topics 1 M. Balakrishnan Dept. of Comp. Sci. & Engg. I.I.T. Delhi.

L11: Lower Power High Level Synthesis(2) 성균관대학교 조 준 동 교수

1 Optimizing compiler tools and building blocks project Alexander Drozdov, PhD Sergey Novikov, PhD.

Design of a High-Throughput Low-Power IS95 Viterbi Decoder Xun Liu Marios C. Papaefthymiou Advanced Computer Architecture Laboratory Electrical Engineering.

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 Bundled Execution.

System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.

University of Michigan Electrical Engineering and Computer Science 1 Streamroller: Compiler Orchestrated Synthesis of Accelerator Pipelines Manjunath Kudlur,

1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.

Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke

University of Michigan Electrical Engineering and Computer Science Automatic Synthesis of Customized Local Memories for Multicluster Application Accelerators.

University of Michigan Electrical Engineering and Computer Science 1 Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System Kevin Fan,

VEAL: Virtualized Execution Accelerator for Loops Nate Clark 1, Amir Hormati 2, Scott Mahlke 2 1 Georgia Tech., 2 U. Michigan.

Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.

Hyunchul Park†, Kevin Fan†, Scott Mahlke†,

VADA Lab.SungKyunKwan Univ. 1 L5:Lower Power Architecture Design 성균관대학교 조 준 동 교수

CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.

University of Michigan Electrical Engineering and Computer Science 1 Increasing Hardware Efficiency with Multifunction Loop Accelerators Kevin Fan, Manjunath.

University of Michigan Electrical Engineering and Computer Science Bridging the Computation Gap Between Programmable Processors and Hardwired Accelerators.

University of Michigan Electrical Engineering and Computer Science Dynamic Voltage/Frequency Scaling in Loop Accelerators using BLADES Ganesh Dasika 1,

Dynamo: A Runtime Codesign Environment

Ph.D. in Computer Science

CGRA Express: Accelerating Execution using Dynamic Operation Fusion

DynaMOS: Dynamic Schedule Migration for Heterogeneous Cores

Two-phase Latch based design

Introduction to cosynthesis Rabi Mahapatra CSCE617

Michael Chu, Kevin Fan, Scott Mahlke

Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke

Lesson 4 Synchronous Design Architectures: Data Path and High-level Synthesis (part two) Sept EE37E Adv. Digital Electronics.

Application-Specific Processing on a General Purpose Core via Transparent Instruction Set Customization Nathan Clark, Manjunath Kudlur, Hyunchul Park,

Henk Corporaal TUEindhoven 2011

Architectural-Level Synthesis

CMSC 611: Advanced Computer Architecture

Application-Specific Customization of Soft Processor Microarchitecture

Presentation transcript:

University of Michigan Electrical Engineering and Computer Science 1 Compiler-directed Synthesis of Multifunction Loop Accelerators Kevin Fan, Manjunath Kudlur, Hyunchul Park, Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan

Electrical Engineering and Computer Science 2 Accelerating Streaming Applications Streaming applications: –Discrete transformations operating on data stream –High performance Map application to pipeline of accelerators Multifunction accelerators reuse hardware –Improve hardware efficiency Frame Type? Loop 2Loop 3 Loop 1 Loop 4 Application … DRAM LA1 LA2 LA3 Accelerator Pipeline … Loop Accelerator Multifunction Loop Accelerator Multifunction Loop Accelerator Block 5

University of Michigan Electrical Engineering and Computer Science 3 Loop Accelerator Schema Hard wired state machine for one or more critical loops Order of magnitude power and performance improvements over more general designs

University of Michigan Electrical Engineering and Computer Science 4 Single Function Accelerator Design Use compiler as architecture synthesis tool –Parameterized meta-architecture – all loop accelerators have same general organization –Performance/throughput is input –Compiler analysis to understand computation and communication requirements –Hardware-sensitive optimization to reduce cost

University of Michigan Electrical Engineering and Computer Science 5 Flow Diagram Application Loop, Desired II Allocate FUs Abstract Arch Modulo Schedule Scheduled Ops Build Datapath Concrete Arch Instantiate Arch Verilog, Control Signals Synthesize Loop Accelerator Op1 Op2 Op3 … time FUs FU RF FU

University of Michigan Electrical Engineering and Computer Science 6 FU Allocation Given operations in a loop and cost of hardware cells implementing those operations Minimize total FU cost while supporting all operations 3  ADD 1  SUB 2  LOAD II = MEM

University of Michigan Electrical Engineering and Computer Science 7 Modulo Scheduling and Datapath Derivation Schedule to abstract architecture (FUs) Determine register and interconnect requirements from schedule r1 = Mem[r2] r3 = r Source Code Datapath MEM+ 12 ADD LOAD time 1 time 4 FU1FU2 Schedule...

University of Michigan Electrical Engineering and Computer Science 8 Multifunction Accelerator Single hardware accelerator to run multiple loops Could place single function accelerators side by side Want to exploit potential hardware sharing between loops –Function units –Registers –Interconnect

University of Michigan Electrical Engineering and Computer Science 9 Multifunction Design Strategies 1. Union Method 2. Phase Ordered Method FU +

University of Michigan Electrical Engineering and Computer Science 10 Union Method +-MM ++*M Accel 1 Accel 2 +-MM ++*M Smart Union +*/-M/+M Storage cost: 11 Positional Union ++/-M/*M Multi- function accel Storage cost: 15 Goal: combine FUs and register files to improve hardware sharing.

University of Michigan Electrical Engineering and Computer Science 11 Union Method Smart union formulated as ILP problem which minimizes FU and register cost Benefit: Look at whole design at once Limitation: Schedules are fixed prior to union phase Fast runtime

University of Michigan Electrical Engineering and Computer Science 12 Cost of Union of Accelerators Image ProcessingMPEG4Signal Processing Worst union: 25% average savings Positional union: 29% average savings Best union: 33% average savings

University of Michigan Electrical Engineering and Computer Science 13 Phase Ordered Method Schedule loops in order During scheduling, account for hardware from previous loop Cost sensitive scheduler attempts to minimize hardware cost increase FU + Loop 1Loop 2Accel 1Accel 1+2

University of Michigan Electrical Engineering and Computer Science 14 Cost Sensitive Scheduling Different valid scheduling alternatives are not equal +1+1 LD LD time FU1FU2FU3 FU1FU2FU LD 2 LD 1 time FU1FU2FU3 FU1FU2FU

University of Michigan Electrical Engineering and Computer Science 15 Greedy Cost Sensitive Scheduler Select scheduling alternative with minimum cost Account for estimated cost of unscheduled ops Modulo Scheduler Loop 1 Cost i Alt i Partial Hardware for Scheduled Ops Estimate for Unscheduled Ops +*+ HW Cost Library Modulo Scheduler Loop 2 Cost i Alt i Loop 1 Hardware Hardware Cost Modeler

University of Michigan Electrical Engineering and Computer Science 16 Phase Ordered Method Extend conventional iterative modulo scheduler with hardware cost model Benefits: –Scheduler is aware of hardware for all previously scheduled loops –Can adjust schedule to improve cost savings Limitation: process is localized, greedy. Schedules of previous loops are fixed Fast runtime

University of Michigan Electrical Engineering and Computer Science 17 Cost Sensitive Scheduling Comparison Image ProcessingMPEG4Signal Processing Greedy scheduling: 41% average savings ILP scheduling: 51% average savings

University of Michigan Electrical Engineering and Computer Science 18 Union vs. Phase Ordered Methods Union method: 45% average savings Phase ordered method: 41% average savings Image ProcessingMPEG4Signal Processing

University of Michigan Electrical Engineering and Computer Science 19 Conclusion Compiler-directed design system Multifunction accelerator for hardware reuse Two multifunction design methods –Smart union of single-function accelerators: 45% average cost savings –Phase ordered scheduling: 41% average cost savings Overall, 20 – 61% hardware savings from sharing

University of Michigan Electrical Engineering and Computer Science 20 Questions?