University of Michigan Electrical Engineering and Computer Science Dynamic Voltage/Frequency Scaling in Loop Accelerators using BLADES Ganesh Dasika 1,

Slides:

Advertisements

Similar presentations

Computer Science and Engineering Laboratory, Transport-triggered processors Jani Boutellier Computer Science and Engineering Laboratory This.

Advertisements

University of Michigan Electrical Engineering and Computer Science 1 Application-Specific Processing on a General Purpose Core via Transparent Instruction.

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.

Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.

University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,

University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution.

Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.

University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.

University of Michigan Electrical Engineering and Computer Science 1 Modulo Scheduling for Highly Customized Datapaths to Increase Hardware Reusability.

University of Michigan Electrical Engineering and Computer Science 1 Increasing the Number of Effective Registers in a Low-Power Processor Using a Windowed.

University of Michigan Electrical Engineering and Computer Science Compiler-directed Synthesis of Programmable Loop Accelerators Kevin Fan, Hyunchul Park,

University of Michigan Electrical Engineering and Computer Science FLASH: Foresighted Latency-Aware Scheduling Heuristic for Processors with Customized.

A Programmable Coprocessor Architecture for Wireless Applications Yuan Lin, Nadav Baron, Hyunseok Lee, Scott Mahlke, Trevor Mudge Advance Computer Architecture.

Compiling Application-Specific Hardware Mihai Budiu Seth Copen Goldstein Carnegie Mellon University.

University of Michigan Electrical Engineering and Computer Science 1 An Architecture Framework for Transparent Instruction Set Customization in Embedded.

1 EECS Components and Design Techniques for Digital Systems Lec 21 – RTL Design Optimization 11/16/2004 David Culler Electrical Engineering and Computer.

On-Line Adjustable Buffering for Runtime Power Reduction Andrew B. Kahng Ψ Sherief Reda † Puneet Sharma Ψ Ψ University of California, San Diego † Brown.

Multiscalar processors

1 Razor: A Low Power Processor Design Presented By: - Murali Dharan.

Instruction Set Architecture (ISA) for Low Power Hillary Grimes III Department of Electrical and Computer Engineering Auburn University.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Hot Chips 16August 24, 2004 OptimoDE: Programmable Accelerator Engines Through Retargetable Customization Nathan Clark, Hongtao Zhong, Kevin Fan, Scott.

1 Design and Implementation of Turbo Decoders for Software Defined Radio Yuan Lin 1, Scott Mahlke 1, Trevor Mudge 1, Chaitali.

University of Michigan Electrical Engineering and Computer Science 1 Processor Acceleration Through Automated Instruction Set Customization Nathan Clark,

University of Michigan Electrical Engineering and Computer Science Data-centric Subgraph Mapping for Narrow Computation Accelerators Amir Hormati, Nathan.

ParaScale : Exploiting Parametric Timing Analysis for Real-Time Schedulers and Dynamic Voltage Scaling Sibin Mohan 1 Frank Mueller 1,William Hawkins 2,

Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.

University of Michigan Electrical Engineering and Computer Science 1 Systematic Register Bypass Customization for Application-Specific Processors Kevin.

University of Michigan Electrical Engineering and Computer Science 1 Integrating Post-programmability Into the High-level Synthesis Equation* Scott Mahlke.

Hybrid-Scheduling: A Compile-Time Approach for Energy–Efficient Superscalar Processors Madhavi Valluri and Lizy John Laboratory for Computer Architecture.

Automated Design of Custom Architecture Tulika Mitra

1 Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction Rakesh Kumar, Keith I. Farkas, Norman P. Jouppi, Parthasarathy.

Efficient Mapping onto Coarse-Grained Reconfigurable Architectures using Graph Drawing based Algorithm Jonghee Yoon, Aviral Shrivastava *, Minwook Ahn,

Exploiting Program Hotspots and Code Sequentiality for Instruction Cache Leakage Management J. S. Hu, A. Nadgir, N. Vijaykrishnan, M. J. Irwin, M. Kandemir.

Amalgam: a Reconfigurable Processor for Future Fabrication Processes Nicholas P. Carter University of Illinois at Urbana-Champaign.

University of Michigan Electrical Engineering and Computer Science 1 SIMD Defragmenter: Efficient ILP Realization on Data-parallel Architectures Yongjun.

University of Michigan EECS PEPSC : A Power Efficient Computer for Scientific Computing. Ganesh Dasika 1 Ankit Sethia 2 Trevor Mudge 2 Scott Mahlke 2 1.

Lecture 9. MIPS Processor Design – Instruction Fetch Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System Education &

1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.

Design Space Exploration for Application Specific FPGAs in System-on-a-Chip Designs Mark Hammerquist, Roman Lysecky Department of Electrical and Computer.

Idempotent Processor Architecture Marc de Kruijf Karthikeyan Sankaralingam Vertical Research Group UW-Madison MICRO 2011, Porto Alegre.

Advanced Computer Architecture Lab University of Michigan Compiler Controlled Value Prediction with Branch Predictor Based Confidence Eric Larson Compiler.

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 Bundled Execution.

A Hybrid Design Space Exploration Approach for a Coarse-Grained Reconfigurable Accelerator Farhad Mehdipour, Hamid Noori, Hiroaki Honda, Koji Inoue, Kazuaki.

University of Michigan Electrical Engineering and Computer Science 1 Streamroller: Compiler Orchestrated Synthesis of Accelerator Pipelines Manjunath Kudlur,

Reduction of Register File Power Consumption Approach: Value Lifetime Characteristics - Pradnyesh Gudadhe.

Dan Ernst – ISCA-30 – 6/10/03 Advanced Computer Architecture Lab The University of Michigan Cyclone: A Low-Complexity Broadcast-Free Dynamic Instruction.

Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke

University of Michigan Electrical Engineering and Computer Science 1 Compiler-directed Synthesis of Multifunction Loop Accelerators Kevin Fan, Manjunath.

Harnessing Soft Computation for Low-Budget Fault Tolerance Daya S Khudia Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan,

University of Michigan Electrical Engineering and Computer Science Automatic Synthesis of Customized Local Memories for Multicluster Application Accelerators.

University of Michigan Electrical Engineering and Computer Science 1 Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System Kevin Fan,

University of Michigan Electrical Engineering and Computer Science 1 Embracing Heterogeneity with Dynamic Core Boosting Hyoun Kyu Cho and Scott Mahlke.

VEAL: Virtualized Execution Accelerator for Loops Nate Clark 1, Amir Hormati 2, Scott Mahlke 2 1 Georgia Tech., 2 U. Michigan.

Hyunchul Park†, Kevin Fan†, Scott Mahlke†,

CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.

University of Michigan Electrical Engineering and Computer Science 1 Increasing Hardware Efficiency with Multifunction Loop Accelerators Kevin Fan, Manjunath.

Application-Specific Customization of Soft Processor Microarchitecture Peter Yiannacouras J. Gregory Steffan Jonathan Rose University of Toronto Electrical.

1 Compiler Managed Dynamic Instruction Placement In A Low-Power Code Cache Rajiv Ravindran, Pracheeti Nagarkar, Ganesh Dasika, Robert Senger, Eric Marsman,

Instruction Level Parallelism

Advanced Topic: Alternative Architectures Chapter 9 Objectives

CGRA Express: Accelerating Execution using Dynamic Operation Fusion

Improving Program Efficiency by Packing Instructions Into Registers

DynaMOS: Dynamic Schedule Migration for Heterogeneous Cores

Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor

Application-Specific Processing on a General Purpose Core via Transparent Instruction Set Customization Nathan Clark, Manjunath Kudlur, Hyunchul Park,

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

Application-Specific Customization of Soft Processor Microarchitecture

Presentation transcript:

University of Michigan Electrical Engineering and Computer Science Dynamic Voltage/Frequency Scaling in Loop Accelerators using BLADES Ganesh Dasika 1, Shidhartha Das 2, Kevin Fan 1, Scott Mahlke 1, David Bull University of Michigan Advanced Computer Architecture Laboratoy Ann Arbor, MI 2 ARM Ltd. Cambridge United Kingdom

University of Michigan Electrical Engineering and Computer Science Introduction 2 [Austin, IEEE Computer March 04]

University of Michigan Electrical Engineering and Computer Science Razor Allows for voltage/frequency scaling beyond first-failure point Exploits difference between design-time conditions (“slow”) and actual conditions (“typical”) 3 [Das, JSSC 2006]

University of Michigan Electrical Engineering and Computer Science Razor in General Purpose Processors Requires detailed analysis of microarchitectural impact –Analyze what state should be stored –Lengthening pipeline for stabilization increases complexity of forwarding logic Unpredictable control and data flow Difficult to determine worst-case vectors 4

University of Michigan Electrical Engineering and Computer Science BLADES Better-than-worst-case Loop Accelerator Design Incorporate DVFS into ASICs using Razor –Shave off some of the high NRE using HLS –Develop generic methodology for any application –Razor solution for a templated architecture Create ASIC design flow that is aware of Razor-ization costs 5

University of Michigan Electrical Engineering and Computer Science Loop Accelerator Template Hardware realization of modulo-scheduled loop Parameterized execution resources, storage, connectivity Control is statically determined, simple and not timing-critical Opportunity to make application-specific optimizations 6

University of Michigan Electrical Engineering and Computer Science Razorized Loop Accelerator 7 Razor + + * * + + * * Extended register queues Added interconnect “Roll-back” muxes } R R is the number of extra entries required Function of max pipeline depth and error-detection delay

University of Michigan Electrical Engineering and Computer Science Error “Life-Cycle” 8 Razor + + * * + + * * Error Reset Error … Error OR-tree Error stabilization Roll-back pipelining … + + Error processing Control

University of Michigan Electrical Engineering and Computer Science Issues with Razor Area, added hold-fixing 9 t spec D CLK

University of Michigan Electrical Engineering and Computer Science 10 Or1 Or0 FU 1 Add1 Add0 FU 0 Time 5Time 4Time 3Time 2Time 1Time 0 Or1FU 3 Or0FU 2 Add1FU 1 Add0FU 0 Time 2Time 1Time 0 Add-Or1Add-Or0FU 0 Time 3Time 2Time 1Time 0 Or1Or0FU 1 Add1Add0FU 0 Time 2Time 1Time 0 50% FU utilization removes hold-fixing need, but requires halving performance or doubling area Use hybrid scheme to execute >2 ops per FU + + I I Opcode-chaining

University of Michigan Electrical Engineering and Computer Science Identifying Opcode Chains Compiler identifies subgraphs of 3-4 input, 1 output instructions –All arith. ops supported Greedy selection algorithm 11 << >> & & ST & & >> + + << + + LD >> LD

University of Michigan Electrical Engineering and Computer Science Custom FUs 12 << >> & & ST & & >> + + << + + LD >> LD << >> & & ST & & >> + + << + + LD >> LD >> + + << + Enabled every 2 cycles Razor DFF

University of Michigan Electrical Engineering and Computer Science Results 13 idct, sharp, systolic_dct had multiple CFUs, and overall lower # of FUs Viterbi, dequant had signficant control-flow that restricted opportunities for creating custom ops 22% reduction in hold-fixing overhead in sobel

University of Michigan Electrical Engineering and Computer Science Conclusion Application-specific optimizations definitely help to mitigate Razor costs –24% reduction in overhead –33% energy savings overall Can optimize Razor-ization with further input from the compiler –Critical-instruction analysis –Error impact analysis 14

University of Michigan Electrical Engineering and Computer Science Thank you! 15

University of Michigan Electrical Engineering and Computer Science Future Work Errors in different FUs affect the system differently –Error “impact-analysis” –Data computation not necessarily error-sensitive –Address, branch target/direction critical to functionality Razor-ization of arbitrary Verilog 16

University of Michigan Electrical Engineering and Computer Science Motivation Using Razor has significant design overhead –Error-recovery system –Added “backup” state –Additional hold-time fixing Modifications for different u-archs are different Information about work-load cannot be used since design must preserve generality 17

University of Michigan Electrical Engineering and Computer Science * *