University of Michigan Electrical Engineering and Computer Science 1 Application-Specific Processing on a General Purpose Core via Transparent Instruction.

Slides:

Advertisements

Similar presentations

Computer Organization Lab 1 Soufiane berouel. Formulas to Remember CPU Time = CPU Clock Cycles x Clock Cycle Time CPU Clock Cycles = Instruction Count.

Advertisements

9-6 The Control Word Fig The selection variables for the datapath control the microoperations executed within datapath for any given clock pulse.

Dynamically Collapsing Dependencies for IPC and Frequency Gain Peter G. Sassone D. Scott Wills Georgia Tech Electrical and Computer Engineering { sassone,

ECE 260C – VLSI Advanced Topics Term paper presentation May 27, 2014 Keyuan Huang Ngoc Luong Low Power Processor Architectures and Software Optimization.

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.

University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,

TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.

University of Michigan Electrical Engineering and Computer Science 1 Libra: Tailoring SIMD Execution using Heterogeneous Hardware and Dynamic Configurability.

UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.

CHIMAERA: A High-Performance Architecture with a Tightly-Coupled Reconfigurable Functional Unit Kynan Fraser.

1 U NIVERSITY OF M ICHIGAN 11 1 SODA: A Low-power Architecture For Software Radio Author: Yuan Lin, Hyunseok Lee, Mark Woh, Yoav Harel, Scott Mahlke, Trevor.

From Sequences of Dependent Instructions to Functions An Approach for Improving Performance without ILP or Speculation Ben Rudzyn.

University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution.

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science August 20, 2009 Enabling.

CSE115: Introduction to Computer Science I Dr. Carl Alphonce 219 Bell Hall 1.

Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.

Chuanjun Zhang, UC Riverside 1 Low Static-Power Frequent-Value Data Caches Chuanjun Zhang*, Jun Yang, and Frank Vahid** *Dept. of Electrical Engineering.

University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.

University of Michigan Electrical Engineering and Computer Science 1 Modulo Scheduling for Highly Customized Datapaths to Increase Hardware Reusability.

University of Michigan Electrical Engineering and Computer Science 1 Parallelizing Sequential Applications on Commodity Hardware Using a Low-Cost Software.

University of Michigan Electrical Engineering and Computer Science 1 Increasing the Number of Effective Registers in a Low-Power Processor Using a Windowed.

11 University of Michigan Electrical Engineering and Computer Science Exploring the Design Space of LUT-based Transparent Accelerators Sami Yehia *, Nathan.

University of Michigan Electrical Engineering and Computer Science Compiler-directed Synthesis of Programmable Loop Accelerators Kevin Fan, Hyunchul Park,

University of Michigan Electrical Engineering and Computer Science FLASH: Foresighted Latency-Aware Scheduling Heuristic for Processors with Customized.

University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,

1 Automatically Generating Custom Instruction Set Extensions Nathan Clark, Wilkin Tang, Scott Mahlke Workshop on Application Specific Processors.

Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1.

University of Michigan Electrical Engineering and Computer Science 1 An Architecture Framework for Transparent Instruction Set Customization in Embedded.

Exploiting Load Latency Tolerance for Relaxing Cache Design Constraints Ramu Pyreddy, Gary Tyson Advanced Computer Architecture Laboratory University of.

University of Michigan Electrical Engineering and Computer Science 1 Liquid SIMD: Abstracting SIMD Hardware Using Lightweight Dynamic Mapping Nathan Clark,

Hot Chips 16August 24, 2004 OptimoDE: Programmable Accelerator Engines Through Retargetable Customization Nathan Clark, Hongtao Zhong, Kevin Fan, Scott.

1 Design and Implementation of Turbo Decoders for Software Defined Radio Yuan Lin 1, Scott Mahlke 1, Trevor Mudge 1, Chaitali.

University of Michigan Electrical Engineering and Computer Science 1 Processor Acceleration Through Automated Instruction Set Customization Nathan Clark,

University of Michigan Electrical Engineering and Computer Science Data-centric Subgraph Mapping for Narrow Computation Accelerators Amir Hormati, Nathan.

University of Michigan Electrical Engineering and Computer Science 1 A Microarchitectural Analysis of Soft Error Propagation in a Production-Level Embedded.

Secure Embedded Processing through Hardware-assisted Run-time Monitoring Zubin Kumar.

University of Michigan Electrical Engineering and Computer Science 1 Systematic Register Bypass Customization for Application-Specific Processors Kevin.

University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.

TECH 6 VLIW Architectures {Very Long Instruction Word}

A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.

Drowsy Caches: Simple Techniques for Reducing Leakage Power Authors: ARM Ltd Krisztián Flautner, Advanced Computer Architecture Lab, The University of.

1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah

University of Michigan Electrical Engineering and Computer Science 1 SIMD Defragmenter: Efficient ILP Realization on Data-parallel Architectures Yongjun.

Generating and Executing Multi-Exit Custom Instructions for an Adaptive Extensible Processor Hamid Noori †, Farhad Mehdipour ‡, Kazuaki Murakami †, Koji.

AN ARCHITECTURE FRAMEWORK FOR TRANSPARENT ISA CUSTOMIZATION IN EMBEDDED PROCESSORS VINAY GANGADHAR ECE 751 TALK, FALL 2015 DEPARTMENT.

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 Bundled Execution.

Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.

University of Michigan Electrical Engineering and Computer Science 1 Compiler Managed Partitioned Data Caches for Low Power Rajiv Ravindran*, Michael Chu,

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

Nadathur R Satish and Pierre-Yves Droz EECS Department, University of California Berkeley.

1 Efficient System-on-Chip Energy Management with a Segmented Counting Bloom Filter Mrinmoy Ghosh- Georgia Tech Emre Özer- ARM Ltd Stuart Biles- ARM Ltd.

University of Michigan Electrical Engineering and Computer Science 1 Compiler-directed Synthesis of Multifunction Loop Accelerators Kevin Fan, Manjunath.

University of Michigan Electrical Engineering and Computer Science Automatic Synthesis of Customized Local Memories for Multicluster Application Accelerators.

University of Michigan Electrical Engineering and Computer Science 1 Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System Kevin Fan,

VEAL: Virtualized Execution Accelerator for Loops Nate Clark 1, Amir Hormati 2, Scott Mahlke 2 1 Georgia Tech., 2 U. Michigan.

COMPSYS 304 Computer Architecture Speculation & Branching Morning visitors - Paradise Bay, Bay of Islands.

University of Michigan Electrical Engineering and Computer Science 1 Increasing Hardware Efficiency with Multifunction Loop Accelerators Kevin Fan, Manjunath.

On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.

University of Michigan Electrical Engineering and Computer Science Dynamic Voltage/Frequency Scaling in Loop Accelerators using BLADES Ganesh Dasika 1,

1 Compiler Managed Dynamic Instruction Placement In A Low-Power Code Cache Rajiv Ravindran, Pracheeti Nagarkar, Ganesh Dasika, Robert Senger, Eric Marsman,

Speed up on cycle time Stalls – Optimizing compilers for pipelining

CGRA Express: Accelerating Execution using Dynamic Operation Fusion

DynaMOS: Dynamic Schedule Migration for Heterogeneous Cores

Dynamically Reconfigurable Architectures: An Overview

Application-Specific Processing on a General Purpose Core via Transparent Instruction Set Customization Nathan Clark, Manjunath Kudlur, Hyunchul Park,

Overheads for Computers as Components 2nd ed.

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

Henk Corporaal TUEindhoven 2011

A Level Computer Science Topic 5: Computer Architecture and Assembly

Chapter 4 The Von Neumann Model

Presentation transcript:

University of Michigan Electrical Engineering and Computer Science 1 Application-Specific Processing on a General Purpose Core via Transparent Instruction Set Customization Nathan Clark, Manjunath Kudlur, Hyunchul Park, Scott Mahlke, Krisztián Flautner* Advanced Computer Architecture Lab, University of Michigan *ARM Ltd.

University of Michigan Electrical Engineering and Computer Science 2 A Case for Customization General purpose processors handles many applications fairly well, but… ► Each application has different requirements ► Need for efficient execution Impressive design wins through customization ► Performance, power, area ► Up to 3.5x speedup [Hot Chips 16]

University of Michigan Electrical Engineering and Computer Science 3 Computationally demanding parts of applications run on special hardware New instructions use the special hardware Instruction Set Customization CUSTOM XOR MPY LD XOR SHR XOR MOV MPY LDSHR AND

University of Michigan Electrical Engineering and Computer Science 4 Traditional vs. Transparent Customization High Non-Recurring Engineering costs (NRE) “Universal” accelerator No ISA change CPU Compute Accelerator (CCA) CPU Traditional Transparent

University of Michigan Electrical Engineering and Computer Science 5 Design of a Compute Accelerator Goal: support important computation subgraphs Array of function units ► Exploits subgraph parallelism ► Allows natural data propagation FU … … IN 1 … IN 2 … FetchFetch IssueIssue … ALU CCA … WBWB

University of Michigan Electrical Engineering and Computer Science 6 Or AndMov Or And Or AndMov Or And Mov Or AndMov Or And Mov CCA Shape 164.gzip

University of Michigan Electrical Engineering and Computer Science 7 AndXor Add Mov CCA Shape Blowfish

University of Michigan Electrical Engineering and Computer Science 8 Dynamic % of subgraphs using FU CCA Utilization

University of Michigan Electrical Engineering and Computer Science 9 CCA Operations Dynamic opcodes in important subgraphs Excluded mpy/div, load/store, branch Two main categories – logicals, adds Subgraphs rarely have more than 3 dependent adds Opcode% Add28.7 And12.5 Move11.7 Sext10.4 Lshift9.8 Or8.7 Xor5.1 Sub4.8 Rshift2.4 Compare0.4

University of Michigan Electrical Engineering and Computer Science 10 Proposed CCA Design 4 inputs/2 outputs Two FU types ► Arith/logic ► Logic Crossbar between rows Captures > 99% of important subgraphs I1I2I1I3I4 O1O2

University of Michigan Electrical Engineering and Computer Science 11 Synthesis of CCA Synopsys design tools, 130nm library DepthConfigurationControl (bits)Delay (ns)Cell area (mm 2 ) Subgraphs Supported 7 6A-4L-4A- 3L-2A-2L-1L % 6 6A-4L-4A- 3L-2A-1L % 5 6A-4L-4A- 2L-1L % 4 6A-4L-3A-2L %

University of Michigan Electrical Engineering and Computer Science 12 + No ISA change + No recompile – Simple selection – Hardware complexity + Powerful selection + Simple hardware – Some ISA change – Recompile necessary ASIPs – ISA change – High NRE + No ISA change + No recompile – Simple selection – Hardware complexity + Powerful selection + Simple hardware – Some ISA change – Recompile necessary ASIPs – ISA change – High NRE + No ISA change + No recompile – Simple selection – Hardware complexity + Powerful selection + Simple hardware – Some ISA change – Recompile necessary ASIPs – ISA change – High NRE + No ISA change + No recompile – Simple selection – Hardware complexity + Powerful selection + Simple hardware – Some ISA change – Recompile necessary ASIPs – ISA change – High NRE Static Dynamic CCA Utilization Realization Selection Static Dynamic

University of Michigan Electrical Engineering and Computer Science 13 … ADD r4, r1, #1 LSR r2, r2, #4 XOR r5, r4, r2 LD r3 ADD r6, r5, r3 XOR r7, r6, r8 SHR … Dynamic Selection – Dynamic Realization Detect and replace subgraphs in fill unit of trace cache I-Cache Trace Cache RetireRetire ExecuteExecute DecodeDecode Trace Construction Subgraph Selection and Insertion … LSR r2, r2, #4 LD r3 CUSTOM SHR … ADD r4, r1, #1 LSR r2, r2, #4 XOR r5, r4, r2 LD r3 ADD r6, r5, r3 XOR r7, r6, r8 SHR … ADD r4, r1, #1 LSR r2, r2, #4 XOR r5, r4, r2 LD r3 ADD r6, r5, r3 XOR r7, r6, r8 SHR … ADD r4, r1, #1 LSR r2, r2, #4 XOR r5, r4, r2 LD r3 ADD r6, r5, r3 XOR r7, r6, r8 SHR … ADD r4, r1, #1 LSR r2, r2, #4 XOR r5, r4, r2 LD r3 ADD r6, r5, r3 XOR r7, r6, r8 SHR …

University of Michigan Electrical Engineering and Computer Science 14 Simulation SimpleScalar – ARM instruction set ► 4-wide Execution, 1 compute accelerator ► 128 RUU entries ► 32k inst. trace cache, 256 inst. Traces ► 5000 cycle selection/insert latency ► L1 I-cache : 32k, 2 way, 2 cycle hit ► L1 D-cache : 32k, 4 way, 2 cycle hit

University of Michigan Electrical Engineering and Computer Science 15 Varying CCA Latency gzip 181.mcf 186.crafty 197.parser 300.twolf cjpeg djpeg epic g721encode gsmdecode mesamipmap mpeg2dec mpeg2enc pegwitdecpegwitenc rawdaudio unepic 3des blowfish rc4 sha Average Speedup SPECint MediaBench Encryption Lat

University of Michigan Electrical Engineering and Computer Science 16 Static Selection – Dynamic Realization Compiler selects subgraphs offline Communicated to the hardware at load time ► Control bits stored in a table and inserted at decode … ADD r4, r1, #1 LSR r2, r2, #4 XOR r5, r4, r2 LD r3 ADD r6, r5, r3 XOR r7, r6, r8 SHR … LSR r2, r2, #4 LD r3 CCA_Start #2 ADD r4, r1, #1 XOR r5, r4, r2 ADD r6, r5, r3 XOR r7, r6, r8 CCA_End SHR … I-Cache Control Table RetireRetire ExecuteExecute DecodeDecode

University of Michigan Electrical Engineering and Computer Science gzip 181.mcf 186.crafty 197.parser 300.twolf cjpeg djpeg epic g721encode gsmdecode mesamipmap mpeg2decmpeg2enc pegwitdecpegwitenc rawdaudio unepic 3des blowfish rc4 sha Average Speedup Dynamic SelectionStatic Selection Dynamic vs. Static Selection SPECintMediaBenchEncryption

University of Michigan Electrical Engineering and Computer Science 18 Summary Transparent instruction set customization ► Benefits of customization without changing ISA Presented design of a compute accelerator ► Handle majority of important computation subgraphs in many benchmarks Developed ways to utilize the accelerator ► Table-based static selection – dynamic realization ► Trace cache based dynamic selection – dynamic realization