Application-Specific Processing on a General Purpose Core via Transparent Instruction Set Customization Nathan Clark, Manjunath Kudlur, Hyunchul Park,

Slides:



Advertisements
Similar presentations
University of Michigan Electrical Engineering and Computer Science 1 Application-Specific Processing on a General Purpose Core via Transparent Instruction.
Advertisements

University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,
TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.
UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.
CHIMAERA: A High-Performance Architecture with a Tightly-Coupled Reconfigurable Functional Unit Kynan Fraser.
From Sequences of Dependent Instructions to Functions An Approach for Improving Performance without ILP or Speculation Ben Rudzyn.
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
11 University of Michigan Electrical Engineering and Computer Science Exploring the Design Space of LUT-based Transparent Accelerators Sami Yehia *, Nathan.
University of Michigan Electrical Engineering and Computer Science FLASH: Foresighted Latency-Aware Scheduling Heuristic for Processors with Customized.
1 Automatically Generating Custom Instruction Set Extensions Nathan Clark, Wilkin Tang, Scott Mahlke Workshop on Application Specific Processors.
University of Michigan Electrical Engineering and Computer Science 1 An Architecture Framework for Transparent Instruction Set Customization in Embedded.
Exploiting Load Latency Tolerance for Relaxing Cache Design Constraints Ramu Pyreddy, Gary Tyson Advanced Computer Architecture Laboratory University of.
University of Michigan Electrical Engineering and Computer Science 1 Liquid SIMD: Abstracting SIMD Hardware Using Lightweight Dynamic Mapping Nathan Clark,
Hot Chips 16August 24, 2004 OptimoDE: Programmable Accelerator Engines Through Retargetable Customization Nathan Clark, Hongtao Zhong, Kevin Fan, Scott.
1 Design and Implementation of Turbo Decoders for Software Defined Radio Yuan Lin 1, Scott Mahlke 1, Trevor Mudge 1, Chaitali.
University of Michigan Electrical Engineering and Computer Science 1 Processor Acceleration Through Automated Instruction Set Customization Nathan Clark,
University of Michigan Electrical Engineering and Computer Science Data-centric Subgraph Mapping for Narrow Computation Accelerators Amir Hormati, Nathan.
University of Michigan Electrical Engineering and Computer Science 1 A Microarchitectural Analysis of Soft Error Propagation in a Production-Level Embedded.
Secure Embedded Processing through Hardware-assisted Run-time Monitoring Zubin Kumar.
University of Michigan Electrical Engineering and Computer Science 1 Systematic Register Bypass Customization for Application-Specific Processors Kevin.
University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.
TECH 6 VLIW Architectures {Very Long Instruction Word}
A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.
Speculative Software Management of Datapath-width for Energy Optimization G. Pokam, O. Rochecouste, A. Seznec, and F. Bodin IRISA, Campus de Beaulieu
1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah
Operation Frequency No. of Clock cycles ALU ops % 1 Loads 25% 2
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
Generating and Executing Multi-Exit Custom Instructions for an Adaptive Extensible Processor Hamid Noori †, Farhad Mehdipour ‡, Kazuaki Murakami †, Koji.
AN ARCHITECTURE FRAMEWORK FOR TRANSPARENT ISA CUSTOMIZATION IN EMBEDDED PROCESSORS VINAY GANGADHAR ECE 751 TALK, FALL 2015 DEPARTMENT.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 Bundled Execution.
Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.
University of Michigan Electrical Engineering and Computer Science 1 Compiler Managed Partitioned Data Caches for Low Power Rajiv Ravindran*, Michael Chu,
Superscalar - summary Superscalar machines have multiple functional units (FUs) eg 2 x integer ALU, 1 x FPU, 1 x branch, 1 x load/store Requires complex.
EKT303/4 Superscalar vs Super-pipelined.
University of Michigan Electrical Engineering and Computer Science 1 Compiler-directed Synthesis of Multifunction Loop Accelerators Kevin Fan, Manjunath.
University of Michigan Electrical Engineering and Computer Science 1 Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System Kevin Fan,
VEAL: Virtualized Execution Accelerator for Loops Nate Clark 1, Amir Hormati 2, Scott Mahlke 2 1 Georgia Tech., 2 U. Michigan.
COMPSYS 304 Computer Architecture Speculation & Branching Morning visitors - Paradise Bay, Bay of Islands.
University of Michigan Electrical Engineering and Computer Science 1 Increasing Hardware Efficiency with Multifunction Loop Accelerators Kevin Fan, Manjunath.
University of Michigan Electrical Engineering and Computer Science Dynamic Voltage/Frequency Scaling in Loop Accelerators using BLADES Ganesh Dasika 1,
1 Compiler Managed Dynamic Instruction Placement In A Low-Power Code Cache Rajiv Ravindran, Pracheeti Nagarkar, Ganesh Dasika, Robert Senger, Eric Marsman,
PINTOS: An Execution Phase Based Optimization and Simulation Tool) PINTOS: An Execution Phase Based Optimization and Simulation Tool) Wei Hsu, Jinpyo Kim,
© 2015 Pearson Education Limited 2015 Quiz in last 15 minutes Midterm 1 is next Sunday Assignment 1 due today at 4pm Assignment 2 will be up today; due.
Visit for more Learning Resources
Design-Space Exploration
Multiscalar Processors
Morgan Kaufmann Publishers The Processor
Morgan Kaufmann Publishers The Processor
CGRA Express: Accelerating Execution using Dynamic Operation Fusion
Microarchitectural for monitoring application specific instructions
Henk Corporaal TUEindhoven 2009
DynaMOS: Dynamic Schedule Migration for Heterogeneous Cores
Superscalar Processors & VLIW Processors
Accelerating Dependent Cache Misses with an Enhanced Memory Controller
Dynamically Reconfigurable Architectures: An Overview
Lecture 5: Pipelining Basics
Overheads for Computers as Components 2nd ed.
Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt
Henk Corporaal TUEindhoven 2011
Introduction to Heterogeneous Parallel Computing
Overview What are pipeline hazards? Types of hazards
What is Computer Architecture?
CSC3050 – Computer Architecture
A Level Computer Science Topic 5: Computer Architecture and Assembly
Application-Specific Customization of Soft Processor Microarchitecture
Instruction Level Parallelism
CSE 502: Computer Architecture
What Are Performance Counters?
Chapter 4 The Von Neumann Model
Efficient Placement of Compressed Code for Parallel Decompression
Presentation transcript:

Application-Specific Processing on a General Purpose Core via Transparent Instruction Set Customization Nathan Clark, Manjunath Kudlur, Hyunchul Park, Scott Mahlke, Krisztián Flautner* Advanced Computer Architecture Lab, University of Michigan *ARM Ltd. 1

A Case for Customization General purpose processors handles many applications fairly well, but… Each application has different requirements Need for efficient execution Impressive design wins through customization Performance, power, area Up to 3.5x speedup [Hot Chips 16] 2

Instruction Set Customization Computationally demanding parts of applications run on special hardware New instructions use the special hardware MPY LD LD SHR MPY XOR SHR AND CUSTOM XOR MOV XOR 3

Traditional vs. Transparent Customization High Non-Recurring Engineering costs (NRE) “Universal” accelerator No ISA change Traditional Transparent CPU CPU CPU Compute Accelerator (CCA) CPU Reverification of core Refabricate lithography masks Retarget software tool chain CPU CPU 4

Design of a Compute Accelerator FU … IN 1 IN 2 Goal: support important computation subgraphs Array of function units Exploits subgraph parallelism Allows natural data propagation F e t c h I s u e CCA W B … … ALU ALU 5

CCA Shape 164.gzip 1 Or And Mov Or And Mov Or And Mov 6 An array of function units derived from the target ISA is the obvious structure for the computational accelerator, so what do we make it look like. 6

CCA Shape Blowfish 1 2 And Xor Add Mov 7 An array of function units derived from the target ISA is the obvious structure for the computational accelerator, so what do we make it look like. 7

CCA Utilization Dynamic % of subgraphs using FU 1 2 3 4 5 6 7 100 59.0 22.9 13.1 6.5 4.2 0.3 91.1 50.6 9.9 4.1 0.6 0.2 0.0 57.4 17.8 6.3 2.9 0.1 18.5 8.3 1.6 8.7 2.1 1.2 8 8

CCA Operations Dynamic opcodes in important subgraphs Excluded mpy/div, load/store, branch Two main categories – logicals, adds Subgraphs rarely have more than 3 dependent adds Opcode % Add 28.7 And 12.5 Move 11.7 Sext 10.4 Lshift 9.8 Or 8.7 Xor 5.1 Sub 4.8 Rshift 2.4 Compare 0.4 9

Proposed CCA Design 4 inputs/2 outputs Two FU types Arith/logic Logic Crossbar between rows Captures > 99% of important subgraphs I1 I1 I2 I3 I4 O1 O2 10

Synthesis of CCA Synopsys design tools, 130nm library 7 245 5.62 0.48 Depth Configuration Control (bits) Delay (ns) Cell area (mm2) Subgraphs Supported 7 6A-4L-4A-3L-2A-2L-1L 245 5.62 0.48 99.3% 6 6A-4L-4A-3L-2A-1L 229 4.56 0.45 95.1% 5 6A-4L-4A-2L-1L 197 3.50 0.40 87.6% 4 6A-4L-3A-2L 172 3.19 0.38 81.8% 11

CCA Utilization Realization Selection – Simple selection Static Dynamic + No ISA change + No recompile – Simple selection – Hardware complexity + Powerful selection + Simple hardware – Some ISA change – Recompile necessary ASIPs – ISA change – High NRE + No ISA change + No recompile – Simple selection – Hardware complexity + Powerful selection + Simple hardware – Some ISA change – Recompile necessary ASIPs – ISA change – High NRE + No ISA change + No recompile – Simple selection – Hardware complexity + Powerful selection + Simple hardware – Some ISA change – Recompile necessary ASIPs – ISA change – High NRE + No ISA change + No recompile – Simple selection – Hardware complexity + Powerful selection + Simple hardware – Some ISA change – Recompile necessary ASIPs – ISA change – High NRE Static Selection Dynamic 12

Dynamic Selection – Dynamic Realization Detect and replace subgraphs in fill unit of trace cache I-Cache D e c o d . E x e c u t . R e t i r … ADD r4, r1, #1 LSR r2, r2, #4 XOR r5, r4, r2 LD r3 ADD r6, r5, r3 XOR r7, r6, r8 SHR … ADD r4, r1, #1 LSR r2, r2, #4 XOR r5, r4, r2 LD r3 ADD r6, r5, r3 XOR r7, r6, r8 SHR … ADD r4, r1, #1 LSR r2, r2, #4 XOR r5, r4, r2 LD r3 ADD r6, r5, r3 XOR r7, r6, r8 SHR … ADD r4, r1, #1 LSR r2, r2, #4 XOR r5, r4, r2 LD r3 ADD r6, r5, r3 XOR r7, r6, r8 SHR … ADD r4, r1, #1 LSR r2, r2, #4 XOR r5, r4, r2 LD r3 ADD r6, r5, r3 XOR r7, r6, r8 SHR … LSR r2, r2, #4 LD r3 CUSTOM SHR Trace Cache Subgraph Selection and Insertion Trace Construction 13

Simulation SimpleScalar – ARM instruction set 4-wide Execution, 1 compute accelerator 128 RUU entries 32k inst. trace cache, 256 inst. Traces 5000 cycle selection/insert latency L1 I-cache : 32k, 2 way, 2 cycle hit L1 D-cache : 32k, 4 way, 2 cycle hit 14

Varying CCA Latency SPECint MediaBench Encryption Lat 15 1.45 1.40 1.35 6 1.30 4 2 1.25 Speedup 1 1.20 1.15 1.10 1.05 1.00 rc4 cjpeg djpeg epic sha unepic 3des 164.gzip 181.mcf 300.twolf blowfish Average 186.crafty 197.parser g721encode gsmdecode mpeg2dec mpeg2enc pegwitdec pegwitenc rawdaudio mesamipmap 15

Static Selection – Dynamic Realization Compiler selects subgraphs offline Communicated to the hardware at load time Control bits stored in a table and inserted at decode … ADD r4, r1, #1 LSR r2, r2, #4 XOR r5, r4, r2 LD r3 ADD r6, r5, r3 XOR r7, r6, r8 SHR LD r3 CCA_Start #2 CCA_End I-Cache Control Table R e t i r . E x c u D o d 16

Dynamic vs. Static Selection SPECint MediaBench Encryption 1.45 Dynamic Selection Static Selection 1.40 1.35 1.30 1.25 Speedup 1.20 1.15 1.10 1.05 1.00 rc4 cjpeg djpeg epic 3des sha 164.gzip 181.mcf unepic blowfish Average 186.crafty 197.parser 300.twolf mpeg2dec mpeg2enc g721encode gsmdecode mesamipmap pegwitdec pegwitenc rawdaudio 17

Summary Transparent instruction set customization Benefits of customization without changing ISA Presented design of a compute accelerator Handle majority of important computation subgraphs in many benchmarks Developed ways to utilize the accelerator Table-based static selection – dynamic realization Trace cache based dynamic selection – dynamic realization 18