University of Michigan Electrical Engineering and Computer Science 1 An Architecture Framework for Transparent Instruction Set Customization in Embedded.

Slides:

Advertisements

Similar presentations

University of Michigan Electrical Engineering and Computer Science 1 Application-Specific Processing on a General Purpose Core via Transparent Instruction.

Advertisements

School of Engineering & Technology Computer Architecture Pipeline.

ECE 260C – VLSI Advanced Topics Term paper presentation May 27, 2014 Keyuan Huang Ngoc Luong Low Power Processor Architectures and Software Optimization.

University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,

TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution.

University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.

COMP3221: Microprocessors and Embedded Systems Lecture 2: Instruction Set Architecture (ISA) Lecturer: Hui Wu Session.

University of Michigan Electrical Engineering and Computer Science 1 Modulo Scheduling for Highly Customized Datapaths to Increase Hardware Reusability.

University of Michigan Electrical Engineering and Computer Science 1 Increasing the Number of Effective Registers in a Low-Power Processor Using a Windowed.

11 University of Michigan Electrical Engineering and Computer Science Exploring the Design Space of LUT-based Transparent Accelerators Sami Yehia *, Nathan.

University of Michigan Electrical Engineering and Computer Science Compiler-directed Synthesis of Programmable Loop Accelerators Kevin Fan, Hyunchul Park,

A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department.

University of Michigan Electrical Engineering and Computer Science FLASH: Foresighted Latency-Aware Scheduling Heuristic for Processors with Customized.

University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,

1 Automatically Generating Custom Instruction Set Extensions Nathan Clark, Wilkin Tang, Scott Mahlke Workshop on Application Specific Processors.

Embedded Computing From Theory to Practice November 2008 USTC Suzhou.

University of Michigan Electrical Engineering and Computer Science 1 Liquid SIMD: Abstracting SIMD Hardware Using Lightweight Dynamic Mapping Nathan Clark,

Hot Chips 16August 24, 2004 OptimoDE: Programmable Accelerator Engines Through Retargetable Customization Nathan Clark, Hongtao Zhong, Kevin Fan, Scott.

Cost-Efficient Soft Error Protection for Embedded Microprocessors

1 Design and Implementation of Turbo Decoders for Software Defined Radio Yuan Lin 1, Scott Mahlke 1, Trevor Mudge 1, Chaitali.

Code Coverage Testing Using Hardware Performance Monitoring Support Alex Shye, Matthew Iyer, Vijay Janapa Reddi and Daniel A. Connors University of Colorado.

University of Michigan Electrical Engineering and Computer Science 1 Processor Acceleration Through Automated Instruction Set Customization Nathan Clark,

University of Michigan Electrical Engineering and Computer Science Data-centric Subgraph Mapping for Narrow Computation Accelerators Amir Hormati, Nathan.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

University of Michigan Electrical Engineering and Computer Science 1 A Microarchitectural Analysis of Soft Error Propagation in a Production-Level Embedded.

1 Layers of Computer Science, ISA and uArch Alexander Titov 20 September 2014.

Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.

University of Michigan Electrical Engineering and Computer Science 1 Systematic Register Bypass Customization for Application-Specific Processors Kevin.

Instruction-Level Parallelism for Low-Power Embedded Processors January 23, 2001 Presented By Anup Gangwar.

University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.

TECH 6 VLIW Architectures {Very Long Instruction Word}

A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.

Adaptive Cache Partitioning on a Composite Core Jiecao Yu, Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Scott Mahlke Computer Engineering Lab University.

Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Execution Characteristics of SPEC CPU2000 Benchmarks: Intel C++ vs. Microsoft VC++

Generating and Executing Multi-Exit Custom Instructions for an Adaptive Extensible Processor Hamid Noori †, Farhad Mehdipour ‡, Kazuaki Murakami †, Koji.

RISC By Ryan Aldana. Agenda Brief Overview of RISC and CISC Features of RISC Instruction Pipeline Register Windowing and renaming Data Conflicts Branch.

AN ARCHITECTURE FRAMEWORK FOR TRANSPARENT ISA CUSTOMIZATION IN EMBEDDED PROCESSORS VINAY GANGADHAR ECE 751 TALK, FALL 2015 DEPARTMENT.

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 Bundled Execution.

University of Michigan Electrical Engineering and Computer Science 1 Compiler Managed Partitioned Data Caches for Low Power Rajiv Ravindran*, Michael Chu,

Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke

University of Michigan Electrical Engineering and Computer Science 1 Compiler-directed Synthesis of Multifunction Loop Accelerators Kevin Fan, Manjunath.

DR. SIMING LIU SPRING 2016 COMPUTER SCIENCE AND ENGINEERING UNIVERSITY OF NEVADA, RENO CS 219 Computer Organization.

Lx: A Technology Platform for Customizable VLIW Embedded Processing.

An Automated Development Framework for a RISC Processor with Reconfigurable Instruction Set Extensions Nikolaos Vassiliadis, George Theodoridis and Spiridon.

University of Michigan Electrical Engineering and Computer Science Automatic Synthesis of Customized Local Memories for Multicluster Application Accelerators.

University of Michigan Electrical Engineering and Computer Science 1 Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System Kevin Fan,

University of Michigan Electrical Engineering and Computer Science 1 Embracing Heterogeneity with Dynamic Core Boosting Hyoun Kyu Cho and Scott Mahlke.

VEAL: Virtualized Execution Accelerator for Loops Nate Clark 1, Amir Hormati 2, Scott Mahlke 2 1 Georgia Tech., 2 U. Michigan.

CPU-GPU Collaboration for Output Quality Monitoring Mehrzad Samadi and Scott Mahlke University of Michigan March 2014 Compilers creating custom processors.

University of Michigan Electrical Engineering and Computer Science 1 Increasing Hardware Efficiency with Multifunction Loop Accelerators Kevin Fan, Manjunath.

Optimizing Multipliers for the CPU: A ROM based approach Michael Moeng Jason Wei Electrical Engineering and Computer Science University of California:

Application-Specific Customization of Soft Processor Microarchitecture Peter Yiannacouras J. Gregory Steffan Jonathan Rose University of Toronto Electrical.

University of Michigan Electrical Engineering and Computer Science Dynamic Voltage/Frequency Scaling in Loop Accelerators using BLADES Ganesh Dasika 1,

A Framework For Trusted Instruction Execution Via Basic Block Signature Verification Milena Milenković, Aleksandar Milenković, and Emil Jovanov Electrical.

1 Compiler Managed Dynamic Instruction Placement In A Low-Power Code Cache Rajiv Ravindran, Pracheeti Nagarkar, Ganesh Dasika, Robert Senger, Eric Marsman,

PROGRAMMABLE LOGIC CONTROLLERS SINGLE CHIP COMPUTER

Adaptive Cache Partitioning on a Composite Core

Application-Specific Customization of Soft Processor Microarchitecture

CGRA Express: Accelerating Execution using Dynamic Operation Fusion

Accelerating Dependent Cache Misses with an Enhanced Memory Controller

Application-Specific Processing on a General Purpose Core via Transparent Instruction Set Customization Nathan Clark, Manjunath Kudlur, Hyunchul Park,

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

Henk Corporaal TUEindhoven 2011

Computer Architecture: A Science of Tradeoffs

Application-Specific Customization of Soft Processor Microarchitecture

Presentation transcript:

University of Michigan Electrical Engineering and Computer Science 1 An Architecture Framework for Transparent Instruction Set Customization in Embedded Processors Nathan Clark, Jason Blome, Michael Chu, Scott Mahlke, Stuart Biles*, Krisztián Flautner* Advanced Computer Architecture Lab, University of Michigan *ARM Ltd.

University of Michigan Electrical Engineering and Computer Science 2 The Expression Gap RISC ISAs are lowest common denominator ► Don’t match applications’ computation ► Don’t match hardware capabilities Need efficient execution Impressive design wins through customization ► Performance, power, etc.

University of Michigan Electrical Engineering and Computer Science 3 Customization Gains: Performance DesAESBlowfishMd5Rc4SHA Speedup OptimoDE (5 Issue VLIW, 333 MHz) OptimoDE + Custom ISA

University of Michigan Electrical Engineering and Computer Science 4 Demanding parts of applications run on special hardware New instructions use the special hardware Traditional ISA Customization XOR MPY LD XOR SHR XOR MOV AND CUSTOM MPY LD SHR CPU Custom Hardware

University of Michigan Electrical Engineering and Computer Science 5 Objectives of Transparent ISA Customization Increase execution efficiency of processors Architecture framework for subgraph acceleration ► Create a pipeline with fixed interface ► Design and verify once Support Plug-and-Play style accelerators CISC on Demand

University of Michigan Electrical Engineering and Computer Science 6 Traditional vs. Transparent Customization Traditional Significant ISA change High NRE ► Verification ► Masks Control placed in binary ► Software migration No legacy codes Transparent No ISA change Baseline CPU unchanged ► Hardware generates control ► Eases software burden Forward compatible

University of Michigan Electrical Engineering and Computer Science 7 Architecture Framework Compiler Standard Pipeline … Subg. … Subg. … Application Subgraph Execution Unit InputsOutputs Control Generation Instructions Augments Instruction Stream

University of Michigan Electrical Engineering and Computer Science 8 Configurable Compute Array (CCA) Array of function units Two types of FUs: arith/logic, logic 82% of important subgraphs Crossbar between rows 3.19ns critical path 0.61mm 2 in 0.13  I1I2I1I3I4 O1O2

University of Michigan Electrical Engineering and Computer Science 9 Architecture Framework Compiler Standard Pipeline … Subg. … Subg. … Application Subgraph Execution Unit InputsOutputs Control Generation Instructions Augments Instruction Stream

University of Michigan Electrical Engineering and Computer Science 10 Compiler Identify and delineate subgraphs “Procedural Abstraction” – used in compression

University of Michigan Electrical Engineering and Computer Science 11 Architecture Framework Compiler Standard Pipeline … Subg. … Subg. … Application Subgraph Execution Unit InputsOutputs Control Generation Instructions Augments Instruction Stream

University of Michigan Electrical Engineering and Computer Science 12 I1 Control Generation I1I2I3I4 O1O2 Subg: AND r3, r1, #-4 SEXT r2, r4 AND r2, r2, #3 OR r3, r3, r2 RET I1I2

University of Michigan Electrical Engineering and Computer Science 13 Architecture Framework Compiler Standard Pipeline … Subg. … Subg. … Application Subgraph Execution Unit InputsOutputs Control Generation Instructions Augments Instruction Stream

University of Michigan Electrical Engineering and Computer Science 14 Pipeline Interface

University of Michigan Electrical Engineering and Computer Science 15 Evaluation Ported Trimaran compiler to ARM ISA ► Subgraph identification engine Synthesized control generator and accelerator SimpleScalar configured as ARM926EJ-S ► 5 stage pipe, 250 MHz ► 1 cycle 16k I/D caches ► Single issue ► 1 cycle subgraph execution latency

University of Michigan Electrical Engineering and Computer Science 16 Performance Results gzip 181.mcf 197.parser 256.bzip2 300.twolf cjpeg djpeg epic unepic g721encodeg721decode gsmencodegsmdecode pegwitencpegwitdec rawcaudio rawdaudio blowfish md5 rc4 Rijndael sha Speedup SPECintMediaBenchEncryption IPC on a single-issue core

University of Michigan Electrical Engineering and Computer Science 17 Plug-and-Play Benefits Baseline Area: 0.61mm 2 Baseline Speedup: 1.8

University of Michigan Electrical Engineering and Computer Science 18 Effect of CCA Pipelining Average:

University of Michigan Electrical Engineering and Computer Science 19 Conclusions Expression gap between ISAs and computation ► Inherent inefficiency Transparent ISA Customization ► Fixed core  low NRE ► Plug-and-Play accelerators ► Enables “CISC on demand” 1.8x speedup for 15% area overhead

University of Michigan Electrical Engineering and Computer Science 20 Questions? More info: