CHAINSAW Von-Neumann Accelerators To Leverage Fused Instruction Chains

Slides:

Advertisements

Similar presentations

Instruction-Level Parallel Processors {Objective: executing two or more instructions in parallel} 4.1 Evolution and overview of ILP-processors 4.2 Dependencies.

Advertisements

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

Breaking SIMD Shackles with an Exposed Flexible Microarchitecture and the Access Execute PDG Venkatraman Govindaraju, Tony Nowatzki, Karthikeyan Sankaralingam.

University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.

Department of Computer Science iGPU: Exception Support and Speculative Execution on GPUs Jaikrishnan Menon, Marc de Kruijf Karthikeyan Sankaralingam Vertical.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.

1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

Instruction Level Parallelism (ILP) Colin Stevens.

The ESW Paradigm Manoj Franklin & Guirndar S. Sohi 05/10/2002.

Lec 8: Pipelining Kavita Bala CS 3410, Fall 2008 Computer Science Cornell University.

Spatial Computation Mihai Budiu CMU CS CALCM Seminar, Oct 21, 2003.

Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Very Long Instruction Word (VLIW) Architecture. VLIW Machine It consists of many functional units connected to a large central register file Each functional.

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.

Hybrid-Scheduling: A Compile-Time Approach for Energy–Efficient Superscalar Processors Madhavi Valluri and Lizy John Laboratory for Computer Architecture.

1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.

Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.

DASX : Hardware Accelerator for Software Data Structures Snehasish Kumar, Naveen Vedula, Arrvindh Shriraman (Simon Fraser University), Vijayalakshmi Srinivasan.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

VTU – IISc Workshop Compiler, Architecture and HPC Research in Heterogeneous Multi-Core Era R. Govindarajan CSA & SERC, IISc

Super computers Parallel Processing By Lecturer: Aisha Dawood.

CS5222 Advanced Computer Architecture Part 3: VLIW Architecture

Sample Code (Simple) Run the following code on a pipelined datapath: add1 2 3 ; reg 3 = reg 1 + reg 2 nand ; reg 6 = reg 4 & reg 5 lw ; reg.

Spring 2006 Wavescalar S. Swanson, et al. Computer Science and Engineering University of Washington Presented by Brett Meyer.

1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.

Caltech CS184b Winter DeHon 1 CS184b: Computer Architecture [Single Threaded Architecture: abstractions, quantification, and optimizations] Day7:

CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.

CPS 258 Announcements –Lecture calendar with slides –Pointers to related material.

Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.

UT-Austin CART 1 Mechanisms for Streaming Architectures Stephen W. Keckler Computer Architecture and Technology Laboratory Department of Computer Sciences.

CS 352H: Computer Systems Architecture

15-740/ Computer Architecture Lecture 3: Performance

Variable Word Width Computation for Low Power

Microarchitecture.

Precise Exceptions and Out-of-Order Execution

Instruction Level Parallelism

Central Processing Unit Architecture

Design-Space Exploration

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue

15-740/ Computer Architecture Lecture 7: Pipelining

Chapter 14 Instruction Level Parallelism and Superscalar Processors

CGRA Express: Accelerating Execution using Dynamic Operation Fusion

Flow Path Model of Superscalars

The fetch-execute cycle

DynaMOS: Dynamic Schedule Migration for Heterogeneous Cores

Stamatis Vassiliadis Symposium Sept. 28, 2007 J. E. Smith

Pipelining: Advanced ILP

Sequential Execution Semantics

Computer Architecture Lecture 3 – Part 1 11th May, 2006

Superscalar Processors & VLIW Processors

Hardware Multithreading

Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke

Yingmin Li Ting Yan Qi Zhao

Coe818 Advanced Computer Architecture

Mattan Erez The University of Texas at Austin

Instruction Level Parallelism (ILP)

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

The Vector-Thread Architecture

Overview Prof. Eric Rotenberg

Mattan Erez The University of Texas at Austin

Hardware Multithreading

The University of Adelaide, School of Computer Science

Instruction Level Parallelism

Research: Past, Present and Future

Predication ECE 721 Prof. Rotenberg.

Presentation transcript:

CHAINSAW Von-Neumann Accelerators To Leverage Fused Instruction Chains Amirali Sharifian, Snehasish Kumar, Apala Guha, Arrvindh Shriraman

AXC Challenge 1: Idleness Application DFG Spatial Fabric 1 2 3 8 12 9 1 3 2 As the accelerator size keep growing, keeping all the nodes busy like using pipelining techniques becomes challenging. Therefore, making fabric bigger leads idleness and static power issue Fabric Size Dataflow graph size More dataflow dependencies More Idleness

AXC Challenge 2: Data movement Compute 8 12 9 1 3 2 30% 70% Communication Traditionally moving data was free in compare to computation, but that’s not true anymore. 70% energy for moving data Spatial Data movement

Von-Neumann Features Reg DFG Ins. Buffer Reg 1 2 3 1 2 3 ALU Central register file is the core problem we could solve. We also could manage to reduce fetch and decode cost by adopting our architecture to only acceleratable region of the code. Temporal Mapping = Less Idleness Central Register File Fetch and Decode

Our Approach : Fused Instruction Chains CHAIN DFG Compiler exposed Bypass Temporal Mapping = Less Idleness Bypass = Internalize communication 3 2 1 Reg. Von-Neumann + Chains 1 2 3

Our Approach : Fused Instruction Chains CHAIN DFG Von-Neumann w/ Chains Do chains exist in a DFG? How to form the chains? What are the challenges? Modeling and Evaluation Reg. 1 2 3 1 Compiler exposed Bypass Reg. 2 3 Temporal Mapping = Less Idleness Central Register File

Finding dependent instructions Finding independent instructions CHAINs vs VLIW Chains VLIW Finding dependent instructions Vertical Fusion Finding independent instructions Horizontal Fusion

50–80% of DFG part of 3+ op chains Do chains exist in a DFG ? 50–80% of DFG part of 3+ op chains

How to form chains? Reduce Communication Chained DFG Schedule C1 4 5 6 1 2 3 C1 C2 1 2 3 C2 4 5 6 Internalize communication May fail to discover ILP

How to form chains? Optimize for ILP Chained DFG Schedule 4 5 6 1 2 3 C1 C2 C3 C1 C2 1 C3 4 2 5 3 6 Same ILP as the prog. Increased communication

How much communication is within chains? 40-60% of communication localized

How to extract – longer – Chains Control Flow

How to extract – longer – Chains GUARD Control Flow Larger Superblocks/Paths ⇒ Larger chains

CHAINSAW is an Accelerator WORKLOAD HOT PATH Control free Only hot paths Limited inst. buffer OOO Core CHAINSAW Chainsaw is an accelerator and only focuses on hot paths. The rest of the program runs on the main processor Cache Mem.

Multi-Lane CHAINSAW Execution Dataflow Graph Lane 1 Lane 2 C0 Ins. Buffer Ins. Buffer 1 C1 4 D1 D2 4 3 5 C2 2 2 1 6 5 3 C2 C0 C1 6 D1 D2 Register file

Chainsaw – Fetch and Decode Dataflow Graph Instruction Fields C1 D1 4 Op IN / 1 WR FWD L/R OUT / 1 4 1 X 5 1 5 6 X 1 Only 13bits is needed to decode! 6

Evaluation – Dynamic Energy Chainsaw adds Fetch/Decode cost for dynamic energy CGRA network overhead dominate Chainsaw F/D cost OOO-4 F/D cost = 8% 45% less than 4-way OOO 14% less than CGRA8x8

Evaluation – Data movement energy CGRA 8X8 Chainsaw reduces 40% of energy

Evaluation – Performance CGRA 8X8 Within 73% of CGRA8x8 20% better than OOO core

Chainsaw is a Von-Neumman accelerator Chains sequentially dependent operations. Chainsaw Accelerator: Exploit lack of ILP Reduce communication energy Reuse functional units Energy < CGRA Performance ≃ CGRA 8 1 9 2 3

github.com/sfu-arch/chainsaw Q&A github.com/sfu-arch/chainsaw

AXC Challenge 2: Data movement Spatial Fabric 8 1 COMPUTE 9 2 12 3 Traditionally moving data was free in compare to computation, but that’s not true anymore. SWITCH 50% Energy overhead for data movement Spatial Data movement

Reduced energy in Chainsaw Evaluation – Data movement energy Reduced energy in Chainsaw Chainsaw internalizes 50%+ of comm.

13% less than CGRA 45% less than 4-way OOO Evaluation – Dynamic Energy Chainsaw adds Fetch/Decode cost for dynamic energy CGRA network overhead dominate Chainsaw F/D cost OOO-4 CGRA 8X8 13% less than CGRA 45% less than 4-way OOO