CHAINSAW Von-Neumann Accelerators To Leverage Fused Instruction Chains

Slides:



Advertisements
Similar presentations
Instruction-Level Parallel Processors {Objective: executing two or more instructions in parallel} 4.1 Evolution and overview of ILP-processors 4.2 Dependencies.
Advertisements

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.
Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Breaking SIMD Shackles with an Exposed Flexible Microarchitecture and the Access Execute PDG Venkatraman Govindaraju, Tony Nowatzki, Karthikeyan Sankaralingam.
University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
Department of Computer Science iGPU: Exception Support and Speculative Execution on GPUs Jaikrishnan Menon, Marc de Kruijf Karthikeyan Sankaralingam Vertical.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.
1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
Instruction Level Parallelism (ILP) Colin Stevens.
The ESW Paradigm Manoj Franklin & Guirndar S. Sohi 05/10/2002.
Lec 8: Pipelining Kavita Bala CS 3410, Fall 2008 Computer Science Cornell University.
Spatial Computation Mihai Budiu CMU CS CALCM Seminar, Oct 21, 2003.
Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Very Long Instruction Word (VLIW) Architecture. VLIW Machine It consists of many functional units connected to a large central register file Each functional.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.
Hybrid-Scheduling: A Compile-Time Approach for Energy–Efficient Superscalar Processors Madhavi Valluri and Lizy John Laboratory for Computer Architecture.
1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.
Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.
DASX : Hardware Accelerator for Software Data Structures Snehasish Kumar, Naveen Vedula, Arrvindh Shriraman (Simon Fraser University), Vijayalakshmi Srinivasan.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
VTU – IISc Workshop Compiler, Architecture and HPC Research in Heterogeneous Multi-Core Era R. Govindarajan CSA & SERC, IISc
Super computers Parallel Processing By Lecturer: Aisha Dawood.
CS5222 Advanced Computer Architecture Part 3: VLIW Architecture
Sample Code (Simple) Run the following code on a pipelined datapath: add1 2 3 ; reg 3 = reg 1 + reg 2 nand ; reg 6 = reg 4 & reg 5 lw ; reg.
Spring 2006 Wavescalar S. Swanson, et al. Computer Science and Engineering University of Washington Presented by Brett Meyer.
1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.
Caltech CS184b Winter DeHon 1 CS184b: Computer Architecture [Single Threaded Architecture: abstractions, quantification, and optimizations] Day7:
CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.
CPS 258 Announcements –Lecture calendar with slides –Pointers to related material.
Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.
UT-Austin CART 1 Mechanisms for Streaming Architectures Stephen W. Keckler Computer Architecture and Technology Laboratory Department of Computer Sciences.
CS 352H: Computer Systems Architecture
15-740/ Computer Architecture Lecture 3: Performance
Variable Word Width Computation for Low Power
Microarchitecture.
Precise Exceptions and Out-of-Order Execution
Instruction Level Parallelism
Central Processing Unit Architecture
Design-Space Exploration
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
15-740/ Computer Architecture Lecture 7: Pipelining
Chapter 14 Instruction Level Parallelism and Superscalar Processors
CGRA Express: Accelerating Execution using Dynamic Operation Fusion
Flow Path Model of Superscalars
The fetch-execute cycle
DynaMOS: Dynamic Schedule Migration for Heterogeneous Cores
Stamatis Vassiliadis Symposium Sept. 28, 2007 J. E. Smith
Pipelining: Advanced ILP
Sequential Execution Semantics
Computer Architecture Lecture 3 – Part 1 11th May, 2006
Superscalar Processors & VLIW Processors
Hardware Multithreading
Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke
Yingmin Li Ting Yan Qi Zhao
Coe818 Advanced Computer Architecture
Mattan Erez The University of Texas at Austin
Instruction Level Parallelism (ILP)
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
The Vector-Thread Architecture
Overview Prof. Eric Rotenberg
Mattan Erez The University of Texas at Austin
Hardware Multithreading
The University of Adelaide, School of Computer Science
Instruction Level Parallelism
Research: Past, Present and Future
Predication ECE 721 Prof. Rotenberg.
Presentation transcript:

CHAINSAW Von-Neumann Accelerators To Leverage Fused Instruction Chains Amirali Sharifian, Snehasish Kumar, Apala Guha, Arrvindh Shriraman

AXC Challenge 1: Idleness Application DFG Spatial Fabric 1 2 3 8 12 9 1 3 2 As the accelerator size keep growing, keeping all the nodes busy like using pipelining techniques becomes challenging. Therefore, making fabric bigger leads idleness and static power issue Fabric Size Dataflow graph size More dataflow dependencies More Idleness

AXC Challenge 2: Data movement Compute 8 12 9 1 3 2 30% 70% Communication Traditionally moving data was free in compare to computation, but that’s not true anymore. 70% energy for moving data Spatial Data movement

Von-Neumann Features Reg DFG Ins. Buffer Reg 1 2 3 1 2 3 ALU Central register file is the core problem we could solve. We also could manage to reduce fetch and decode cost by adopting our architecture to only acceleratable region of the code. Temporal Mapping = Less Idleness Central Register File Fetch and Decode

Our Approach : Fused Instruction Chains CHAIN DFG Compiler exposed Bypass Temporal Mapping = Less Idleness Bypass = Internalize communication 3 2 1 Reg. Von-Neumann + Chains 1 2 3

Our Approach : Fused Instruction Chains CHAIN DFG Von-Neumann w/ Chains Do chains exist in a DFG? How to form the chains? What are the challenges? Modeling and Evaluation Reg. 1 2 3 1 Compiler exposed Bypass Reg. 2 3 Temporal Mapping = Less Idleness Central Register File

Finding dependent instructions Finding independent instructions CHAINs vs VLIW Chains VLIW Finding dependent instructions Vertical Fusion Finding independent instructions Horizontal Fusion

50–80% of DFG part of 3+ op chains Do chains exist in a DFG ? 50–80% of DFG part of 3+ op chains

How to form chains? Reduce Communication Chained DFG Schedule C1 4 5 6 1 2 3 C1 C2 1 2 3 C2 4 5 6 Internalize communication May fail to discover ILP

How to form chains? Optimize for ILP Chained DFG Schedule 4 5 6 1 2 3 C1 C2 C3 C1 C2 1 C3 4 2 5 3 6 Same ILP as the prog. Increased communication

How much communication is within chains? 40-60% of communication localized

How to extract – longer – Chains Control Flow

How to extract – longer – Chains GUARD Control Flow Larger Superblocks/Paths ⇒ Larger chains

CHAINSAW is an Accelerator WORKLOAD HOT PATH Control free Only hot paths Limited inst. buffer OOO Core CHAINSAW Chainsaw is an accelerator and only focuses on hot paths. The rest of the program runs on the main processor Cache Mem.

Multi-Lane CHAINSAW Execution Dataflow Graph Lane 1 Lane 2 C0 Ins. Buffer Ins. Buffer 1 C1 4 D1 D2 4 3 5 C2 2 2 1 6 5 3 C2 C0 C1 6 D1 D2 Register file

Chainsaw – Fetch and Decode Dataflow Graph Instruction Fields C1 D1 4 Op IN / 1 WR FWD L/R OUT / 1 4 1 X 5 1 5 6 X 1 Only 13bits is needed to decode! 6

Evaluation – Dynamic Energy Chainsaw adds Fetch/Decode cost for dynamic energy CGRA network overhead dominate Chainsaw F/D cost OOO-4 F/D cost = 8% 45% less than 4-way OOO 14% less than CGRA8x8

Evaluation – Data movement energy CGRA 8X8 Chainsaw reduces 40% of energy

Evaluation – Performance CGRA 8X8 Within 73% of CGRA8x8 20% better than OOO core

Chainsaw is a Von-Neumman accelerator Chains sequentially dependent operations. Chainsaw Accelerator: Exploit lack of ILP Reduce communication energy Reuse functional units Energy < CGRA Performance ≃ CGRA 8 1 9 2 3

github.com/sfu-arch/chainsaw Q&A github.com/sfu-arch/chainsaw

AXC Challenge 2: Data movement Spatial Fabric 8 1 COMPUTE 9 2 12 3 Traditionally moving data was free in compare to computation, but that’s not true anymore. SWITCH 50% Energy overhead for data movement Spatial Data movement

Reduced energy in Chainsaw Evaluation – Data movement energy Reduced energy in Chainsaw Chainsaw internalizes 50%+ of comm.

13% less than CGRA 45% less than 4-way OOO Evaluation – Dynamic Energy Chainsaw adds Fetch/Decode cost for dynamic energy CGRA network overhead dominate Chainsaw F/D cost OOO-4 CGRA 8X8 13% less than CGRA 45% less than 4-way OOO