1. Arizona State University, Tempe, USA

Slides:



Advertisements
Similar presentations
CML Efficient & Effective Code Management for Software Managed Multicores CODES+ISSS 2013, Montreal, Canada Ke Bai, Jing Lu, Aviral Shrivastava, and Bryce.
Advertisements

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
5th International Conference, HiPEAC 2010 MEMORY-AWARE APPLICATION MAPPING ON COARSE-GRAINED RECONFIGURABLE ARRAYS Yongjoo Kim, Jongeun Lee *, Aviral Shrivastava.
University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution.
CML Enabling Multithreading on CGRAs Reiley Jeyapaul, Aviral Shrivastava 1, Jared Pager 1, Reiley Jeyapaul, 1 Mahdi Hamzeh 12, Sarma Vrudhula 2 Compiler.
11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.
University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.
Processor Technology and Architecture
A High Performance Application Representation for Reconfigurable Systems Wenrui GongGang WangRyan Kastner Department of Electrical and Computer Engineering.
Chapter 4 Processor Technology and Architecture. Chapter goals Describe CPU instruction and execution cycles Explain how primitive CPU instructions are.
Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,
Yongjoo Kim*, Jongeun Lee**, Jinyong Lee*, Toan Mai**, Ingoo Heo* and Yunheung Paek* *Seoul National University **UNIST (Ulsan National Institute of Science.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
Automated Design of Custom Architecture Tulika Mitra
Efficient Mapping onto Coarse-Grained Reconfigurable Architectures using Graph Drawing based Algorithm Jonghee Yoon, Aviral Shrivastava *, Minwook Ahn,
CML REGISTER FILE ORGANIZATION FOR COARSE GRAINED RECONFIGURABLE ARCHITECTURES (CGRAs) Dipal Saluja Compiler Microarchitecture Lab, Arizona State University,
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.
Theory and Practice of Software Testing
Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.
CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.
Resource Sharing in LegUp. Resource Sharing in High Level Synthesis Resource Sharing is a well-known technique in HLS to reduce circuit area by sharing.
PINTOS: An Execution Phase Based Optimization and Simulation Tool) PINTOS: An Execution Phase Based Optimization and Simulation Tool) Wei Hsu, Jinpyo Kim,
1 Processor design Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section 11.3.
Optimizing Interconnection Complexity for Realizing Fixed Permutation in Data and Signal Processing Algorithms Ren Chen, Viktor K. Prasanna Ming Hsieh.
Topics to be covered Instruction Execution Characteristics
Presenter: Darshika G. Perera Assistant Professor
Global Register Allocation Based on
Gwangsun Kim, Jiyun Jeong, John Kim
Advanced Architectures
Scalable Register File Architectures for CGRA Accelerators
Computer Organization and Architecture + Networks
Hiba Tariq School of Engineering
ECE354 Embedded Systems Introduction C Andras Moritz.
Handouts Software Testing and Quality Assurance Theory and Practice Chapter 4 Control Flow Testing
Ph.D. in Computer Science
CHAINSAW Von-Neumann Accelerators To Leverage Fused Instruction Chains
Conception of parallel algorithms
ESE532: System-on-a-Chip Architecture
nZDC: A compiler technique for near-Zero silent Data Corruption
The University of Adelaide, School of Computer Science
CGRA Express: Accelerating Execution using Dynamic Operation Fusion
Improving java performance using Dynamic Method Migration on FPGAs
Flow Path Model of Superscalars
CSE-591 Compilers for Embedded Systems Code transformations and compile time data management techniques for application mapping onto SIMD-style Coarse-grained.
EPIMap: Using Epimorphism to Map Applications on CGRAs
Communication and Memory Efficient Parallel Decision Tree Construction
Hardware Multithreading
Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke
Register Pressure Guided Unroll-and-Jam
Summary of my Thesis Power-efficient performance is the need of the future CGRAs can be used as programmable accelerators Power-efficient performance can.
Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform
URECA: A Compiler Solution to Manage Unified Register File for CGRAs
Dynamically Scheduled High-level Synthesis
Reiley Jeyapaul and Aviral Shrivastava Compiler-Microarchitecture Lab
Computer Architecture
Processor design Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section 11.3.
Chapter 12 Pipelining and RISC
Hardware Multithreading
Dynamic Hardware Prediction
Mapping DSP algorithms to a general purpose out-of-order processor
Code Transformation for TLB Power Reduction
ECE 448 Lecture 6 Finite State Machines State Diagrams, State Tables, Algorithmic State Machine (ASM) Charts, and VHDL code ECE 448 – FPGA and ASIC Design.
CMSC 611: Advanced Computer Architecture
Loop-Level Parallelism
RAMP: Resource-Aware Mapping for CGRAs
COMPUTER ORGANIZATION AND ARCHITECTURE
Processor design Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section 11.3.
Unit III – Chapter 3 Path Testing.
Presentation transcript:

1. Arizona State University, Tempe, USA LASER: A Hardware/Software Approach to Accelerate Complicated Loops on CGRAs Mahesh Balasubramanian1, Shail Dave1, Aviral Shrivastava1 and Reiley Jeyapaul2 1. Arizona State University, Tempe, USA 2. ARM, Cambridge, GB

Accelerators Today ASICs – Poor usability FPGAs – Reconfigurable but, less power efficient (fine-grained) SIMDs/Vector Processors – Hard to program. – Inefficient if parallelism is irregular GPUs – Popular. Unable to efficiently accelerate loops that {are non- parallel, incurs conditionals, have low trip-counts} Need an alternative that can achieve more acceleration with much higher power-efficiency! 11 January 2019

Coarse-Grained Reconfigurable Array (CGRA) A PE contains a functional unit, which computes fixed-point logic. Intermediate values are stored in the registers. Predominantly used to accelerate compute-intensive loops in streaming and multi- media applications, e.g., set-top boxes, TVs, projectors, for filtering and decoding etc. 4x4 CGRA with a 2-D mesh interconnect. 11 January 2019

Mapping Applications on CGRAs for(i=0;i<100;i++) { a = b - 4; b = a * 2; c = b * d; d = c - a; } B C 1 A D Compute-Intensive loops can be extracted from application. CGRA compiler generates data dependency graph (DDG); each node represents operation being computed and the edges represent dependencies among the operations. A: B: C: D: Sample Loop Compiler Maps DDG onto CGRA Data Dependency Graph 1 2 1x2 CGRA Architecture 11 January 2019

Mapping Loops on CGRAs Time PE1 PE2 B C 1 A D a0 Iterative Modulo Scheduling – Each operation is executed at every II cycles. Initiation Interval aka II is performance metric. Software Pipelining: Operations from different loop iterations execute simultaneously. 1 2 3 4 5 A a0 B a0 A C a1 DDG II=2 a0 D B 1 2 a1 1x2 CGRA a2 A C a1 11 January 2019

The Challenge: Accelerating Complicated Loops Perfectly and imperfectly nested loops [1, 2] execute nested loops with maximum depth of 2 We assume that nested loops can be flattened to a single loop with conditional statements, therefore, we focus on techniques to execute the loops with complex conditionals on CGRAs. Loops with conditional nests (if-then-else’s) [3] accelerates a loop with a single if-then-else construct or [4] allows loops with arbitrary nesting of conditionals, but incurs high overhead This paper proposed an efficient way to execute loops with arbitrary nesting of conditionals on CGRAs for(i=0; i<100; i++){ p = i*4; q = i*2; if(x > i)then r = p - 4; s = q - 2; if(y > p) a--; else a = 255 - a; else { r = p + 4; s = q + 2; if(y > i) a++; a = 255 & a; } 1. Y. Kim et al., “Improving performance of nested loops on reconfigurable array processors”, TACO ‘12 2. D. Liu et al., “Polyhedral Model based Mapping Optimization of Loop Nests for CGRAs”, DAC ‘13 3. S. RajendranRadhika et al., "Path selection based acceleration of conditionals in CGRAs." DATE ‘15 4. H. Kyuseung et al., "Power-efficient predication techniques for acceleration of control flow execution on cgra." ACM TACO ‘13 11 January 2019

Executing Loops with Control Flow on CGRA Prior techniques to execute loops with an if-then-else are Partial Predication [1] – Execute operations from both the if and else paths simultaneously and selects right value based on the condition evaluated. Dual Issue Scheme [2] – fetches 2 instructions and issues the right instruction to CGRA PE based on the branch outcome. Path Selection Based approach [3] – fetches 1 instruction selectively based on the branch outcome. [2,3] works for loops with single conditional statement. For complicated condition nest, they rely on the partial predication. So, partial predication is the only viable approach to execute loops with the complex control-flow. [1] H. Kyuseung et al., "Power-efficient predication techniques for acceleration of control flow execution on cgra." ACM TACO 2013 [2] M. Hamzeh et al., "Branch-aware loop mapping on CGRAs." DAC 2014. [3] S. RajendranRadhika et al., "Path selection based acceleration of conditionals in CGRAs." DATE 2015 11 January 2019

Partial Predication a b i Time 1 2 3 c a b II = 3 i d 4 a0 b0 i0 i1 et ef sel cmp a1 b1 for (i=0;i<100;i++) { a = a+1; b = b+1; c = a*b; d = b*2; if(x>i) e = c+1; else e = d+1; } b 1 a i c d ef et cmp sel 1 2 3 4 9 operations, 4 PEs For a conditional (e.g., operation e), need to execute 3 operations instead of just 1 operation PEs require predicate registers and muxes. 2x2 CGRA 11 January 2019

Key Idea: Selective Execution for (i=0;i<100;i++) { a = a+1; b = b+1; c = a*b; d = b*2; if(x>i) e = c+1; else e = d+1; } b 1 a i c d ef et cmp sel DDG with Partial Predication b 1 a i c d cmp DDG with Selective Execution e e Semantics of If ‘cmp’ is evaluated as: True: e gets input from c False: e gets input from d Compiler Transformation The operations in conditional paths are replaced by a single operation ‘e’. PEs selectively execute only useful computations, based on branch outcome. 11 January 2019

New semantics of cmp Instruction Memory Layout 1 idle c d cmp 2 a b et cmp <operator>, <op1>, <op2>, <k> Example, cmp gt, c, d, 1 If c gt d, then Issue first k=1 instructions to PEs and skip next k=1 instructions If c ngt d, then Skip first k=1 instructions and issue next k=1 instructions to PEs Instruction Memory Layout PE1 PE2 PE3 PE4 1 idle c d cmp 2 a b et i 3 ef k = 1 cmp=true cmp=false 11 January 2019

How does it work for more complex loops? Instruction Memory Layout PE1 PE2 PE3 PE4 1 q h p a 2 st gt rt idle 3 i att 4 atf 5 sf gf rf 6 aft 7 aff kh = 3 h=true kg = 1 g=true g=false h=false Execute g, s, r based on condition h Execute a based on conditions h and g g=true g=false 11 January 2019

Instruction Fetch Unit Architecture Interface with core (code and data) MMU Once a condition is evaluated by the PE, the result is <Predicate, k> is communicated to IFU. CLB keeps track of the information about the corresponding k values for the conditionals. If the condition evaluated is false, hardware can look-up for needed k value and IFU skips k instructions. Branch Outcome Fetch Signals Instruction Fetch Unit Fetch Signal Generator Fetch Signal Branch Outcome From PEs <Predicate, K> Conditional Look-aside Buffer (CLB) [K] PC 11 January 2019

Experimental Setup We extracted 12 compute-intensive loops from MiBench, which are nested and/or have conditional nest. Data dependency graph generation phase is modified to combine the operations inside all the true and false paths of the nested conditionals. LASER is compared with partial predication scheme – only viable approach to map loops with nested conditionals. The performance is evaluated on 4x4 CGRA and then on CGRAs of larger sizes, to determine the scalability of LASER. 11 January 2019

LASER Improves Performance by 42% Lower is better On a 4x4 CGRA, LASER performs better for loops featuring complex nests of conditionals (up to the depth of 24). Performance improvement of LASER can be attributed to less number of operations to be executed on the PEs. 11 January 2019

Partial Predication Incurs High Overhead for(i=0; i<100; i++) { p = i*4; q = i*2; if(x > i){ r = p - 4; s = q - 2; if(y > p) a--; else a = 255 - a; } else { r = p + 4; s = q + 2; if(y > i) a++; a = 255 & a; i 1 h q p h gt st rt a att atf af 1 asel at aff aft sf rf gt gf s r gf 20 operations, 4 PEs i.e., II >= 5 Useful operations: 9 11 January 2019

Combining the Operations inside Conditional Paths for(i=0; i<100; i++) { p = i*4; q = i*2; if(x > i){ r = p - 4; s = q - 2; if(y > p) a--; else a = 255 - a; } else { r = p + 4; s = q + 2; if(y > i) a++; a = 255 & a; i 1 h q p h g a 1 st , sf rt , rf gt , gf <att , atf> <aft , aff> Fused operation indicates that only one operation will be executed based on the branch outcomes Total Operations: 20 --> 9 11 January 2019

Mapping of simplified DDG onto CGRA Time i 1 i0 i 1 q p h i0 a0 2 q h p a st , sf rt , rf gt , gf To IFU a 1 a0 II = 3 3 g i0 r S <att , atf> <aft , aff> To IFU i0 4 a0 i a i1 11 January 2019

LASER Reduces Total Operations by 43% Lower is better The performance improvement of LASER is due to the reduction in total number of operations being executed. On average, about 43% less operations need to be mapped and executed on CGRA PEs. 11 January 2019

LASER is Scalable With Better Performance Lower is better Geomean Performance (II) of LASER over Partial Predication Performance improvement of LASER is consistent over Partial Predication for 8x8 and 16x16 CGRA configurations. 11 January 2019

Summary Nested loops and loops with nested conditionals are common and should be efficiently executed on CGRAs. Previous techniques either accelerates loops with simple control flow or imposes high performance overhead. LASER is a hybrid approach; the operations from various paths of the conditionals are fused altogether and the IFU architecture selectively issues right instructions based on branch outcomes. LASER improves performance by 40% over partial predication and reduces total operations to be executed by 43%. 11 January 2019

Thank you! 11 January 2019