RAMP: Resource-Aware Mapping for CGRAs

Slides:

Advertisements

Similar presentations

Goal: Split Compiler LLVM LLVM – DRESC bytecode staticdeployment time optimized architecture description compiler strategy ML annotations C code ADRES.

Advertisements

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.

Breaking SIMD Shackles with an Exposed Flexible Microarchitecture and the Access Execute PDG Venkatraman Govindaraju, Tony Nowatzki, Karthikeyan Sankaralingam.

CML Efficient & Effective Code Management for Software Managed Multicores CODES+ISSS 2013, Montreal, Canada Ke Bai, Jing Lu, Aviral Shrivastava, and Bryce.

5th International Conference, HiPEAC 2010 MEMORY-AWARE APPLICATION MAPPING ON COARSE-GRAINED RECONFIGURABLE ARRAYS Yongjoo Kim, Jongeun Lee *, Aviral Shrivastava.

University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution.

CML Enabling Multithreading on CGRAs Reiley Jeyapaul, Aviral Shrivastava 1, Jared Pager 1, Reiley Jeyapaul, 1 Mahdi Hamzeh 12, Sarma Vrudhula 2 Compiler.

University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.

LCTES 2010, Stockholm Sweden OPERATION AND DATA MAPPING FOR CGRA’S WITH MULTI-BANK MEMORY Yongjoo Kim, Jongeun Lee *, Aviral Shrivastava ** and Yunheung.

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

Compilation Techniques for Energy Reduction in Horizontally Partitioned Cache Architectures Aviral Shrivastava, Ilya Issenin, Nikil Dutt Center For Embedded.

Mahdi Hamzeh, Aviral Shrivastava, and Sarma Vrudhula School of Computing, Informatics, and Decision Systems Engineering Arizona State University June 2013.

Yongjoo Kim*, Jongeun Lee**, Jinyong Lee*, Toan Mai**, Ingoo Heo* and Yunheung Paek* *Seoul National University **UNIST (Ulsan National Institute of Science.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

UNIVERSITAT POLITÈCNICA DE CATALUNYA Departament d’Arquitectura de Computadors Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning.

Automated Design of Custom Architecture Tulika Mitra

Efficient Mapping onto Coarse-Grained Reconfigurable Architectures using Graph Drawing based Algorithm Jonghee Yoon, Aviral Shrivastava *, Minwook Ahn,

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

I2CRF: Incremental Interconnect Customization for Embedded Reconfigurable Fabrics Jonghee W. Yoon, Jongeun Lee*, Jaewan Jung, Sanghyun Park, Yongjoo Kim,

ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular.

CML REGISTER FILE ORGANIZATION FOR COARSE GRAINED RECONFIGURABLE ARCHITECTURES (CGRAs) Dipal Saluja Compiler Microarchitecture Lab, Arizona State University,

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.

Architecture Selection of a Flexible DSP Core Using Re- configurable System Software July 18, 1998 Jong-Yeol Lee Department of Electrical Engineering,

OPTIMIZING DSP SCHEDULING VIA ADDRESS ASSIGNMENT WITH ARRAY AND LOOP TRANSFORMATION Chun Xue, Zili Shao, Ying Chen, Edwin H.-M. Sha Department of Computer.

High Performance, Low Power Reconfigurable Processor for Embedded Systems Farhad Mehdipour, Hamid Noori, Koji Inoue, Kazuaki Murakami Kyushu University,

1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.

Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke

Exploiting Vector Parallelism in Software Pipelined Loops Sam Larsen Rodric Rabbah Saman Amarasinghe Computer Science and Artificial Intelligence Laboratory.

An Automated Development Framework for a RISC Processor with Reconfigurable Instruction Set Extensions Nikolaos Vassiliadis, George Theodoridis and Spiridon.

Hyunchul Park†, Kevin Fan†, Scott Mahlke†,

WCET-Aware Dynamic Code Management on Scratchpads for Software-Managed Multicores Yooseong Kim 1,2, David Broman 2,3, Jian Cai 1, Aviral Shrivastava 1,2.

CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.

Carnegie Mellon Lecture 8 Software Pipelining I. Introduction II. Problem Formulation III. Algorithm Reading: Chapter 10.5 – 10.6 M. LamCS243: Software.

15-740/ Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011.

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

Scalable Register File Architectures for CGRA Accelerators

A Dynamic Scheduling Framework for Emerging Heterogeneous Systems

Microarchitecture.

Ph.D. in Computer Science

Evaluating Register File Size

Benjamin Goldberg, Emily Crutcher NYU

The Hardware/Software Interface CSE351 Winter 2013

nZDC: A compiler technique for near-Zero silent Data Corruption

CGRA Express: Accelerating Execution using Dynamic Operation Fusion

CSE-591 Compilers for Embedded Systems Code transformations and compile time data management techniques for application mapping onto SIMD-style Coarse-grained.

Splitting Functions in Code Management on Scratchpad Memories

Anne Pratoomtong ECE734, Spring2002

Pipelining and Vector Processing

CSCI1600: Embedded and Real Time Software

EPIMap: Using Epimorphism to Map Applications on CGRAs

Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke

Stephen Hines, David Whalley and Gary Tyson Computer Science Dept.

Register Pressure Guided Unroll-and-Jam

Summary of my Thesis Power-efficient performance is the need of the future CGRAs can be used as programmable accelerators Power-efficient performance can.

Christophe Dubach, Timothy M. Jones and Michael F.P. O’Boyle

URECA: A Compiler Solution to Manage Unified Register File for CGRAs

Embedded Computer Architecture 5SIA0 Overview

Reiley Jeyapaul and Aviral Shrivastava Compiler-Microarchitecture Lab

1. Arizona State University, Tempe, USA

In Search of Near-Optimal Optimization Phase Orderings

/ Computer Architecture and Design

Department of Electrical Engineering Joint work with Jiong Luo

Realizing Closed-loop, Online Tuning and Control for Configurable-Cache Embedded Systems: Progress and Challenges Islam S. Badreldin*, Ann Gordon-Ross*,

Code Transformation for TLB Power Reduction

CSCI1600: Embedded and Real Time Software

Communication Driven Remapping of Processing Element (PE) in Fault-tolerant NoC-based MPSoCs Chia-Ling Chen, Yen-Hao Chen and TingTing Hwang Department.

Multidisciplinary Optimization

Stream-based Memory Specialization for General Purpose Processors

CS 201 Compiler Construction

Presentation transcript:

RAMP: Resource-Aware Mapping for CGRAs Shail Dave, Mahesh Balasubramanian, Aviral Shrivastava Compiler Microarchitecture Lab, Arizona State University

Coarse Grained Reconfigurable Array (CGRA) An array of Processing Elements (PEs); each PE has ALU-like functional unit that works on an operation at every cycle. Array configurations vary in terms of – Array Size ► Reg. File Architectures Functional Units ► Interconnect Network Quick Facts CGRAs can achieve power-efficiency of several 10s of GOps/Sec per Watt! ADRES CGRA, upto 60 GOps/sec per Watt [IMEC, HiPEAC 2008] HyCUBE, about 63 MIPS/mW [Karunaratne M. et al., DAC 2017] Popular in Embedded Systems and Multimedia [Samsung SRP processor] 5/19/2019

Mapping Loops on CGRAs a b c a d Iterative Modulo Scheduling a b c d Modulo Schedule 1 2 B = 0; for(i=0; i<1000; i++) { A = B - 4; B = A + L; C = A * 3 D = C + 7; } Iterative Modulo Scheduling Each loop iteration is executed at II cycles [Bob Rau, MICRO 1994] a b c 1 d time i=0 a: b: c: d: a 1 Software Pipelining Operations from 2 different iterations execute simultaneously 2 b c Sample Loop DDG i=1 II = 2 4 operations to map on 2 PEs => minimum initiation interval (MII) is 2 cycles. 1 2 d 3 a 1x2 CGRA The Code Generation Battle The performance (loop execution time) critically depends on the mapping obtained by compiler Mapping problem boils down to routing problem In a temporal-spatial solution set, if all the dependent operations are placed on those PEs which can directly communicate the resultant values at time being, mapping (placement) is trivial. Routing is needed when the dependent operations can be scheduled at a distant time, or operations cannot be mapped due to resource constraints. 5/19/2019

What are the Various Routing Strategies 5/19/2019

Routing Data Dependency via PEs Modulo Schedule (II = 3) P&R DDG time To achieve a mapping, Need to route a → d 1 a 1 3 2 c a b d Insert a routing operation Place it on an empty PE slot (EMS[1], EPIMap[2]) ► 2 ar b 3 arr c 4 a d 1.5-2 1 2 1x2 CGRA [1] H. Park et al., Edge-centric modulo scheduling for coarse-grained reconfigurable architectures. In PACT, 2008. [2] M. Hamzeh et al., Epimap: using epimorphism to map applications on cgras. In DAC, 2012. 5/19/2019

Routing Data Dependency via Registers Modulo Schedule (II = 2) DDG P&R time 1 2 c a b d a0 1 a a0 2 b a0 3 a c a1 a0 4 d R1 b R1 1 2 a1 R2 R2 [3] L. Chen et al., Graph minor approach for application mapping on cgras. ACM TRETS, 2014. [4] M. Hamzeh et al., Regimap: Register-aware application mapping on cgras. In DAC, 2013. 5/19/2019

Routing Data Dependency via Memory Modulo Schedule (II = 3) P&R DDG time 1 a 1 3 2 c a b d 2 Sa b 3 La c 4 a d 1 2 1x2 CGRA [5] S. Yin et al., Memory-aware loop mapping on coarse-grained reconfigurable architectures. IEEE TVLSI, 2016. 5/19/2019

Analyzing Impact of Ad-hoc Routing Strategies For the top performance-critical loops from 8 MiBench benchmarks, previous techniques failed to obtain mappings for almost all loops, when highly constrained by the resources. The mapping quality obtained is far from the best possible mapping (II = MII), even when the target CGRA has higher resources (including manual attempts to achieve better mapping). 5/19/2019

CGRA Code Generation Maze 5/19/2019

Nearly optimal Quality) Valid mappings (with Nearly optimal Quality) 5/19/2019

RAMP: Resource-Aware Mapping Mapping Attempt (II = 4) a time b c d e i a t t+1 t+3 t+2 t+4 b 2 Need to Route Initially Route All Dependencies Through Direct PE Communication c ? d e 1 DDG 1 2 R1 1x2 CGRA 5/19/2019

RAMP: Resource-Aware Mapping Mapping Attempt (II = 4) a time i b t a ai ei-1 2 Initially Route All Dependencies Through Direct PE Communication And via Registers c t+1 b ai ei-1 d e 1 t+2 c ai DDG ei-1 t+3 d ei ai e 1 2 R1 1x2 CGRA t+4 a ai+1 ei 5/19/2019

RAMP: Resource-Aware Mapping Mapping Attempt (II = 4) a e Modified DDG a b c d br time i i-1 b t b → e might be routed by Spilling to memory Distributed Registers PEs a ai ei-1 2 c 2 t+1 b ai ei-1 d e 1 1 t+2 c ai br DDG ei-1 t+3 d e ei ai 1 2 R1 1x2 CGRA t+4 a ai+1 ei 5/19/2019

RAMP: Resource-Aware Mapping Mapping Attempt (II = 4) DDG a e b c d 2 1 br Modified DDG a e b c d 1 br Ld Sd time i i-1 di → ci-2 might be routed by Spilling to memory Distributed Registers PEs t a ai Sd ei-1 t+1 ld b ai ei-1 t+2 br c ai ei-1 Mapping is Combination of: - Routing via PEs - Routing via Registers - Spilling to Memory II = 4 t+3 d ai e ei 1 2 R1 1x2 CGRA t+4 a ai+1 Sd ei 5/19/2019

Selecting a Routing Alternative Failure Analysis Dependent operations are scheduled at distant time; managing the data with large lifetime in registers is not possible Route by PEs, Spill to memory/distributed RFs Dependent operation is a live value, cannot be managed in the register. Manage live value in the memory Dependent operations are scheduled at the consequent time; routing is not possible due to limited interconnect/unavailability of free PEs Re-compute, Route by a PE, Re-schedule Graph Modification and Rescheduling 5/19/2019

RAMP Enables Spilling to Distributed RFs Based on the distant schedule time of dependent operations, RAMP determines number of distributed registers required. Before the source operation (e) of the next iteration (ei+1) over-writes the value, insert a RF-read operation err. If the destination operation (ai) is scheduled far than ewr, insert a RF-write operation. P&R should ensure that operations err and ewr are mapped onto the PEs that don’t share RF. time i e f d g b c a 2 h t f ei-1 a ei-2 g t+1 b ei-1 ei-2 t+2 err c ei-1 ei-2 II = 5 t+3 d ei-1 erw ei-1 t+4 e ei h ei-1 t+5 1 2 f ei a ei-1 R1 R1 5/19/2019

Experimental Setup Benchmarks Compilation Simulation MiBench suite [Guthaus et al., IEEE WWC 2001] (top performance critical loops) Compilation CCF: CGRA Compilation Framework (LLVM 4.0 [Lattner et al., CGO 2004] as foundation) Optimization level 3 Simulation Cycle-Accurate CCF-Simulator (based on gem5 [Binkert et al., SIGARCH Comp. Arch. News 2001]) CGRA modeled as a separate core coupled with ARM Cortex-like processor core PEs connected in a 2D torus, perform fixed-point computations CGRA accesses 4 kB data memory and 4 kB instruction memory Techniques Evaluated Register-aware mapping - REGIMap [Hamzeh et al., DAC 2012] Memory-aware mapping - MEMMap [Yin et al., IEEE TVLSI 2016] Resource-aware mapping - RAMP 5/19/2019

RAMP Improves CGRA’s Acceleration Capability by 2.13× With systematic resource exploration, RAMP achieved better mapping, outperforming state-of-the-art. It spills to memory and/or exploits the distributed registers, when the resources are limited. Generated mapping features combination of various routing strategies to route different data dependencies. Scaled well with the availability of different architectural resources. RAMP adapts to the needs of the application, flexibly exploring resources via the various routing strategies. Total Loops Mapped Speedup 5 10 20 4 8 12 Architecture Configuration Resources RAMP REGIMap MEMMap Increase in Architectural RAMP accelerated loops by 23× as compared to sequential execution, and by 2.13× over REGIMap, and by 3.39× over MEMMap. 5/19/2019

RAMP Achieves Nearly Best Possible Mapping gsm_ short gsm_ long susan_ smooth geo mean jpeg _enc adpcm _enc sha bitcount adpcm _dec Higher the Better 0.91 Since RAMP systematically explored resources, it spilled data to memory only after using the available registers, minimizing routing operations E.g. jpeg encoding New routing strategy of utilizing distributed registers yielded better mappings for susan, adpcm etc. Mapping Quality for Config#7 (4x4CGRA_LRF4) To map unmapped operations, RAMP systematically explored the various routing strategies, instead of implicitly performing P&R in an ad-hoc manner. RAMP was able to achieve better mapping with inserting less routing/memory operations. With less operations to be mapped, RAMP observed same computational complexity as other clique-based mapping heuristics e.g. REGIMap or MEMMap. 5/19/2019

Summary The goodness of the obtained mapping is critically dependent on how efficiently the compiler can route the data dependencies. Existing mapping techniques are unable to make good use of the routing resources. They first schedule the DDG and then attempt the P&R; routing is internal to P&R and is carried out in an ad-hoc manner. Hence, Operations may not be mapped due to resource constraints, or Obtain poor code quality. RAMP models various routing strategies explicitly. It systematically and flexibly explores available architectural resources for routing the data dependencies. Enables effective utilization of distributed registers and spilling to memory. Failure analysis allows to systematically choose the routing alternatives to map a data dependency. With comprehensive problem formulation, RAMP exploits heterogeneous architectural resources. RAMP accelerated the top performance-critical loops of MiBench by 23× over a sequential execution, and by 2.13× over state-of-the-art techniques. 5/19/2019

Thank you !

Additional Slides Inefficiencies of the Mapping Heuristics with Ad-hoc Routing Strategies 5/19/2019

Routing Data Dependency via PEs Schedule (II = 3) DDG P&R e a b d c DDG P&R time 1 2 d a 3 e c P&R b ar time time a 1 3 2 e a d c 1 c 1 3 2 1 c a ar ? ar 2 d 2 d b 3 3 e e 1 2 To achieve a mapping, Need to route a → e Since ar is not rescheduled, cannot route a → e 1x2 CGRA Insert a routing operation Place it on an empty PE slot (EMS[1], EPIMap[2]) ► Rescheduling (+ path-sharing) can enable a mapping. ► [1] H. Park et al., Edge-centric modulo scheduling for coarse-grained reconfigurable architectures. In PACT, 2008. [2] M. Hamzeh et al., Epimap: using epimorphism to map applications on cgras. In DAC, 2012. 5/19/2019

Routing Data Dependency via Registers GraphMinor[3] and REGIMap[4] allow routing dependency via registers. To route the data dependency via register file of a PE, both the dependent operations (e and a) must be placed on the same PE. Typically, for CGRA with local/distributed registers, target PE should have enough registers available to route the data dependency. E.g., dependency e → a is routed only if the PE1 have 2 registers. Challenge? Routing dependency by efficiently utilizing distributed registers of different PEs. e f d g b c a 2 h time i t a ei-1 g t+1 b ei-1 t+2 c ei-1 t+3 ei-1 ? ei-1 a i+1 d t+4 e ei h t+5 f 1 2 R1 [3] L. Chen et al., Graph minor approach for application mapping on cgras. ACM TRETS, 2014. [4] M. Hamzeh et al., Regimap: Register-aware application mapping on cgras. In DAC, 2013. 5/19/2019

Routing Data Dependency via Memory MEMMap [5] statically determines to manage values with large lifetime in memory, avoiding the need of re-scheduling DDG Challenges ? Routing data dependency via memory (spilling) requires additional memory operations, and without re-scheduling the DDG, they might not be mapped! With sufficient registers and PEs, such dependencies might be better managed via registers. For a resource-constrained CGRA target, the variable value has to be spilled, even though its lifetime is less than the pre-set threshold. time 1 2 4 3 5 6 c b d a i e f g lb la ? sa sb b e a f c d g 1 1 2 R1 [5] S. Yin et al., Memory-aware loop mapping on coarse-grained reconfigurable architectures. IEEE TVLSI, 2016. 5/19/2019

A High-Level Overview of RAMP Additional Slides A High-Level Overview of RAMP 5/19/2019

RAMP: Resource-Aware Mapping Partition mapping in 3 sub problems: Systematically explore routing strategies Re-scheduling and Place and Route Spill to Memory Routing via PEs and/or Registers Spill to other distributed RFs Re- Compute Route via PE Change Schedule Time Routing Strategy Load Read-Only Data From Memory 1 2 3 Re-Schedule Place & Route Failure Analysis For example, we can choose to first map the DDG with routing via registers. Then, for any unmapped data dependency, explore different routing options. For selected routing strategy, DDG is modified and re-scheduled. Additional constraints: for spilling data to memory, a store must occur before a load! Check whether the opted strategy(ies) routed the targeted data dependency. For multiple strategies being successful, choose the one which maps more operations, and requires minimum PEs to map. Take new modified and mapped graph, and try mapping remaining dependencies at targeted II. If no strategy can route a dependency, need to increase II by 1, and start over. 5/19/2019