URECA: A Compiler Solution to Manage Unified Register File for CGRAs

Slides:



Advertisements
Similar presentations
Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.
Advertisements

Adding the Jump Instruction
Computer Architecture Lecture 7 Compiler Considerations and Optimizations.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.
5th International Conference, HiPEAC 2010 MEMORY-AWARE APPLICATION MAPPING ON COARSE-GRAINED RECONFIGURABLE ARRAYS Yongjoo Kim, Jongeun Lee *, Aviral Shrivastava.
University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution.
CML Enabling Multithreading on CGRAs Reiley Jeyapaul, Aviral Shrivastava 1, Jared Pager 1, Reiley Jeyapaul, 1 Mahdi Hamzeh 12, Sarma Vrudhula 2 Compiler.
University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.
Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.
Computer Organization and Architecture The CPU Structure.
LCTES 2010, Stockholm Sweden OPERATION AND DATA MAPPING FOR CGRA’S WITH MULTI-BANK MEMORY Yongjoo Kim, Jongeun Lee *, Aviral Shrivastava ** and Yunheung.
Unit -II CPU Organization By- Mr. S. S. Hire. CPU organization.
Reduced Instruction Set Computers (RISC) Computer Organization and Architecture.
Micro controllers A self-contained system in which a processor, support, memory, and input/output (I/O) are all contained in a single package.
COMPUTER ORGANIZATIONS CSNB123 May 2014Systems and Networking1.
CH13 Reduced Instruction Set Computers {Make hardware Simpler, but quicker} Key features  Large number of general purpose registers  Use of compiler.
Mahdi Hamzeh, Aviral Shrivastava, and Sarma Vrudhula School of Computing, Informatics, and Decision Systems Engineering Arizona State University June 2013.
Yongjoo Kim*, Jongeun Lee**, Jinyong Lee*, Toan Mai**, Ingoo Heo* and Yunheung Paek* *Seoul National University **UNIST (Ulsan National Institute of Science.
Chapter 5 Basic Processing Unit
A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.
Efficient Mapping onto Coarse-Grained Reconfigurable Architectures using Graph Drawing based Algorithm Jonghee Yoon, Aviral Shrivastava *, Minwook Ahn,
CSc 453 Final Code Generation Saumya Debray The University of Arizona Tucson.
CML REGISTER FILE ORGANIZATION FOR COARSE GRAINED RECONFIGURABLE ARCHITECTURES (CGRAs) Dipal Saluja Compiler Microarchitecture Lab, Arizona State University,
Execution of an instruction
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.
ECEG-3202 Computer Architecture and Organization Chapter 7 Reduced Instruction Set Computers.
COMPUTER ORGANIZATIONS CSNB123 NSMS2013 Ver.1Systems and Networking1.
Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.
Computer Organization Instructions Language of The Computer (MIPS) 2.
CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.
My Coordinates Office EM G.27 contact time:
Code Optimization Overview and Examples
Nios II Processor: Memory Organization and Access
Advanced Architectures
Scalable Register File Architectures for CGRA Accelerators
Microarchitecture.
Ph.D. in Computer Science
For Massively Parallel Computation The Chaotic State of the Art
A Closer Look at Instruction Set Architectures
William Stallings Computer Organization and Architecture 8th Edition
Benjamin Goldberg, Emily Crutcher NYU
The Hardware/Software Interface CSE351 Winter 2013
CGRA Express: Accelerating Execution using Dynamic Operation Fusion
CSE-591 Compilers for Embedded Systems Code transformations and compile time data management techniques for application mapping onto SIMD-style Coarse-grained.
Morgan Kaufmann Publishers The Processor
Software Cache Coherent Control by Parallelizing Compiler
Array Processor.
EPIMap: Using Epimorphism to Map Applications on CGRAs
Computer Organization “Central” Processing Unit (CPU)
CISC AND RISC SYSTEM Based on instruction set, we broadly classify Computer/microprocessor/microcontroller into CISC and RISC. CISC SYSTEM: COMPLEX INSTRUCTION.
Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke
Computer Architecture
Register Pressure Guided Unroll-and-Jam
Summary of my Thesis Power-efficient performance is the need of the future CGRAs can be used as programmable accelerators Power-efficient performance can.
Reiley Jeyapaul and Aviral Shrivastava Compiler-Microarchitecture Lab
1. Arizona State University, Tempe, USA
Spring 2008 CSE 591 Compilers for Embedded Systems
The Vector-Thread Architecture
ECE 352 Digital System Fundamentals
Introduction to Computer Systems
Chapter 12 Pipelining and RISC
Code Transformation for TLB Power Reduction
RAMP: Resource-Aware Mapping for CGRAs
What Are Performance Counters?
Presentation transcript:

URECA: A Compiler Solution to Manage Unified Register File for CGRAs Shail Dave, Mahesh Balasubramanian, Aviral Shrivastava Compiler Microarchitecture Lab Arizona State University

Coarse Grained Reconfigurable Array (CGRA) An array of Processing Elements (PEs); each PE has ALU-like functional unit that works on an operation at every cycle. Array configurations vary in terms of – • Array Size • Reg. File Architectures • Functional Units • Interconnect Network Quick Facts CGRAs can achieve power-efficiency of several 10s of GOps/sec per Watt! - ADRES CGRA, upto 60 GOps/sec per Watt (HiPEAC 2008) - HyCUBE, M. Karunaratne et al., about 63 MIPS/mW (DAC 2017) Popular in Embedded Systems and Multimedia. (Samsung SRP processor) 6 December 2018 Shail Dave / Arizona State University

Shail Dave / Arizona State University Mapping Loops on CGRAs for (i=2; i<1000; i++){ A[i] = L - 4; B[i] = A[i] + D[i-2]; C = B[i] * 3 D[i] = C + 7; } t t+1 t+2 t+3 t+4 time i-1 a: b: c: d: i L c a d Sample Loop II = 3 Iterative Modulo Scheduling Each loop iteration is executed at II cycles b i+1 DDG L c Software Pipelining: Operations from 2 different loop iterations execute simultaneously a d 1x2 CGRA 6 December 2018 Shail Dave / Arizona State University

Recurring Variables are Managed in Rotating RF In a software pipelined schedule, liveness of a variable outcomes can overlap. E.g., Loop-carried dependence. bi (operation b in ith iteration) needs di-2. 2 different values to be stored in 2 registers. Register read/write occurs with same register index. b always reads from R2. d always writes to R1. time i i-1 L c d i-3 t d i-2 a d t+1 i-1 d i-1 d i-2 II = 3 load b d i-1 t+2 d i-2 i+1 Rotate Reg. Values L t+3 c d i-2 d i-1 a d d i t+4 d i-1 Rotation of register values occurs at every II cycles to avoid overwrite. 6 December 2018 Shail Dave / Arizona State University

Not All Variables are Recurring Recurring (Repeatedly Written and Read) Generated during kernel execution (for intermediate use) Stored in the rotating registers. for(i=0; i < 1000; i++) { sum += series[i]; count += 1; 16bitLSB = sum & a; } recurring immediate constant Nonrecurring Constants in the program (e.g., live-in values including larger 32-bit values such as base address of an array or signed values) Immediate (constants in the instructions) CGRA instructions have to deal with more fields and complexity, meaning immediate bits can be about 8 to 12 bits. 6 December 2018 Shail Dave / Arizona State University

Prior Approaches of Managing Variables Recurring variables (repeatedly read and written) are frequently accessed, stored into local rotating RF. Nonrecurring variables (constants) can be accessed via – 1) On-Chip Memory 2) Global Register File PE 6 December 2018 Shail Dave / Arizona State University

configuration boundary Variables (Nonrotating Section) Unified Register File Unified RF can contain both recurring and nonrecurring values The separation is determined by the RF configuration URECA targets local unified RF* Local RFs are Smaller and Scalable configuration boundary Recurring Values (Rotating Section) sum0 sum1 PE totalCount inputPtr Nonrecurring Variables (Nonrotating Section) Local Unified RF* *M. Hamzeh, Compiler and Architecture Design for Coarse-Grained Programmable Accelerators. 2015 6 December 2018 Shail Dave / Arizona State University

What Must the Compiler Do? Register allocation is integrated with P&R phase of the compiler. To ensure valid and efficient mapping, the compiler needs to: Analyze register requirements for each operation being mapped and determine the registers are required inside the rotating and nonrotating sections of the RF. For the operations mapped, keep track of allocated registers. Determine RF configurations for all PEs. Generate the code to pre-load nonrecurring variables. Generate machine instructions to dynamically configure RFs of the PEs. 6 December 2018 Shail Dave / Arizona State University

How URECA Analyzes Register Requirements? For each operation being mapped, compiler determines total registers for – Nonrotating registers required (for nonrecurring variables) Compiler determines live-in/live-out values for the variable based on the User-Definition analysis. If any constant value (e.g., 0xFFFF) is larger than size of immediate bits in CGRA instruction, it is also a nonrecurring variable. In allocating registers, data reuse analysis avoids duplication; same nonrecurring variable is required by multiple operations mapped on a PE. Rotating registers required (for recurring variables) Determined based on the modulo schedule time of the dependent predecessor/successor operations. Honor inter/intra-iteration dependencies. 6 December 2018 Shail Dave / Arizona State University

URECA Improves CGRA’s Acceleration Capability by 1.74x 6 December 2018 Shail Dave / Arizona State University

Shail Dave / Arizona State University Summary URECA efficiently manages all the variables in unified RF Compiler promotes variables from the memory to CGRA registers. It improves performance by 1.74x, as compared to CGRA accessing nonrecurring variables from the on-chip memory. Reduces register requirements by about 39% and reduces energy consumption by 32%. Compiler determines register requirements and allocates registers for both recurring and nonrecurring variables. Configures boundary of the RF and enables CGRA PEs to flexibly support different register requirements for different mappings of different loops, promoting general-purpose computing on CGRAs. 6 December 2018 Shail Dave / Arizona State University

Shail Dave / Arizona State University Thank you ! 6 December 2018 Shail Dave / Arizona State University