EPIMap: Using Epimorphism to Map Applications on CGRAs

Slides:



Advertisements
Similar presentations
CML Efficient & Effective Code Management for Software Managed Multicores CODES+ISSS 2013, Montreal, Canada Ke Bai, Jing Lu, Aviral Shrivastava, and Bryce.
Advertisements

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.
CML CML Presented by: Aseem Gupta, UCI Deepa Kannan, Aviral Shrivastava, Sarvesh Bhardwaj, and Sarma Vrudhula Compiler and Microarchitecture Lab Department.
Lecture 9: Coarse Grained FPGA Architecture October 6, 2004 ECE 697F Reconfigurable Computing Lecture 9 Coarse Grained FPGA Architecture.
5th International Conference, HiPEAC 2010 MEMORY-AWARE APPLICATION MAPPING ON COARSE-GRAINED RECONFIGURABLE ARRAYS Yongjoo Kim, Jongeun Lee *, Aviral Shrivastava.
University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution.
CML Enabling Multithreading on CGRAs Reiley Jeyapaul, Aviral Shrivastava 1, Jared Pager 1, Reiley Jeyapaul, 1 Mahdi Hamzeh 12, Sarma Vrudhula 2 Compiler.
University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.
LCTES 2010, Stockholm Sweden OPERATION AND DATA MAPPING FOR CGRA’S WITH MULTI-BANK MEMORY Yongjoo Kim, Jongeun Lee *, Aviral Shrivastava ** and Yunheung.
© ACES Labs, CECS, ICS, UCI. Energy Efficient Code Generation Using rISA * Aviral Shrivastava, Nikil Dutt
Energy Efficient Instruction Cache for Wide-issue Processors Alex Veidenbaum Information and Computer Science University of California, Irvine.
Compilation Techniques for Energy Reduction in Horizontally Partitioned Cache Architectures Aviral Shrivastava, Ilya Issenin, Nikil Dutt Center For Embedded.
Synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng.
Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.
An Energy-Efficient Reconfigurable Multiprocessor IC for DSP Applications Multiple programmable VLIW processors arranged in a ring topology –Balances its.
Mahdi Hamzeh, Aviral Shrivastava, and Sarma Vrudhula School of Computing, Informatics, and Decision Systems Engineering Arizona State University June 2013.
Yongjoo Kim*, Jongeun Lee**, Jinyong Lee*, Toan Mai**, Ingoo Heo* and Yunheung Paek* *Seoul National University **UNIST (Ulsan National Institute of Science.
University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Archs, VHDL 3 Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.
A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.
Efficient Mapping onto Coarse-Grained Reconfigurable Architectures using Graph Drawing based Algorithm Jonghee Yoon, Aviral Shrivastava *, Minwook Ahn,
I2CRF: Incremental Interconnect Customization for Embedded Reconfigurable Fabrics Jonghee W. Yoon, Jongeun Lee*, Jaewan Jung, Sanghyun Park, Yongjoo Kim,
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.
CML REGISTER FILE ORGANIZATION FOR COARSE GRAINED RECONFIGURABLE ARCHITECTURES (CGRAs) Dipal Saluja Compiler Microarchitecture Lab, Arizona State University,
Bypass Aware Instruction Scheduling for Register File Power Reduction Sanghyun Park, Aviral Shrivastava Nikil Dutt, Alex Nicolau Yunheung Paek Eugene Earlie.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.
CML SSDM: Smart Stack Data Management for Software Managed Multicores Jing Lu Ke Bai, and Aviral Shrivastava Compiler Microarchitecture Lab Arizona State.
Jason Li Jeremy Fowers 1. Speedups and Energy Reductions From Mapping DSP Applications on an Embedded Reconfigurable System Michalis D. Galanis, Gregory.
DTM and Reliability High temperature greatly degrades reliability
Analysis of Cache Tuner Architectural Layouts for Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing.
OPTIMIZING DSP SCHEDULING VIA ADDRESS ASSIGNMENT WITH ARRAY AND LOOP TRANSFORMATION Chun Xue, Zili Shao, Ying Chen, Edwin H.-M. Sha Department of Computer.
Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke
University of Michigan Electrical Engineering and Computer Science 1 Compiler-directed Synthesis of Multifunction Loop Accelerators Kevin Fan, Manjunath.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.
Hyunchul Park†, Kevin Fan†, Scott Mahlke†,
CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.
CML Branch Penalty Reduction by Software Branch Hinting Jing Lu Yooseong Kim, Aviral Shrivastava, and Chuan Huang Compiler Microarchitecture Lab Arizona.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
1 Architecture of Datapath- oriented Coarse-grain Logic and Routing for FPGAs Andy Ye, Jonathan Rose, David Lewis Department of Electrical and Computer.
Optimizing Interconnection Complexity for Realizing Fixed Permutation in Data and Signal Processing Algorithms Ren Chen, Viktor K. Prasanna Ming Hsieh.
Computer Organization Exam Review CS345 David Monismith.
Hang Zhang1, Xuhao Chen1, Nong Xiao1,2, Fang Liu1
Scalable Register File Architectures for CGRA Accelerators
Floating-Point FPGA (FPFPGA)
Adaptive Cache Partitioning on a Composite Core
Ph.D. in Computer Science
Evaluating Register File Size
James Coole PhD student, University of Florida Aaron Landy Greg Stitt
SmartCell: A Coarse-Grained Reconfigurable Architecture for High Performance and Low Power Embedded Computing Xinming Huang Depart. Of Electrical and Computer.
Nithin Michael, Yao Wang, G. Edward Suh and Ao Tang Cornell University
CGRA Express: Accelerating Execution using Dynamic Operation Fusion
Michael Bedford Taylor, Walter Lee, Saman Amarasinghe, Anant Agarwal
CSE-591 Compilers for Embedded Systems Code transformations and compile time data management techniques for application mapping onto SIMD-style Coarse-grained.
Hyperthreading Technology
Verilog to Routing CAD Tool Optimization
Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke
Computer Architecture
Summary of my Thesis Power-efficient performance is the need of the future CGRAs can be used as programmable accelerators Power-efficient performance can.
URECA: A Compiler Solution to Manage Unified Register File for CGRAs
Dynamic Code Mapping Techniques for Limited Local Memory Systems
Embedded Computer Architecture 5SIA0 Overview
Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt
Reiley Jeyapaul and Aviral Shrivastava Compiler-Microarchitecture Lab
1. Arizona State University, Tempe, USA
What time is it?. What time is it? Major Concepts: a data structure model: basic representation of data, such as integers, logic values, and characters.
Code Transformation for TLB Power Reduction
Duo Liu, Bei Hua, Xianghui Hu, and Xinan Tang
RAMP: Resource-Aware Mapping for CGRAs
CS295: Modern Systems: Application Case Study Neural Network Accelerator – 2 Sang-Woo Jun Spring 2019 Many slides adapted from Hyoukjun Kwon‘s Gatech.
What Are Performance Counters?
Presentation transcript:

EPIMap: Using Epimorphism to Map Applications on CGRAs Mahdi Hamzeh, Aviral Shrivastava, and Sarma Vrudhula School of Computing, Informatics, and Decision Systems Engineering Arizona State University June 2012 This work was supported in part by NSF IUCRC Center for Embedded Systems under Grant DWS-0086, by the Science Foundation of Arizona (Grant SRG 0211-07), and by the Stardust Foundation.

Accelerators for Energy Efficiency Demand for performance Power consumption Technology scaling Processor Accelerator Shared Memory 50 100 150 200 250 1 10 DRESC CGRA Intel Core i7 NVIDIA Tesla™ c2050 Power (W) Giga Ops per Sec 60 GOpS/W 1.4 GOpS/W 4.3 GOpS/W

Coarse-grained Reconfigurable Architectures Time 1 2 3 4 1 2 3 4 2D array of Processing Elements (PEs) ALU + local register File -> PE Mesh interconnection Shared data bus PE inputs: 4 Neighboring PEs Local register file Data memory 1 2 3 4

II is the performance metric What to Map on CGRA and How? Time 4 b 1 2 3 4 1 2 3 4 a a a a a b b b b b b 1 2 3 4 c d 1 2 3 4 1 b II is the performance metric c c c c d d d d 2 1 2 3 4 1 2 3 4 f f f f e e e e 2 g g g g 1 2 3 4 1 2 3 4 3

Re-computation can lead to better mapping Re-computation is necessary 1 2 3 4 1 2 3 4 4 3 a a a a b 1 2 3 4 1 2 3 4 b 1 b b b b b b b Re-computation can lead to better mapping 1 2 3 4 2 1 2 3 4 c c c c c c c c d d d d d d d d d d 2 e e e e e e e e e e f f f f f f f f f f 1 2 3 4 1 2 3 4 3 Re-computation is necessary

Re-Computation and Routing 3 1 2 3 4 1 2 3 4 1 2 3 4 b b b b 1 2 3 4 b b b b a 1 2 b 1 2 3 4 d c e d b b 1 2 3 4 c 2 f d b d c c c d e f 1 2 3 4 e e 1 2 3 4 f f f e 3

Related Works and Contributions Several CGRAs architectures been designed XPP, PADDI, PipeBench, KressArray etc. Survey in [Harstentien 2001] Compilers for CGRA EMS [Park 2008], Semi-simulated annealing based [Mei 2004] , Simulated annealing based [Hatanaka 2007, Friedman 2009] Use routing to resolve resource limitation problem No techniques exist that exploit re-computation for mapping. Contributions of this work General problem formulation Re-computation, routing, or both for resource limitation problem Application mapping heuristic EPIMap More accurate MII extraction Resource aware routing Efficient placement (Maximum Common Subgraph problem) Use information from unsuccessful attempts for next mapping

Experimental Setup Loops from SPEC2006 and multimedia benchmarks 4 × 4 CGRA with enough instruction and data memory Shared data bus for each row Latency is 1 cycle and 2 registers at PEs EMS[Park 2006] and BCEMS (best among 500 runs)

EPIMap improves performance on average by 2.8X more than EMS Mapping Results The lower II, the better performance EPIMap improves performance on average by 2.8X more than EMS 2.2X less than BCEMS 2.8X less than EMS

EPIMap finds MII in 9 out of 14 loops Achieved II vs. Minimum II Minimum II may be not achievable Relative to MII= 𝑀𝐼𝐼 𝐴𝑐ℎ𝑖𝑒𝑣𝑒𝑑 𝐼𝐼 The higher the value, the closer to optimum II EPIMap finds MII in 9 out of 14 loops 92.9%

Reasonable Running time 252 38

Summary Accelerators for energy efficiency Coarse-grained reconfigurable architecture, a programmable accelerator Contributions Problem formulation Re-computation, routing, or both EPIMap Better mappings 2.8X performance improvement Optimum mapping in 9 out of 14 Reasonable compilation time