Efficient Mapping onto Coarse-Grained Reconfigurable Architectures using Graph Drawing based Algorithm Jonghee Yoon, Aviral Shrivastava *, Minwook Ahn,

Slides:

Advertisements

Similar presentations

Goal: Split Compiler LLVM LLVM – DRESC bytecode staticdeployment time optimized architecture description compiler strategy ML annotations C code ADRES.

Advertisements

The Microprocessor is no more General Purpose. Design Gap.

A Survey of Logic Block Architectures For Digital Signal Processing Applications.

Lecture 9: Coarse Grained FPGA Architecture October 6, 2004 ECE 697F Reconfigurable Computing Lecture 9 Coarse Grained FPGA Architecture.

5th International Conference, HiPEAC 2010 MEMORY-AWARE APPLICATION MAPPING ON COARSE-GRAINED RECONFIGURABLE ARRAYS Yongjoo Kim, Jongeun Lee *, Aviral Shrivastava.

University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution.

CML Enabling Multithreading on CGRAs Reiley Jeyapaul, Aviral Shrivastava 1, Jared Pager 1, Reiley Jeyapaul, 1 Mahdi Hamzeh 12, Sarma Vrudhula 2 Compiler.

PipeRench: A Coprocessor for Streaming Multimedia Acceleration Seth Goldstein, Herman Schmit et al. Carnegie Mellon University.

University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.

A High Performance Application Representation for Reconfigurable Systems Wenrui GongGang WangRyan Kastner Department of Electrical and Computer Engineering.

LCTES 2010, Stockholm Sweden OPERATION AND DATA MAPPING FOR CGRA’S WITH MULTI-BANK MEMORY Yongjoo Kim, Jongeun Lee *, Aviral Shrivastava ** and Yunheung.

Compiling Application-Specific Hardware Mihai Budiu Seth Copen Goldstein Carnegie Mellon University.

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

University of Michigan Electrical Engineering and Computer Science 1 Resource Recycling: Putting Idle Resources to Work on a Composable Accelerator Yongjun.

4/29/09Prof. Hilfinger CS164 Lecture 381 Register Allocation Lecture 28 (from notes by G. Necula and R. Bodik)

Mahdi Hamzeh, Aviral Shrivastava, and Sarma Vrudhula School of Computing, Informatics, and Decision Systems Engineering Arizona State University June 2013.

Yongjoo Kim*, Jongeun Lee**, Jinyong Lee*, Toan Mai**, Ingoo Heo* and Yunheung Paek* *Seoul National University **UNIST (Ulsan National Institute of Science.

1 Miodrag Bolic ARCHITECTURES FOR EFFICIENT IMPLEMENTATION OF PARTICLE FILTERS Department of Electrical and Computer Engineering Stony Brook University.

Paper Review I Coarse Grained Reconfigurable Arrays Presented By: Matthew Mayhew I.D.# ENG*6530 Tues, June, 10,

Sanghyun Park, §Aviral Shrivastava and Yunheung Paek

A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.

Chapter One Introduction to Pipelined Processors.

Automated Design of Custom Architecture Tulika Mitra

Speculative Software Management of Datapath-width for Energy Optimization G. Pokam, O. Rochecouste, A. Seznec, and F. Bodin IRISA, Campus de Beaulieu

1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.

CMPE 511 Computer Architecture A Faster Optimal Register Allocator Betül Demiröz.

I2CRF: Incremental Interconnect Customization for Embedded Reconfigurable Fabrics Jonghee W. Yoon, Jongeun Lee*, Jaewan Jung, Sanghyun Park, Yongjoo Kim,

A Graph Based Algorithm for Data Path Optimization in Custom Processors J. Trajkovic, M. Reshadi, B. Gorjiara, D. Gajski Center for Embedded Computer Systems.

L11: Lower Power High Level Synthesis(2) 성균관대학교 조 준 동 교수

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular.

CML REGISTER FILE ORGANIZATION FOR COARSE GRAINED RECONFIGURABLE ARCHITECTURES (CGRAs) Dipal Saluja Compiler Microarchitecture Lab, Arizona State University,

1 COMP541 Multicycle MIPS Montek Singh Apr 4, 2012.

Bypass Aware Instruction Scheduling for Register File Power Reduction Sanghyun Park, Aviral Shrivastava Nikil Dutt, Alex Nicolau Yunheung Paek Eugene Earlie.

1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.

Task Graph Scheduling for RTR Paper Review By Gregor Scott.

CML SSDM: Smart Stack Data Management for Software Managed Multicores Jing Lu Ke Bai, and Aviral Shrivastava Compiler Microarchitecture Lab Arizona State.

A Hybrid Design Space Exploration Approach for a Coarse-Grained Reconfigurable Accelerator Farhad Mehdipour, Hamid Noori, Hiroaki Honda, Koji Inoue, Kazuaki.

COARSE GRAINED RECONFIGURABLE ARCHITECTURES 04/18/2014 Aditi Sharma Dhiraj Chaudhary Pruthvi Gowda Rachana Raj Sunku DAY

System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.

Jason Li Jeremy Fowers 1. Speedups and Energy Reductions From Mapping DSP Applications on an Embedded Reconfigurable System Michalis D. Galanis, Gregory.

OPTIMIZING DSP SCHEDULING VIA ADDRESS ASSIGNMENT WITH ARRAY AND LOOP TRANSFORMATION Chun Xue, Zili Shao, Ying Chen, Edwin H.-M. Sha Department of Computer.

Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke

FPGA-Based System Design: Chapter 6 Copyright  2004 Prentice Hall PTR Topics n Low power design. n Pipelining.

An Automated Development Framework for a RISC Processor with Reconfigurable Instruction Set Extensions Nikolaos Vassiliadis, George Theodoridis and Spiridon.

Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.

A New Class of High Performance FFTs Dr. J. Greg Nash Centar ( High Performance Embedded Computing (HPEC) Workshop.

1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.

Hyunchul Park†, Kevin Fan†, Scott Mahlke†,

ECE 526 – Network Processing Systems Design Programming Model Chapter 21: D. E. Comer.

CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.

CAD for VLSI Ramakrishna Lecture#2.

Lecture 17: Dynamic Reconfiguration I November 10, 2004 ECE 697F Reconfigurable Computing Lecture 17 Dynamic Reconfiguration I Acknowledgement: Andre DeHon.

Fast VLSI Implementation of Sorting Algorithm for Standard Median Filters Hyeong-Seok Yu SungKyunKwan Univ. Dept. of ECE, Vada Lab.

Varun Mathur Mingwei Liu Sanghyun Park, Aviral Shrivastava and Yunheung Paek.

Introduction to Intrusion Detection Systems. All incoming packets are filtered for specific characteristics or content Databases have thousands of patterns.

An FFT for Wireless Protocols Dr. J. Greg Nash Centar ( HAWAI'I INTERNATIONAL CONFERENCE ON SYSTEM SCIENCES Mobile.

Ph.D. in Computer Science

Cache Memory Presentation I

CGRA Express: Accelerating Execution using Dynamic Operation Fusion

CSE-591 Compilers for Embedded Systems Code transformations and compile time data management techniques for application mapping onto SIMD-style Coarse-grained.

Anne Pratoomtong ECE734, Spring2002

EPIMap: Using Epimorphism to Map Applications on CGRAs

Department of Computer Science University of California, Santa Barbara

Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke

Summary of my Thesis Power-efficient performance is the need of the future CGRAs can be used as programmable accelerators Power-efficient performance can.

URECA: A Compiler Solution to Manage Unified Register File for CGRAs

1. Arizona State University, Tempe, USA

Department of Computer Science University of California, Santa Barbara

Presentation transcript:

Efficient Mapping onto Coarse-Grained Reconfigurable Architectures using Graph Drawing based Algorithm Jonghee Yoon, Aviral Shrivastava *, Minwook Ahn, Sanghyun Park, Doosan Cho and Yunheung Paek SO&R Research Group Seoul National University, Korea * Compiler and Microarchitecture Lab Arizona State University, USA

Reconfigurable Architectures  Reconfigurable = Hardware reconfigurable  Reuse of silicon estate  Dynamically change the hardware functionality  Use only the optimal hardware to execute the application  High computation throughput  Reduced overhead of instruction execution  Highly power efficient execution  Several kinds of reconfigurable architectures  Field Programmable Gate Arrays  Instruction Set Extension  Coarse Grain Reconfigurable Architectures

Coarse Grain Reconfiguration  FPGAs (Field-Programmable Gate Arrays)  fine grain reconfigurability  limited application fields slow clock speed & slow reconfiguration speed  S/W development is very difficult  CGRAs (Coarse-Grained Reconfigurable Architectures)  higher performance in more application fields  Operation level granularity  Word level datapath  S/W development is easy

Outline  Why Reconfigurable Architectures?  Coarse-Grained Reconfigurable Architectures  Problem Formulation  Graph Drawing Algorithm  Split & Push  Matching-Cut  Experimental Results  Conclusion 2

CGRAs  A set of processing elements (PEs)  PE (or reconfigurable cell, RC, in MorphoSys) Light-weight processor No control unit Simple ALU operations  ex) Morphosys, RSPA, ADRES,.etc 4 MorphoSys RC Array MUX AMUX B A L U Register Shift Logic Register File PE structure of RSPA

Application Mapping onto CGRAs  Compiler’s has a critical role for CGRAs  analyze the applications  Map the applications to the CGRA  Two main compiler issues in CGRAs are…  Parallelism finding more parallelism in the application  better use of CGRA features e.g., s/w pipelining  Resource Minimization to reduce power consumption to increase throughput to have more opportunities for further optimizations  e.g., power gating of PEs 5

CGRAs are becoming customized  Processing Element (PE) Interconnection  2-D mesh structure is not enough for high performance  Shared Resources  cost, power, complexity,  multipliers and load/store units can be shared  Routing PE  In some CGRAs, a PE can be used for routing only  to map a node with degree greater than the # of connections of a PE 6 RSPA structure R

Existing compilers assume simple CGRAs  Various Compiler Techniques for CGRAs  MorphoSys and XPP : can only evaluate simple loops  DRESC for ADRES : Too long mapping time, low utilization of PE  Those do not model complex CGRA designs (shared resources, irregular interconnections, row constraints, memory interface.etc)  AHN et al. for RSPA : Spatial mapping, shared multiplier & memory  can only consider 2-D mesh PE interconnection do not consider PEs as routing resources  Our Contribution  We propose a compiler technique that considers irregular PE interconnection resource sharing routing resource 7

Problem Formulation  Inputs  Given a kernel DAG K = (V, E), and a CGRA C = (P, L)  Outputs  Mapping M1: V  P (of vertices to PEs)  Mapping M2: E  2 L (of edges to paths)  Objective is to minimize  Routing PEs  Number of rows More useful in practice  Constraints  Path existence: links share a PE (routing PE)  Simple path (no loops in a path)  Uniqueness of routing PE (Routing PE can be used to route only one value)  No computation on routing PE (No computation on routing PE)  Shared resource constraints 8

Outline  Why Reconfigurable Architectures?  Coarse-Grained Reconfigurable Architectures  Problem Formulation  Graph Drawing Algorithm  Split & Push  Matching-Cut  Experimental Results  Conclusion 2

Graph Drawing Problem ( I )  Split & Push Algorithm Kernel DAGCGRA SplitPush Good Mapping Kernel DAGCGRA Split Push Split PushSplit Push Bad Mapping Dummy node insertion  Bad split decision incurs more uses of resources  2 vs. 3 columns  Forks happen  When adjacent edges are cut by a split  Forks incurs dummy nodes, which are ‘unnecessary routing PEs’  How to reduce forks? 1 G. D. Battista et. al. A split & push approach to 3D orthogonal drawing. In Graph Drawing, Fork occurs!! Dummy node insertion 9

Graph Drawing Problem ( II )  Matching-Cut 2  Matching: A set of edges which do not share nodes  Cut: A set of edges whose removal makes the graph disconnected A cut, but not a matchingA matching, but not a cutA matching-cut 2 M. Patrignani and M. Pizzonia. The complexity of the matching-cut problem. In WG ’01: Proceedings of the 27th International Workshop on Graph-Theoretic Concepts in Computer Science, : shared  Forks can be avoided by finding matching-cut in DAG A matching-cut, need 4 PEs, no routing PEs A cut, need 6 PEs, 2 routing PEs 10

Split & Push Kernel Mapping  PE is connected to at most 6 other PEs.  At most 2 load operations and one store Operation can be scheduled.  Load : Store : ALU : RPE : Fork :  # of node : |V| = 10  # of load : L = 3  # of store : S = 1  Initial ROW min = = = 3 Initial Position Matching CutSplit & Push No Matching Cut  Forks occur  RPEs InsertionViolationRepeat with increased ROW min Row-wise Scattering 11

Outline  Why Reconfigurable Architectures?  Coarse-Grained Reconfigurable Architectures  Problem Formulation  Graph Drawing Algorithm  Split & Push  Matching-Cut  Experimental Results  Conclusion 2

Experimental Setup  We test SPKM on a CGRA called RSPA  RSPA has orthogonal interconnection (irregular interconnection)  Each row has 2 shared multipliers Each row can perform 2 loads and 1 store(shared resource)  PE can be used for routing only (routing resource)  2 Sets of Experiments  Synthetic Benchmarks  Real Benchmarks 12

SPKM for Synthetic Benchmarks  4x4 CGRA  Random Kernel DAG generator  First choose n (1-16) – number of nodes in DAG – cardinality  Then randomly create non-cyclical edges between nodes of DAG  100 DAGs of each cardinality  Run [AHN] and SPKM on them  Compare  Map-ability  Number of RRs  Mapping Time

SPKM maps more applications  SPKM can on average map 4.5X more applications than AHN  For large application, SPKM shows high map-ability since it considers routing PEs well 13 Y axis : # of applications that each technique can map X axis : # of nodes that each application has SPKM can map 4.5X more applications than [AHN]

SPKM generates better mapping 15  For 62% of the applications, SPKM generates better mapping as [AHN]  For 99% of applications, SPKM generates at least as good mapping as [AHN] [AHN] uses less Rows [AHN] and SPKM use equal number of Rows SPKM uses less Rows

No significant difference in mapping time  SPKM has 8% less mapping time as compared to AHN. 16

SPKM for real benchmarks Benchmarks from Livermore loops, MultiMedia, and DSPStone 17

SPKM for real benchmarks  SPKM can map more real benchmarks  SPKM reduces power consumption by 10% on applications that both [AHN] and SPKM can map % reduction in power consumption AHN fails to map

Conclusion  CGRAs are a promising platform  High throughput, power efficient computation  Applicability of CGRAs critically hinges on the compiler  CGRAs are becoming complex  Irregular interconnect  Shared resources  Routing resources  Existing compilers do not consider these complexities  Cannot map applications  We propose Graph-Drawing based heuristic, SPKM, that considers architectural details of CGRAs, and uses a split-push algorithm  Can map 4.5X more DAGs  Less number of rows in 62% of DAGs  Same mapping time 19

Thank you