I2CRF: Incremental Interconnect Customization for Embedded Reconfigurable Fabrics Jonghee W. Yoon, Jongeun Lee*, Jaewan Jung, Sanghyun Park, Yongjoo Kim,

Slides:

Advertisements

Similar presentations

Towards a Quadratic Time Approximation of Graph Edit Distance Fischer, A., Suen, C., Frinken, V., Riesen, K., Bunke, H. Contents Introduction Graph edit.

Advertisements

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

Bio Michel Hanna M.S. in E.E., Cairo University, Egypt B.S. in E.E., Cairo University at Fayoum, Egypt Currently is a Ph.D. Student in Computer Engineering.

Bidding Protocols for Deploying Mobile Sensors Reporter: Po-Chung Shih Computer Science and Information Engineering Department Fu-Jen Catholic University.

The Microprocessor is no more General Purpose. Design Gap.

CSE 534 Fundamentals of Computer Networks Lecture 4: Bridging (From Hub to Switch by Way of Tree) Based on slides from D. Choffnes Northeastern U. Revised.

CS 4700 / CS 5700 Network Fundamentals Lecture 7: Bridging (From Hub to Switch by Way of Tree) Revised 1/14/13.

5th International Conference, HiPEAC 2010 MEMORY-AWARE APPLICATION MAPPING ON COARSE-GRAINED RECONFIGURABLE ARRAYS Yongjoo Kim, Jongeun Lee *, Aviral Shrivastava.

Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer Architecture and VLSI Systems Laboratory Exploiting Spatial.

Towards Virtual Routers as a Service 6th GI/ITG KuVS Workshop on “Future Internet” November 22, 2010 Hannover Zdravko Bozakov.

University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.

A High Performance Application Representation for Reconfigurable Systems Wenrui GongGang WangRyan Kastner Department of Electrical and Computer Engineering.

LCTES 2010, Stockholm Sweden OPERATION AND DATA MAPPING FOR CGRA’S WITH MULTI-BANK MEMORY Yongjoo Kim, Jongeun Lee *, Aviral Shrivastava ** and Yunheung.

Center for Embedded Computer Systems University of California, Irvine Coordinated Coarse Grain and Fine Grain Optimizations.

Architecture and Compilation for Reconfigurable Processors Jason Cong, Yiping Fan, Guoling Han, Zhiru Zhang Computer Science Department UCLA Nov 22, 2004.

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

An Efficient Chiplevel Time Slack Allocation Algorithm for Dual-Vdd FPGA Power Reduction Yan Lin 1, Yu Hu 1, Lei He 1 and Vijay Raghunathan 2 1 EE Department,

A Tool for Partitioning and Pipelined Scheduling of Hardware-Software Systems Karam S Chatha and Ranga Vemuri Department of ECECS University of Cincinnati.

Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.

1 Reasons for parallelization Can we make GA faster? One of the most promising choices is to use parallel implementations. The reasons for parallelization.

October 26, 2006 Parallel Image Processing Programming and Architecture IST PhD Lunch Seminar Wouter Caarls Quantitative Imaging Group.

Escape Routing For Dense Pin Clusters In Integrated Circuits Mustafa Ozdal, Design Automation Conference, 2007 Mustafa Ozdal, IEEE Trans. on CAD, 2009.

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 2007 (TPDS 2007)

Yongjoo Kim*, Jongeun Lee**, Jinyong Lee*, Toan Mai**, Ingoo Heo* and Yunheung Paek* *Seoul National University **UNIST (Ulsan National Institute of Science.

LOPASS: A Low Power Architectural Synthesis for FPGAs with Interconnect Estimation and Optimization Harikrishnan K.C. University of Massachusetts Amherst.

A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.

UNIVERSITAT POLITÈCNICA DE CATALUNYA Departament d’Arquitectura de Computadors Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning.

May 2004 Department of Electrical and Computer Engineering 1 ANEW GRAPH STRUCTURE FOR HARDWARE- SOFTWARE PARTITIONING OF HETEROGENEOUS SYSTEMS A NEW GRAPH.

Efficient Mapping onto Coarse-Grained Reconfigurable Architectures using Graph Drawing based Algorithm Jonghee Yoon, Aviral Shrivastava *, Minwook Ahn,

1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.

L11: Lower Power High Level Synthesis(2) 성균관대학교 조 준 동 교수

Reconfigurable Computing Using Content Addressable Memory (CAM) for Improved Performance and Resource Usage Group Members: Anderson Raid Marie Beltrao.

CML REGISTER FILE ORGANIZATION FOR COARSE GRAINED RECONFIGURABLE ARCHITECTURES (CGRAs) Dipal Saluja Compiler Microarchitecture Lab, Arizona State University,

An Efficient Linear Time Triple Patterning Solver Haitong Tian Hongbo Zhang Zigang Xiao Martin D.F. Wong ASP-DAC’15.

A Hybrid Design Space Exploration Approach for a Coarse-Grained Reconfigurable Accelerator Farhad Mehdipour, Hamid Noori, Hiroaki Honda, Koji Inoue, Kazuaki.

Reconfigurable Computing Ender YILMAZ, Hasan Tahsin OĞUZ.

Jason Li Jeremy Fowers 1. Speedups and Energy Reductions From Mapping DSP Applications on an Embedded Reconfigurable System Michalis D. Galanis, Gregory.

Analysis of Cache Tuner Architectural Layouts for Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing.

A Programmable Single Chip Digital Signal Processing Engine MAPLD 2005 Paul Chiang, MathStar Inc. Pius Ng, Apache Design Solutions.

A High-Level Synthesis Flow for Custom Instruction Set Extensions for Application-Specific Processors Asia and South Pacific Design Automation Conference.

High Performance, Low Power Reconfigurable Processor for Embedded Systems Farhad Mehdipour, Hamid Noori, Koji Inoue, Kazuaki Murakami Kyushu University,

Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Nov 3, 2005.

Author: Haoyu Song, Murali Kodialam, Fang Hao and T.V. Lakshman Publisher/Conf. : IEEE International Conference on Network Protocols (ICNP), 2009 Speaker:

Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Dec 1, 2005 Part 2.

Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke

An Automated Development Framework for a RISC Processor with Reconfigurable Instruction Set Extensions Nikolaos Vassiliadis, George Theodoridis and Spiridon.

Hardware-based Job Queue Management for Manycore Architectures and OpenMP Environments Junghee Lee, Chrysostomos Nicopoulos, Yongjae Lee, Hyung Gyu Lee.

Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.

Incremental Run-time Application Mapping for Heterogeneous Network on Chip 2012 IEEE 14th International Conference on High Performance Computing and Communications.

1 Field-programmable Gate Array Architectures and Algorithms Optimized for Implementing Datapath Circuits Andy Gean Ye University of Toronto.

1 Hardware-Software Co-Synthesis of Low Power Real-Time Distributed Embedded Systems with Dynamically Reconfigurable FPGAs Li Shang and Niraj K.Jha Proceedings.

Hyunchul Park†, Kevin Fan†, Scott Mahlke†,

CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.

Slack Analysis in the System Design Loop Girish VenkataramaniCarnegie Mellon University, The MathWorks Seth C. Goldstein Carnegie Mellon University.

1 Double-Patterning Aware DSA Template Guided Cut Redistribution for Advanced 1-D Gridded Designs Zhi-Wen Lin and Yao-Wen Chang National Taiwan University.

1 Architecture of Datapath- oriented Coarse-grain Logic and Routing for FPGAs Andy Ye, Jonathan Rose, David Lewis Department of Electrical and Computer.

Ph.D. in Computer Science

Nithin Michael, Yao Wang, G. Edward Suh and Ao Tang Cornell University

CGRA Express: Accelerating Execution using Dynamic Operation Fusion

CSE-591 Compilers for Embedded Systems Code transformations and compile time data management techniques for application mapping onto SIMD-style Coarse-grained.

CS 4700 / CS 5700 Network Fundamentals

Reconfigurable Computing

EPIMap: Using Epimorphism to Map Applications on CGRAs

Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke

SAT-Based Area Recovery in Technology Mapping

Alan Mishchenko University of California, Berkeley

CS 4700 / CS 5700 Network Fundamentals

Jongik Kim1, Dong-Hoon Choi2, and Chen Li3

Kyoungwoo Lee, Minyoung Kim, Nikil Dutt, and Nalini Venkatasubramanian

Communication Driven Remapping of Processing Element (PE) in Fault-tolerant NoC-based MPSoCs Chia-Ling Chen, Yen-Hao Chen and TingTing Hwang Department.

Presentation transcript:

I2CRF: Incremental Interconnect Customization for Embedded Reconfigurable Fabrics Jonghee W. Yoon, Jongeun Lee*, Jaewan Jung, Sanghyun Park, Yongjoo Kim, Yunheung Paek and Doosan Cho** Seoul National University, Korea *UNIST, Korea **Sunchon National University, Korea

2 Udo Kebschull University of Heidelberg Outline CGRA & Augmentation Overall Design Flow Our Approach (I2CRF) Problem definition(Inexact graph matching) Mapping with A* search Experiment Conclusion

3 Udo Kebschull University of Heidelberg Reconfigurable Architecture Reconfiguration is emerging increasing needs for flexible and high speed computing fabrics CGRAs (Coarse-Grained Reconfigurable Architectures) operation level granularity high performance S/W development is easy MorphoSysADRES

4 Udo Kebschull University of Heidelberg Augmentation General CGRA - Mapping CGRA Arch. + Applications  Configurations Application specific CGRAs - Synthesis Applications  New Arch. + Configurations Augmentation Base CGRA + Applications  New Arch.+Configurations Customizable Features The number of PEs The set of PE operation Heterogeneity or Homogeneity Memory subsystem architectures Interconnection network Interconnect Exploration for Energy Versus Performance Tradeoffs for Coarse Grained Reconfigurable Architectures, TVLSI % (130nm)  30%(45nm) Energy consumption

5 Udo Kebschull University of Heidelberg Overall design flow - I2CRF Kernel Evaluation Application-Specific Reconfigurable Architecture Arch Extension Mapping (A* Search for Minimum-Cost Edit Path) + (Accum.) I 2 CRF (Incremental Interconnect Customization for Reconfigurable Fabrics ) Base CGRA Interconnections Not Satisfied Vertex Clustering

6 Udo Kebschull University of Heidelberg I2CRF Incremental architecture change by adding interconnections to the base architecture Strengths Regularity is maintained through the base architecture But provides specialization for the target applications Fast specialization and no limitation for design space The architecture change occurs while kernel is mapped.

7 Udo Kebschull University of Heidelberg The difference Compared with general mapping PE 1 PE 1 PE 2 PE 2 PE 3 PE 3 PE 4 PE 4 PE 5 PE 5 PE 6 PE Existing application mapping for CGRA Find a graph X C that is isomorphic to K Augmentation and Mapping Find the a graph Y that is isomorphic to K and a subset of C` which is most similar to C Kernel graph, KBase CGRA graph, C 5 × General Mapping Augmentation and Mapping

8 Udo Kebschull University of Heidelberg Problem Definition - Inexact Graph Matching Problem How to find C which is most similar to C0 : Inexact graph matching Similarity between two graph can be measured by calculating the cost of graph edit path Edit path is the set of edit operations that transform G1 into another G2 Edit operations –Node(or edge) substitution : NS, ES ( identical or non-identical ) –Node(or edge) insertion : NI, EI –Node(or edge) deletion : ND, ED –All the other edit operations are induced by Node substitution abc d e f gh i NS 1  e 2  a 3  h 4  d 5  b 6  g 7  f e1e1 a2a2 h3h3 b5b5 g6g6 d4d4 f7f7 Identical ES Non-identical ES & NI ED EI

9 Udo Kebschull University of Heidelberg Graph Edit Cost Model C e - The cost of Edge deletion Interconnection insertion cost C v - The cost of Node insertion Routing PE insertion cost Routing PE can replace interconnection insertion in case there are extra PEs Do not need augmentation –can reduce the amount of architecture extension C v is much cheaper than C e

10 Udo Kebschull University of Heidelberg A* Search for Min Cost Edit Path Inexact graph matching problem is NP-complete  How to search the mapping space for the min cost path : A* Search algorithm Root : Kernel graph Leaf : Sub-CGRA graph s : current mapping state g(s) : The sum of the costs(C e, C v ) of the graph edit operations from root to current state s h(s) : The estimated cost from current state s to a leaf state Assessment of the partial mapping s g(s) + h(s)

11 Udo Kebschull University of Heidelberg Vertex Scattering Make clusters of vertex and assign each cluster to row Strengths of Vertexscattering Search space reduction Considering shared resource constraints PE 1 PE 1 PE 2 PE 2 PE 4 PE 4 PE 5 PE 5 PE 3 PE 3 PE 6 PE Kernel Clustering & Row assignment Final mapping Row 1 Row 2

12 Udo Kebschull University of Heidelberg h(s) & Vertex Scattering Heuristic function, h(s) … guides the fast search of mapping space needs cost estimation methods Detecting difficult-to-map edges After vertex scattering Forks, Over-length edges cannot be mapped to a mesh without routing PE or a custom interconnection links H(s) # of forks & over-length edges (=Nr ) Unroutable difficult-to-map edge (c 1 ) has more cost than routable (c 2 )

13 Udo Kebschull University of Heidelberg Example PE 1 PE 1 PE 2 PE 2 PE 4 PE 4 PE 5 PE 5 PE 3 PE 3 PE 6 PE 6 c 1 = c v = 1 c 2 = c e = s=0 { } s=1 {(1  1)} s=2 {(1  2)} s=4 {(4  2)} s=5 {(4  3), ($  2)} s=8 {(2  4)} s=7 {(2  5)} s=3 {(1  3)} g( s ) + h( s ) = s=6 {(2  6)} s=9 {(3  3), ($  5} s=10 {(3  5), ($  4)}

14 Udo Kebschull University of Heidelberg Experimental Setup We test I2CRF on a CGRA called RSPA mesh base interconnection Each row has 2 shared multipliers Each row can perform 2 loads and 1 store PE can be used for routing Benchmarks from Livermore loops, MultiMedia and DSPStone Comparison to Mesh, 1-hop, Diagonal, and Mixed

15 Udo Kebschull University of Heidelberg Performance Improvement IPC of 16 is equivalent to 100% utilization PE utilization and the IPC are increased by more than 70% on average compared to Mesh or by 41% on average compared to Mixed

16 Udo Kebschull University of Heidelberg Customization Overhead Through our interconnection increment, … # of new interconnection links is very small Very marginal increase in the overall Mux complexity

17 Udo Kebschull University of Heidelberg Optimization Time Find competitive custom interconnection architecture with configuration in reasonable time.

18 Udo Kebschull University of Heidelberg Conclusion We presented an interconnection customization method for CGRAs Our method exploits the similarity between the interconnection customization problem and inexact graph Non-homogeneous extensions to a base interconnection architecture may present some challenges and possibly penalty in back-end VLSI design matching We plan to find out the extent of the difficulty due to the non-homogeneity as well as find novel ways to mitigate any impact if necessary

19 Udo Kebschull University of Heidelberg Thank you for your attention!