Yongjoo Kim*, Jongeun Lee**, Jinyong Lee*, Toan Mai**, Ingoo Heo* and Yunheung Paek* *Seoul National University **UNIST (Ulsan National Institute of Science.

Slides:



Advertisements
Similar presentations
Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.
Advertisements

Breaking SIMD Shackles with an Exposed Flexible Microarchitecture and the Access Execute PDG Venkatraman Govindaraju, Tony Nowatzki, Karthikeyan Sankaralingam.
Lecture 6: Multicore Systems
TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
Department of Computer Science iGPU: Exception Support and Speculative Execution on GPUs Jaikrishnan Menon, Marc de Kruijf Karthikeyan Sankaralingam Vertical.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
5th International Conference, HiPEAC 2010 MEMORY-AWARE APPLICATION MAPPING ON COARSE-GRAINED RECONFIGURABLE ARRAYS Yongjoo Kim, Jongeun Lee *, Aviral Shrivastava.
University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution.
CML Enabling Multithreading on CGRAs Reiley Jeyapaul, Aviral Shrivastava 1, Jared Pager 1, Reiley Jeyapaul, 1 Mahdi Hamzeh 12, Sarma Vrudhula 2 Compiler.
11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.
University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.
SCIENCES USC INFORMATION INSTITUTE An Open64-based Compiler Approach to Performance Prediction and Performance Sensitivity Analysis for Scientific Codes.
LCTES 2010, Stockholm Sweden OPERATION AND DATA MAPPING FOR CGRA’S WITH MULTI-BANK MEMORY Yongjoo Kim, Jongeun Lee *, Aviral Shrivastava ** and Yunheung.
EECC551 - Shaaban #1 Winter 2003 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.
University of Michigan Electrical Engineering and Computer Science 1 Resource Recycling: Putting Idle Resources to Work on a Composable Accelerator Yongjun.
Enhancing Embedded Processors with Specific Instruction Set Extensions for Network Applications A. Chormoviti, N. Vassiliadis, G. Theodoridis, S. Nikolaidis.
EECC551 - Shaaban #1 Winter 2002 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.
Gnort: High Performance Intrusion Detection Using Graphics Processors Giorgos Vasiliadis, Spiros Antonatos, Michalis Polychronakis, Evangelos Markatos,
ECE669 L23: Parallel Compilation April 29, 2004 ECE 669 Parallel Computer Architecture Lecture 23 Parallel Compilation.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.
Flynn’s Taxonomy of Computer Architectures Source: Wikipedia Michael Flynn 1966 CMPS 5433 – Parallel Processing.
The Vector-Thread Architecture Ronny Krashinsky, Chris Batten, Krste Asanović Computer Architecture Group MIT Laboratory for Computer Science
Generic Software Pipelining at the Assembly Level Markus Pister
Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.
Mahdi Hamzeh, Aviral Shrivastava, and Sarma Vrudhula School of Computing, Informatics, and Decision Systems Engineering Arizona State University June 2013.
Optimizing Loop Performance for Clustered VLIW Architectures by Yi Qian (Texas Instruments) Co-authors: Steve Carr (Michigan Technological University)
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.
Automated Design of Custom Architecture Tulika Mitra
Efficient Mapping onto Coarse-Grained Reconfigurable Architectures using Graph Drawing based Algorithm Jonghee Yoon, Aviral Shrivastava *, Minwook Ahn,
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
CMPE 511 Computer Architecture A Faster Optimal Register Allocator Betül Demiröz.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
Page 1 Reconfigurable Communications Processor Principal Investigator: Chris Papachristou Task Number: NAG Electrical Engineering & Computer Science.
I2CRF: Incremental Interconnect Customization for Embedded Reconfigurable Fabrics Jonghee W. Yoon, Jongeun Lee*, Jaewan Jung, Sanghyun Park, Yongjoo Kim,
Lecture 16: Reconfigurable Computing Applications November 3, 2004 ECE 697F Reconfigurable Computing Lecture 16 Reconfigurable Computing Applications.
ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular.
CML REGISTER FILE ORGANIZATION FOR COARSE GRAINED RECONFIGURABLE ARCHITECTURES (CGRAs) Dipal Saluja Compiler Microarchitecture Lab, Arizona State University,
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
Jason Li Jeremy Fowers 1. Speedups and Energy Reductions From Mapping DSP Applications on an Embedded Reconfigurable System Michalis D. Galanis, Gregory.
OPTIMIZING DSP SCHEDULING VIA ADDRESS ASSIGNMENT WITH ARRAY AND LOOP TRANSFORMATION Chun Xue, Zili Shao, Ying Chen, Edwin H.-M. Sha Department of Computer.
1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.
Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke
An Automated Development Framework for a RISC Processor with Reconfigurable Instruction Set Extensions Nikolaos Vassiliadis, George Theodoridis and Spiridon.
Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.
Hyunchul Park†, Kevin Fan†, Scott Mahlke†,
WCET-Aware Dynamic Code Management on Scratchpads for Software-Managed Multicores Yooseong Kim 1,2, David Broman 2,3, Jian Cai 1, Aviral Shrivastava 1,2.
CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.
Autumn 2006CSE P548 - Dataflow Machines1 Von Neumann Execution Model Fetch: send PC to memory transfer instruction from memory to CPU increment PC Decode.
My Coordinates Office EM G.27 contact time:
Winter-Spring 2001Codesign of Embedded Systems1 Essential Issues in Codesign: Architectures Part of HW/SW Codesign of Embedded Systems Course (CE )
UT-Austin CART 1 Mechanisms for Streaming Architectures Stephen W. Keckler Computer Architecture and Technology Laboratory Department of Computer Sciences.
Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University.
EECE571R -- Harnessing Massively Parallel Processors ece
Ph.D. in Computer Science
The University of Adelaide, School of Computer Science
Parallel Algorithm Design
Nithin Michael, Yao Wang, G. Edward Suh and Ao Tang Cornell University
CGRA Express: Accelerating Execution using Dynamic Operation Fusion
CSE-591 Compilers for Embedded Systems Code transformations and compile time data management techniques for application mapping onto SIMD-style Coarse-grained.
EPIMap: Using Epimorphism to Map Applications on CGRAs
Verilog to Routing CAD Tool Optimization
Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke
Summary of my Thesis Power-efficient performance is the need of the future CGRAs can be used as programmable accelerators Power-efficient performance can.
URECA: A Compiler Solution to Manage Unified Register File for CGRAs
Samuel Larsen and Saman Amarasinghe, MIT CSAIL
The Vector-Thread Architecture
Introduction to Optimization
Presentation transcript:

Yongjoo Kim*, Jongeun Lee**, Jinyong Lee*, Toan Mai**, Ingoo Heo* and Yunheung Paek* *Seoul National University **UNIST (Ulsan National Institute of Science & Technology) ARC March 21, 2012 Hong Kong

Reconfigurable Architecture 2/20  Reconfigurable arc hitecture  High performance  Flexible  Cf. ASIC  Energy efficient  Cf. GPU Source: ChipDesignMag.com

Coarse-Grained Reconfigurable Architecture 3 /20  Coarse-Grained RA  Word-level granularity  Dynamic reconfigurability  Simpler to compile  Execution model Main Processor CGRA Main Memory DMA Controller MorphoSys ADRES

Application Mapping 4 /20  Place and route DFG on the PE array mapping space  Should satisfy several constraints  Should map nodes on the PE which have a right functionality  Data transfer between nodes should be guaranteed  Resource consumption should be minimized for performance Application IR Front-end Partitioner Conventional C compilation ConfigurationAssembly Exec. + Config. Extended assembler Seq Code Loops Place & Route DFG generation Arch Param. Mapping for CGRA

 Modulo scheduling-based mapping 5 /20 Software Pipelining time A[i]B[i] C[i] PE0 PE3 PE1 PE2 PE0PE1PE2PE II = 2 cycles II : Initiation Interval

 Suffer several problems in a large scale CGRA  Lack of parallelism  Limited ILP in general applications  Configuration size(in unrolling case)  Search a very large mapping space for placement and routing  Skyrocketing compilation time CGRAs remain at 4x4 or 8x8 at the most. 6 /20 Problem - Scalability

Overview 7 /20  Background  SIMD Reconfigurable Architecture (SIMD RA)  Mapping on SIMD RA  Evaluation

 Consists of multiple identical parts, called cores  Identical for the reuse of configurations  At least one load-store PE in each core 8 /20 SIMD Reconfigurable Architecture Crossbar Switch Bank1Bank2Bank3Bank4 Core 1 Core 2 Core 3 Core 4

 More iterations executed in parallel  Scale with the PE array size  Short compilation time thanks to small mapping space  Archive denser scheduled configuration  Higher utilization and performance.  Loop must not have loop-carried dependence. 9 /20 Advantages of SIMD-RA time Large Core Iteration 0 Iteration 1 Iteration 2 Iteration 3 Iteration 4 Iteration 5 time Core 1Core 2Core 3Core 4 Iter. 0 Iter. 1 Iter. 2 Iter. 3 Iter. 4 Iter. 5 Iter. 6 Iter. 7 Iter. 8 Iter. 9 Iter. 10 Iter. 11 Large Core Core 1Core 2 Core 3Core 4

Overview 10 /20  Background  SIMD Reconfigurable Architecture (SIMD RA)  Bank Conflict Minimization in SIMD RA  Evaluation

 New mapping problem  Iteration-to-core mapping  Iteration mapping affects on the performance  related with a data mapping  affect the number of bank conflicts 11 /20 Problems of SIMD RA mapping for(i=0 ; i<15 ; i++) { B[i] = A[i] + B[i]; } Core 1Core 2 Core 3Core 4 15 iterations

Iteration-to-core mappingData mapping 12 /20 Mapping schemes Iter. 0-3 Iter. 4-7 Iter Iter Iter. 0,4,8,12 Iter. 1,5,9,13 Iter. 3,7,11 Iter. 2,6,10,14 Crossbar Switch A[0] A[4] A[8] A[12] B[1] B[5] B[9] B[13] A[1] A[5] A[9] A[13] B[2] B[6] B[10] B[14] A[2] A[6] A[10] A[14] B[3] B[7] B[11] A[3] A[7] A[11] B[0] B[4] B[8] B[12] Crossbar Switch A[0] A[1] A[2] A[3] A[4] A[5] A[13] A[14] B[0] B[1] B[2] B[3] B[4] B[5] B[13] B[14] … … for(i=0 ; i<15 ; i++) { B[i] = A[i] + B[i]; }

 With interleaving data placement, interleaved iteration assignment is better than sequential iteration assignment.  Weak in stride accesses  reduce the number of utilized banks,  increase bank conflicts 13 Interleaving data placement Iter. 0-3 Iter. 4-7 Iter Iter Iter. 0,4,8,12 Iter. 1,5,9,13 Iter. 3,7,11 Iter. 2,6,10,14 Crossbar Switch A[0] A[4] A[8] A[12] B[1] B[5] B[9] B[13] A[1] A[5] A[9] A[13] B[2] B[6] B[10] B[14] A[2] A[6] A[10] A[14] B[3] B[7] B[11] A[3] A[7] A[11] B[0] B[4] B[8] B[12] Configuration Load A[i] … … … Load A[2i]

14 Sequential data placement  Cannot work well with SIMD mapping  Cause frequent bank conflicts  Data tiling  i) array base address modification  ii) rearranging data on the local memory.  Sequential iteration assignment with data tiling suits for SIMD mapping 14 Crossbar Switch A[0] A[1] A[2] A[3] B[0] B[1] B[2] B[3] A[4] A[5] A[6] A[7] B[4] B[5] B[6] B[7] A[8] A[9] A[10] A[11] B[8] B[9] B[10] B[11] A[12] A[13] A[14] B[12] B[13] B[14] Crossbar Switch A[0] A[1] A[2] A[3] A[4] A[5] A[13] A[14] B[0] B[1] B[2] B[3] B[4] B[5] B[13] B[14] …… Iter. 0-3 Iter. 4-7 Iter Iter Iter. 0,4,8,12 Iter. 1,5,9,13 Iter. 3,7,11 Iter. 2,6,10,14 Configuration Load A[i] … … …

 Two out of the four combinations have strong advantages  Interleaved iteration, interleaved data mapping  Weak in accesses with stride  Simple data management  Sequential iteration, sequential data mapping (with data tiling)  More robust against bank conflict  Data rearranging overhead 15 /20 Summary of Mapping Combinations Analysis

Experimental Setup 16 /20  Sets of loop kernels from OpenCV, multimedia and SPEC2000 benchmarks  Target system  Two CGRA sizes – 4x4, 8x4  2x2 core with one load-store PE and one multiplier PE  Mesh + diagonal connections between PEs  Full crossbar switch between PEs and local memory banks  Compared with non-SIMD mapping  Original : non-SIMD previous mapping  SIMD : Our approach (interleaving-interleaving mapping)

reduced by 61% in 4x4 CGRA, 79% in 8x4 CGRA 17 /20 Configuration Size

18 /20 Runtime 29% 32%

 Presented SIMD reconfigurable architecture  Exploit data parallelism and instruction level parallelism at the same time  Advantages of SIMD reconfigurable architecture  Scale the large number of PEs well  Alleviate increasing compilation time  Increase performance and reduce configuration size 19 /20 Conclusion

Thank you! 20 /20

 In a large loop case,  small core might not be a good match  Merge multiple cores ⇒ Macrocore  No HW modification require 21 Core size Crossbar Switch Bank1Bank2Bank3Bank4 Core 1 Core 2 Core 3 Core 4 Macrocore 1 Macrocore 2

22 SIMD RA mapping flow Operation Mapping Check SIMD Requirement Check SIMD Requirement Select Core Size Iteration Mapping Data Tiling If scheduling fails and MaxII<II, increase core size. Traditional Mapping Fail If scheduling fails, increase II and repeat. Modulo Scheduling Array Placement (Implicit) Array Placement (Implicit) Int-IntSeq-Tiling