Mahdi Hamzeh, Aviral Shrivastava, and Sarma Vrudhula School of Computing, Informatics, and Decision Systems Engineering Arizona State University June 2013.

Slides:

Advertisements

Similar presentations

CML Efficient & Effective Code Management for Software Managed Multicores CODES+ISSS 2013, Montreal, Canada Ke Bai, Jing Lu, Aviral Shrivastava, and Bryce.

Advertisements

CML CML Presented by: Aseem Gupta, UCI Deepa Kannan, Aviral Shrivastava, Sarvesh Bhardwaj, and Sarma Vrudhula Compiler and Microarchitecture Lab Department.

Lecture 9: Coarse Grained FPGA Architecture October 6, 2004 ECE 697F Reconfigurable Computing Lecture 9 Coarse Grained FPGA Architecture.

5th International Conference, HiPEAC 2010 MEMORY-AWARE APPLICATION MAPPING ON COARSE-GRAINED RECONFIGURABLE ARRAYS Yongjoo Kim, Jongeun Lee *, Aviral Shrivastava.

Copyright © 2002 UCI ACES Laboratory A Design Space Exploration framework for rISA Design Ashok Halambi, Aviral Shrivastava,

University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution.

CML Enabling Multithreading on CGRAs Reiley Jeyapaul, Aviral Shrivastava 1, Jared Pager 1, Reiley Jeyapaul, 1 Mahdi Hamzeh 12, Sarma Vrudhula 2 Compiler.

PipeRench: A Coprocessor for Streaming Multimedia Acceleration Seth Goldstein, Herman Schmit et al. Carnegie Mellon University.

University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.

LCTES 2010, Stockholm Sweden OPERATION AND DATA MAPPING FOR CGRA’S WITH MULTI-BANK MEMORY Yongjoo Kim, Jongeun Lee *, Aviral Shrivastava ** and Yunheung.

© ACES Labs, CECS, ICS, UCI. Energy Efficient Code Generation Using rISA * Aviral Shrivastava, Nikil Dutt

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

University of Michigan Electrical Engineering and Computer Science 1 Resource Recycling: Putting Idle Resources to Work on a Composable Accelerator Yongjun.

Split Compilation for Accelerator-based Multicores Panagiotis Theocharis Computer Systems Lab (CSL) Ghent University.

Compilation Techniques for Energy Reduction in Horizontally Partitioned Cache Architectures Aviral Shrivastava, Ilya Issenin, Nikil Dutt Center For Embedded.

COE 561 Digital System Design & Synthesis Resource Sharing and Binding Dr. Aiman H. El-Maleh Computer Engineering Department King Fahd University of Petroleum.

Trend towards Embedded Multiprocessors Popular Examples –Network processors (Intel, Motorola, etc.) –Graphics (NVIDIA) –Gaming (IBM, Sony, and Toshiba)

Spring 2008 CSE 591 Compilers for Embedded Systems Aviral Shrivastava Department of Computer Science and Engineering Arizona State University.

CS1104 – Computer Organization PART 2: Computer Architecture Lecture 1 Introduction.

Synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng.

Presenter MaxAcademy Lecture Series – V1.0, September 2011 Introduction and Motivation.

Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.

Yongjoo Kim*, Jongeun Lee**, Jinyong Lee*, Toan Mai**, Ingoo Heo* and Yunheung Paek* *Seoul National University **UNIST (Ulsan National Institute of Science.

A Thermal-Aware Mapping Algorithm for Reducing Peak Temperature of an Accelerator Deployed in a 3D Stack A Thermal-Aware Mapping Algorithm for Reducing.

Adapting Convergent Scheduling Using Machine Learning Diego Puppin*, Mark Stephenson †, Una-May O’Reilly †, Martin Martin †, and Saman Amarasinghe † *

Computer Architecture and Organization Introduction.

A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.

Automated Design of Custom Architecture Tulika Mitra

Efficient Mapping onto Coarse-Grained Reconfigurable Architectures using Graph Drawing based Algorithm Jonghee Yoon, Aviral Shrivastava *, Minwook Ahn,

High Performance Scalable Base-4 Fast Fourier Transform Mapping Greg Nash Centar 2003 High Performance Embedded Computing Workshop

A Dynamic Code Mapping Technique for Scratchpad Memories in Embedded Systems Amit Pabalkar Compiler and Micro-architecture Lab School of Computing and.

Computer architecture Lecture 11: Reduced Instruction Set Computers Piotr Bilski.

Energy/Reliability Trade-offs in Fault-Tolerant Event-Triggered Distributed Embedded Systems Junhe Gan, Flavius Gruian, Paul Pop, Jan Madsen.

Hardware/Software Co-design Design of Hardware/Software Systems A Class Presentation for VLSI Course by : Akbar Sharifi Based on the work presented in.

I2CRF: Incremental Interconnect Customization for Embedded Reconfigurable Fabrics Jonghee W. Yoon, Jongeun Lee*, Jaewan Jung, Sanghyun Park, Yongjoo Kim,

L11: Lower Power High Level Synthesis(2) 성균관대학교 조 준 동 교수

CML REGISTER FILE ORGANIZATION FOR COARSE GRAINED RECONFIGURABLE ARCHITECTURES (CGRAs) Dipal Saluja Compiler Microarchitecture Lab, Arizona State University,

1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.

Séminaire COSI-Roscoff’011 Séminaire COSI ’01 Power Driven Processor Array Partitionning for FPGA SoC S.Derrien, S. Rajopadhye.

Spring 2008 CSE 591 Compilers for Embedded Systems Aviral Shrivastava Department of Computer Science and Engineering Arizona State University.

CML SSDM: Smart Stack Data Management for Software Managed Multicores Jing Lu Ke Bai, and Aviral Shrivastava Compiler Microarchitecture Lab Arizona State.

System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.

Hardware Benchmark Results for An Ultra-High Performance Architecture for Embedded Defense Signal and Image Processing Applications September 29, 2004.

Jason Li Jeremy Fowers 1. Speedups and Energy Reductions From Mapping DSP Applications on an Embedded Reconfigurable System Michalis D. Galanis, Gregory.

Analysis of Cache Tuner Architectural Layouts for Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing.

OPTIMIZING DSP SCHEDULING VIA ADDRESS ASSIGNMENT WITH ARRAY AND LOOP TRANSFORMATION Chun Xue, Zili Shao, Ying Chen, Edwin H.-M. Sha Department of Computer.

Reduction of Register File Power Consumption Approach: Value Lifetime Characteristics - Pradnyesh Gudadhe.

Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke

University of Michigan Electrical Engineering and Computer Science Automatic Synthesis of Customized Local Memories for Multicluster Application Accelerators.

Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.

1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.

Hyunchul Park†, Kevin Fan†, Scott Mahlke†,

CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.

CML Branch Penalty Reduction by Software Branch Hinting Jing Lu Yooseong Kim, Aviral Shrivastava, and Chuan Huang Compiler Microarchitecture Lab Arizona.

Winter-Spring 2001Codesign of Embedded Systems1 Essential Issues in Codesign: Architectures Part of HW/SW Codesign of Embedded Systems Course (CE )

Ph.D. in Computer Science

CGRA Express: Accelerating Execution using Dynamic Operation Fusion

Michael Bedford Taylor, Walter Lee, Saman Amarasinghe, Anant Agarwal

CSE-591 Compilers for Embedded Systems Code transformations and compile time data management techniques for application mapping onto SIMD-style Coarse-grained.

EPIMap: Using Epimorphism to Map Applications on CGRAs

Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke

Ann Gordon-Ross and Frank Vahid*

Summary of my Thesis Power-efficient performance is the need of the future CGRAs can be used as programmable accelerators Power-efficient performance can.

URECA: A Compiler Solution to Manage Unified Register File for CGRAs

Dynamic Code Mapping Techniques for Limited Local Memory Systems

Reiley Jeyapaul and Aviral Shrivastava Compiler-Microarchitecture Lab

1. Arizona State University, Tempe, USA

Spring 2008 CSE 591 Compilers for Embedded Systems

RAMP: Resource-Aware Mapping for CGRAs

CS295: Modern Systems: Application Case Study Neural Network Accelerator – 2 Sang-Woo Jun Spring 2019 Many slides adapted from Hyoukjun Kwon‘s Gatech.

Presentation transcript:

Mahdi Hamzeh, Aviral Shrivastava, and Sarma Vrudhula School of Computing, Informatics, and Decision Systems Engineering Arizona State University June 2013 This work was supported in part by CSR-EHS , CCF , CCF , NSF IUCRC for Embedded Systems (IIP ), Center for Embedded Systems grant DWS-0086; Science Foundation Arizona grant SRG , Raytheon and by the Stardust Foundation.

ADRES[1] CGRA Intel Core i7 NVIDIA Tesla™ c2050 Power (W) Giga Ops per Sec 60 GOpS/W 1.4 GOpS/W 4.3 GOpS/W Demand for performance Power consumption Technology scaling 2 Core Accelerator Shared Cache Private cache [1] BOUWENS, F., BEREKOVIC, M., SUTTER, B. D., AND GAYDADJIEV, G. Architecture enhancements for the adres coarse-grained reconfigurable array. In Proc. HiPEAC (2008), pp. 66–81.

2D array of Processing Elements (PEs) ALU + Local register file → PE Mesh interconnection Shared data bus – Data memory PE inputs: – 4 Neighboring PEs – Local register file 3

Time P 1 2 Q 1 2

P 1 2 Q 1 2

Size of resource graph ≈ O(n) Partition the resources n+1 partitions Huge number of possible partitions (exponential) Assign operations to sets such that All operations are mapped Data dependency between operations are obeyed Intractable Existing techniques are Exploratory – Huge search space – If fail, start from scratch Adhoc register allocation 6

General Problem formulation Reduce search space – Partition the problem to Scheduling and integrated placement and register allocation – No register in resource graph Constructive search Integrated placement and register allocation REGIMap – Schedule DFG – Construct Resource graph – Construct a compatibility graph between DFG and resource graph – Model register requirement of operation in the weight of arcs in compatibility graph – Find a restricted maximal clique 7

P 1 2 Q 1 2 2

Loops from SPEC2006 and multimedia benchmarks 4 × 4 CGRA with enough instruction and data memory Shared data bus for each row Latency is 1 cycle Compared with register-aware DRESC [2] 9 [2] DE SUTTER, B., COENE, P., VANDER AA, T., AND MEI, B. Placement-and-routing-based register allocation for coarse-grained reconfigurable arrays. In Proc. LCTES (2008), pp. 151–160.

10

11

Accelerators for energy efficiency Coarse-grained reconfigurable architecture, a programmable accelerator Contributions – Problem formulation – Search space reduction – Constructive search – Integrated register allocation – REGIMap Better mappings 1.8X performance improvement On average 56 times better compilation time Please join my poster presentation for more details 12