Hy-C A Compiler Retargetable for Single-Chip Heterogeneous Multiprocessors Philip Sweany 8/27/2010.

Slides:

Advertisements

Similar presentations

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.

Advertisements

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.

SSA and CPS CS153: Compilers Greg Morrisett. Monadic Form vs CFGs Consider CFG available exp. analysis: statement gen's kill's x:=v 1 p v 2 x:=v 1 p v.

8. Static Single Assignment Form Marcus Denker. © Marcus Denker SSA Roadmap  Static Single Assignment Form (SSA)  Converting to SSA Form  Examples.

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) SSA Guo, Yao.

A Process Splitting Transformation for Kahn Process Networks Sjoerd Meijer.

Architecture-dependent optimizations Functional units, delay slots and dependency analysis.

ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto

8. Code Generation. Generate executable code for a target machine that is a faithful representation of the semantics of the source code Depends not only.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

Stanford University CS243 Winter 2006 Wei Li 1 Register Allocation.

Computer Science 313 – Advanced Programming Topics.

Program Representations. Representing programs Goals.

11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.

Representing programs Goals. Representing programs Primary goals –analysis is easy and effective just a few cases to handle directly link related things.

Cpeg421-08S/final-review1 Course Review Tom St. John.

1 Intermediate representation Goals: –encode knowledge about the program –facilitate analysis –facilitate retargeting –facilitate optimization scanning.

Addressing Optimization for Loop Execution Targeting DSP with Auto-Increment/Decrement Architecture Wei-Kai Cheng Youn-Long Lin* Computer & Communications.

A High Performance Application Representation for Reconfigurable Systems Wenrui GongGang WangRyan Kastner Department of Electrical and Computer Engineering.

Mahapatra-Texas A&M-Fall'001 cosynthesis Introduction to cosynthesis Rabi Mahapatra CPSC498.

EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

Center for Embedded Computer Systems University of California, Irvine Coordinated Coarse Grain and Fine Grain Optimizations.

Multiscalar processors

The Effect of Data-Reuse Transformations on Multimedia Applications for Different Processing Platforms N. Vassiliadis, A. Chormoviti, N. Kavvadias, S.

Trend towards Embedded Multiprocessors Popular Examples –Network processors (Intel, Motorola, etc.) –Graphics (NVIDIA) –Gaming (IBM, Sony, and Toshiba)

Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Copyright © 2006 by The McGraw-Hill Companies,

ECE669 L23: Parallel Compilation April 29, 2004 ECE 669 Parallel Computer Architecture Lecture 23 Parallel Compilation.

Center for Embedded Computer Systems University of California, Irvine and San Diego SPARK: A Parallelizing High-Level Synthesis.

Generic Software Pipelining at the Assembly Level Markus Pister

Optimization software for apeNEXT Max Lukyanov,  apeNEXT : a VLIW architecture  Optimization basics  Software optimizer for apeNEXT  Current.

Hy-C A Compiler Retargetable for 2014 and beyond Philip Sweany 4/29/2014.

Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.

Winter-Spring 2001Codesign of Embedded Systems1 Co-Synthesis Algorithms: HW/SW Partitioning Part of HW/SW Codesign of Embedded Systems Course (CE )

1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.

Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.

Predicated Static Single Assignment (PSSA) Presented by AbdulAziz Al-Shammari

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

Array Synthesis in SystemC Hardware Compilation Authors: J. Ditmar and S. McKeever Oxford University Computing Laboratory, UK Conference: Field Programmable.

1 Optimizing compiler tools and building blocks project Alexander Drozdov, PhD Sergey Novikov, PhD.

1 CS 201 Compiler Construction Introduction. 2 Instructor Information Rajiv Gupta Office: WCH Room Tel: (951) Office.

Hy-C A Compiler Retargetable for Single-Chip Heterogeneous Multiprocessors Philip Sweany 8/30/2013.

6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)

System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.

Compiler Optimizations ECE 454 Computer Systems Programming Topics: The Role of the Compiler Common Compiler (Automatic) Code Optimizations Cristiana Amza.

CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.

CS412/413 Introduction to Compilers Radu Rugina Lecture 18: Control Flow Graphs 29 Feb 02.

Optimal Superblock Scheduling Using Enumeration Ghassan Shobaki, CS Dept. Kent Wilken, ECE Dept. University of California, Davis

1 Control Flow Graphs. 2 Optimizations Code transformations to improve program –Mainly: improve execution time –Also: reduce program size Can be done.

Introduction to SSA Data-flow Analysis Revisited – Static Single Assignment (SSA) Form Liberally Borrowed from U. Delaware and Cooper and Torczon Text.

High Performance Embedded Computing © 2007 Elsevier Lecture 10: Code Generation Embedded Computing Systems Michael Schulte Based on slides and textbook.

Memory-Aware Compilation Philip Sweany 10/20/2011.

Linear Analysis and Optimization of Stream Programs Masterworks Presentation Andrew A. Lamb 4/30/2003 Professor Saman Amarasinghe MIT Laboratory for Computer.

Compiler Research How I spent my last 22 summer vacations Philip Sweany.

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

Single Static Assignment Intermediate Representation (or SSA IR) Many examples and pictures taken from Wikipedia.

Code Optimization.

Dynamo: A Runtime Codesign Environment

Ph.D. in Computer Science

The Dataflow Interchange Format (DIF): A Framework for Specifying, Analyzing, and Integrating Dataflow Representations of Signal Processing Systems Shuvra.

Static Single Assignment

Introduction to cosynthesis Rabi Mahapatra CSCE617

CSCI1600: Embedded and Real Time Software

Static Single Assignment Form (SSA)

Register Allocation Hal Perkins Summer 2004

Register Allocation Hal Perkins Autumn 2005

Final Code Generation and Code Optimization

Reference These slides, with minor modification and some deletion, come from U. of Delaware – and the web, of course. 4/4/2019 CPEG421-05S/Topic5.

Reference These slides, with minor modification and some deletion, come from U. of Delaware – and the web, of course. 4/17/2019 CPEG421-05S/Topic5.

CSCI1600: Embedded and Real Time Software

Research: Past, Present and Future

Presentation transcript:

Hy-C A Compiler Retargetable for Single-Chip Heterogeneous Multiprocessors Philip Sweany 8/27/2010

2 No single architecture solves all power problems Hard -wired proxy General Purpose Processor 100 X Software Programmable DSP Industry has debated merits of each architecture for decades… Combination of all approaches optimizes power and performance 10 X

Retargetable Compilation Why ? Rocket – C compiler, written in C++ – Retargetable for ILP computers – Single machine description file – Development Gnu

Hybrid Computing Heterogeneous processors on single chip – “CPU” – FPGA – ASIC – N “CPU”s, M FPGAs, K ASICs Tradeoffs of performance, power, flexibility

CPU 1 CPU 2 CPU m Multi-CPU FPGA 1 FPGA 2 FPGA n Multi-FPGA Shared Memory Generic Hybrid Architecture

System Specification Partitioning CPU Compiler FPGA Synthesis CPU Power-Performance Model FPGA Power-Performance Model Source Code Generic Hy-C Tools Optimization Control Objectives/Constraints

Intermediate Representations 3-address form Control flow graph SSA --- static single assignment

Control Flow Graph Nodes are Basic Blocks – Single entry, single exit – No branch exempt (possibly) at bottom Edges represent one possible flow of execution between two basic blocks Whole CFG represents a function

1/26/20169 Static Single Assignment SSA: A program is in SSA form iff – Each variable is statically defined exactly only once, and – Each use of a variable is dominated by that variable’s definition.

1/26/ Example In general, how to transform an arbitrary program into SSA form? Does the definition of X 2 dominates its use in the example? X1X1 X 2 = X 4 = X 3 = (X 1, X 2 ) =

1/26/ SSA: Motivation Provide a uniform basis of an IR to solve a wide range of classical dataflow problems Encode both dataflow and control flow information A SSA form can be constructed and maintained efficiently Its popular Gcc uses SSA

Software Pipelining Schedule operations from multiple iterations of a loop in parallel Hides latency Compiler “reorders” loop code to include: – Prelude – Kernel – Postlude

Software Pipeline Benefit for “Typical” Architecture and MMult “Typical” Architecture – 8-wide Instruction-Level Parallel (ILP) Assuming 3000 x 3000 matrices – Original requires 45 million cycles – Pipelined version requires 3 million + 15

Current Compiler Projects Hy-C – Build tools – Partition algorithms – Retargetability and constraint specification – OMAP project Thread-level parallelism in imperative code – Limit study – Improved identification of threads Fast compiler-controlled memory

15 Application Imaging Video Audio OMAP4 Sub-System Encapsulation

Chiron Tesla Ducati Multi-CPU Shared Memory OMAP Resources

OMAP Processor Resources Chiron – 2 x 600 MHz (2 symmetric processors each at 600 MHz with shared L2) – Power 600uW / MHz Tesla – DSP Sub-System (C64x derivative); 400 MHz, 8-wide ILP – Power 200uW / MHz Ducati – 200 MHz (targeted for control, low latency code) – Power 100uW / MHz

System Specification Partitioning Veyron Ducati Source Code Hy-C for OMAP Optimization Control Objectives/Constraints Tesla

OMAP Project, Current State Use gcc to generate “readable” SSA graphs for C programs Developing translator to convert SSA graphs to Hy-C internal Control, Data Dependence Graphs (CDDGs). Translator to Hy-C CDDGs successfully tested on small C programs 1/26/2016

Partition Algorithm Examine Control Flow Graph (CFG) for a function – Identify software pipelining possibility – Build Dependence Graph (combining data and control dependence) Choose one of three resources for the function

Partition Algorithm (cont.) If software pipelining profitable, place function on C64 DSP resource Else examine Dependence Graph – if ( number of nodes / critical path length ) > 1.5, place on double-issue ARM – else place on single-issue ARM

Long-Term Future Automatic Code Generation (I don’t believe in software) Visual Programming of Components