Hy-C A Compiler Retargetable for Single-Chip Heterogeneous Multiprocessors Philip Sweany 8/27/2010
2 No single architecture solves all power problems Hard -wired proxy General Purpose Processor 100 X Software Programmable DSP Industry has debated merits of each architecture for decades… Combination of all approaches optimizes power and performance 10 X
Retargetable Compilation Why ? Rocket – C compiler, written in C++ – Retargetable for ILP computers – Single machine description file – Development Gnu
Hybrid Computing Heterogeneous processors on single chip – “CPU” – FPGA – ASIC – N “CPU”s, M FPGAs, K ASICs Tradeoffs of performance, power, flexibility
CPU 1 CPU 2 CPU m Multi-CPU FPGA 1 FPGA 2 FPGA n Multi-FPGA Shared Memory Generic Hybrid Architecture
System Specification Partitioning CPU Compiler FPGA Synthesis CPU Power-Performance Model FPGA Power-Performance Model Source Code Generic Hy-C Tools Optimization Control Objectives/Constraints
Intermediate Representations 3-address form Control flow graph SSA --- static single assignment
Control Flow Graph Nodes are Basic Blocks – Single entry, single exit – No branch exempt (possibly) at bottom Edges represent one possible flow of execution between two basic blocks Whole CFG represents a function
1/26/20169 Static Single Assignment SSA: A program is in SSA form iff – Each variable is statically defined exactly only once, and – Each use of a variable is dominated by that variable’s definition.
1/26/ Example In general, how to transform an arbitrary program into SSA form? Does the definition of X 2 dominates its use in the example? X1X1 X 2 = X 4 = X 3 = (X 1, X 2 ) =
1/26/ SSA: Motivation Provide a uniform basis of an IR to solve a wide range of classical dataflow problems Encode both dataflow and control flow information A SSA form can be constructed and maintained efficiently Its popular Gcc uses SSA
Software Pipelining Schedule operations from multiple iterations of a loop in parallel Hides latency Compiler “reorders” loop code to include: – Prelude – Kernel – Postlude
Software Pipeline Benefit for “Typical” Architecture and MMult “Typical” Architecture – 8-wide Instruction-Level Parallel (ILP) Assuming 3000 x 3000 matrices – Original requires 45 million cycles – Pipelined version requires 3 million + 15
Current Compiler Projects Hy-C – Build tools – Partition algorithms – Retargetability and constraint specification – OMAP project Thread-level parallelism in imperative code – Limit study – Improved identification of threads Fast compiler-controlled memory
15 Application Imaging Video Audio OMAP4 Sub-System Encapsulation
Chiron Tesla Ducati Multi-CPU Shared Memory OMAP Resources
OMAP Processor Resources Chiron – 2 x 600 MHz (2 symmetric processors each at 600 MHz with shared L2) – Power 600uW / MHz Tesla – DSP Sub-System (C64x derivative); 400 MHz, 8-wide ILP – Power 200uW / MHz Ducati – 200 MHz (targeted for control, low latency code) – Power 100uW / MHz
System Specification Partitioning Veyron Ducati Source Code Hy-C for OMAP Optimization Control Objectives/Constraints Tesla
OMAP Project, Current State Use gcc to generate “readable” SSA graphs for C programs Developing translator to convert SSA graphs to Hy-C internal Control, Data Dependence Graphs (CDDGs). Translator to Hy-C CDDGs successfully tested on small C programs 1/26/2016
Partition Algorithm Examine Control Flow Graph (CFG) for a function – Identify software pipelining possibility – Build Dependence Graph (combining data and control dependence) Choose one of three resources for the function
Partition Algorithm (cont.) If software pipelining profitable, place function on C64 DSP resource Else examine Dependence Graph – if ( number of nodes / critical path length ) > 1.5, place on double-issue ARM – else place on single-issue ARM
Long-Term Future Automatic Code Generation (I don’t believe in software) Visual Programming of Components