A Code Refinement Methodology for Performance-Improved Synthesis from C Greg Stitt, Frank Vahid*, Walid Najjar Department of Computer Science and Engineering.

Slides:

Advertisements

Similar presentations

A Program Transformation For Faster Goal-Directed Search Akash Lal, Shaz Qadeer Microsoft Research.

Advertisements

ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto

Computer Architecture Lecture 7 Compiler Considerations and Optimizations.

Conjoining Soft-Core FPGA Processors David Sheldon a, Rakesh Kumar b, Frank Vahid a*, Dean Tullsen b, Roman Lysecky c a Department of Computer Science.

- 1 -  P. Marwedel, Univ. Dortmund, Informatik 12, 05/06 Universität Dortmund Hardware/Software Codesign.

Data Analytics and Dynamic Languages Lee E. Edlefsen, Ph.D. VP of Engineering 1.

Application-Specific Customization of Parameterized FPGA Soft-Core Processors David Sheldon a, Rakesh Kumar b, Roman Lysecky c, Frank Vahid a*, Dean Tullsen.

A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering.

A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department.

Behavioral Design Outline –Design Specification –Behavioral Design –Behavioral Specification –Hardware Description Languages –Behavioral Simulation –Behavioral.

The New Software: Invisible Ubiquitous FPGAs that Enable Next-Generation Embedded Systems Frank Vahid Professor Department of Computer Science and Engineering.

A High Performance Application Representation for Reconfigurable Systems Wenrui GongGang WangRyan Kastner Department of Electrical and Computer Engineering.

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

1 Energy Savings and Speedups from Partitioning Critical Software Loops to Hardware in Embedded Systems Greg Stitt, Frank Vahid, Shawn Nematbakhsh University.

Automatic Tuning of Two-Level Caches to Embedded Applications Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.

Frank Vahid, UC Riverside 1 Self-Improving Configurable IC Platforms Frank Vahid Associate Professor Dept. of Computer Science and Engineering University.

Center for Embedded Computer Systems University of California, Irvine and San Diego SPARK: A Parallelizing High-Level Synthesis.

BRASS Analysis of QuasiStatic Scheduling Techniques in a Virtualized Reconfigurable Machine Yury Markovskiy, Eylon Caspi, Randy Huang, Joseph Yeh, Michael.

CS 202, Spring 2003 Fundamental Structures of Computer Science II Bilkent University1 Sorting - 3 CS 202 – Fundamental Structures of Computer Science II.

Dynamic Hardware/Software Partitioning: A First Approach Greg Stitt, Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering University.

HW/SW Co-Design of an MPEG-2 Decoder Pradeep Dhananjay Kiran Divakar Leela Kishore Kothamasu Anthony Weerasinghe.

Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.

Universität Dortmund  P. Marwedel, Univ. Dortmund, Informatik 12, 2003 Hardware/software partitioning  Functionality to be implemented in software.

Delevopment Tools Beyond HDL

Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.

Optimizing Loop Performance for Clustered VLIW Architectures by Yi Qian (Texas Instruments) Co-authors: Steve Carr (Michigan Technological University)

A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.

Department of Computer Science and Engineering, HKUST 1 HKUST Summer Programming Course 2008 Recursion.

Allen Michalski CSE Department – Reconfigurable Computing Lab University of South Carolina Microprocessors with FPGAs: Implementation and Workload Partitioning.

Department of Computer Science A Static Program Analyzer to increase software reuse Ramakrishnan Venkitaraman and Gopal Gupta.

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

Automated Design of Custom Architecture Tulika Mitra

HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

Hardware/Software Partitioning of Floating-Point Software Applications to Fixed-Point Coprocessor Circuits Lance Saldanha, Roman Lysecky Department of.

Static Program Analysis of Embedded Software Ramakrishnan Venkitaraman Graduate Student, Computer Science Advisor: Dr. Gopal Gupta.

Static Program Analyses of DSP Software Systems Ramakrishnan Venkitaraman and Gopal Gupta.

- 1 - EE898_HW/SW Partitioning Hardware/software partitioning  Functionality to be implemented in software or in hardware? No need to consider special.

University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 1 Computer Systems Optimizing program performance.

Evaluating and Improving an OpenMP-based Circuit Design Tool Tim Beatty, Dr. Ken Kent, Dr. Eric Aubanel Faculty of Computer Science University of New Brunswick.

Topic 3: C Basics CSE 30: Computer Organization and Systems Programming Winter 2011 Prof. Ryan Kastner Dept. of Computer Science and Engineering University.

Compiler Optimizations ECE 454 Computer Systems Programming Topics: The Role of the Compiler Common Compiler (Automatic) Code Optimizations Cristiana Amza.

1 Becoming More Effective with C++ … Day Two Stanley B. Lippman

Codesigned On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also.

Programming for Performance CS 740 Oct. 4, 2000 Topics How architecture impacts your programs How (and how not) to tune your code.

WARP PROCESSORS ROMAN LYSECKY GREG STITT FRANK VAHID Presented by: Xin Guan Mar. 17, 2010.

Physically Aware HW/SW Partitioning for Reconfigurable Architectures with Partial Dynamic Reconfiguration Sudarshan Banarjee, Elaheh Bozorgzadeh, Nikil.

3/2/2016© Hal Perkins & UW CSES-1 CSE P 501 – Compilers Optimizing Transformations Hal Perkins Autumn 2009.

On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also with the.

Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis Greg Stitt Department of Electrical and Computer Engineering University of Florida.

Michael J. Voss and Rudolf Eigenmann PPoPP, ‘01 (Presented by Kanad Sinha)

1 Frequent Loop Detection Using Efficient Non-Intrusive On-Chip Hardware Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering.

K-Nearest Neighbor Digit Recognition ApplicationDomainConstraintsKernels/Algorithms Voice Removal and Pitch ShiftingAudio ProcessingLatency (Real-Time)FFT,

CoDeveloper Overview Updated February 19, Introducing CoDeveloper™  Targeting hardware/software programmable platforms  Target platforms feature.

Data Structures I (CPCS-204) Week # 5: Recursion Dr. Omar Batarfi Dr. Yahya Dahab Dr. Imtiaz Khan.

Automated Software Generation and Hardware Coprocessor Synthesis for Data Adaptable Reconfigurable Systems Andrew Milakovich, Vijay Shankar Gopinath, Roman.

A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation Roman Lysecky a, Frank Vahid a*, Sheldon X.-D. Tan b a Department of Computer.

Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University.

James Coole PhD student, University of Florida Aaron Landy Greg Stitt

ESE532: System-on-a-Chip Architecture

Optimizing Transformations Hal Perkins Autumn 2011

Ann Gordon-Ross and Frank Vahid*

Optimizing Transformations Hal Perkins Winter 2008

Dynamic Hardware/Software Partitioning: A First Approach

Lecture 4: Introduction to Code Analysis

Automatic Tuning of Two-Level Caches to Embedded Applications

Lecture 4: Instruction Set Design/Pipelining

Digital Designs – What does it take

Presentation transcript:

A Code Refinement Methodology for Performance-Improved Synthesis from C Greg Stitt, Frank Vahid*, Walid Najjar Department of Computer Science and Engineering University of California, Riverside Also with the Center for Embedded Computer Systems, UC Irvine This research is supported in part by the National Science Foundation and the Semiconductor Research Corporation

2/23 Introduction uPFPGA motionComp() filterLuma() filterChroma() deblocking()..... H.264 Compiler **** **** Synthesis motionComp() Select Critical Region deblocking() Previous work: In-depth hw/sw partitioning study of H.264 decoder Collaboration with Freescale

3/23 Introduction Previous work: In-depth hw/sw partitioning study of H.264 decoder Collaboration with Freescale Large gap between ideal and actual speedup Obtained 2.5x speedup

4/23 Introduction Noticed coding constructs/practices limited hw speed Identified problematic coding constructs Developed simple coding guidelines Dozens of lines of code Minutes per guideline Refined critical regions using guidelines motionComp() filterLuma() filterChroma() deblocking()..... Apply Guidelines motionComp’() filterLuma() filterChroma() deblocking’()..... Hw/Sw Partitioning

5/23 Introduction Noticed coding constructs/practices limited hw speed Identified problematic coding constructs Developed simple coding guidelines Dozens of lines of code Minutes per guideline Refined critical regions using guidelines Simple guidelines increased speedup to 6.5x Can simple coding guidelines show similar improvements on other applications?

6/23 Coding Guidelines Analyzed dozens of benchmarks Identified common problems related to synthesis Developed 10 guidelines to fix problems Although some are well known, analysis shows they are rarely applied Automation unlikely or impossible in many cases Conversion to Constants (CC) Conversion to Fixed Point (CF) Conversion to Explicit Data Flow (CEDF) Conversion to Explicit Memory Accesses (CEMA) Function Specialization (FS) Constant Input Enumeration (CIE) Loop Rerolling (LR) Conversion to Explicit Control Flow (CECF) Algorithmic Specialization (AS) Pass-By-Value Return (PVR) Coding Guidelines

7/23 Fast Refinement Several dozen lines of code provide most performance improvement Refining takes minutes/hours Idct() Memset() FIR() Sort() Search() ReadInput() WriteOutput() Matrix() Brev() Compress() Quantize()..... Sample Application Profiling Results Only several performance critical regions Apply guidelines to only the critical regions

8/23 int coef[100]; void initCoef() { // initialize coef } void fir() { // fir filter using coef } void f() { initCoef() // other code fir(); } Conversion to Constants (CC) int coef[100]; void initCoef() { // initialize coef } void fir() { // fir filter using coef } void firConstWrapper(const int array[100]) { // misc code... fir(array); } void f() { initCoef() // other code fir(); } Problem: Arrays of constants commonly not specified as constants Initialized at runtime Guideline: Use constant wrapper function Specifies array constant for all future functions Automation Difficult, requires global def- use/alias analysis int coef[100]; void initCoef() { // initialize coef } void fir(const int array[100]) { // fir filter using const array } void constWrapper(const int array[100]) { // misc code... fir(array); } void f() { initCoef() // other code fir(); } int coef[100]; void initCoef() { // initialize coef } void fir(const int array[100]) { // fir filter using const array } void constWrapper(const int array[100]) { // misc code... fir(array); } void f() { initCoef() constWrapper(coef); } Can also enable constant folding int coef[100]; void initCoef() { // initialize coef } void fir(const int array[100]) { // fir filter using const array } void constWrapper(const int array[100]) { prefetchArray( array ); // misc code... fir(array); } void f() { initCoef() constWrapper(coef); } Array can’t change, prefetching won’t violate dependencies

9/23 int array[100]; void a() { for (i=0; i < 100; i++) array[i] =..... } void b() { for (i=0; i < 100; i++) array[i] = array[i]+f(i); } int c() { for (i=0; i < 100; i++) temp += array[i]; } void d() { for (..... ) { a(); b(); c(); } Conversion to Explicit Data Flow (CEDF) Problem: Global variables make determination of parallelism difficult Requires global def-use/alias analysis Guideline: Replace globals with extra parameters Makes data flow explicit Simpler analysis may expose parallelism Automation Been proposed [Lee01] But, difficult because of aliases a(), b(), c() must execute sequentially because of global array dependencies void a(int array[100]) { for (i=0; i < 100; i++) array[i] =..... } void b(int array1[100], int array2[100]) { for (i=0; i < 100; i++) array2[i] = array1[i]+f(i); } int c(int array[100]) { for (i=0; i < 100; i++) temp += array[i]; } void d() { int array1[100], array2[100]; for (..... ) { a(array1 ); b(array1, array2 ); c(array2 ); } a() and c() can execute in parallel after 1 st iteration

10/23 void f(int a, int b) {.... for (i=0; i < a; i++) { for (j=0; j < b; i++) { c[i][j]=i+j; } Constant Input Enumeration (CIE) Problem: Function parameters may limit parallelism Guideline: Create enum for possible values Synthesis can create specialized functions Automation In some cases, def-use analysis may identify all inputs In general, difficult due to aliases c[i][j] + i j One iteration at a time Bounds not known, hard to unroll enum PRM { VAL1=2, VAL2=4 }; void f(enum PRM a, enum PRM b) {.... for (i=0; i < a; i++) { for (j=0; j < b; i++) { c[i][j]=i+j; } c[0][0] Iterations can be parallelized in each version c[0][1] c[0][2] Specialized Versions: f(2,2), f(2,4), f(4,2), f(2,4)

11/23 Conversion to Explicit Control Flow (CECF) Problem: Function pointers may prevent static control flow analysis Guideline: Replace function pointer with if-else, static calls Makes possible targets explicit Automation In general, is impossible Equivalent to halting problem void f( int (*fp) (int) ) {..... for (i=0; i < 10; i++) { a[i] = fp(i); } enum Target { FUNC1, FUNC2, FUNC3 }; void f( enum Target fp ) {..... for (i=0; i < 10; i++) { if (fp == FUNC1) a[i] = f1(i); else if (fp == FUNC2) a[i] = f2(i); else a[i] = f3(i); } Synthesis unlikely to determine possible targets of function pointer ? a[i] Synthesized Hardware a[i] Synthesized Hardware f1(i)f2(i)f3(i) 3x1 fp

12/23 Algorithmic Specialization (AS) Algorithms targeting sw may not be fast in hw Sequential vs. parallel C code generally uses sw algorithms Guideline: Specialize critical functions with hw algorithms Automation Requires higher level specification Intrinsics void search(int a[], int k, int l, int r) { while (l <= r) { mid = (l+r)/2; if (k > a[mid]) l = mid+1; else if (k < a[mid) r = mid-1; else return mid; } return –1; } void search(int a[], int k, const int s) { for (i=0; i < s; i++) { if (a[i] == k) return i; } return –1; } Can be parallelized in hardware

13/23 Pass-By-Value Return (PVR) Problem: Array parameters cannot be prefetched due to potential aliases Designer may know aliases don’t exist Guideline: Use pass-by-value- return Automation Requires global alias analysis void f(int *a, int *b, int array[16]) { … // unrelated computation g(array); … // unrelated computation } int g(int array[16]) { // computation done on array } void f(int *a, int *b, int array[16]) { int localArray[16]; memcpy(localArray,array,16*sizeof(int)); … // misc computation g(localArray); … // misc computation memcpy(array, localArray,16*sizeof(int)); } int g(int array[16]) { // computation done on array } Can’t prefetch array for g(), may be aliased Local array can’t be aliased, can prefetch

14/23 Why Synthesis From C? Why not use HDL? HDL may yield better results C is mainstream language Acceptable performance in many cases Learning HDL is large overhead Approaches are orthogonal This work focuses on improving mainstream Guidelines common for HDL Can also be applied to algorithmic HDL

15/23 Software Overhead Refined regions may not be partitioned to hardware Partitioner may select non-refined regions OS may select software or hardware implementation Based on state of FPGA Coding guidelines have potential software overhead motionComp’() filterLuma() filterChroma() deblocking’()..... Hw/Sw Partitioning uPFPGA filterLuma() filterChroma() motionComp’() deblocking’() Problem - Refined code mapped to software

16/23 Refinement Methodology Considerations Reduce software overhead Reduce refinement time Methodology Profile Iterative-improvement Determine critical region Apply all except PVR/AS Minimal overhead Apply PVR if overhead acceptable Apply AS if known algorithm and overhead acceptable Profile Determine Critical Region Is overhead of copying array acceptable? Does suitable hw algorithm exist and have acceptable sw performance ? Apply CC, CF, CEMA, CIE, CEDF, CECF, FS, LR Apply PVR Apply AS yes no yes Repeat until performance acceptable

17/23 Experimental Setup Benchmark suite MediaBench, Powerstone Manually applied guidelines 1-2 hours 23 additional lines/benchmark, on average Target Architecture Xilinx VirtexII FPGA with ARM9 uP Hardware/software partitioning Selects critical regions for hardware Synthesis High-level synthesis tool ~30,000 lines of C code Outputs register-transfer level (RTL) VHDL RTL Synthesis using Xilinx ISE Compilation Gcc with –O1 optimizations Benchmarks Manual Refinement Hw/Sw Partitioning Synthesis Refined Code Compilation Sw Hw Bitfile ARM9Virtex II

18/23 Speedups from Guidelines Conversion to constants Speedup: 3.6x Total Time: 5 minutes Explicit Dataflow + Algorithmic Specialization Speedup: 16.4x Total Time: 15 minutes No guidelines Speedup: 2x

19/23 Speedups from Guidelines No Guidelines Speedup: 8.6x Conversion to Constants Speedup: 14.4x Total Time: 10 minutes Input Enumeration Speedup: 16.7x Total Time: 20 minutes Algorithmic Specialization Speedup: 19x Time: 30 minutes Sw Overhead: 6000%

20/23 Speedups from Guidelines Original code Speedups range from 1x (no speedup) to 573x Average: 2.6x (excludes brev) Refined code with guidelines Average: 8.4x (excludes brev) 3.5x average improvement compared to original code

21/23 Speedups from Guidelines Guidelines move speedups closer to ideal Almost identical for mpeg2, fir Several examples still far from ideal May imply new guidelines needed

22/23 Guideline SW Overhead/Improvement Average Sw performance overhead: -15.7% (improvement) -1.1% excluding brev 3 examples improved Average Sw size overhead (lines of C code) 8.4% excluding brev Overhead Improvement

23/23 Summary Simple coding guidelines significantly improve synthesis from C 3.5x speedup compared to Hw/Sw synthesized from unrefined code Major rewrites may not be necessary Between 1-2 hours Refinement Methodology Reduces software size/performance overhead In some cases, improvement Future Work Test on commercial synthesis tools New guidelines for different domains