Download presentation
Presentation is loading. Please wait.
1
A Code Refinement Methodology for Performance-Improved Synthesis from C Greg Stitt, Frank Vahid*, Walid Najjar Department of Computer Science and Engineering University of California, Riverside Also with the Center for Embedded Computer Systems, UC Irvine This research is supported in part by the National Science Foundation and the Semiconductor Research Corporation
2
2/23 Introduction uPFPGA motionComp() filterLuma() filterChroma() deblocking()..... H.264 Compiler **** + + + **** + + + Synthesis motionComp() Select Critical Region deblocking() Previous work: In-depth hw/sw partitioning study of H.264 decoder Collaboration with Freescale
3
3/23 Introduction Previous work: In-depth hw/sw partitioning study of H.264 decoder Collaboration with Freescale Large gap between ideal and actual speedup Obtained 2.5x speedup
4
4/23 Introduction Noticed coding constructs/practices limited hw speed Identified problematic coding constructs Developed simple coding guidelines Dozens of lines of code Minutes per guideline Refined critical regions using guidelines motionComp() filterLuma() filterChroma() deblocking()..... Apply Guidelines motionComp’() filterLuma() filterChroma() deblocking’()..... Hw/Sw Partitioning
5
5/23 Introduction Noticed coding constructs/practices limited hw speed Identified problematic coding constructs Developed simple coding guidelines Dozens of lines of code Minutes per guideline Refined critical regions using guidelines Simple guidelines increased speedup to 6.5x Can simple coding guidelines show similar improvements on other applications?
6
6/23 Coding Guidelines Analyzed dozens of benchmarks Identified common problems related to synthesis Developed 10 guidelines to fix problems Although some are well known, analysis shows they are rarely applied Automation unlikely or impossible in many cases Conversion to Constants (CC) Conversion to Fixed Point (CF) Conversion to Explicit Data Flow (CEDF) Conversion to Explicit Memory Accesses (CEMA) Function Specialization (FS) Constant Input Enumeration (CIE) Loop Rerolling (LR) Conversion to Explicit Control Flow (CECF) Algorithmic Specialization (AS) Pass-By-Value Return (PVR) Coding Guidelines
7
7/23 Fast Refinement Several dozen lines of code provide most performance improvement Refining takes minutes/hours Idct() Memset() FIR() Sort() Search() ReadInput() WriteOutput() Matrix() Brev() Compress() Quantize()..... Sample Application Profiling Results Only several performance critical regions Apply guidelines to only the critical regions
8
8/23 int coef[100]; void initCoef() { // initialize coef } void fir() { // fir filter using coef } void f() { initCoef() // other code fir(); } Conversion to Constants (CC) int coef[100]; void initCoef() { // initialize coef } void fir() { // fir filter using coef } void firConstWrapper(const int array[100]) { // misc code... fir(array); } void f() { initCoef() // other code fir(); } Problem: Arrays of constants commonly not specified as constants Initialized at runtime Guideline: Use constant wrapper function Specifies array constant for all future functions Automation Difficult, requires global def- use/alias analysis int coef[100]; void initCoef() { // initialize coef } void fir(const int array[100]) { // fir filter using const array } void constWrapper(const int array[100]) { // misc code... fir(array); } void f() { initCoef() // other code fir(); } int coef[100]; void initCoef() { // initialize coef } void fir(const int array[100]) { // fir filter using const array } void constWrapper(const int array[100]) { // misc code... fir(array); } void f() { initCoef() constWrapper(coef); } Can also enable constant folding int coef[100]; void initCoef() { // initialize coef } void fir(const int array[100]) { // fir filter using const array } void constWrapper(const int array[100]) { prefetchArray( array ); // misc code... fir(array); } void f() { initCoef() constWrapper(coef); } Array can’t change, prefetching won’t violate dependencies
9
9/23 int array[100]; void a() { for (i=0; i < 100; i++) array[i] =..... } void b() { for (i=0; i < 100; i++) array[i] = array[i]+f(i); } int c() { for (i=0; i < 100; i++) temp += array[i]; } void d() { for (..... ) { a(); b(); c(); } Conversion to Explicit Data Flow (CEDF) Problem: Global variables make determination of parallelism difficult Requires global def-use/alias analysis Guideline: Replace globals with extra parameters Makes data flow explicit Simpler analysis may expose parallelism Automation Been proposed [Lee01] But, difficult because of aliases a(), b(), c() must execute sequentially because of global array dependencies void a(int array[100]) { for (i=0; i < 100; i++) array[i] =..... } void b(int array1[100], int array2[100]) { for (i=0; i < 100; i++) array2[i] = array1[i]+f(i); } int c(int array[100]) { for (i=0; i < 100; i++) temp += array[i]; } void d() { int array1[100], array2[100]; for (..... ) { a(array1 ); b(array1, array2 ); c(array2 ); } a() and c() can execute in parallel after 1 st iteration
10
10/23 void f(int a, int b) {.... for (i=0; i < a; i++) { for (j=0; j < b; i++) { c[i][j]=i+j; } Constant Input Enumeration (CIE) Problem: Function parameters may limit parallelism Guideline: Create enum for possible values Synthesis can create specialized functions Automation In some cases, def-use analysis may identify all inputs In general, difficult due to aliases c[i][j] + i j One iteration at a time Bounds not known, hard to unroll enum PRM { VAL1=2, VAL2=4 }; void f(enum PRM a, enum PRM b) {.... for (i=0; i < a; i++) { for (j=0; j < b; i++) { c[i][j]=i+j; } c[0][0] + 0 0 Iterations can be parallelized in each version c[0][1] + 0 1 c[0][2] + 0 2..... Specialized Versions: f(2,2), f(2,4), f(4,2), f(2,4)
11
11/23 Conversion to Explicit Control Flow (CECF) Problem: Function pointers may prevent static control flow analysis Guideline: Replace function pointer with if-else, static calls Makes possible targets explicit Automation In general, is impossible Equivalent to halting problem void f( int (*fp) (int) ) {..... for (i=0; i < 10; i++) { a[i] = fp(i); } enum Target { FUNC1, FUNC2, FUNC3 }; void f( enum Target fp ) {..... for (i=0; i < 10; i++) { if (fp == FUNC1) a[i] = f1(i); else if (fp == FUNC2) a[i] = f2(i); else a[i] = f3(i); } Synthesis unlikely to determine possible targets of function pointer ? a[i] Synthesized Hardware a[i] Synthesized Hardware f1(i)f2(i)f3(i) 3x1 fp
12
12/23 Algorithmic Specialization (AS) Algorithms targeting sw may not be fast in hw Sequential vs. parallel C code generally uses sw algorithms Guideline: Specialize critical functions with hw algorithms Automation Requires higher level specification Intrinsics void search(int a[], int k, int l, int r) { while (l <= r) { mid = (l+r)/2; if (k > a[mid]) l = mid+1; else if (k < a[mid) r = mid-1; else return mid; } return –1; } void search(int a[], int k, const int s) { for (i=0; i < s; i++) { if (a[i] == k) return i; } return –1; } Can be parallelized in hardware
13
13/23 Pass-By-Value Return (PVR) Problem: Array parameters cannot be prefetched due to potential aliases Designer may know aliases don’t exist Guideline: Use pass-by-value- return Automation Requires global alias analysis void f(int *a, int *b, int array[16]) { … // unrelated computation g(array); … // unrelated computation } int g(int array[16]) { // computation done on array } void f(int *a, int *b, int array[16]) { int localArray[16]; memcpy(localArray,array,16*sizeof(int)); … // misc computation g(localArray); … // misc computation memcpy(array, localArray,16*sizeof(int)); } int g(int array[16]) { // computation done on array } Can’t prefetch array for g(), may be aliased Local array can’t be aliased, can prefetch
14
14/23 Why Synthesis From C? Why not use HDL? HDL may yield better results C is mainstream language Acceptable performance in many cases Learning HDL is large overhead Approaches are orthogonal This work focuses on improving mainstream Guidelines common for HDL Can also be applied to algorithmic HDL
15
15/23 Software Overhead Refined regions may not be partitioned to hardware Partitioner may select non-refined regions OS may select software or hardware implementation Based on state of FPGA Coding guidelines have potential software overhead motionComp’() filterLuma() filterChroma() deblocking’()..... Hw/Sw Partitioning uPFPGA filterLuma() filterChroma() motionComp’() deblocking’() Problem - Refined code mapped to software
16
16/23 Refinement Methodology Considerations Reduce software overhead Reduce refinement time Methodology Profile Iterative-improvement Determine critical region Apply all except PVR/AS Minimal overhead Apply PVR if overhead acceptable Apply AS if known algorithm and overhead acceptable Profile Determine Critical Region Is overhead of copying array acceptable? Does suitable hw algorithm exist and have acceptable sw performance ? Apply CC, CF, CEMA, CIE, CEDF, CECF, FS, LR Apply PVR Apply AS yes no yes Repeat until performance acceptable
17
17/23 Experimental Setup Benchmark suite MediaBench, Powerstone Manually applied guidelines 1-2 hours 23 additional lines/benchmark, on average Target Architecture Xilinx VirtexII FPGA with ARM9 uP Hardware/software partitioning Selects critical regions for hardware Synthesis High-level synthesis tool ~30,000 lines of C code Outputs register-transfer level (RTL) VHDL RTL Synthesis using Xilinx ISE Compilation Gcc with –O1 optimizations Benchmarks Manual Refinement Hw/Sw Partitioning Synthesis Refined Code Compilation Sw Hw Bitfile ARM9Virtex II
18
18/23 Speedups from Guidelines Conversion to constants Speedup: 3.6x Total Time: 5 minutes Explicit Dataflow + Algorithmic Specialization Speedup: 16.4x Total Time: 15 minutes No guidelines Speedup: 2x
19
19/23 Speedups from Guidelines No Guidelines Speedup: 8.6x Conversion to Constants Speedup: 14.4x Total Time: 10 minutes Input Enumeration Speedup: 16.7x Total Time: 20 minutes Algorithmic Specialization Speedup: 19x Time: 30 minutes Sw Overhead: 6000%
20
20/23 Speedups from Guidelines Original code Speedups range from 1x (no speedup) to 573x Average: 2.6x (excludes brev) Refined code with guidelines Average: 8.4x (excludes brev) 3.5x average improvement compared to original code
21
21/23 Speedups from Guidelines Guidelines move speedups closer to ideal Almost identical for mpeg2, fir Several examples still far from ideal May imply new guidelines needed
22
22/23 Guideline SW Overhead/Improvement Average Sw performance overhead: -15.7% (improvement) -1.1% excluding brev 3 examples improved Average Sw size overhead (lines of C code) 8.4% excluding brev Overhead Improvement
23
23/23 Summary Simple coding guidelines significantly improve synthesis from C 3.5x speedup compared to Hw/Sw synthesized from unrefined code Major rewrites may not be necessary Between 1-2 hours Refinement Methodology Reduces software size/performance overhead In some cases, improvement Future Work Test on commercial synthesis tools New guidelines for different domains
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.