Scott Sirowy*, Greg Stitt‡, Frank Vahid*† is for Circuits: Capturing FPGA Circuits as Sequential Code for Portability Scott Sirowy*, Greg Stitt‡, Frank Vahid*† *Department of Computer Science and Engineering University of California, Riverside {ssirowy,vahid}@cs.ucr.edu ‡Department of Electrical and Computer Engineering University of Florida gstitt@ece.ufl.edu †Also with the Center for Embedded Computer Systems at UC Irvine This work was supported in part by the National Science Foundation and the Semiconductor Research Corporation
“C is for Circuits” vs. High Level Synthesis Designer captures application with temporal algorithm quicksort( array, left, right){ if right > left: pivot= array[left] newpivot = partition(array, left, right, pivot) quicksort(array, left, newpivot -1) quicksort(array, newpivot + 1, right) } Synthesis (Click mouse for animation) It is well known that many sequential algorithms caputred in C code (or other sequential languages) can be synthesized to wide range of circuits, varying in performance, efficiency, etc. An added advantage of capturing algorithms as C code is that it provides tremendous portability advantages, as code can be run on a microprocessor, partially or fully synthesized to FPGA available on a given platform. Yet, we observed that designers still often conceptualize and capture algorithms at the circuit level, using very clever spatial algorithms (involving pipelining, clever memory organization, parallelism, etc) Quite possibly due to the young state of synthesis tools, it is unclear whether or not synthesizing existing sequential algorithms will capture the cleverness and ingenuity of a manually created circuit by an expert circuit designer. ? N unsorted Split 1 sorted Merge 2 sorted 4 sorted … Designer captures spatial algorithm as custom circuit
“C is for Circuits” vs. High Level Synthesis Designer captures application with temporal algorithm quicksort( array, left, right){ if right > left: pivot= array[left] newpivot = partition(array, left, right, pivot) quicksort(array, left, newpivot -1) quicksort(array, newpivot + 1, right) } Queue 1_1, 1_2, 2_1, 2_2, 4_s, 4_us; Split(16_u.dequeue, 16_u.dequeue, 1_1, 1_2); stage1 = Merge(1_1.dequeue, 1_2.dequeue); Split(16_u.dequeue, 16_u.dequeue); stage1 += Merge(1_1.dequeue, 1_2.dequeue); Split(stage1, 2_1, 2_2); stage2 = Merge(2_1, 2_2); Split(stage1); stage2 += Merge(2_1, 2_2); Split(stage2, 4_1, 4_2); … Synthesis Synthesis Capture in temporal language One approach to this problem is to continually improve high level synthesis techniques such that they can generate better and faster circuits. However, there may be something inherent in the notion of the spatial capture of a circuit that a high level synthesis tool might never be able to achieve. (Click mouse for animation) Instead, the “C is for Circuits” work takes an opposite approach, by trying to capture the circuit in some temporal form such that a standard synthesis tool might be able to synthesize the original, clever circuit. Capturing circuits in C, or any other sequential language, provide the portability benefits not seen with current circuit distribution formats (segue into next slide) N unsorted Split 1 sorted Merge 2 sorted 4 sorted … Designer captures spatial algorithm as custom circuit
Goal: Portable Circuit Distribution Format FPGA + Proc. Current circuit distribution method Bitstreams Tightly coupled to a specific device 4 sorted 8 sorted 16 unsorted Split 1 sorted 2 sorted Merge 16 sorted FPGA Proc. 00100100101010 10111011100111 00101100110111 00010110001001 00011100101111 01111110010111 00111000111010 11111100001001 10111011100111 00101100110111 00010111111101 00011100101111 01100000000011 00111000111010 ? Application conceptualized and captured as circuit How are circuits currently distributed? (Click on mouse for animation) One means is through the use of a bitstream, which is a series of one’s and zeroes used to program the look up tables and other components within an FPGA. However, if our goal is to find a portable distribution format, the bitstream would be one of our last choices. The bitstream is very tightly coupled to specific FPGAs, meaning a distributor would have to maintain potentially hundreds of bitstreams to accommodate the growing number of FPGA platforms becoming available. 11111100001001 10111011100111 00101100110111 00010111111101 00011100101111 01100000000011 00111000111010 FPGA + * MEM Proc.
Goal: Portable Circuit Distribution Format FPGA + Proc. Current circuit distribution method RTL Good across multiple FPGA devices But requires resynthesis/mapping May not use FPGA resources most effectively Loop unrolling, memory mapping, hard-core use, … 4 sorted 8 sorted 16 unsorted Split 1 sorted 2 sorted Merge 16 sorted FPGA Proc. Entity Circuit port( … ); Architecture of… Begin End arch; Application conceptualized and captured as circuit Another means of circuit distribution would be register transfer level code(RTL). (Click slide for animation) RTL provides the benefit of being a good format for use across multiple FPGA devices at the possible disadvantage of having to be resynthesized and mapped for each platform. Another disadvantage is that using RTL may not make effective uses of the available FPGA resources. Since RTL is written at a sufficiently low level , synthesis tools may not be able to take advantage of memory mapping techniques, loop unrolling, and hard-core component use that the same synthesis tool may able to use with a higher level of abstraction. FPGA + * MEM Proc.
Goal: Portable Circuit Distribution Format FPGA + Proc. Higher abstraction C code (or any sequential language) Can yield more effective resource usage Could even run on platforms with no FPGA But also requires resynthesis/mapping 4 sorted 8 sorted 16 unsorted Split 1 sorted 2 sorted Merge 16 sorted FPGA Proc. #include <foo.h> int main(){ float pi = 3.141; while(1){ … } Application conceptualized and captured as circuit Using C code as that higher level of abstraction provides the opportunity for a synthesis tool to take advantage of the FPGAs available resources, which may include on-chip memory, hard-core components, etc. Plus, distributing circuits in C code allows those circuits to be run on platforms with no FPGA, making it much more portable than both the bitstream and the RTL methods. As with the RTL method of circuit distribution, distributing circuits captured in C would also require synthesis and mapping for each platform, which could be accomplished with possible on-chip tools… (subject of another talk) FPGA + * MEM Proc. Processor
Problem: Many FPGA Applications Captured “Spatially” as Circuits, not C Circuits in FCCM Year ~~~ 3D Vector Normalization 2001 ~~~ Regular Expression 2001 ~~~ RC4 2002 ~~~ Gaussian Noise Gen. 2003 ~~~ Molecular Dynamics 2004 But, as I pointed out in the first slide, we’ve observed that designers are still capturing FPGA applications spatially as circuits, and not C. Over the past six years, we identified 70 applications were designed, captured as circuits and published in the conference FCCM, in domains ranging from cryptography and math to data mining. Again, this could be due to the FPGA tools are still maturing, but could also be due to the fact that the spatial and temporal models are inherently different. Designer captures spatial algorithm as custom circuit for max performance N unsorted Split 1 sorted Merge 2 sorted 4 sorted … ~~~ Particle Graphics 2005 ~~~ … ~~~ Shortest Path 2006 ~~~ ~~~ 70 custom circuits in FCCM’01-’06 alone ~~~ ~~~ ~~~ ~~~ ~~~ ~~~ ~~~
Capturing Circuit Level Designs in Can designers’ circuits be reverse-engineered to some form of C code? From which original circuit will be synthesized by “standard” synthesis tools Queue 1_1, 1_2, 2_1, 2_2, 4_s, 4_us; Split(16_u.dequeue, 16_u.dequeue, 1_1, 1_2); stage1 = Merge(1_1.dequeue, 1_2.dequeue); Split(16_u.dequeue, 16_u.dequeue); stage1 += Merge(1_1.dequeue, 1_2.dequeue); Split(stage1, 2_1, 2_2); stage2 = Merge(2_1, 2_2); Split(stage1); stage2 += Merge(2_1, 2_2); Split(stage2, 4_1, 4_2); … This leads us to the main question of the “C is for Circuits” study: Can these manually created, spatially oriented circuits be captured in “some” form of C code? And more importantly, can a standard synthesis tool recapture the original circuit from that sequential description? Designer captures spatial algorithm as custom circuit for max performance N unsorted Split 1 sorted Merge 2 sorted 4 sorted … Synthesis
Previous Work Convert existing sequential algorithms to circuits Diniz, Eles, Frigo, Henkel, Najjar, Srinivasan, Stitt, etc. Coding guidelines for synthesis Stitt CODES/ISSS 2006 Reverse engineering techniques Doom, Hanson et. al Languages that encapsulate spatial and temporal concepts SystemC, StreamsC, etc. There has been a number of works relating to our study, including extensive work in high level synthesis and partitioning, including but not limited to Diniz, Eles, Najjar, etc. In 2006, Stitt proposed several C coding guidelines that would aid synthesis in creating better circuits. Doom and Hanson developed reverse engineering techniques that concentrated on low level behavior models. And there have even been languages, including SystemC and StreamsC, that have tried to encapsulate both spatial and temporal concepts. None of these however have addressed the question of whether or not existing circuits can be captured in C such that the original circuit could be synthesized from these known high level synthesis techniques….
Study Methodology Chose pseudo-random subset of all applicable FPGA circuit designs from past six years of FCCM (Field Programmable Custom Computing Machines) Attempted to capture circuit with high level C such that a “standard” synthesis tool would output the original circuit ~~~ ~~~ ~~~ ~~~ ~~~ In order to examine whether or not circuits could be captured in C, we chose a pseudo-random subset of all the applicable FPGA circuit designs from the past six years of FCCM, a forum for cleverly designed FPGA applications. We sorted all the designs in order by year and alphabetically, and chose every other one. This was an effort to not bias our choice on picking more amenable circuits for C capture. We identified 70 applicable circuit designs, of which we picked 35 that we tried to capture in high level C such that a standard synthesis tool would output the original circuit, standard being in quotes deliberately, and which I will explain in the next slide.
Study Methodology ? “Standard” HLS tool 1. 2. Each circuit either Manually performed Optimizations applied in same order for every application 1. ? ~~~ ~~~ ~~~ ~~~ ~~~ 2. int main(){ Float pi = 3.14; …; } Capture circuit in C code? Each circuit either Re-derivable from C Not re-derivable from C Re-derivable Temporal C (the “natural” algorithm Spatial C (reflecting the circuit) Not re-derivable Might still be possible 3. Pictorially, here is our study methodology. We attempted to capture each circuit as C code, which we then ran through a standard synthesis process. The standard synthesis flow was performed manually partly because current FPGA synthesis tools differ widely in features and functionality, partly because they are still maturing, and partly because we wanted to identify the specific optimizations required that would work well on a variety of circuits captured in C. We used the following seven optimizations (name them)… of which we applied in the same order for every single application. We then compared the output from our standard synthesis flow to the original circuit, and visually compared both circuits for similar architectural features. We classified each circuit as either “re-derivable from C” or “not re-derivable from C”. We further classified the circuits “re-derivable from C” as derivable from temporal C, which might be the most natural way to capture the algorithm sequentially, or as spatial C, which more reflects the circuit architecture. “Standard” Synthesis CDFG creation Optimizations/Scheduling Resource Allocation VHDL Creation CDFG analysis 1. Function Inlining 2. Loop Unrolling 3. Predication 4. Constant Propagation 5. Dead Code Elimination 6. Code Hoisting 7. Pipeline Analysis
Gaussian Noise Generator FCCM 2003 Lee et. al ~~~ 1. Linear Feedback Shift Registers u2 f(u1) g1(u2) g2(u2) * x1 x2 + Stage1 Stage2 Stage3 Stage4 u1 ~~~ ~~~ ~~~ ~~~ 2. int main(){ … } Capture circuit in C code? I’ll go over one example from FCCM highlighting our study methodology process. In 2003, Lee and his colleagues developed a circuit to output noise in the form of a Gaussian curve. They created a four stage pipeline that was able to generate a noise point every cycle. CDFG creation Optimizations/Scheduling Resource Allocation VHDL Creation CDFG analysis Synthesis 1. Function Inlining 2. Loop Unrolling 3. Predication 4. Constant Propagation 5. Dead Code Elimination 6. Code Hoisting 7. Pipeline Analysis
Gaussian Noise Generator FCCM 2003 Lee et. al inline float rand0_1() { return rand()/((float) RAND_MAX+1); } inline Stage1 doStage1() { Stage1 result; result.u1 = rand0_1(); result.u2 = rand0_1(); return result; inline Stage2 doStage2( float u1, float u2 ) { Stage2 result; float f_u1, g1_u2, g2_u2; f_u1 = sqrt( -log( u1 ) ); g1_u2 = sin( 2*M_PI*u2 ); g2_u2 = cos( 2*M_PI*u2 ); result.x1 = f_u1*g1_u2; result.x2 = f_u1*g2_u2; inline Stage3 doStage3( float x1, float x2 ) { static float acc1=0.0, acc2=0.0; Stage3 result; result.x1 = acc1 + x1; result.x2 = acc2 + x2; acc1 = x1; acc2 = x2; inline void doStage4( int i, int j, float x1, float x2 ) { noise[i] = stage3.x1; noise[j] = stage3.x2; int main() { Stage1 stage1; Stage2 stage2; Stage3 stage3; unsigned int i=0; while (1) { stage1 = doStage1(); stage2 = doStage2( stage1.u1, stage1.u2 ); stage3 = doStage3( stage2.x1, stage2.x2 ); doStage4( i, i+1%NUM_SAMPLES, stage3.x1, stage3.x2 ); i = (i+2)%NUM_SAMPLES; return 1; ~~~ 1. ~~~ ~~~ ~~~ Linear Feedback Shift Registers u2 f(u1) g1(u2) g2(u2) * x1 x2 + Stage1 Stage2 Stage3 Stage4 u1 ~~~ 2. int main(){ … } Capture circuit in C code? The first major step in our methodology ( the second in the list) was to try and capture the circuit as C code. The circuit was straightforwardly captured by modeling each stage in the pipeline as a function, and then creating dependencies in the main function to establish the order of the pipeline. CDFG creation Optimizations/Scheduling Resource Allocation VHDL Creation CDFG analysis Synthesis 1. Function Inlining 2. Loop Unrolling 3. Predication 4. Constant Propagation 5. Dead Code Elimination 6. Code Hoisting 7. Pipeline Analysis
Gaussian Noise Generator FCCM 2003 Lee et. al CDFG Creation/Analysis Scheduling/Resource Allocation ~~~ 1. ~~~ u1 u2 LFSR doStage1() rand() u1 u2 doStage1() ~~~ ~~~ ~~~ 2. f(u1) g1(u2) g2(u2) * u1 u2 doStage2() int main(){ … } g2(u2) f(u1) g1(u2) * x1 x2 u1 u2 doStage2() Capture circuit in C code? The next step was to run the C code through our defined standard synthesis process. The control and data flow graphs are shown on the screen, as well as the scheduling and resource allocation graphs after we performed the said optimizations. CDFG creation Optimizations/Scheduling Resource Allocation VHDL Creation CDFG analysis Synthesis 1. Function Inlining 2. Loop Unrolling 3. Predication 4. Constant Propagation 5. Dead Code Elimination 6. Code Hoisting 7. Pipeline Analysis x1 + acc1 acc2 x2 doStage3() acc1 acc2 x1 x2 + doStage3() doStage4() x1 x2 noise[i] noise[j] noise[] doStage4() sel x1 x2
Gaussian Noise Generator FCCM 2003 Lee et. al CDFG Creation/Analysis Scheduling/Resource Allocation Circuit from “Standard” Synthesis u1 u2 LFSR doStage1() rand() u1 u2 doStage1() LFSR f(u1) g1(u2) * + acc1 acc2 sel g2(u2) doStage1() doStage2() doStage4() main() doStage3() f(u1) g1(u2) g2(u2) * u1 u2 doStage2() g2(u2) f(u1) g1(u2) * x1 x2 u1 u2 doStage2() The final synthesis step is to combine all the graphs shown on the right into a final circuit, which is shown by the control graph for the main function. VHDL creation not shown x1 + acc1 acc2 x2 doStage3() acc1 acc2 x1 x2 + doStage3() doStage4() x1 x2 noise[i] noise[j] noise[] doStage4() sel x1 x2
Gaussian Noise Generator FCCM 2003 Lee et. al Circuit from “Standard” Synthesis Original Circuit LFSR f(u1) g1(u2) * + acc1 acc2 sel g2(u2) doStage1() doStage2() doStage4() main() doStage3() Linear Feedback Shift Registers u2 f(u1) g1(u2) g2(u2) * x1 x2 + Stage1 Stage2 Stage3 Stage4 u1 If (nearly) same “Rederivable from C” Finally, we do a comparison of the original circuit to the circuit created from our standard synthesis process. A quick inspection shows the results are nearly identical, with our standard synthesis process adding registers after each math function call. What’s interesting is the additional registers added to our circuit may actually improve the circuit by improving the clock frequency. In general though, capturing the Gaussian Noise Generator as C code and running that code through our standard synthesis tool created the exact same circuit. Thus, we categorized this circuit as “Rederivable from Spatial C”, since our code reflected the nature of the circuit, and not necessarily the most natural way of writing the algorithm.
Results 82% of the circuit designs were re-derivable from C Year of Publication Design Re-derivable from C Method/Reason 2001 3D Vec. Normalization Yes Spatial, if online algorithms can be specified 2001 Efficient CAM No Uses dynamic FPGA routing 2001 Automated Sensor Yes Temporal, floating point -> fixed point 2001 Regular Expression Yes Spatial, creative connections of one-bit flip flops 2002 Hyperspectral Image Yes Spatial, data reordering 2002 Machine Vision Yes Spatial, custom pipelining 2002 RC4 Yes Temporal, straightforward implementation 2002 Set Covering Yes Spatial, data structures for easy hw implementation 2002 Template Matching Yes Spatial, heavy modifications to original algorithm 2002 Triangle Mesh Yes Spatial, custom encoding scheme 2003 Congruential Sieves Yes Temporal, straightforward translation 2003 Content Scanning Yes Temporal 2003 F.P and Square Root Yes Spatial 2003 Gaussian Noise Yes Spatial, requires the use of spatial C constructs 2003 TRNG No Requires sampling a high frequency clock for noise 2004 3D FDTD Method Yes Spatial 2004 Deep Packet Filter No Requires knowledge of underlying FPGA 2004 Online Floating Point No Online algorithm, variable length buffers 2004 Molecular Dynamics Yes Spatial 2004 Pattern Matching Yes Spatial 2004 Seismic Migration Yes Spatial 2004 Software Deceleration No Use a uP for its cache 2004 V.M Window No Specific timing schemes implemented 2005 Data Mining Yes Spatial 2005 Cell Automata Yes Temporal 2005 Particle Graphics Yes Spatial 2005 Radiosity Yes Temporal 2005 Transient Waves Yes Spatial 2005 Road Traffic Yes Temporal 2006 All Pairs Shortest Path Yes Spatial 2006 Apriori Data Mining Yes Spatial 2006 Molecular Dynamics Yes Spatial, define separate memories, custom pipeline 2006 Gaussian Elimination Yes Spatial 2006 Radiation Dose Yes Temporal 2006 Random Variates Yes Spatial We looked at 35 designs published over the last 6 years of the FCCM conference, out of a possible 70 we originally identified. We were surprised to find that 82% of the designs were re-derivable from sort of C. Of course that leaves 18% that we couldn’t effectively capture in C, reasons including the use of dynamic FPGA routing, the use of high frequency clock noise, and even very clever memory hierarchies, none of which are easily captured in C, or which our standard synthesis process could handle.
Results Performance Comparison We couldn’t describe in C to re-derive same circuit Used separate on-board memories Similar or identical performance Here we show performance comparisons of just a few of the original custom circuits compared to our synthesized circuits from C. Circuits that weren’t derivable from C failed to perform as well as their custom counterparts. But as expected, those circuits which were re-derivable showed similar or exact performance numbers than the corresponding custom counterparts. Custom Synthesized Not re-derivable from C Re-derivable from C
Results Area Comparison Extra area due to added multiplexors or registers, none of which significantly altered behavior of the circuit Custom Synthesized For the same examples, we show the results of comparing the area of the published custom circuits to our synthesized circuits from C. While the measurements were very similar, some of our circuits actually had a small increased area due to added multiplexors and/or registers, none of which significantly altered the behavior of the circuit.
onclusion Designers continue to conceptualize/capture some FPGA applications “spatially” as circuits Despite increasing C-based synthesis tools For 35 FCCM circuits studied, 82% were re-derivable from some form of C Distributing a circuit using C code expands the range of target platforms and the longevity of an application Compared to a netlist or RTL distribution Future work Using C as part of a standard binary for FPGA Despite the increased presence of C-bases synthesis tools, we observed that designers continue to conceptualize and capture some FPGA applications spatially as circuits… We looked at 35 of those circuits, published in FCCM, and found that we were able to capture 29 of those circuits in some form C such that the original circuit could be re-derived using a standard synthesis process. This is important because the ability to capture and distribute circuits as C code increases the usefulness of an application by expanding its range of target platforms, as well as increasing the longevity of the application. While not a perfect means of distributing FPGA applications, the use of C as part of a “standard binary for FPGAs” may prove useful, and is a subject of more analysis in future work
N S F Sponsors This presentation brought to you by the letters And viewers like you… N S F