Optimus: Efficient Realization of Streaming Applications on FPGAs University of Michigan: Amir Hormati, Manjunath Kudlur, Scott Mahlke IBM Research: David Bacon, Rodric Rabbah
Introduction End of free ride from clock scaling Applications more demanding More applications on embedded platforms Evolution of new architectures
Crypto XML parser Physics GPU Customizable and reconfigurable –On-the-fly and in-the-field –Customizability performance and low power Many orders of magnitude more parallelism than existing multicores –Task-level parallelism –Pipeline parallelism –Bit-level parallelism Why FPGAs?
Liquid Metal Vision One unified language (Lime) for programming hardware (e.g., FPGAs) and heterogeneous architectures Liquid Metal VM: JIT the hardware! GPU Cell (Multicore) CPU ??? FPGA LiquidMetal VM Program all with Lime
Liquid Metal Tool Chain 5 Streaming Languages Front-End Compiler Spatial IR Streaming VM Virtex5 FPGA Streaming VM Xilinx bitfile Xilinx bitfile Xilinx VHDL Compiler Xilinx VHDL Compiler HDL Cell BE Streaming VM Cell binary Cell binary Cell SDK C C Crucible Back-End Compiler Optimus Back-End Compiler FPGA Model
Overview Spatial IR (SIR) Compilation Flow Scheduling Optimizations Results
Spatial Intermediate Representation Main Constructs: –Filter Encapsulate computation. –Pipeline Expressing pipeline parallelism. –Splitjoin Expressing task-level parallelism. –Other constructs not relevant here Exposes different types of parallelism –Composable, hierarchical Some streaming languages can be easily lowered to SIR: –Lime, StreamIt pipeline filter splitjoin
Top Level Compilation Filter Controller M0M0 Init M1M1 …... i0i0 i1i1 ixix OmOm O0O0 O0O0 … MnMn Work Source Filter Round-Robin Splitter(8,8,8,8) Filter Round-Robin Joiner(1,1,1,1) Sink a[ ] i Init Controller A BEC HGF I J D Work Source Filter Round-Robin Splitter(8,8,8,8) Filter Round-Robin Joiner(1,1,1,1) Sink B D C F E A J I H G
Filter Compilation sum = 0 i = 0 sum = 0 i = 0 temp = pop( ) sum = sum + temp i = i + 1 Branch bb2 if i < 8 sum = sum + temp i = i + 1 Branch bb2 if i < 8 push(sum) Basic Block Register Control in Control outs Memory/Queue ports Ack Live dataouts Live data ins bb1 bb2 bb3 bb4 Live out Data Register mux Register FIFO Read FIFO Write Control Token Ack
Operation Compilation FU … … i0i0 imim o0o0 onon predicate ADD CMP Register i 1 tempsum 8 Control out temp Control out 4 Control in … sum = sum + temp i = i + 1 Branch bb2 if i < 8 sum = sum + temp i = i + 1 Branch bb2 if i < 8
Stream Scheduling Filters fire eagerly. –Blocking channel access. –Allows for potentially smaller channels Results produced with lower latency. 11 Filter 1 Filter 2 Push 2 Pop 3 Filter 1 Filter 2
Optimizations Streaming optimizations (macro functional) –Channel allocations, Channel access fusion, Filter fission and fusion, etc. –Doing these optimization needs global information about the stream graph –Typically performed manually using existing tools Classic optimizations (micro functional) –Common subexpression elimination, Constant folding, Loop unrolling, etc. –Typically included in existing compilers and tools
Channel Allocation Larger channels: –More SRAM –More control logic –Less stalls Interlocking makes sure that each filter gets the right data or blocks. What is the right channel size?
Channel Allocation Algorithm Set the size of the channels to infinity. Warm-up the queues. Record the steady state instruction schedules for each pair. Unroll the schedules to have the same number of pushes and pops. Find the maximum number of overlapping lifetimes. 14
Channel Allocation Example ---- push ---- push ---- push ---- push ---- pop ---- pop ---- pop Max overlap = 3 ProducerConsumer Source Filter 1 Filter 2 Sink
Channel Allocation
Channel Access Fusion Each channel access (push or pop) takes one cycle. Communication to computation ratio Longer critical path latency Limit task-level parallelism
Channel Access Fusion Algorithm Clustering channel access operations –Loop Unrolling –Code Motion –Balancing the groups Similar to vectorization Wide channels 18 rrrrrrrr w w w w r w w r Write Mult. = 1 Read Mult. = 8 Write Mult. = 8 Read Mult. = 8 Write Mult. = 4 Read Mult. = 1
Access Fusion Example Some caveats int sum = 0; for (int i = 0; i < 32; i++) sum+ = pop(); push(sum); int sum = 0; int t1, t2, t3, t4; for (int i = 0; i < 8; i++) { (t1, t2, t3, t4) = pop4(); sum+ = t1 + t2 + t3 + t4; } push(sum); } } int sum = 0; for (int i = 0; i < 32; i++) sum+ = pop(); pop(); push(sum); int sum = 0; for (int i = 0; i < 8; i++) { sum+ = pop(); } pop(); push(sum);
Access Fusion
Speedup (baseline = PowerPC)
Energy Consumption
Handel-C Comparison Compared DES and DCT with hand-optimized Handel-C implementation Performance – 5% faster before optimizations –12x faster after optimizations Area –66% larger before optimizations –90% larger after optimizations 23
Conclusion Streaming language to program heterogeneous systems Hierarchical synthesis using Spatial IR Macro and micro functional optimizations −Channel Access Fusion: 2.4x speedup −Channel Allocation: 50% area saving
Thank you! Questions? 25
Static Stream Scheduling Resources have to be ready before a filter starts(pushes and pops are non-blocking). Double buffering for parallelism. Deadlock can be detected at compile-time. Could be inefficient in case of data dependent bahavior.
System Setup 27 Streaming Languages Front-End Compiler SIR Streaming VM Virtex5 FPGA Streaming VM Xilinx bitfile Xilinx bitfile Xilinx VHDL Compiler Xilinx VHDL Compiler HDL Cell BE Streaming VM Cell binary Cell binary Cell SDK C C Crucible Back-End Compiler Optimus Back-End Compiler FPGA Model
Stream Scheduling Activate all the filters at time 0. Blocking channel access. No restriction on the channel size. Result to least latency. 28 Source Adder 1Adder 4 Round-Robin Splitter(8,8,8,8) Adder 3Adder 2 Round-Robin Joiner(1,1,1,1) Printer a[ ] i Init Controller A BEC HGF I J D Work
StreamIt Example Source Adder 1Adder 4 Round-Robin Splitter(8,8,8,8) Adder 3Adder 2 Round-Robin Joiner(1,1,1,1) Printer B D C F E A J I H G void->void pipeline Minimal { add Source(); add AddSplitter(8, 4); add Printer(); } int->int splitjoin AddSplitter(int addSize, int pFactor) { split roundrobin(pFactor); for (int i = 0; i < pFactor; i++) add AdderFilter(addSize); join roundrobin(1); } int->void filter Printer() { work pop 1 { println(pop()); } }