Optimus: Efficient Realization of Streaming Applications on FPGAs University of Michigan: Amir Hormati, Manjunath Kudlur, Scott Mahlke IBM Research: David.

Slides:



Advertisements
Similar presentations
ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto
Advertisements

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,
High Level Languages: A Comparison By Joel Best. 2 Sources The Challenges of Synthesizing Hardware from C-Like Languages  by Stephen A. Edwards High-Level.
EECC551 - Shaaban #1 Fall 2005 lec# Static Compiler Optimization Techniques We examined the following static ISA/compiler techniques aimed.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.
University of Michigan Electrical Engineering and Computer Science Compiler-directed Synthesis of Programmable Loop Accelerators Kevin Fan, Hyunchul Park,
Phased Scheduling of Stream Programs Michal Karczmarek, William Thies and Saman Amarasinghe MIT LCS.
University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,
EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
2015/6/21\course\cpeg F\Topic-1.ppt1 CPEG 421/621 - Fall 2010 Topics I Fundamentals.
A Programmable Coprocessor Architecture for Wireless Applications Yuan Lin, Nadav Baron, Hyunseok Lee, Scott Mahlke, Trevor Mudge Advance Computer Architecture.
Compiling Application-Specific Hardware Mihai Budiu Seth Copen Goldstein Carnegie Mellon University.
Static Translation of Stream Programming to a Parallel System S. M. Farhad PhD Student Supervisor: Dr. Bernhard Scholz Programming Language Group School.
Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware Scott Mahlke (University of Michigan) and Rodric Rabbah (IBM) DAC HLS Tutorial, San Francisco.
University of Michigan Electrical Engineering and Computer Science 1 Streamroller: Automatic Synthesis of Prescribed Throughput Accelerator Pipelines Manjunath.
Center for Embedded Computer Systems University of California, Irvine and San Diego Loop Shifting and Compaction for the.
Static Translation of Stream Programming to a Parallel System S. M. Farhad PhD Student Supervisor: Dr. Bernhard Scholz Programming Language Group School.
Compiler Optimization Overview
A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
GPGPU platforms GP - General Purpose computation using GPU
Jan Programming Models for Accelerator-Based Architectures R. Govindarajan HPC Lab,SERC, IISc
Optimization software for apeNEXT Max Lukyanov,  apeNEXT : a VLIW architecture  Optimization basics  Software optimizer for apeNEXT  Current.
Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
LOPASS: A Low Power Architectural Synthesis for FPGAs with Interconnect Estimation and Optimization Harikrishnan K.C. University of Massachusetts Amherst.
Sub-expression elimination Logic expressions: –Performed by logic optimization. –Kernel-based methods. Arithmetic expressions: –Search isomorphic patterns.
Floating-Point Reuse in an FPGA Implementation of a Ray-Triangle Intersection Algorithm Craig Ulmer June 27, 2006 Sandia is a multiprogram.
Static Translation of Stream Programs S. M. Farhad School of Information Technology The University of Sydney.
AMIN FARMAHININ-FARAHANI CHARLES TSEN KATHERINE COMPTON FPGA Implementation of a 64-bit BID-Based Decimal Floating Point Adder/Subtractor.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
Array Synthesis in SystemC Hardware Compilation Authors: J. Ditmar and S. McKeever Oxford University Computing Laboratory, UK Conference: Field Programmable.
StreamX10: A Stream Programming Framework on X10 Haitao Wei School of Computer Science at Huazhong University of Sci&Tech.
Compiler Optimizations ECE 454 Computer Systems Programming Topics: The Role of the Compiler Common Compiler (Automatic) Code Optimizations Cristiana Amza.
Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.
Exploiting Parallelism
EECS 583 – Class 20 Research Topic 2: Stream Compilation, Stream Graph Modulo Scheduling University of Michigan November 30, 2011 Guest Speaker Today:
Michael I. Gordon, William Thies, and Saman Amarasinghe
University of Michigan Electrical Engineering and Computer Science 1 Compiler-directed Synthesis of Multifunction Loop Accelerators Kevin Fan, Manjunath.
University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.
Flexible Filters for High Performance Embedded Computing Rebecca Collins and Luca Carloni Department of Computer Science Columbia University.
WARP PROCESSORS ROMAN LYSECKY GREG STITT FRANK VAHID Presented by: Xin Guan Mar. 17, 2010.
University of Michigan Electrical Engineering and Computer Science Automatic Synthesis of Customized Local Memories for Multicluster Application Accelerators.
StreamIt on Raw StreamIt Group: Michael Gordon, William Thies, Michal Karczmarek, David Maze, Jasper Lin, Jeremy Wong, Andrew Lamb, Ali S. Meli, Chris.
Memory-Aware Compilation Philip Sweany 10/20/2011.
University of Michigan Electrical Engineering and Computer Science 1 Increasing Hardware Efficiency with Multifunction Loop Accelerators Kevin Fan, Manjunath.
Linear Analysis and Optimization of Stream Programs Masterworks Presentation Andrew A. Lamb 4/30/2003 Professor Saman Amarasinghe MIT Laboratory for Computer.
Static Translation of Stream Program to a Parallel System S. M. Farhad The University of Sydney.
StreamIt: A Language for Streaming Applications
Code Optimization.
Linear Filters in StreamIt
CS203 – Advanced Computer Architecture
CGRA Express: Accelerating Execution using Dynamic Operation Fusion
From C to Elastic Circuits
High Performance Stream Processing for Mobile Sensing Applications
High Level Synthesis Overview
StreamIt: High-Level Stream Programming on Raw
Dynamically Scheduled High-level Synthesis
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Presentation transcript:

Optimus: Efficient Realization of Streaming Applications on FPGAs University of Michigan: Amir Hormati, Manjunath Kudlur, Scott Mahlke IBM Research: David Bacon, Rodric Rabbah

Introduction  End of free ride from clock scaling  Applications more demanding  More applications on embedded platforms  Evolution of new architectures

Crypto XML parser Physics GPU Customizable and reconfigurable –On-the-fly and in-the-field –Customizability  performance and low power Many orders of magnitude more parallelism than existing multicores –Task-level parallelism –Pipeline parallelism –Bit-level parallelism Why FPGAs?

Liquid Metal Vision One unified language (Lime) for programming hardware (e.g., FPGAs) and heterogeneous architectures Liquid Metal VM: JIT the hardware! GPU Cell (Multicore) CPU ??? FPGA LiquidMetal VM Program all with Lime

Liquid Metal Tool Chain 5 Streaming Languages Front-End Compiler Spatial IR Streaming VM Virtex5 FPGA Streaming VM Xilinx bitfile Xilinx bitfile Xilinx VHDL Compiler Xilinx VHDL Compiler HDL Cell BE Streaming VM Cell binary Cell binary Cell SDK C C Crucible Back-End Compiler Optimus Back-End Compiler FPGA Model

Overview Spatial IR (SIR) Compilation Flow Scheduling Optimizations Results

Spatial Intermediate Representation Main Constructs: –Filter  Encapsulate computation. –Pipeline  Expressing pipeline parallelism. –Splitjoin  Expressing task-level parallelism. –Other constructs not relevant here Exposes different types of parallelism –Composable, hierarchical Some streaming languages can be easily lowered to SIR: –Lime, StreamIt pipeline filter splitjoin

Top Level Compilation Filter Controller M0M0 Init M1M1 …... i0i0 i1i1 ixix OmOm O0O0 O0O0 … MnMn Work Source Filter Round-Robin Splitter(8,8,8,8) Filter Round-Robin Joiner(1,1,1,1) Sink a[ ] i Init Controller A BEC HGF I J D Work Source Filter Round-Robin Splitter(8,8,8,8) Filter Round-Robin Joiner(1,1,1,1) Sink B D C F E A J I H G

Filter Compilation sum = 0 i = 0 sum = 0 i = 0 temp = pop( ) sum = sum + temp i = i + 1 Branch bb2 if i < 8 sum = sum + temp i = i + 1 Branch bb2 if i < 8 push(sum) Basic Block Register Control in Control outs Memory/Queue ports Ack Live dataouts Live data ins bb1 bb2 bb3 bb4 Live out Data Register mux Register FIFO Read FIFO Write Control Token Ack

Operation Compilation FU … … i0i0 imim o0o0 onon predicate ADD CMP Register i 1 tempsum 8 Control out temp Control out 4 Control in … sum = sum + temp i = i + 1 Branch bb2 if i < 8 sum = sum + temp i = i + 1 Branch bb2 if i < 8

Stream Scheduling Filters fire eagerly. –Blocking channel access. –Allows for potentially smaller channels Results produced with lower latency. 11 Filter 1 Filter 2 Push 2 Pop 3 Filter 1 Filter 2

Optimizations Streaming optimizations (macro functional) –Channel allocations, Channel access fusion, Filter fission and fusion, etc. –Doing these optimization needs global information about the stream graph –Typically performed manually using existing tools Classic optimizations (micro functional) –Common subexpression elimination, Constant folding, Loop unrolling, etc. –Typically included in existing compilers and tools

Channel Allocation Larger channels: –More SRAM –More control logic –Less stalls Interlocking makes sure that each filter gets the right data or blocks. What is the right channel size?

Channel Allocation Algorithm Set the size of the channels to infinity. Warm-up the queues. Record the steady state instruction schedules for each pair. Unroll the schedules to have the same number of pushes and pops. Find the maximum number of overlapping lifetimes. 14

Channel Allocation Example ---- push ---- push ---- push ---- push ---- pop ---- pop ---- pop Max overlap = 3 ProducerConsumer Source Filter 1 Filter 2 Sink

Channel Allocation

Channel Access Fusion Each channel access (push or pop) takes one cycle. Communication to computation ratio Longer critical path latency Limit task-level parallelism

Channel Access Fusion Algorithm Clustering channel access operations –Loop Unrolling –Code Motion –Balancing the groups Similar to vectorization Wide channels 18 rrrrrrrr w w w w r w w r Write Mult. = 1 Read Mult. = 8 Write Mult. = 8 Read Mult. = 8 Write Mult. = 4 Read Mult. = 1

Access Fusion Example Some caveats int sum = 0; for (int i = 0; i < 32; i++) sum+ = pop(); push(sum); int sum = 0; int t1, t2, t3, t4; for (int i = 0; i < 8; i++) { (t1, t2, t3, t4) = pop4(); sum+ = t1 + t2 + t3 + t4; } push(sum); } } int sum = 0; for (int i = 0; i < 32; i++) sum+ = pop(); pop(); push(sum); int sum = 0; for (int i = 0; i < 8; i++) { sum+ = pop(); } pop(); push(sum);

Access Fusion

Speedup (baseline = PowerPC)

Energy Consumption

Handel-C Comparison Compared DES and DCT with hand-optimized Handel-C implementation Performance – 5% faster before optimizations –12x faster after optimizations Area –66% larger before optimizations –90% larger after optimizations 23

Conclusion Streaming language to program heterogeneous systems Hierarchical synthesis using Spatial IR Macro and micro functional optimizations −Channel Access Fusion: 2.4x speedup −Channel Allocation: 50% area saving

Thank you! Questions? 25

Static Stream Scheduling Resources have to be ready before a filter starts(pushes and pops are non-blocking). Double buffering for parallelism. Deadlock can be detected at compile-time. Could be inefficient in case of data dependent bahavior.

System Setup 27 Streaming Languages Front-End Compiler SIR Streaming VM Virtex5 FPGA Streaming VM Xilinx bitfile Xilinx bitfile Xilinx VHDL Compiler Xilinx VHDL Compiler HDL Cell BE Streaming VM Cell binary Cell binary Cell SDK C C Crucible Back-End Compiler Optimus Back-End Compiler FPGA Model

Stream Scheduling Activate all the filters at time 0. Blocking channel access. No restriction on the channel size. Result to least latency. 28 Source Adder 1Adder 4 Round-Robin Splitter(8,8,8,8) Adder 3Adder 2 Round-Robin Joiner(1,1,1,1) Printer a[ ] i Init Controller A BEC HGF I J D Work

StreamIt Example Source Adder 1Adder 4 Round-Robin Splitter(8,8,8,8) Adder 3Adder 2 Round-Robin Joiner(1,1,1,1) Printer B D C F E A J I H G void->void pipeline Minimal { add Source(); add AddSplitter(8, 4); add Printer(); } int->int splitjoin AddSplitter(int addSize, int pFactor) { split roundrobin(pFactor); for (int i = 0; i < pFactor; i++) add AdderFilter(addSize); join roundrobin(1); } int->void filter Printer() { work pop 1 { println(pop()); } }