Hossein Omidian, Guy Lemieux

Slides:



Advertisements
Similar presentations
D ARMSTADT, G ERMANY - 11/07/2013 A Framework for Effective Exploitation of Partial Reconfiguration in Dataflow Computing Riccardo Cattaneo ∗, Xinyu Niu†,
Advertisements

Mani Srivastava UCLA - EE Department Room: 6731-H Boelter Hall Tel: WWW: Copyright 2003.
Presenter MaxAcademy Lecture Series – V1.0, September 2011 Stream Scheduling.
Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Implementation Approaches with FPGAs Compile-time reconfiguration (CTR) CTR is a static implementation strategy where each application consists of one.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
2 s 2.org Qiuling Zhu, Navjot Garg, Yun-Ta Tsai, Kari Pulli NVIDIA An Energy Efficient Time-sharing Pyramid Pipeline for Multi-resolution.
FPGA Latency Optimization Using System-level Transformations and DFG Restructuring Daniel Gomez-Prado, Maciej Ciesielski, and Russell Tessier Department.
Zheming CSCE715.  A wireless sensor network (WSN) ◦ Spatially distributed sensors to monitor physical or environmental conditions, and to cooperatively.
Green Governors: A Framework for Continuously Adaptive DVFS Vasileios Spiliopoulos, Stefanos Kaxiras Uppsala University, Sweden.
11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.
 Based on the resource constraints a lower bound on the iteration interval is estimated  Synthesis targeting reconfigurable logic (e.g. FPGA) faces the.
Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
BRASS Analysis of QuasiStatic Scheduling Techniques in a Virtualized Reconfigurable Machine Yury Markovskiy, Eylon Caspi, Randy Huang, Joseph Yeh, Michael.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.
HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.
Floating-Point Reuse in an FPGA Implementation of a Ray-Triangle Intersection Algorithm Craig Ulmer June 27, 2006 Sandia is a multiprogram.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
StreamX10: A Stream Programming Framework on X10 Haitao Wei School of Computer Science at Huazhong University of Sci&Tech.
ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular.
Resource Mapping and Scheduling for Heterogeneous Network Processor Systems Liang Yang, Tushar Gohad, Pavel Ghosh, Devesh Sinha, Arunabha Sen and Andrea.
Task Graph Scheduling for RTR Paper Review By Gregor Scott.
Evaluating and Improving an OpenMP-based Circuit Design Tool Tim Beatty, Dr. Ken Kent, Dr. Eric Aubanel Faculty of Computer Science University of New Brunswick.
Processor Architecture
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
Dynamic Scheduling Monte-Carlo Framework for Multi-Accelerator Heterogeneous Clusters Authors: Anson H.T. Tse, David B. Thomas, K.H. Tsoi, Wayne Luk Source:
An Improved “Soft” eFPGA Design and Implementation Strategy
1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.
Hy-C A Compiler Retargetable for Single-Chip Heterogeneous Multiprocessors Philip Sweany 8/27/2010.
Physically Aware HW/SW Partitioning for Reconfigurable Architectures with Partial Dynamic Reconfiguration Sudarshan Banarjee, Elaheh Bozorgzadeh, Nikil.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.
Soft Vector Processors with Streaming Pipelines Aaron Severance Joe Edwards Hossein Omidian Guy G. F. Lemieux.
CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.
Distortion Correction ECE 6276 Project Review Team 5: Basit Memon Foti Kacani Jason Haedt Jin Joo Lee Peter Karasev.
Equalizer: Dynamically Tuning GPU Resources for Efficient Execution Ankit Sethia* Scott Mahlke University of Michigan.
Static Translation of Stream Program to a Parallel System S. M. Farhad The University of Sydney.
Philipp Gysel ECE Department University of California, Davis
Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.
Optimizing Interconnection Complexity for Realizing Fixed Permutation in Data and Signal Processing Algorithms Ren Chen, Viktor K. Prasanna Ming Hsieh.
Co-Designing Accelerators and SoC Interfaces using gem5-Aladdin
Hiba Tariq School of Engineering
School of Engineering University of Guelph
James Coole PhD student, University of Florida Aaron Landy Greg Stitt
Efficient Complex Operators for Irregular Codes
Texas Instruments TDA2x and Vision SDK
FPGA Acceleration of Convolutional Neural Networks
Decoupled Access-Execute Pioneering Compilation for Energy Efficiency
C-LSTM: Enabling Efficient LSTM using Structured Compression Techniques on FPGAs Shuo Wang1, Zhe Li2, Caiwen Ding2, Bo Yuan3, Qinru Qiu2, Yanzhi Wang2,
Xuechao Wei, Peng Zhang, Cody Hao Yu, and Jim Wu
CSCI1600: Embedded and Real Time Software
Improved schedulability on the ρVEX polymorphic VLIW processor
Computer Architecture Lecture 4 17th May, 2006
Liquid computing – the rVEX approach
Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke
Parallel Computation Patterns (Scan)
Final Project presentation
Introduction SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.
Introduction SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.
6- General Purpose GPU Programming
CSCI1600: Embedded and Real Time Software
Research: Past, Present and Future
Rohan Yadav and Charles Yuan (rohany) (chenhuiy)
Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project
Presentation transcript:

Hossein Omidian, Guy Lemieux hosseino@ece.ubc.ca JANUS: A Compilation System for Balancing Parallelism and Performance in OpenVX Hossein Omidian, Guy Lemieux hosseino@ece.ubc.ca Vancouver, Canada

Which target for Computer Vision or other similar applications? The goal is to run CV application faster while using less power Which hardware target? When? How? Source: Altera (Intel) slides by Tom Spyrou

Motivation: Computer Vision on FPGAs and Many-core systems OpenCV: de facto API/toolkit while(operations remaining) { read frame; frame = Operate(frame); write frame; } Frame size > on-chip memory  Low performance, high power OpenVX: standards-based API/toolkit by Khronos (OpenCL) while(tiles remaining) { read tile; while(operations remaining) tile = Operate(tile); write tile; Small tile size  Pipelining, cache-friendly, low power

OpenVX Programming Model C library Streaming compute graph Operations + local data buffering Two-phase execution Build + optimize graph Execute graph Example: Sobel 5 nodes, 1 input, 2 outputs Send image tiles through pipeline

OpenVX on different targets PROBLEMS How to “scale” to different target size? Only write one C program Area target Throughput target How to break into tiles? How to balance throughput of entire graph? OUR SOLUTION JANUS: Tool to automatically balance area/throughput

Example: Streaming Compute Kernel void compute_kernel() {     for( int i = 0 ; i < N; i++ ) { // STREAM INPUTS m[i] px[i] py[i]         gmm = M(ref_gmm,m[i]);         dx = S(ref_px, px[i]);         dy = S(ref_py, py[i]);         dx2 = M(dx, dx);         dy2 = M(dy, dy);         r2 = A(dx2, dy2);         r = SQRT(r2);         rr = D(1, r);         gmm_rr = M(rr, gmm);         gmm_rr2 = M(rr, gmm_rr);         gmm_rr3 = M(rr, gmm_rr2);         dfx = M(dx, gmm_rr3);         dfy = M(dy, gmm_rr3);     } } Using HLS Each compute kernel has unique Initiation Interval 2 1 4 8 Kernel Initiation Interval A and S 1 M 2 SQRT 4 D 8 Takes us 31 Clock cycles

Example: OpenVX Graph Critical path with parallelism: 24 Static Time Analysis Throughput Analysis 𝑥 𝑟𝑒𝑓 1 2 M A Sqrt D S 𝑥 𝑖 1 4 8 2 2 𝑦 𝑟𝑒𝑓 1 2 2 2 2 𝑦 𝑖 𝑀 𝑟𝑒𝑓 2 𝑀 𝑖

Example: OpenVX Graph Critical path with parallelism: 24 Static Time Analysis Throughput Analysis 𝑥 𝑟𝑒𝑓 1 2 M A Sqrt D S 𝑥 𝑖 1 4 8 2 2 𝑦 𝑟𝑒𝑓 1 2 2 2 2 24 𝑦 𝑖 𝑀 𝑟𝑒𝑓 2 𝑀 𝑖 31 24

Example: OpenVX Graph Pipelining Balancing Sending/gathering data to/from system Scheduling or back pressure M A Sqrt D S 𝑥 𝑟𝑒𝑓 𝑥 𝑖 𝑦 𝑟𝑒𝑓 𝑦 𝑖 𝑀 𝑟𝑒𝑓 𝑀 𝑖 2 1 4 8 31 24

Example: OpenVX Graph Pipelining Balancing Sending/gathering data to/from system Scheduling or back pressure M A Sqrt D S 𝑥 𝑟𝑒𝑓 𝑥 𝑖 𝑦 𝑟𝑒𝑓 𝑦 𝑖 𝑀 𝑟𝑒𝑓 𝑀 𝑖 2 1 4 8 31 24 8

Example: OpenVX Graph Replicating Sending data in round-robin order A Sqrt 4 A A A A A A A A 1 Sqrt 4

Example: OpenVX Graph Replicating Sending data in round-robin order A Sqrt 4 A A A A A A A A 1 Sqrt 4 A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A

Example: OpenVX Graph Replicating Sending data in round-robin order A Sqrt 4 A A A A A A A A 1 Sqrt 4 A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A

Example: OpenVX Graph Let’s make it faster by expanding (Replicate slow nodes) M A Sqrt D S 𝑥 𝑟𝑒𝑓 𝑥 𝑖 𝑦 𝑟𝑒𝑓 𝑦 𝑖 𝑀 𝑟𝑒𝑓 𝑀 𝑖 2 1 4 8 31 24 8

Example: OpenVX Graph Expanding Maximum throughput by using maximum area S + M Sqrt D 31 24 8  1

Example: OpenVX Graph Saving area (Minimum area) Decrease the throughput M A Sqrt D S 𝑥 𝑟𝑒𝑓 𝑥 𝑖 𝑦 𝑟𝑒𝑓 𝑦 𝑖 𝑀 𝑟𝑒𝑓 𝑀 𝑖 8

Example: OpenVX Graph What about Many-core systems? Clustering (assume we have 5 Processing Elements (PEs) 𝑥 𝑟𝑒𝑓 1 2 S M M 𝑥 𝑖 1 4 8 2 2 A Sqrt D M M M 𝑦 𝑟𝑒𝑓 1 2 2 2 2 S M M 𝑦 𝑖 𝑀 𝑟𝑒𝑓 2 M 𝑀 𝑖

Example: OpenVX Graph Implementing on 5 Processing Elements (PEs) S() NOP SQRT() NOP D() M() M() A() NOP 17.5% running NOP

JANUS Tool flow 2 Phases Pre-compute Kernel Analyzer Intra-Node Optimizer 1 3 2 4 DB of Implementations for each kernel (node) CV Kernels Kernel 1 Kernel 2 Kernel 3 Kernel 4 Different Implementations Generator Design Evaluator + Throughput Calculator for each node Area/Throughput/Tile-width correlation for each node Trade-off Finder CV Compute Graph Trade-off FPGA Fabric Many-core system Heterogeneous Backend JANUS Tool flow 2 Phases Pre-compute (library characterization) Implementation Pre-compute Kernels written in Parameterized C++ for HLS Different Implementation Generator (DIG) Sweep parameterized space for each kernel Throughput, Area, Tile-width Intra-Node Optimizer Replication vs. tile size Throughput Analysis Automated space/time tradeoffs Integer Linear Programming JANUS heuristic with Inter-Node Optimizer

JANUS Tool flow 2 Phases Pre-compute Kernel Analyzer Intra-Node Optimizer 1 3 2 4 DB of Implementations for each kernel (node) CV Kernels Kernel 1 Kernel 2 Kernel 3 Kernel 4 Different Implementations Generator Design Evaluator + Throughput Calculator for each node Area/Throughput/Tile-width correlation for each node Trade-off Finder CV Compute Graph Trade-off FPGA Fabric Many-core system Heterogeneous Backend JANUS Tool flow 2 Phases Pre-compute (library characterization) Implementation Pre-compute Kernels written in Parameterized C++ for HLS Different Implementation Generator (DIG) Sweep parameterized space for each kernel Throughput, Area, Tile-width Intra-Node Optimizer Replication vs. tile size Throughput Analysis Automated space/time tradeoffs Integer Linear Programming JANUS heuristic with Inter-Node Optimizer

JANUS Tool flow 2 Phases Pre-compute Kernel Analyzer Intra-Node Optimizer 1 3 2 4 DB of Implementations for each kernel (node) CV Kernels Kernel 1 Kernel 2 Kernel 3 Kernel 4 Different Implementations Generator Design Evaluator + Throughput Calculator for each node Area/Throughput/Tile-width correlation for each node Trade-off Finder CV Compute Graph Trade-off FPGA Fabric Many-core system Heterogeneous Backend JANUS Tool flow 2 Phases Pre-compute (library characterization) Implementation Pre-compute Kernels written in Parameterized C++ for HLS Different Implementation Generator (DIG) Sweep parameterized space for each kernel Throughput, Area, Tile-width Intra-Node Optimizer Replication vs. tile size Throughput Analysis Automated space/time tradeoffs Integer Linear Programming JANUS heuristic with Inter-Node Optimizer

Experimental Results (area target) SOBEL and Harris benchmarks Area Target: fill each Xilinx 7-series device Average utilization: 95% of the chip area

Experimental Results (throughput target) JANUS vs. Automated ILP (average 19% area reduction @ 2% throughput penalty)

Run-time vs ILP Heuristic is 3.6x faster on average Note: also saves 19% area

Area Efficiency on Zedboard Achieved 5.5 GigaPixel/Sec for Sobel on a $125 SOC chip

Conclusions / Summary We studied the problem of automatically finding area/throughput trade-off of CV applications We proposed JANUS (OpenVX based) Targeting FPGAs and programmable Many-core systems Satisfy different area budgets Average utilization: 95% of the chip area Satisfy different throughput target JANUS vs. Automated ILP 19% area reduction @ 2% throughput penalty 3.6x faster Achieved 5.5 GigaPixel/sec on a small FPGA (Zynq7020 costs $125)