Fan Zhang, Yang Gao and Jason D. Bakos

Slides:



Advertisements
Similar presentations
Motion.
Advertisements

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
DSPs Vs General Purpose Microprocessors
Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.
Lecture 6: Multicore Systems
Yaron Doweck Yael Einziger Supervisor: Mike Sumszyk Spring 2011 Semester Project.
The University of Adelaide, School of Computer Science
1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)
MASKS © 2004 Invitation to 3D vision Lecture 7 Step-by-Step Model Buidling.
C66x CorePac: Achieving High Performance. Agenda 1.CorePac Architecture 2.Single Instruction Multiple Data (SIMD) 3.Memory Access 4.Pipeline Concept.
Computer Vision Optical Flow
GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.
Announcements Quiz Thursday Quiz Review Tomorrow: AV Williams 4424, 4pm. Practice Quiz handout.
Optical Flow Methods 2007/8/9.
Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.
Matching Compare region of image to region of image. –We talked about this for stereo. –Important for motion. Epipolar constraint unknown. But motion small.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
GPGPU platforms GP - General Purpose computation using GPU
Synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng.
Heterogeneous Computing at USC Dept. of Computer Science and Engineering University of South Carolina Dr. Jason D. Bakos Associate Professor Heterogeneous.
GPU-accelerated Evaluation Platform for High Fidelity Networking Modeling 11 December 2007 Alex Donkers Joost Schutte.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
RICE UNIVERSITY Implementing the Viterbi algorithm on programmable processors Sridhar Rajagopal Elec 696
1 Chapter 04 Authors: John Hennessy & David Patterson.
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.
© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.
PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Jason Li Jeremy Fowers 1. Speedups and Energy Reductions From Mapping DSP Applications on an Embedded Reconfigurable System Michalis D. Galanis, Gregory.
Processor Architecture
Spatiotemporal Saliency Map of a Video Sequence in FPGA hardware David Boland Acknowledgements: Professor Peter Cheung Mr Yang Liu.
Motion Estimation Today’s Readings Trucco & Verri, 8.3 – 8.4 (skip 8.3.3, read only top half of p. 199) Newton's method Wikpedia page
Dynamic Scheduling Monte-Carlo Framework for Multi-Accelerator Heterogeneous Clusters Authors: Anson H.T. Tse, David B. Thomas, K.H. Tsoi, Wayne Luk Source:
Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke
Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.
Orchestrating Multiple Data-Parallel Kernels on Multiple Devices Janghaeng Lee, Mehrzad Samadi, and Scott Mahlke October, 2015 University of Michigan -
An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)
Sparse Matrix-Vector Multiply on the Keystone II Digital Signal Processor Yang Gao, Fan Zhang and Dr. Jason D. Bakos 2014 IEEE High Performance Extreme.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 Graphic Processing Processors (GPUs) Parallel.
Sunpyo Hong, Hyesoon Kim
Motion Estimation Today’s Readings Trucco & Verri, 8.3 – 8.4 (skip 8.3.3, read only top half of p. 199) Newton's method Wikpedia page
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
My Coordinates Office EM G.27 contact time:
Static Translation of Stream Program to a Parallel System S. M. Farhad The University of Sydney.
Philipp Gysel ECE Department University of California, Davis
Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.
Motion tracking TEAM D, Project 11: Laura Gui - Timisoara Calin Garboni - Timisoara Peter Horvath - Szeged Peter Kovacs - Debrecen.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
CS203 – Advanced Computer Architecture
Yang Gao and Dr. Jason D. Bakos
TI Information – Selective Disclosure
Microarchitecture.
The University of Adelaide, School of Computer Science
Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 14 The Roofline Visual Performance Model Prof. Zhang Gang
Leiming Yu, Fanny Nina-Paravecino, David Kaeli, Qianqian Fang
Motion Estimation Today’s Readings
Coe818 Advanced Computer Architecture
All-Pairs Shortest Paths
Announcements more panorama slots available now
Announcements more panorama slots available now
Graphics Processing Unit
Optical flow and keypoint tracking
6- General Purpose GPU Programming
CS295: Modern Systems: Application Case Study Neural Network Accelerator – 2 Sang-Woo Jun Spring 2019 Many slides adapted from Hyoukjun Kwon‘s Gatech.
Presentation transcript:

Fan Zhang, Yang Gao and Jason D. Bakos Lucas Kanade Optical Flow Estimation on the TI C66x Digital Signal Processor Fan Zhang, Yang Gao and Jason D. Bakos 2014 IEEE High Performance Extreme Computing Conference(HPEC ‘14)

What is Optical Flow Evaluates the pixel movement in video stream Represented as a dense 2D fields Many applications need to apply real-time optical flow Robotic vision Augmented reality Computationally intensive

TI C66x Multicore DSP Unique architectural features Eight cores Statically scheduled VLIW/SIMD instructions 2 register files and 8 functional units per core Tesla K20X GPU Intel i7 Ivy Bridge TI C6678 Keystone NVIDIA Tegra K1 Server GPU Server CPU Embedded Peak single precision throughput 3.95 Tflops 448 Gflops 160 327 Peak Power 225 W 77 W <10 W <20 W DRAM bandwidth 250 GB/s 25.6 12.8 17.1

Why using DSP? Low power consumption High performance floating point arithmetic Strong potential for computer vision tasks 1D vs 2D (signal processing vs image processing)

Gradient-based Optical Flow Solver . (x, y, t) (x + Δx, y + Δ y, t + Δt) Frame t Frame t + Δt Optical flow evaluation First-order approximation Gradient of pixel in x, y and t direction, known One formula with two unknowns Optical flow, unknown

Lucas Kanade Method x-1,y-1 x,y-1 x+1,y-1 x-1,y x,y x+1,y x-1,y+1 If we assume the pixel adjacent to the center has the same optical flow as the center x-1,y-1 x,y-1 x+1,y-1 x-1,y x,y x+1,y x-1,y+1 x,y+1 x+1,y+1 Let Least Square Method

Image Derivative Computation B (A – B + C – D) / 2 C D frame n A B (A – C + B – D) / 2 C D frame n A B A - B frame n+1 frame n 7

Lucas Kanade Method Input: Frame n, frame n+1, window size w Output: Optical flow field F For each pixel x, y Compute x,y,t derivative matrices Dx, Dy, Dt for its neighbor window Compute optical flow by the least square method

Derivative Computation (Example) Image n 10 30 Image n + 1 10 30 -10 10 -10 10 20 -20

Least Square Method Example (Neighbor Window Size = 3) -10 10 -10 10 20 -20

Optimize Lucas Kanade Method on DSP Derivative computation SIMD arithmetic Data interleaving Least square method Loop unrolling SIMD load & arithmetic

Derivative Computation 10 30 Derivative Computation (Dx, Dy) Cycle 1 -10 10 -10 10 Cycle 2 Interleave -10 10 Cycle 3

Least Square Method Required computations Dx Dy Dt Multiplication x 5 DxDx DxDy DyDy DxDt DyDt Accumulation x 5 Map to device Dx Dy Dt Dt Complex Mul 2-way SIMD Mul DxDy -DyDy DxDt DyDt DyDx DxDx (a+bj)(c+dj) = (ac-bd) + (ad+bc)j

+ Least Square Method Unrolled 2x DxDy Dt -10 10 20 -20 (Dx,Dy,Dt) Group up loads into SIMD operation Load DxDy Load Dt Complex MUL SIMD MUL SIMD ADD DxDy Dt -10 10 20 -20 (Dx,Dy,Dt) (DxDx,DxDy,DyDy,DxDt,DyDt) 200 200 -10 -10 -10 100 100 100 100 + 200 (Dx,Dy,Dt) (DxDx,DxDy,DyDy,DxDt,DyDt) 10 -10 10 -10 100 100 -100 -100 100

Loop Flattening Software Pipelining for (i = 0; i < m; ++i) TI’s compiler technique to allow independent loop iterations be executed in parallel The consecutive iterations are packed and executed together so that the arithmetic functional units are fully utilized for (i = 0; i < m; ++i) for (j = 0; j < n; ++j) for (k = 0; k < m * n; ++k) … Update i, j; j = 0 j = 1 j = 0 j = 1 Pipeline prologue/epilogue overhead

Multicore Utilization … Derivative Computation Least Square Method Least Square Method Least Square Method … Row [0, k) Row [k, 2k) Row [2k, 3k) k = numRows / numCores Cache sync

Accuracy Improvement Cannot catch movement larger than window size Gaussian Pyramid A coarse to fine strategy for optical flow computation Catches large movements First order approximation may not be accurate Iterative refinement Iteratively process and offset pixels until the computed optical flow is converged Introduce data random access

YOKOGAWA WT500 Power Analyzer Experiment Setup Platform #Cores Implementation Power Measurement TI C6678 DSP 8 Our Implementation TI GPIO-USB Module ARM Cortex A9 2 YOKOGAWA WT500 Power Analyzer Intel i7-2600 4 Intel RAPL Tesla K20 GPU 2688 OpenCV NVIDIA SMI

Results and Analysis Highest Performance Lowest Power Tesla K20 Lowest Power Cortex A9 Best Power Efficiency TI C66x DSP Platform C66x CortexA9 Intel i7-2600 K20 Actual Gflops/ Peak Gflops 12% 7% 4% 3% Gflops 15.4 0.7 17.1 108.6 Power (W) 5.7 4.8 52.5 79.0 Gflops/W 2.69 0.2 0.3 1.4

Results and Analysis We achieve linear scalability on multi-cores The power efficiency of the DSP is low when its cores are partially utilized Static power consumption

Results and Analysis Performance are related with window size Software pipeline performance Loop flattening is able to improve performance significantly on small window size

Conclusion First research work on DPS accelerated Lucas Kanade method Achieve higher energy efficiency and device utilization than GPU and CPU

Q & A

Kernels of Pyramidal Lucas Kanade Method Gaussian Blur Derivative Computation Least Square Method Flow Field Bilinear Interpolation 28 flop/pixel 8 flop/pixel Window Size = 16, Pyramid Level = 4, Iteration = 10 105 flop/pixel 3 flop/pixel