1/24 Exploring the Design Space of a Parallel Object Recognition System Bor-Yiing Su, Kurt Keutzer,

Slides:

Advertisements

Similar presentations

Computer Vision Group UC Berkeley How should we combine high level and low level knowledge? Jitendra Malik UC Berkeley Recognition using regions is joint.

Advertisements

Partitional Algorithms to Detect Complex Clusters

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

Spectral graph reduction for image and streaming video segmentation Fabio Galasso 1 Margret Keuper 2 Thomas Brox 2 Bernt Schiele 1 1 Max Planck Institute.

Olivier Duchenne ， Armand Joulin ， Jean Ponce Willow Lab ， ICCV2011.

Analysis of Contour Motions Ce Liu William T. Freeman Edward H. Adelson Computer Science and Artificial Intelligence Laboratory Massachusetts Institute.

An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC.

Development of a track trigger based on parallel architectures Felice Pantaleo PH-CMG-CO (University of Hamburg) Felice Pantaleo PH-CMG-CO (University.

Empowering visual categorization with the GPU Present by 陳群元我是強壯 !

Ghunhui Gu, Joseph J. Lim, Pablo Arbeláez, Jitendra Malik University of California at Berkeley Berkeley, CA

Biased Normalized Cuts 1 Subhransu Maji and Jithndra Malik University of California, Berkeley IEEE Conference on Computer Vision and Pattern Recognition.

Discriminative and generative methods for bags of features

A New Block Based Motion Estimation with True Region Motion Field Jozef Huska & Peter Kulla EUROCON 2007 The International Conference on “Computer as a.

Recognition using Regions CVPR Outline Introduction Overview of the Approach Experimental Results Conclusion.

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

Exchanging Faces in Images SIGGRAPH ’04 Blanz V., Scherbaum K., Vetter T., Seidel HP. Speaker: Alvin Date: 21 July 2004.

Pattern Recognition Topic 1: Principle Component Analysis Shapiro chap

Normalized Cuts Demo Original Implementation from: Jianbo Shi Jitendra Malik Presented by: Joseph Djugash.

Quadtrees, Octrees and their Applications in Digital Image Processing

Efficient Spatiotemporal Grouping Using the Nyström Method Charless Fowlkes, U.C. Berkeley Serge Belongie, U.C. San Diego Jitendra Malik, U.C. Berkeley.

Perceptual Organization: Segmentation and Optical Flow.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

Communication-Minimizing 2D Convolution in GPU Registers Forrest N. Iandola David Sheffield Michael Anderson P. Mangpo Phothilimthana Kurt Keutzer University.

Presented by: Kamakhaya Argulewar Guided by: Prof. Shweta V. Jain

MASKS © 2004 Invitation to 3D vision Lecture 3 Image Primitives andCorrespondence.

Computer Vision James Hays, Brown

CSE 185 Introduction to Computer Vision Pattern Recognition.

Linked Edges as Stable Region Boundaries* Michael Donoser, Hayko Riemenschneider and Horst Bischof This work introduces an unsupervised method to detect.

“Low-Power, Real-Time Object- Recognition Processors for Mobile Vision Systems”, IEEE Micro Jinwook Oh ; Gyeonghoon Kim ; Injoon Hong ; Junyoung.

PageRank for Product Image Search Kevin Jing (Googlc IncGVU, College of Computing, Georgia Institute of Technology) Shumeet Baluja (Google Inc.) WWW 2008.

Segmentation using eigenvectors Papers: “Normalized Cuts and Image Segmentation”. Jianbo Shi and Jitendra Malik, IEEE, 2000 “Segmentation using eigenvectors:

Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.

Tzu ming Su Advisor ： S.J.Wang MOTION DETAIL PRESERVING OPTICAL FLOW ESTIMATION 2013/1/28 L. Xu, J. Jia, and Y. Matsushita. Motion detail preserving optical.

Recognition using Regions (Demo) Sudheendra V. Outline Generating multiple segmentations –Normalized cuts [Ren & Malik (2003)] Uniform regions –Watershed.

Jifeng Dai 2011/09/27.  Introduction  Structural SVM  Kernel Design  Segmentation and parameter learning  Object Feature Descriptors  Experimental.

Chapter 14: SEGMENTATION BY CLUSTERING 1. 2 Outline Introduction Human Vision & Gestalt Properties Applications – Background Subtraction – Shot Boundary.

Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.

Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,

Dense Image Over-segmentation on a GPU Alex Rodionov 4/24/2009.

USC Search Space Properties for Pipelined FPGA Applications University of Southern California Information Sciences Institute Heidi Ziegler, Mary Hall,

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

Experiences Accelerating MATLAB Systems Biology Applications Heart Wall Tracking Lukasz Szafaryn, Kevin Skadron University of Virginia.

Event retrieval in large video collections with circulant temporal encoding CVPR 2013 Oral.

Histograms of Oriented Gradients for Human Detection(HOG)

CS 1699: Intro to Computer Vision Support Vector Machines Prof. Adriana Kovashka University of Pittsburgh October 29, 2015.

Big data Usman Roshan CS 675. Big data Typically refers to datasets with very large number of instances (rows) as opposed to attributes (columns). Data.

Data Structures and Algorithms in Parallel Computing Lecture 7.

Quiz Week 8 Topical. Topical Quiz (Section 2) What is the difference between Computer Vision and Computer Graphics What is the difference between Computer.

1 Review and Summary We have covered a LOT of material, spending more time and more detail on 2D image segmentation and analysis, but hopefully giving.

Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning.

MASKS © 2004 Invitation to 3D vision Lecture 3 Image Primitives andCorrespondence.

Finding Clusters within a Class to Improve Classification Accuracy Literature Survey Yong Jae Lee 3/6/08.

Jianchao Yang, John Wright, Thomas Huang, Yi Ma CVPR 2008 Image Super-Resolution as Sparse Representation of Raw Image Patches.

Normalized Cuts and Image Segmentation Patrick Denis COSC 6121 York University Jianbo Shi and Jitendra Malik.

Gaussian Conditional Random Field Network for Semantic Segmentation

Motion tracking TEAM D, Project 11: Laura Gui - Timisoara Calin Garboni - Timisoara Peter Horvath - Szeged Peter Kovacs - Debrecen.

Network Management Lecture 13. MACHINE LEARNING TECHNIQUES 2 Dr. Atiq Ahmed Université de Balouchistan.

1 Bilinear Classifiers for Visual Recognition Computational Vision Lab. University of California Irvine To be presented in NIPS 2009 Hamed Pirsiavash Deva.

Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi

Recognition of biological cells – development

Chilimbi, et al. (2014) Microsoft Research

Convolutional Neural Fabrics by Shreyas Saxena, Jakob Verbeek

Real-Time Ray Tracing Stefan Popov.

Image Primitives and Correspondence

Dynamic Routing Using Inter Capsule Routing Protocol Between Capsules

Cheng-Ming Huang, Wen-Hung Liao Department of Computer Science

Outline S. C. Zhu, X. Liu, and Y. Wu, “Exploring Texture Ensembles by Efficient Markov Chain Monte Carlo”, IEEE Transactions On Pattern Analysis And Machine.

Restructuring the multi-resolution approximation for spatial data to reduce the memory footprint and to facilitate scalability Vinay Ramakrishnaiah Mentors:

Spectral Clustering Eric Xing Lecture 8, August 13, 2010

Human-object interaction

Presentation transcript:

1/24 Exploring the Design Space of a Parallel Object Recognition System Bor-Yiing Su, Kurt Keutzer, and the PALLAS group Parallel Computing Lab, University of California, Berkeley

2/24 Successes of the Pallas Group  MRI: Murphy M, Keutzer K, Vasanawala S, Lustig M (2010) Clinically Feasible Reconstruction for L1-SPIRiT Parallel Imaging and Compressed Sensing MRI. ISMRM  SVM: Catanzaro B, Sundaram N, Keutzer K (2008) Fast support vector machine training and classification on graphics processors. ICML 2008, pp  Contour: Catanzaro B, Su B, Sundaram N, Lee Y, Murphy M, Keutzer K (2009) Efficient, high quality image contour detector. ICCV 2009, pp  Object Recognition: Su B, Brutch T, Keutzer K (2011) A parallel region based object recognition system. WACV  Optical Flow: Sundaram N, Brox T, Keutzer K (2010) Dense Point Trajectories by GPU-accelerated Large Displacement Optical Flow. ECCV  Speech: Chong J, Gonina E, You K, Keutzer K (2010) Exploring Recognition Network Representations for Efficient Speech Inference on Highly Parallel Platforms. International Speech Communication Association, pp  Value-at-risk: Dixon M, Chong J, Keutzer K (2009) Acceleration of market value-at-risk estimation. Workshop on High Performance Computing in Finance at Super Computing.

3/24 Outline  Design Space  An Object Recognition System  Exploring the Design Space of the Object Recognition system  Conclusion

4/24 Three Layers of Design Space  The design space of parallel applications is composed of three layers Design SpaceExplanationExample Algorithm Layer Using different ways to transform same inputs into same or similar outputs Parallelization Strategy Layer Using different strategies to parallelize the same algorithm Platform Layer Using specific hardware features to optimize the same parallelization strategy Lanczos Solver BFS Graph Traversal Blocking Dimensions Full Selective No with CW test Graph Partition Parallel Task Queue 2 x 8 4 x 4 8 x 2 ? ? ?

5/24 Statements 1. Exploring the design space is necessary to achieve high performance on a hardware platform of choice 2. Take advantage of domain knowledge is necessary to understand trade-offs among different parallelization methods and achieve peak performance Algorithm Layer Parallelization Strategy Layer Platform Layer Design Space

6/24 Outline  Design Space  An Object Recognition System  Exploring the Design Space of the Object Recognition system  Future Work

7/24 Object Recognition System Object Recognition System Image Queries Outputs Bottles Swans Trained Categories Mugs Giraffes

8/24 Targeting Object Recognition System  C. Gu, J. Lim, P. Arbelǽz, and J. Malik. Recognition using regions. In CVPR, 2009 Training FlowClassification Flow

9/24 Outline  Design Space  An Object Recognition System  Exploring the Design Space of the Object Recognition system  Lanczos Eigensolver  Breadth First Search (BFS) Graph Traversal Kernel  Pair-wise Distance Kernel  Overall Performance  Demo  Conclusion

10/24 Lanczos Eigensolver  The Lanczos Eigensolver is the most time consuming part in the contour detection computation  By linking pixels to its neighbors, we can construct an affinity matrix among the pixels  We can find good contours by performing the normalized cut spectral graph partitioning algorithm on the graph  The normalized cut can be solved by finding the eigenvectors of the affinity matrix

11/24 Exploring the Algorithm Layer  Three different re-orthogonalization methods  Full re-orthogonalization  Selective re-orthogonalization  No re-orthogonalization with Cullum - Willoughby test Full Reorthogonalization Selective Reorthogonalization No Reorthogonalization

12/24 Experimental Results MethodRuntime(s) Matlab227 Trlan170 ParaMat151.2 Full15.83 Selective3.6 No0.81  Explored Design Space  Three different re-orthogonalization methods Conclusion: Use the no re-orthogonalization with Cullum- Willoughby test method on a GPU in our system

13/24 BFS Graph Traversal Kernel  The image segmentation component heavily relies on the BFS graph traversal kernel  Image Graph:  Nodes represent image pixels  Edges represent neighboring relationships  BFS graph traversal kernel: propagate information from some pixels to other pixels

14/24 Exploring the Algorithm Layer  Direct algorithm: Each source node propagates information to its neighbors  Traditional BFS graph traversal algorithm  Reverse algorithm: Each node checks whether it can be updated by one or more neighboring nodes  Structured grid computation

15/24 Exploring the Parallelization Strategy Layer  Two strategies can be used to parallelize the traditional BFS graph traversal algorithm  Graph partition  Parallel task queue... Graph PartitionTask Queue

16/24 Associating Design Space Exploration with Input Data Properties  Explored Design Space  Parallel Task queue on Intel Core i7 using OpenMP with 8 threads  Graph partition on Intel Core i7 using OpenMP with 8 threads  Structured grid on Nvidia GTX 480 Maximum Traversal Distance Computation Time (s) Conclusion: Use the structured grid method on a GPU in our system

17/24 Pair-wise χ 2 Distance Kernel  In both the training stage and the classification stage, we need to compute the pair-wise distance between two region sets  Similar regions have shorter distance  Different regions have longer distance  It is a matrix matrix multiplication computation  Replacing dot product into χ 2 distance Definition of the χ 2 distance:

18/24 Exploring the Design Space  Algorithm Layer  Inner χ 2 distance  Outer χ 2 distance  Platform Layer  Cache Mechanisms  No Cache  Hardware Controlled Cache (Texture memory on GPU)  Software Controlled Cache (Shared memory on GPU)

19/24 Associating Design Space Exploration with Input Data Properties  Explored Design Space  The combination of two algorithms and three cache mechanisms on Nvidia GTX 480 Conclusion: Use the outer χ 2 distance method with software controlled cache in the training stage Use the inner χ 2 distance method with software controlled cache in the classification stage

20/24 Overall Performance: Speedups TrainingClassification Computati on Computation time (s)Speedup SerialParallel Contour x Segmentati on x Feature x Hough Voting x Total x Computa tion Computation time (s)Speedup SerialParallel Feature x Distance x Weight x Total x

21/24 Overall Performance: Detection Accuracy SerialParallel FPPI Detection Rate FPPI Detection Rate

22/24 Demo

23/24 Conclusion  Exploring the design space is necessary to achieve high performance on a hardware platform of choice  Take advantage of domain knowledge is necessary to understand trade-offs among different parallelization methods and achieve peak performance  We have developed a parallel object recognition system with comparable detection accuracy while achieving 110x-120x times speedup  Work presented at ICCV 2009 and WACV 2011 Bor-Yiing Su, Tasneem Brutch, Kurt Keutzer, "A Parallel Region Based Object Recognition System," in IEEE Workshop on Applications of Computer Vision (WACV 2011), Hawaii, January 2011 Bryan Catanzaro, Bor-Yiing Su, Narayanan Sundaram, Yunsup Lee, Mark Murphy, and Kurt Keutzer, "Efficient, High-Quality Image Contour Detection," in International Conference on Computer Vision (ICCV-2009), pp , Kyoto Japan, September 2009.

24/24 Thank You