1/24 Exploring the Design Space of a Parallel Object Recognition System Bor-Yiing Su, Kurt Keutzer, and the PALLAS group Parallel Computing Lab, University of California, Berkeley
2/24 Successes of the Pallas Group MRI: Murphy M, Keutzer K, Vasanawala S, Lustig M (2010) Clinically Feasible Reconstruction for L1-SPIRiT Parallel Imaging and Compressed Sensing MRI. ISMRM SVM: Catanzaro B, Sundaram N, Keutzer K (2008) Fast support vector machine training and classification on graphics processors. ICML 2008, pp Contour: Catanzaro B, Su B, Sundaram N, Lee Y, Murphy M, Keutzer K (2009) Efficient, high quality image contour detector. ICCV 2009, pp Object Recognition: Su B, Brutch T, Keutzer K (2011) A parallel region based object recognition system. WACV Optical Flow: Sundaram N, Brox T, Keutzer K (2010) Dense Point Trajectories by GPU-accelerated Large Displacement Optical Flow. ECCV Speech: Chong J, Gonina E, You K, Keutzer K (2010) Exploring Recognition Network Representations for Efficient Speech Inference on Highly Parallel Platforms. International Speech Communication Association, pp Value-at-risk: Dixon M, Chong J, Keutzer K (2009) Acceleration of market value-at-risk estimation. Workshop on High Performance Computing in Finance at Super Computing.
3/24 Outline Design Space An Object Recognition System Exploring the Design Space of the Object Recognition system Conclusion
4/24 Three Layers of Design Space The design space of parallel applications is composed of three layers Design SpaceExplanationExample Algorithm Layer Using different ways to transform same inputs into same or similar outputs Parallelization Strategy Layer Using different strategies to parallelize the same algorithm Platform Layer Using specific hardware features to optimize the same parallelization strategy Lanczos Solver BFS Graph Traversal Blocking Dimensions Full Selective No with CW test Graph Partition Parallel Task Queue 2 x 8 4 x 4 8 x 2 ? ? ?
5/24 Statements 1. Exploring the design space is necessary to achieve high performance on a hardware platform of choice 2. Take advantage of domain knowledge is necessary to understand trade-offs among different parallelization methods and achieve peak performance Algorithm Layer Parallelization Strategy Layer Platform Layer Design Space
6/24 Outline Design Space An Object Recognition System Exploring the Design Space of the Object Recognition system Future Work
7/24 Object Recognition System Object Recognition System Image Queries Outputs Bottles Swans Trained Categories Mugs Giraffes
8/24 Targeting Object Recognition System C. Gu, J. Lim, P. Arbelǽz, and J. Malik. Recognition using regions. In CVPR, 2009 Training FlowClassification Flow
9/24 Outline Design Space An Object Recognition System Exploring the Design Space of the Object Recognition system Lanczos Eigensolver Breadth First Search (BFS) Graph Traversal Kernel Pair-wise Distance Kernel Overall Performance Demo Conclusion
10/24 Lanczos Eigensolver The Lanczos Eigensolver is the most time consuming part in the contour detection computation By linking pixels to its neighbors, we can construct an affinity matrix among the pixels We can find good contours by performing the normalized cut spectral graph partitioning algorithm on the graph The normalized cut can be solved by finding the eigenvectors of the affinity matrix
11/24 Exploring the Algorithm Layer Three different re-orthogonalization methods Full re-orthogonalization Selective re-orthogonalization No re-orthogonalization with Cullum - Willoughby test Full Reorthogonalization Selective Reorthogonalization No Reorthogonalization
12/24 Experimental Results MethodRuntime(s) Matlab227 Trlan170 ParaMat151.2 Full15.83 Selective3.6 No0.81 Explored Design Space Three different re-orthogonalization methods Conclusion: Use the no re-orthogonalization with Cullum- Willoughby test method on a GPU in our system
13/24 BFS Graph Traversal Kernel The image segmentation component heavily relies on the BFS graph traversal kernel Image Graph: Nodes represent image pixels Edges represent neighboring relationships BFS graph traversal kernel: propagate information from some pixels to other pixels
14/24 Exploring the Algorithm Layer Direct algorithm: Each source node propagates information to its neighbors Traditional BFS graph traversal algorithm Reverse algorithm: Each node checks whether it can be updated by one or more neighboring nodes Structured grid computation
15/24 Exploring the Parallelization Strategy Layer Two strategies can be used to parallelize the traditional BFS graph traversal algorithm Graph partition Parallel task queue... Graph PartitionTask Queue
16/24 Associating Design Space Exploration with Input Data Properties Explored Design Space Parallel Task queue on Intel Core i7 using OpenMP with 8 threads Graph partition on Intel Core i7 using OpenMP with 8 threads Structured grid on Nvidia GTX 480 Maximum Traversal Distance Computation Time (s) Conclusion: Use the structured grid method on a GPU in our system
17/24 Pair-wise χ 2 Distance Kernel In both the training stage and the classification stage, we need to compute the pair-wise distance between two region sets Similar regions have shorter distance Different regions have longer distance It is a matrix matrix multiplication computation Replacing dot product into χ 2 distance Definition of the χ 2 distance:
18/24 Exploring the Design Space Algorithm Layer Inner χ 2 distance Outer χ 2 distance Platform Layer Cache Mechanisms No Cache Hardware Controlled Cache (Texture memory on GPU) Software Controlled Cache (Shared memory on GPU)
19/24 Associating Design Space Exploration with Input Data Properties Explored Design Space The combination of two algorithms and three cache mechanisms on Nvidia GTX 480 Conclusion: Use the outer χ 2 distance method with software controlled cache in the training stage Use the inner χ 2 distance method with software controlled cache in the classification stage
20/24 Overall Performance: Speedups TrainingClassification Computati on Computation time (s)Speedup SerialParallel Contour x Segmentati on x Feature x Hough Voting x Total x Computa tion Computation time (s)Speedup SerialParallel Feature x Distance x Weight x Total x
21/24 Overall Performance: Detection Accuracy SerialParallel FPPI Detection Rate FPPI Detection Rate
22/24 Demo
23/24 Conclusion Exploring the design space is necessary to achieve high performance on a hardware platform of choice Take advantage of domain knowledge is necessary to understand trade-offs among different parallelization methods and achieve peak performance We have developed a parallel object recognition system with comparable detection accuracy while achieving 110x-120x times speedup Work presented at ICCV 2009 and WACV 2011 Bor-Yiing Su, Tasneem Brutch, Kurt Keutzer, "A Parallel Region Based Object Recognition System," in IEEE Workshop on Applications of Computer Vision (WACV 2011), Hawaii, January 2011 Bryan Catanzaro, Bor-Yiing Su, Narayanan Sundaram, Yunsup Lee, Mark Murphy, and Kurt Keutzer, "Efficient, High-Quality Image Contour Detection," in International Conference on Computer Vision (ICCV-2009), pp , Kyoto Japan, September 2009.
24/24 Thank You