Empowering visual categorization with the GPU Present by 陳群元我是強壯 !

Slides:

Advertisements

Similar presentations

Context-based object-class recognition and retrieval by generalized correlograms by J. Amores, N. Sebe and P. Radeva Discussion led by Qi An Duke University.

Advertisements

SE263 Video Analytics Course Project Initial Report Presented by M. Aravind Krishnan, SERC, IISc X. Mei and H. Ling, ICCV’09.

Multi-layer Orthogonal Codebook for Image Classification Presented by Xia Li.

CS395: Visual Recognition Spatial Pyramid Matching Heath Vinicombe The University of Texas at Austin 21 st September 2012.

Herv´ eJ´ egouMatthijsDouzeCordeliaSchmid INRIA INRIA INRIA

Activity Recognition Aneeq Zia. Agenda What is activity recognition Typical methods used for action recognition “Evaluation of local spatio-temporal features.

Energy Characterization and Optimization of Embedded Data Mining Algorithms: A Case Study of the DTW-kNN Framework Huazhong University of Science & Technology,

Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.

ECE 562 Computer Architecture and Design Project: Improving Feature Extraction Using SIFT on GPU Rodrigo Savage, Wo-Tak Wu.

OpenFOAM on a GPU-based Heterogeneous Cluster

Bag-of-features models Many slides adapted from Fei-Fei Li, Rob Fergus, and Antonio Torralba.

Packing bag-of-features ICCV 2009 Herv´e J´egou Matthijs Douze Cordelia Schmid INRIA.

Landmark Classification in Large- scale Image Collections Yunpeng Li David J. Crandall Daniel P. Huttenlocher ICCV 2009.

Image Categorization by Learning and Reasoning with Regions Yixin Chen, University of New Orleans James Z. Wang, The Pennsylvania State University Published.

Implementing a reliable neuro-classifier

Linear Clustering Algorithm BY Horne Ken & Khan Farhana & Padubidri Shweta.

5/30/2006EE 148, Spring Visual Categorization with Bags of Keypoints Gabriella Csurka Christopher R. Dance Lixin Fan Jutta Willamowski Cedric Bray.

Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.

Multiple Object Class Detection with a Generative Model K. Mikolajczyk, B. Leibe and B. Schiele Carolina Galleguillos.

Face Processing System Presented by: Harvest Jang Group meeting Fall 2002.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

© 2013 IBM Corporation Efficient Multi-stage Image Classification for Mobile Sensing in Urban Environments Presented by Shashank Mujumdar IBM Research,

Project 2 SIFT Matching by Hierarchical K-means Quantization

Performance and Energy Efficiency of GPUs and FPGAs

Revisiting Kirchhoff Migration on GPUs Rice Oil & Gas HPC Workshop

Marcin Marszałek, Ivan Laptev, Cordelia Schmid Computer Vision and Pattern Recognition, CVPR Actions in Context.

YOU LI SUPERVISOR: DR. CHU XIAOWEN CO-SUPERVISOR: PROF. LIU JIMING THURSDAY, MARCH 11, 2010 Speeding up k-Means by GPUs 1.

Svetlana Lazebnik, Cordelia Schmid, Jean Ponce

ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University.

Accelerating image recognition on mobile devices using GPGPU

PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata.

Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,

Beyond Sliding Windows: Object Localization by Efficient Subwindow Search The best paper prize at CVPR 2008.

Efficient Subwindow Search: A Branch and Bound Framework for Object Localization ‘PAMI09 Beyond Sliding Windows: Object Localization by Efficient Subwindow.

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

Locality-constrained Linear Coding for Image Classification

IIIT Hyderabad Scalable Clustering using Multiple GPUs K Wasif Mohiuddin P J Narayanan Center for Visual Information Technology International Institute.

Visual Categorization With Bags of Keypoints Original Authors: G. Csurka, C.R. Dance, L. Fan, J. Willamowski, C. Bray ECCV Workshop on Statistical Learning.

Event retrieval in large video collections with circulant temporal encoding CVPR 2013 Oral.

Gang WangDerek HoiemDavid Forsyth. INTRODUCTION APROACH (implement detail) EXPERIMENTS CONCLUSION.

RSVM: Reduced Support Vector Machines Y.-J. Lee & O. L. Mangasarian First SIAM International Conference on Data Mining Chicago, April 6, 2001 University.

Introduction to String Kernels Blaz Fortuna JSI, Slovenija.

Kylie Gorman WEEK 1-2 REVIEW. CONVERTING AN IMAGE FROM RGB TO HSV AND DISPLAY CHANNELS.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

CDVS on mobile GPUs MPEG 112 Warsaw, July Our Challenge CDVS on mobile GPUs  Compute CDVS descriptor from a stream video continuously  Make.

CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.

On Utillizing LVQ3-Type Algorithms to Enhance Prototype Reduction Schemes Sang-Woon Kim and B. John Oommen* Myongji University, Carleton University*

Big data Usman Roshan CS 675. Big data Typically refers to datasets with very large number of instances (rows) as opposed to attributes (columns). Data.

A Parallel, High Performance Implementation of the Dot Plot Algorithm Chris Mueller July 8, 2004.

Improving Support Vector Machine through Parameter Optimized Rujiang Bai, Junhua Liao Shandong University of Technology Library Zibo , China { brj,

Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,

Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.

Goggle Gist on the Google Phone A Content-based image retrieval system for the Google phone Manu Viswanathan Chin-Kai Chang Ji Hyun Moon.

Nawanol Theera-Ampornpunt, Seong Gon Kim, Asish Ghoshal, Saurabh Bagchi, Ananth Grama, and Somali Chaterji Fast Training on Large Genomics Data using Distributed.

Incremental Reduced Support Vector Machines Yuh-Jye Lee, Hung-Yi Lo and Su-Yun Huang National Taiwan University of Science and Technology and Institute.

Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.

Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory.

Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi

Optimizing Interconnection Complexity for Realizing Fixed Permutation in Data and Signal Processing Algorithms Ren Chen, Viktor K. Prasanna Ming Hsieh.

GPU Architecture and Its Application

Analysis of Sparse Convolutional Neural Networks

Distributed Network Traffic Feature Extraction for a Real-time IDS

Performance of Computer Vision

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Cheng-Ming Huang, Wen-Hung Liao Department of Computer Science

Nathan Grabaskas: Batched LA and Parallel Communication Optimization

Array Processor.

Oral presentation for ACM International Conference on Multimedia, 2014

University of Wisconsin - Madison

Automatic Handwriting Generation

Presentation transcript:

Empowering visual categorization with the GPU Present by 陳群元我是強壯 !

outline 我是強壯 !  Introduction  Overview of visual categorization  Image feature extraction  Category model learning  Test image classification  GPU accelerated categorization  Experimental setup  Results

introduction 我是強壯 !  Use GPU accelerate the quantization and classiﬁcation components of a visual categorization architecture  The algorithms and their implementations should push the state-of-the-art in categorization accuracy.  Visual categorization must be decomposable into components to locate bottlenecks.  Given the same input, implementations of a component on various hardware architectures must give the same output.

overview 我是強壯 !

Visual categorization system 我是強壯 !  Image Feature Extraction  Category Model Learning  Test Image Classification

Visual categorization system 我是強壯 !  Image Feature Extraction  Point Sampling Strategy  Descriptor Computation  Bag-of-Words  Category Model Learning  Test Image Classification

Visual categorization system 我是強壯 !  Image Feature Extraction  Point Sampling Strategy  Descriptor Computation  Bag-of-Words  Category Model Learning  Test Image Classification

Point sampling strategy 我是強壯 !  Dense sampling  Typically, around10,000 points are sampled per image  Salient point method  Harris-Laplace salient point detector [29]  Difference-of-Gaussians detector [28]

Visual categorization system 我是強壯 !  Image Feature Extraction  Point Sampling Strategy  Descriptor Computation  Bag-of-Words  Category Model Learning  Test Image Classification

Descriptors 我是強壯 !  SIFT descriptor ->128 dim  10 frames per second for 640x480 images(GPU)  SURF descriptor  100 frames per second for 640x480 images(GPU)  ColorSIFT descriptor ->384 dim  Triple of SIFT

Visual categorization system 我是強壯 !  Image Feature Extraction  Point Sampling Strategy  Descriptor Computation  Bag-of-Words  Category Model Learning  Test Image Classification

Bag-of-words 我是強壯 !  Vector quantization is computationally the most expensive part of the bag-of-words model.  Bag -> images set  Words->features

Bag-of-words 我是強壯 !  N descriptors of length d in an image  codebook with m elements  O(ndm) per image  A tree-based codebook  O(nd log(m))->real-time on the GPU [25].

我是強壯 !

Visual categorization system 我是強壯 !  Image Feature Extraction  Point Sampling Strategy  Descriptor Computation  Bag-of-Words  Category Model Learning  Test Image Classification

Category model learning 我是強壯 !  precompute kernel function values  kernel-based SVM algorithm

我是強壯 !

 Support Vector Machines  Kernel Support Vector Machines

Visual categorization system 我是強壯 !  Image Feature Extraction  Point Sampling Strategy  Descriptor Computation  Bag-of-Words  Category Model Learning  Test Image Classification

Test image classification 我是強壯 !

outline 我是強壯 !  Introduction  Overview of visual categorization  Image feature extraction  Category model learning  Test image classification  GPU accelerated categorization  Parallel Programming on the GPU and CPU  GPU-Accelerated Vector Quantization  GPU-Accelerated Kernel Value Precomputation  Experimental setup  Results

Parallel Programming on the GPU and CPU 我是強壯 !  SIMD instructions perform the same operation on multiple data elements at the same time

我是強壯 !

GPU-Accelerated Vector Quantization 我是強壯 !  The most expensive computational step in vector quantization is the calculation of the distance matrix.(n*m)  A:n*d  matrix with all image descriptors as rows  B:m*d  matrix with all codebook elements as rows

GPU-Accelerated Vector Quantization(cont.) 我是強壯 !

GPU-Accelerated Vector Quantization(cont.) 我是強壯 !  Compute the dot products between all rows of A and B (line 7).  matrix multiplications are the building block for many algorithms highly optimized BLAS linear algebra libraries containing this operation exist for both the CPU and the GPU.

我是強壯 !

GPU-Accelerated Kernel Value Precomputation 我是強壯 !  To compute kernel function values, we use the kernel function based on the distance  distance between feature vectors F and F’  kernel function based on this distance

GPU-Accelerated Kernel Value Precomputation(cont.) 我是強壯 !  multiple input features  For kernel value precomputation, memory usage is an important problem.  for a dataset with 50, 000 images, the input data is 12 GB and the output data is 19 GB  to avoid holding all data in memory simultaneously. We divide the processing into evenly sized chunks.(1024*1024)

GPU-Accelerated Kernel Value Precomputation(cont.) 我是強壯 !

EXPERIMENTAL SETUP 我是強壯 !  Experiment 1: Vector Quantization Speed  CPU implementation is SIMD-optimized.  codebook of size m = 4, 000  20, 000 descriptors per image  descriptor lengths of d = 128 (SIFT) and d = 384 (ColorSIFT).  Experiment 2: Kernel Value Precomputation Speed  chosen the large Mediamill Challenge training set of 30, 993 frames  Experiment 3: Visual Categorization Throughput  comparison is made between the quad-core Core i7 920 CPU (2.66GHz) and the Gefore GTX260 GPU (27 cores).

Results 我是強壯 !  Experiment 1: Vector Quantization Speed  Experiment 2: Kernel Value Precomputation Speed  Experiment 3: Visual Categorization Throughput

Results 我是強壯 !  Experiment 1: Vector Quantization Speed  Experiment 2: Kernel Value Precomputation Speed  Experiment 3: Visual Categorization Throughput

Vector Quantization Speed(SIFT) 我是強壯 !

Vector Quantization Speed(ColorSIFT) 我是強壯 !

Results 我是強壯 !  Experiment 1: Vector Quantization Speed  Experiment 2: Kernel Value Precomputation Speed  Experiment 3: Visual Categorization Throughput

Kernel Value Precomputation Speed 我是強壯 !

Results 我是強壯 !  Experiment 1: Vector Quantization Speed  Experiment 2: Kernel Value Precomputation Speed  Experiment 3: Visual Categorization Throughput

Visual Categorization Throughput 我是強壯 !

Other applications 我是強壯 !  Application 1: k-means Clustering  Application 2: Bag-of-Words Model for Text Retrieval  Application 3: Multi-Frame Processing for Video Retrieval

Conclusions 我是強壯 !  This paper provides an efficiency analysis of a state-of-the art visual categorization pipeline based on the bag-of- words model.  two large bottlenecks were identified: the vector quantization step in the image feature extraction and the kernel value computation in the category classification  Compared to a multi-threaded CPU implementation on a quad-core CPU, the GPU is 4.8 times faster.

The end 我是強壯 !  Thank you!

Conclusions 我是強壯 !  This paper provides an efficiency analysis of a state-of-the art visual categorization pipeline based on the bag-of- words model.  two large bottlenecks were identified: the vector quantization step in the image feature extraction and the kernel value computation in the category classification  GPU implementation of vector quantization, it is 3.9 times faster than when it is computed on a modern quad- core CPU.  precomputing these kernel values on the GPU instead of a quad-core CPU accelerates it by a factor of 10.

Conclusion(cont.) 我是強壯 !  Overall, by using a parallel implementation on the GPU, classifying unseen images is 17 times faster than a singlethreaded CPU version  Compared to a multi-threaded CPU implementation on a quad-core CPU, the GPU is 4.8 times faster.

Kernel svm 我是強壯 !  machine-learning.html#kernel_method machine-learning.html#kernel_method  mIM-/article?mid=25 mIM-/article?mid=25  vector-machines-for.html

我是強壯 !