Evaluation of Multi-core Architectures for Image Processing Algorithms Masters Thesis Presentation by Trupti Patil July 22, 2009.

Slides:



Advertisements
Similar presentations
Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.
Advertisements

Parallel Processing with PlayStation3 Lawrence Kalisz.
Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.
4. Shared Memory Parallel Architectures 4.4. Multicore Architectures
Multicore Architectures Michael Gerndt. Development of Microprocessors Transistor capacity doubles every 18 months © Intel.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Lecture 6: Multicore Systems
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Optimization on Kepler Zehuan Wang
Cell Broadband Engine. INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Cell Broadband Engine Structure SPE PPE MIC EIB.
CSC457 Seminar YongKang Zhu December 6 th, 2001 About Network Processor.
Presented by Performance and Productivity of Emerging Architectures Jeremy Meredith Sadaf Alam Jeffrey Vetter Future Technologies.
Using Cell Processors for Intrusion Detection through Regular Expression Matching with Speculation Author: C˘at˘alin Radu, C˘at˘alin Leordeanu, Valentin.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
Chapter 17 Parallel Processing.
Synergistic Execution of Stream Programs on Multicores with Accelerators Abhishek Udupa et. al. Indian Institute of Science.
A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
Cell Broadband Processor Daniel Bagley Meng Tan. Agenda  General Intro  History of development  Technical overview of architecture  Detailed technical.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, and D. Shippy IBM Systems and Technology Group IBM Journal of Research and Development.
Programming the Cell Multiprocessor Işıl ÖZ. Outline Cell processor – Objectives – Design and architecture Programming the cell – Programming models CellSs.
Cell Architecture. Introduction The Cell concept was originally thought up by Sony Computer Entertainment inc. of Japan, for the PlayStation 3 The architecture.
Introduction to the Cell multiprocessor J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, D. Shippy (IBM Systems and Technology Group)
Agenda Performance highlights of Cell Target applications
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
1 Chapter 04 Authors: John Hennessy & David Patterson.
Extracted directly from:
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Cell processor implementation of a MILC lattice QCD application.
Company LOGO High Performance Processors Miguel J. González Blanco Miguel A. Padilla Puig Felix Rivera Rivas.
MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi CoE EECS Department April 21, 2014.
Programming Examples that Expose Efficiency Issues for the Cell Broadband Engine Architecture William Lundgren Gedae), Rick Pancoast.
Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.
© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
1 The IBM Cell Processor – Architecture and On-Chip Communication Interconnect.
Kevin Eady Ben Plunkett Prateeksha Satyamoorthy.
Hardware Acceleration Using GPUs M Anirudh Guide: Prof. Sachin Patkar VLSI Consortium April 4, 2008.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung Wong Chung Hoi Supervised by Prof. Michael R. Lyu Department of Computer.
Morgan Kaufmann Publishers Multicores, Multiprocessors, and Clusters
Optimizing Ray Tracing on the Cell Microprocessor David Oguns.
Sunpyo Hong, Hyesoon Kim
Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.
Aarul Jain CSE520, Advanced Computer Architecture Fall 2007.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
My Coordinates Office EM G.27 contact time:
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
High performance computing architecture examples Unit 2.
IBM Cell Processor Ryan Carlson, Yannick Lanner-Cusin, & Cyrus Stoller CS87: Parallel and Distributed Computing.
1/21 Cell Processor Systems Seminar Diana Palsetia (11/21/2006)
● Cell Broadband Engine Architecture Processor ● Ryan Layer ● Ben Kreuter ● Michelle McDaniel ● Carrie Ruppar.
Computer Engg, IIT(BHU)
GPU Architecture and Its Application
Cell Architecture.
Mattan Erez The University of Texas at Austin
NVIDIA Fermi Architecture
Graphics Processing Unit
6- General Purpose GPU Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

Evaluation of Multi-core Architectures for Image Processing Algorithms Masters Thesis Presentation by Trupti Patil July 22, 2009

Overview  Motivation  Contribution & scope  Background  Platforms  Algorithms  Experimental Results  Conclusion

Motivation  Fast processing response a major requirement in many image processing applications.  Image processing algorithms can be computationally expensive  Data needs to be processed in parallel, and optimized for real-time execution  Recent introduction of massively-parallel computer architectures promising significant acceleration.  Some architectures haven’t been actively explored yet.

Overview  Motivation  Contribution & scope  Background  Platforms  Algorithms  Experimental Results  Conclusion

Contribution & scope of the thesis  This thesis adapts and optimizes three image processing and computer vision algorithms for four multi-core architectures.  The timings are found  Obtained timings are compared against available corresponding previous work (intra-class) and architecture type (inter-class).  Appropriate deductions are made based on results.

Overview  Motivation  Contribution & scope  Background  Platforms  Algorithms  Implementation  Conclusion

Background  Need for Parallelization  SIMD Optimization  The need for faster execution time  Related work  Canny edge detection on CellBE [Gupta et al.] and on GPU [Luo et al.]  KLT tracking implementation on GPU [Sinha et al., Zach et al.]

Overview  Motivation  Contribution & scope  Background  Platforms  Algorithms  Implementation  Experimental Results  Conclusion

Hardware & Software Platforms ArchitectureHardware PlatformSoftware Platform NetBurst Microarchitecture Intel Pentium 4 HTLinux (Ubuntu) Intel C++ Compiler 11.1 Core Microarchitecture Intel Core 2 Duo MobileLinux (Ubuntu) Intel C++ Compiler 11.1 Cell Broadband Engine (CBE) Sony PlayStation 3Linux (Fedora) Cell SDK 3.1 Graphics Processing Unit (GPU) Nvidia GeForce 8 SeriesLinux (Fedora) CUDA 2.1

 Can execute legacy IA-32 and SIMD applications at higher clock rate.  HT allows simultaneous multithreading.  Has two logical processors on each physical processor  Support for upto SSE3  Improved performance/watt factor.  SSSE3 support for effective XMM registers’ utilization.  Supports SSE4  Scales upto Quad-core Intel NetBurst & Core Microarchitectures

Structural diagram of the Cell Broadband Engine Cell Broadband Engine (CBE) EIB SPE PPE Main Memory Graphics Device I/O Devices PPE PPU L2 Cache L1 Instruction Cache L1 Data Cache EIB SPE PPE Main Memory Graphics Device I/O Devices EIB SPE PPE Main Memory Graphics Device I/O Devices SPE SPU Memory Flow Controller (MFC) Local Store (LS)

 One Power-based PPE, with VMX  32/32kB I/D L1, and 512kB L2  dual issue, in order PPU, 2 HW threads  Eight SPEs, with up to 16x SIMD  dual issue, in order SPU  128 registers (128b wide)  256 kB local store (LS)  2x 16B/cycle DMA, 16 outstanding req. Cell processor overview  Element Interconnect Bus (EIB)  4 rings, 16B wide (at 1:2 clock)  96B/cycle peak, 16B/cycle to memory  2x 16B/cycle BIF and I/O  External communication  Dual XDR memory controller (MIC)  Two configurable bus interfaces (BIC)  Classical I/O interface  SMP coherent interface

Data flow in GPU Graphics Processing Unit (GPU) Vertex Processor Fragment Processor Assemble & Rasterize Frame buffer Operations Application Textures FRAMEBUFFERFRAMEBUFFER

Nvidia GeForce 8 Series GPU Graphics pipeline in NVIDIA GeForce 8 Series GPU

 Computing engine in Nvidia GPUs  Makes GPU a compute device into a highly multithreaded coprocessor.  Provides both low level and a higher level APIs  Has several advantages over GPUs using graphics APIs (e.g.: OpenGL) Compute Unified Device Interface (CUDA)

Overview  Motivation  Contribution & scope  Background  Platforms  Algorithms  Experimental Results  Conclusion

Algorithm 1: Gaussian Smoothing  Gaussian smoothing is a filtering kernel  Removes small-scale texture and noise for given spatial extent  1-D Gaussian kernel written as:  2-D Gaussian kernel: Separable

Gaussian Smoothing (example)

Algorithm 2: Canny Edge Detection  Edge detection a commonly operation in image processing  Edges are discontinuities in image gray levels, have strong intensity contrast.  Canny Edge Detection is an optimal edge-detector algorithm.  Illustrated ahead with an example.

Canny Edge Detection (example)

Algorithm 3: KLT Tracking First proposed by Lucas and Kanade. Extended by Tomasi and Kanade and Shi and Tomasi. Firstly, determine what feature(s) to track through feature selection Secondly, track the selected feature(s) across image sequence. Rests on three assumptions: temporal persistence, spatial coherence and brightness constancy

Algorithm 3: KLT Tracking

Overview  Motivation  Contribution & scope  Background  Platforms  Algorithms  Results  Conclusion

Gaussian Smoothing: Results Lenna Mandrill

Results: Gaussian Smoothing

Canny edge detection: Results Lenna Mandrill

Results: Canny edge detection

Results: Canny Edge Detection Comparison with other implementations on Cell Comparison with other implementations on GPU

Results: KLT Tracking

Comparison with other implementations on GPU Results: KLT Tracking Comparison with other implementations on GPU —No known implementations yet.

Overview  Motivation  Contribution & scope  Background  Platforms  Algorithms  Results  Conclusion & Extension

Conclusion & Future work  GPU still ahead of other architectures, most suited for image processing applications.  Optimizing PS3 could improve timings to narrow the gap between its and GPU timings. We could provide:  Support for faster color Canny.  Support for kernel width larger than 5  Better management of thread alignment in GPU if not a multiple of 16  Include Intel Xeon & Larrabee as potential architectures.

Questions..

Additional Slides

CBE Architecture  Contains traditional microprocessor, PowerPC Processor Element (PPE) – Controls tasks  64-bit PPC: 32 KB L1 instruction cache, 32 KB L1 data cache, and 512 KB L2 cache.  PPE controls 8 synergistic processor elements (SPEs) operating as SIMD units  Each SPE has an SPU and a memory flow controller (MFC) - data intensive tasks  SPU (RISC) with bit SIMD registers 256KB local store (LS).  PPE, SPE, MIC, BIC connected by Element Interconnect Bus (EIB) – for data movement - ring bus consisting of four 16 byte channels providing sustained b/w of GB/s. MFC connection to Rambus XDR memory and BIC interface to I/O devices connected via RapidIO provide 25.6 GB/s of data b/w.

CBE: What makes it fast?  Huge inter-SPE bandwidth  205 GB/s sustained output  Fast main memory  GB/s bandwidth for Rambus XDR memory  Predictable DMA latency and throughput  DMA traffic has negligible impact on SPE local store bandwidth  Easy to overlap data movement with computation  High performance, low-power SPE cores

Nvidia GeForce (Continued) GPU has K multiprocessors (MP) Each MP has L scalar processors (SP) Each MP performs block processing in batches A block is processed by only one MP Each block is split into SIMD groups of threads (warps) A warp is executed physically in parallel A scheduler switches between warps A warp contains threads of increasing, consecutive thread IDs Currently a warp size is 32 threads

CUDA: Programming model Block (0,0) Block (1,0) Block (2,0) Block (3,0) Block (0,1) Block (1,1) Block (2,1) Block (3,1) Grid of thread blocks Thread (0,0) Thread (3,0) Thread (0,7) Thread (3,7) Thread (1,0) Thread (1,7) Thread (0,1) Thread (1,1) Thread (3,1) Thread (4,0) Thread (5,0) Thread (4,1) Thread (5,1) Thread (4,7) Thread (5,7) Block (2,1) Warp 1Warp 2 Grid consist of thread blocks Each thread executes the kernel Grid and block dimensions specified by application. Max. by GPU memory 1/ 2/ 3-D grid layout Thread and Block-IDs are unique

CUDA: Memory model Shared memory(R/W) - For sharing data within block Texture memory – spatially cached Constant memory – About 20K, cached Global Memory – Not cached, coalesce Explicit GPU memory alloc/de-allocation Slow copying between CPU and GPU memory