JPEG-GPU: A GPGPU IMPLEMENTATION OF JPEG CORE CODING SYSTEMS Ang Li University of Wisconsin-Madison.

Slides:



Advertisements
Similar presentations
1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei.
Advertisements

Timothy Blattner and Shujia Zhou May 18, This project is sponsored by Lockheed Martin We would like to thank Joseph Swartz, Sara Hritz, Michael.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.
A Source-to-Source OpenACC compiler for CUDA Akihiro Tabuchi †1 Masahiro Nakao †2 Mitsuhisa Sato †1 †1. Graduate School of Systems and Information Engineering,
Adnan Ozsoy & Martin Swany DAMSL - Distributed and MetaSystems Lab Department of Computer Information and Science University of Delaware September 2011.
Panda: MapReduce Framework on GPU’s and CPU’s
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
1 1 © 2011 The MathWorks, Inc. Accelerating Bit Error Rate Simulation in MATLAB using Graphics Processors James Lebak Brian Fanous Nick Moore High-Performance.
Department of Electrical Engineering National Cheng Kung University
Jared Barnes Chris Jackson.  Originally created to calculate pixel values  Each core executes the same set of instructions Mario projected onto several.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
SAGE: Self-Tuning Approximation for Graphics Engines
GmImgProc Alexandra Olteanu SCPD Alexandru Ştefănescu SCPD.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
2012/06/22 Contents  GPU (Graphic Processing Unit)  CUDA Programming  Target: Clustering with Kmeans  How to use.
OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.
Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Cell processor implementation of a MILC lattice QCD application.
GPU Programming with CUDA – Optimisation Mike Griffiths
General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Profiling and Tuning OpenACC Code. Profiling Tools (PGI) Use time option to learn where time is being spent -ta=nvidia,time NVIDIA Visual Profiler 3 rd.
Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,
GPU Architecture and Programming
Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.
Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.
Experiences Accelerating MATLAB Systems Biology Applications Heart Wall Tracking Lukasz Szafaryn, Kevin Skadron University of Virginia.
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
Adam Wagner Kevin Forbes. Motivation  Take advantage of GPU architecture for highly parallel data-intensive application  Enhance image segmentation.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
Developing the Demosaicing Algorithm in GPGPU Ping Xiang Electrical engineering and computer science.
Slide 1 Using OpenACC in IFS Physics’ Cloud Scheme (CLOUDSC) Sami Saarinen ECMWF Basic GPU Training Sept 16-17, 2015.
ACCELERATING QUERY-BY-HUMMING ON GPU Pascal Ferraro, Pierre Hanna, Laurent Imbert, Thomas Izard ISMIR 2009 Presenter: Chung-Che Wang (Focus on the performance.
Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
Sunpyo Hong, Hyesoon Kim
Heterogeneous Computing With GPGPUs Matthew Piehl Overview Introduction to CUDA Project Overview Issues faced nvcc Implementation Performance Metrics Conclusions.
Lab Activities 1, 2. Some of the Lab Server Specifications CPU: 2 Quad(4) Core Intel Xeon 5400 processors CPU Speed: 2.5 GHz Cache : Each 2 cores share.
GPU Programming Contest. Contents Target: Clustering with Kmeans How to use toolkit1.0 Towards the fastest program.
GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.
Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory.
S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.
Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi
Sobolev(+Node 6, 7) Showcase +K20m GPU Accelerator.
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs Mai Zheng, Vignesh T. Ravi, Wenjing Ma, Feng Qin, and Gagan Agrawal Dept. of Computer.
GPGPU Programming with CUDA Leandro Avila - University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University of Northern Iowa.
1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,
Introduction to CUDA Li Sung-Chi Taiwan Evolutionary Intelligence Laboratory 2016/12/14 Group Meeting Presentation.
NFV Compute Acceleration APIs and Evaluation
Generalized and Hybrid Fast-ICA Implementation using GPU
Employing compression solutions under openacc
A Tool for Chemical Kinetics Simulation on Accelerated Architectures
Sathish Vadhiyar Parallel Programming
Steven Ge, Xinmin Tian, and Yen-Kuang Chen
Linchuan Chen, Xin Huo and Gagan Agrawal
Faster File matching using GPGPU’s Deephan Mohan Professor: Dr
Multicore and GPU Programming
Multicore and GPU Programming
Presentation transcript:

JPEG-GPU: A GPGPU IMPLEMENTATION OF JPEG CORE CODING SYSTEMS Ang Li University of Wisconsin-Madison

Outline Brief Introduction of Background Implementation Evaluation Conclusion 3/20/ NVIDIA GTC 2013

Background JPEG Encoding Parallelism Seeking Pre-processing: Color Conversion Block Encoding/Decoding 3/20/ NVIDIA GTC 2013

Implementation Step 1 – Find target functions Encode: encode_mcu_huff, encode_one_block, emit_bits_s Decode: decode_mcu_DC_first, decode_mcu_DC_refine Profiling to find other functions Using GPROF Encode: rgb_ycc_convert Decode: ycc_rgb_convert Both take small half of the total execution time of encoding/decoding 3/20/ NVIDIA GTC 2013

Implementation – Cont’d Step 2 – Parallel with CUDA First, implementing in OpenMP to make sure the understandings are correct E.g., in encode_one_block, emit_bits_s changes the state of system => parallel with multiple threads will lead to incorrect results! Secondly, make a baseline GPGPU implementation to all critical functions Thirdly, optimize GPGPU implementations Using constant memory 3/20/ NVIDIA GTC 2013 for (k = 1; k <= Se; k++) { … if (! emit_bits_s(…)) return FALSE; … if (! emit_bits_s(…)) return FALSE; … if (! emit_bits_s(…)) return FALSE; … }

Evaluation Evaluation Environment CPU: Intel Nehalem Xeon E GHz processor GPU: Tesla K20c Picture used My favorite picture Compressing: 1280 x 768 pixels Decompressing: the products after compressing Correctness checked by ``diff’’ 3/20/ NVIDIA GTC 2013

Evaluation – Cont’d SequentialOpenMPGPGPU BaseGPGPU Optimized Compress Decompress /20/ NVIDIA GTC 2013 Timings are in milliseconds, averagin 10 times of execution Four threads are forked for OpenMP implementation For both GPU implementations, configurations are tuned to be optimized Results discussion OpenMP is fastest. GPGPU basically degrades the performance  while `optimized’ version degrades more (due to serialized constant memory accesses). Observations after hacking the code: Each kernel launch deals with at most 250 elements, too fine-grained. Kernel launch is expensive (allocation & copying the data) Using OpenMP is always going to better off as long as there is enough parallelism & loop iterations are not extremely trivial.

Conclusion For JPEG encoding/decoding core system, GPGPU basically degrades the performance. Coarser-grained parallelism is required. OpenMP acceleration can be easily applied to gain some performance. 3/20/ NVIDIA GTC 2013

Thank you. Ang Li 3/20/2013 NVIDIA GTC