Christopher Mitchell CDA 6938, Spring 2009. The Discrete Cosine Transform  In the same family as the Fourier Transform  Converts data to frequency domain.

Slides:



Advertisements
Similar presentations
Informatik 4 Lab 1. Laboratory Exercise Overview 1. Define size of 20 radius vectors 2. DCT transformation 3. Create Microsoft Excel spreadsheet 4. Create.
Advertisements

Fourier Transforms and Their Use in Data Compression
+ Accelerating Fully Homomorphic Encryption on GPUs Wei Wang, Yin Hu, Lianmu Chen, Xinming Huang, Berk Sunar ECE Dept., Worcester Polytechnic Institute.
Intro to GPU’s for Parallel Computing. Goals for Rest of Course Learn how to program massively parallel processors and achieve – high performance – functionality.
DFT/FFT and Wavelets ● Additive Synthesis demonstration (wave addition) ● Standard Definitions ● Computing the DFT and FFT ● Sine and cosine wave multiplication.
An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC.
Appendix A — 1 FIGURE A.2.2 Contemporary PCs with Intel and AMD CPUs. See Chapter 6 for an explanation of the components and interconnects in this figure.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
GRAPHICS AND COMPUTING GPUS Jehan-François Pâris
School of Computing Science Simon Fraser University
Computer Graphics Recitation 6. 2 Motivation – Image compression What linear combination of 8x8 basis signals produces an 8x8 block in the image?
DCABES 2009 China University Of Geosciences 1 The Parallel Models of Coronal Polarization Brightness Calculation Jiang Wenqian.
Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
Time and Frequency Representation
Advisor: Dr. Chandra Christopher Picard Michael Neuberg.
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
Performance and Energy Efficiency of GPUs and FPGAs
GPU-accelerated Evaluation Platform for High Fidelity Networking Modeling 11 December 2007 Alex Donkers Joost Schutte.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.
Computer Graphics Graphics Hardware
By : Vladimir Novikov. Digital Watermarking? Allows users to embed SPECIAL PATTERN or SOME DATA into digital contents without changing its perceptual.
Improving Network I/O Virtualization for Cloud Computing.
EE302 Lesson 19: Digital Communications Techniques 3.
CS654: Digital Image Analysis Lecture 15: Image Transforms with Real Basis Functions.
Fourier series. The frequency domain It is sometimes preferable to work in the frequency domain rather than time –Some mathematical operations are easier.
Transforms. 5*sin (2  4t) Amplitude = 5 Frequency = 4 Hz seconds A sine wave.
CS 6068 Parallel Computing Fall 2013 Lecture 10 – Nov 18 The Parallel FFT Prof. Fred Office Hours: MWF.
YOU LI SUPERVISOR: DR. CHU XIAOWEN CO-SUPERVISOR: PROF. LIU JIMING THURSDAY, MARCH 11, 2010 Speeding up k-Means by GPUs 1.
Diane Marinkas CDA 6938 April 30, Outline Motivation Algorithm CPU Implementation GPU Implementation Performance Lessons Learned Future Work.
Applying GPU and POSIX Thread Technologies in Massive Remote Sensing Image Data Processing By: Group 17 King Mongkut's Institute of Technology Ladkrabang.
Performance Enhancement of Video Compression Algorithms using SIMD Valia, Shamik Jamkar, Saket.
Large-scale Deep Unsupervised Learning using Graphics Processors
Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,
Image Processing Architecture, © 2001, 2002, 2003 Oleh TretiakPage 1 ECE-C490 Image Processing Architecture MP-3 Compression Course Review Oleh Tretiak.
Math 3360: Mathematical Imaging Prof. Ronald Lok Ming Lui Department of Mathematics, The Chinese University of Hong Kong Lecture 9: More about Discrete.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
ICAL GPU 架構中所提供分散式運算 之功能與限制. 11/17/09ICAL2 Outline Parallel computing with GPU NVIDIA CUDA SVD matrix computation Conclusion.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
by Arjun Radhakrishnan supervised by Prof. Michael Inggs
CDVS on mobile GPUs MPEG 112 Warsaw, July Our Challenge CDVS on mobile GPUs  Compute CDVS descriptor from a stream video continuously  Make.
The task of compression consists of two components, an encoding algorithm that takes a file and generates a “compressed” representation (hopefully with.
Fig1: component of Demo Set. Fig2:Load Map of M16C Family.
University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.
Fourier and Wavelet Transformations Michael J. Watts
CS 376b Introduction to Computer Vision 03 / 17 / 2008 Instructor: Michael Eckmann.
The Internet (Gaming) Windows XP or later 1.7 GHz Intel or AMD Processor 512 MB of RAM DirectX 8.1 graphics card Sound card (These requirements are based.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory.
1 “A picture speaks a thousand words.” Art By Ranjith & Waquas Islamiah Evening College.
Appendix C Graphics and Computing GPUs
Computer Graphics Graphics Hardware
The content of lecture This lecture will cover: Fourier Transform
M. Bellato INFN Padova and U. Marconi INFN Bologna
An Example of 1D Transform with Two Variables
Fourier and Wavelet Transformations
Even Discrete Cosine Transform The Chinese University of Hong Kong
Embedded OpenCV Acceleration
Digital Image Procesing Discrete Walsh Trasform (DWT) in Image Processing Discrete Hadamard Trasform (DHT) in Image Processing DR TANIA STATHAKI READER.
2D Fourier transform is separable
Image Processing, Leture #14
4. DIGITAL IMAGE TRANSFORMS 4.1. Introduction
Judith Molka-Danielsen, Oct. 02, 2000
1-D DISCRETE COSINE TRANSFORM DCT
Computer Graphics Graphics Hardware
 = N  N matrix multiplication N = 3 matrix N = 3 matrix N = 3 matrix
The Frequency Domain Any wave shape can be approximated by a sum of periodic (such as sine and cosine) functions. a--amplitude of waveform f-- frequency.
Presentation transcript:

Christopher Mitchell CDA 6938, Spring 2009

The Discrete Cosine Transform  In the same family as the Fourier Transform  Converts data to frequency domain.  Represents data via summation of variable frequency cosine waves.  Since it is a discrete version, conducive to problems formatted for computer analysis.  Captures only real components of the function.  Discrete Sine Transform (DST) captures odd (imaginary) components → not as useful.  Discrete Fourier Transform (DFT) captures both odd and even components → computationally intense.

Significance / Where is this used?  Image Processing  Compression - Ex.) JPEG  Scientific Analysis - Ex.) Radio Telescope Data  Audio Processing  Compression - Ex.) MPEG – Layer 3, aka. MP3  Scientific Computing / High Performance Computing (HPC)  Partial Differential Equation Solvers

Significance, Cont.  Image Processing Example  Exhibits Energy Compaction  Drop small amplitude coefficients Original ImageDCT Transformed Image

Implementation Platform NVIDIA CUDA Version 2.0

Implementation Platform, Cont.  What Happened to the Cell/BE?  Too many technical challenges compared to the deadline.  Algorithm is embarrassingly parallel  Conducive of launching hundreds of threads → GPU  Algorithm requires too much data per pass compared to local store size.  Would have to be creative with DMA and no guarantee of bottleneck mitigation.

Algorithm Walk Through  Mathematical Basis  1D Version:  Where:  2D Version:  Where α(u) and α(v) are defined as shown in the 1D case.

Algorithm Walk Through  CPU Version – 1D DCT

Algorithm Walk Through  CPU Version – 2D DCT

Algorithm Walk Through  Problem  1D DCT is O(n 2 )  2D DCT is O(n 3 )  Additionally, the Algorithm uses calls to calculate the cosine and square root.  Long Latency ALU Operations

Algorithm Walk Through  CUDA Version – 1D DCT

Algorithm Walk Through  CUDA Version – 2D DCT

Algorithm Walk Through  Solution  1D DCT is now O(n)  2D DCT is now O(n 2 )  Parallelization key to success with this algorithm

Testing  Platform  Intel Core 2 Duo 2.66 GHz.  Gigabyte GA-P35-DQ6 Motherboard  2 GB RAM  2 NVIDIA GeForce 8600 GTS Superclocked GPUs  720 MHz. Core Clock  256 MB GDDR3 Memory  4 Multiprocessors → 32 Streaming Processors  Windows XP Professional (32-bit) w\ SP3 and NVIDIA ForceWare Drivers

Testing - Overview Vector Test CaseCPU VersionCUDA Version Vector: ms ms Vector: ms ms Vector: ms ms Vector: ms ms Vector: ms ms Matrix Test CaseCPU VersionGPU Version Matrix: 64 x641, ms ms Matrix: 128 x 12816, ms ms Matrix: 256 x , ms ms Matrix: 512 x 5124,007, ms ms

Testing – 1D DCT

Testing – 2D DCT

Future Work  Multiple GPU version  Have a dual card setup to test this with.  Need to find efficient way to split the problem between the two cards without incurring a large I/O penalty.  Still interested in trying a Cell/BE version of the algorithm.  Need to improve at CBEA programming.  DMA & local store size is the limiting factor for this particular problem.

References  NVIDIA CUDA Programming Guide, Version 2.1  uda/2_1/toolkit/docs/NVIDIA_CUDA_Programmin g_Guide_2.1.pdf uda/2_1/toolkit/docs/NVIDIA_CUDA_Programmin g_Guide_2.1.pdf  The Discrete Cosine Transform (DCT): Theory and Application  CT_TR802.pdf CT_TR802.pdf  CDA 6938 Lecture Notes and Slides