Download presentation
Presentation is loading. Please wait.
Published byBritton Mitchell Modified over 9 years ago
1
Christopher Mitchell CDA 6938, Spring 2009
2
The Discrete Cosine Transform In the same family as the Fourier Transform Converts data to frequency domain. Represents data via summation of variable frequency cosine waves. Since it is a discrete version, conducive to problems formatted for computer analysis. Captures only real components of the function. Discrete Sine Transform (DST) captures odd (imaginary) components → not as useful. Discrete Fourier Transform (DFT) captures both odd and even components → computationally intense.
3
Significance / Where is this used? Image Processing Compression - Ex.) JPEG Scientific Analysis - Ex.) Radio Telescope Data Audio Processing Compression - Ex.) MPEG – Layer 3, aka. MP3 Scientific Computing / High Performance Computing (HPC) Partial Differential Equation Solvers
4
Significance, Cont. Image Processing Example Exhibits Energy Compaction Drop small amplitude coefficients Original ImageDCT Transformed Image
5
Implementation Platform NVIDIA CUDA Version 2.0
6
Implementation Platform, Cont. What Happened to the Cell/BE? Too many technical challenges compared to the deadline. Algorithm is embarrassingly parallel Conducive of launching hundreds of threads → GPU Algorithm requires too much data per pass compared to local store size. Would have to be creative with DMA and no guarantee of bottleneck mitigation.
7
Algorithm Walk Through Mathematical Basis 1D Version: Where: 2D Version: Where α(u) and α(v) are defined as shown in the 1D case.
8
Algorithm Walk Through CPU Version – 1D DCT
9
Algorithm Walk Through CPU Version – 2D DCT
10
Algorithm Walk Through Problem 1D DCT is O(n 2 ) 2D DCT is O(n 3 ) Additionally, the Algorithm uses calls to calculate the cosine and square root. Long Latency ALU Operations
11
Algorithm Walk Through CUDA Version – 1D DCT
12
Algorithm Walk Through CUDA Version – 2D DCT
13
Algorithm Walk Through Solution 1D DCT is now O(n) 2D DCT is now O(n 2 ) Parallelization key to success with this algorithm
14
Testing Platform Intel Core 2 Duo E6700 @ 2.66 GHz. Gigabyte GA-P35-DQ6 Motherboard 2 GB RAM 2 NVIDIA GeForce 8600 GTS Superclocked GPUs 720 MHz. Core Clock 256 MB GDDR3 Memory 4 Multiprocessors → 32 Streaming Processors Windows XP Professional (32-bit) w\ SP3 and NVIDIA ForceWare 178.24 Drivers
15
Testing - Overview Vector Test CaseCPU VersionCUDA Version Vector: 2563.00 ms0.016930 ms Vector: 51214.67 ms0.027778 ms Vector: 102464.33 ms0.015876 ms Vector: 2058246.33 ms0.015213 ms Vector: 4096989.33 ms0.015721 ms Matrix Test CaseCPU VersionGPU Version Matrix: 64 x641,055.67 ms0.009612 ms Matrix: 128 x 12816,205.33 ms0.010277 ms Matrix: 256 x 256254,448.33 ms0.009850 ms Matrix: 512 x 5124,007,952.00 ms0.014130 ms
16
Testing – 1D DCT
18
Testing – 2D DCT
20
Future Work Multiple GPU version Have a dual card setup to test this with. Need to find efficient way to split the problem between the two cards without incurring a large I/O penalty. Still interested in trying a Cell/BE version of the algorithm. Need to improve at CBEA programming. DMA & local store size is the limiting factor for this particular problem.
21
References NVIDIA CUDA Programming Guide, Version 2.1 http://developer.download.nvidia.com/compute/c uda/2_1/toolkit/docs/NVIDIA_CUDA_Programmin g_Guide_2.1.pdf http://developer.download.nvidia.com/compute/c uda/2_1/toolkit/docs/NVIDIA_CUDA_Programmin g_Guide_2.1.pdf The Discrete Cosine Transform (DCT): Theory and Application http://www.egr.msu.edu/waves/people/Ali_files/D CT_TR802.pdf http://www.egr.msu.edu/waves/people/Ali_files/D CT_TR802.pdf CDA 6938 Lecture Notes and Slides
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.