Christopher Mitchell CDA 6938, Spring 2009
The Discrete Cosine Transform In the same family as the Fourier Transform Converts data to frequency domain. Represents data via summation of variable frequency cosine waves. Since it is a discrete version, conducive to problems formatted for computer analysis. Captures only real components of the function. Discrete Sine Transform (DST) captures odd (imaginary) components → not as useful. Discrete Fourier Transform (DFT) captures both odd and even components → computationally intense.
Significance / Where is this used? Image Processing Compression - Ex.) JPEG Scientific Analysis - Ex.) Radio Telescope Data Audio Processing Compression - Ex.) MPEG – Layer 3, aka. MP3 Scientific Computing / High Performance Computing (HPC) Partial Differential Equation Solvers
Significance, Cont. Image Processing Example Exhibits Energy Compaction Drop small amplitude coefficients Original ImageDCT Transformed Image
Implementation Platform NVIDIA CUDA Version 2.0
Implementation Platform, Cont. What Happened to the Cell/BE? Too many technical challenges compared to the deadline. Algorithm is embarrassingly parallel Conducive of launching hundreds of threads → GPU Algorithm requires too much data per pass compared to local store size. Would have to be creative with DMA and no guarantee of bottleneck mitigation.
Algorithm Walk Through Mathematical Basis 1D Version: Where: 2D Version: Where α(u) and α(v) are defined as shown in the 1D case.
Algorithm Walk Through CPU Version – 1D DCT
Algorithm Walk Through CPU Version – 2D DCT
Algorithm Walk Through Problem 1D DCT is O(n 2 ) 2D DCT is O(n 3 ) Additionally, the Algorithm uses calls to calculate the cosine and square root. Long Latency ALU Operations
Algorithm Walk Through CUDA Version – 1D DCT
Algorithm Walk Through CUDA Version – 2D DCT
Algorithm Walk Through Solution 1D DCT is now O(n) 2D DCT is now O(n 2 ) Parallelization key to success with this algorithm
Testing Platform Intel Core 2 Duo 2.66 GHz. Gigabyte GA-P35-DQ6 Motherboard 2 GB RAM 2 NVIDIA GeForce 8600 GTS Superclocked GPUs 720 MHz. Core Clock 256 MB GDDR3 Memory 4 Multiprocessors → 32 Streaming Processors Windows XP Professional (32-bit) w\ SP3 and NVIDIA ForceWare Drivers
Testing - Overview Vector Test CaseCPU VersionCUDA Version Vector: ms ms Vector: ms ms Vector: ms ms Vector: ms ms Vector: ms ms Matrix Test CaseCPU VersionGPU Version Matrix: 64 x641, ms ms Matrix: 128 x 12816, ms ms Matrix: 256 x , ms ms Matrix: 512 x 5124,007, ms ms
Testing – 1D DCT
Testing – 2D DCT
Future Work Multiple GPU version Have a dual card setup to test this with. Need to find efficient way to split the problem between the two cards without incurring a large I/O penalty. Still interested in trying a Cell/BE version of the algorithm. Need to improve at CBEA programming. DMA & local store size is the limiting factor for this particular problem.
References NVIDIA CUDA Programming Guide, Version 2.1 uda/2_1/toolkit/docs/NVIDIA_CUDA_Programmin g_Guide_2.1.pdf uda/2_1/toolkit/docs/NVIDIA_CUDA_Programmin g_Guide_2.1.pdf The Discrete Cosine Transform (DCT): Theory and Application CT_TR802.pdf CT_TR802.pdf CDA 6938 Lecture Notes and Slides