Download presentation
Presentation is loading. Please wait.
Published byDominick Scott Modified over 9 years ago
1
JPEG C OMPRESSION A LGORITHM I N CUDA Group Members: Pranit Patel Manisha Tatikonda Jeff Wong Jarek Marczewski Date: April 14, 2009
2
O UTLINE Motivation JPEG Algorithm Design Approach in CUDA Benchmark Conclusion
3
M OTIVATION Growth of Digital Imaging Applications Effective algorithm for Video Compression Applications Loss of Data Information must be minimal JPEG is a lossy compression algorithm that reduces the file size without affecting quality of image It perceive the small changes in brightness more readily than we do small change in color
4
JPEG A LGORITHM Step 1: Divide sample image into 8x8 blocks Step 2: Apply DCT DCT is applied to each block It replaces actual color of block to average matrix which is analyze for entire matrix This step does not compress the file In general: Simple color space model: [R,G,B] per pixel JPEG uses [Y, Cb, Cr] Model Y = Brightness Cb = Color blueness Cr = Color redness
5
JPEG A LGORITHM Step 3: Quantization First Compression Step Each DCT coefficient is divided by its corresponding constant in Quantization table and rounded off to nearest integer The result of quantizing the DCT coefficients is that smaller, unimportant coefficients will be replaced by zeros and larger coefficients will lose precision. It is this rounding-off that causes a loss in image quality. Step 4: Apply Huffman Encoding Apply Huffman encoding to Quantized DCT Coefficient to reduce the image size further Step 5: Decoder Decoder of JPEG consist of: Huffman Decoding De-Quantization IDCT
6
DCT and IDCT
7
Discrete Cosine Transform Separable transform algorithm (1D and then the 2D): 2D DCT is performed in a 2 pass approach one for horizontal direction and one for vertical direction DCT 1 st pass2 nd pass
8
Discrete Cosine Transform Translate DCT into matrix cross multiplication Pre-calculate Cosine values are stored as constant array Inverse DCT are calculated in the same way only with P 00 P 01 P 02 P 03 P 04 P 05 P 06 P 07 P 10 P 11 P 12 P 13 P 14 P 15 P 16 P 17 P 20 P 21 P 22 P 23 P 24 P 25 P 26 P 27 P 30 P 31 P 32 P 33 P 34 P 35 P 36 P 37 P 40 P 41 P 42 P 43 P 44 P 45 P 46 P 47 P 50 P 51 P 52 P 53 P 54 P 55 P 56 P 57 P 60 P 61 P 62 P 63 P 64 P 65 P 66 P 67 P 70 P 71 P 72 P 73 P 74 P 75 P 76 P 77 C 00 C 01 C 02 C 03 C 04 C 05 C 06 C 07 C 10 C 11 C 12 C 13 C 14 C 15 C 16 C 17 C 20 C 21 C 22 C 23 C 24 C 25 C 26 C 27 C 30 C 31 C 32 C 33 C 34 C 35 C 36 C 37 C 40 C 41 C 42 C 43 C 44 C 45 C 46 C 47 C 50 C 51 C 52 C 53 C 54 C 55 C 56 C 57 C 60 C 61 C 62 C 63 C 64 C 65 C 66 C 67 C 70 C 71 C 72 C 73 C 74 C 75 C 76 C 77 x
9
DCT CUDA Implementation Each thread within each block has the same number of calculation Each thread multiply and accumulated eight elements P 00 P 01 P 02 P 03 P 04 P 05 P 06 P 07 P 10 P 11 P 12 P 13 P 14 P 15 P 16 P 17 P 20 P 21 P 22 P 23 P 24 P 25 P 26 P 27 P 30 P 31 P 32 P 33 P 34 P 35 P 36 P 37 P 40 P 41 P 42 P 43 P 44 P 45 P 46 P 47 P 50 P 51 P 52 P 53 P 54 P 55 P 56 P 57 P 60 P 61 P 62 P 63 P 64 P 65 P 66 P 67 P 70 P 71 P 72 P 73 P 74 P 75 P 76 P 77 C 00 C 01 C 02 C 03 C 04 C 05 C 06 C 07 C 10 C 11 C 12 C 13 C 14 C 15 C 16 C 17 C 20 C 21 C 22 C 23 C 24 C 25 C 26 C 27 C 30 C 31 C 32 C 33 C 34 C 35 C 36 C 37 C 40 C 41 C 42 C 43 C 44 C 45 C 46 C 47 C 50 C 51 C 52 C 53 C 54 C 55 C 56 C 57 C 60 C 61 C 62 C 63 C 64 C 65 C 66 C 67 C 70 C 71 C 72 C 73 C 74 C 75 C 76 C 77 x Thread.x = 2 Thread.y = 3
10
DCT Grid and Block Two methods and approach Each thread block process 1 macro blocks (64 threads) Each thread block process 8 macro blocks (512 threads)
11
DCT and IDCT GPU results 512x5121024x7682048x2048
12
DCT Results
13
IDCT Results
14
Quantization
15
Break the image into 8x8 blocks 8x8 Quantized matrix to be applied to the image. Every content of the image is multiplied by the Quantized value and divided again to round to the nearest integer value.
16
Quantization CUDA Programing Method 1 – Exact implementation as in CPU Method 2 – Shared memory to copy 8x8 image Method 3 – Load divided values into shared memory.
17
Quantization CUDA Results
18
Quantization CPU vs GPU Results
19
Tabulated Results for Quantization Method 2 and Method 3 have similar performance on small image sizes Method 3 might perform better on images bigger that 2048x2048 Quantization is ~x70 faster for the first method and much more as resolution increases. Quantization is ~ x180 faster for method2 and 3 and much more as resolution increases. Method 1Method 2Method 3CPUxCPU - 1xCPU - 2xCPU - 3 512x5120.1020.039 7.3772.2549188.9744 1024x7680.2740.085 2280.29197258.8235 2048x20481.390.3790.3611079.13669290.2375305.5556
20
Huffman Encode/Decode
21
Huffman Encoding Basics Utilizes frequency of each symbol Lossless compression Uses VARIABLE length code for each symbol IMAGE
22
Challenges Encoding is a very very very serial process Variable length of symbols is a problem Encoding: don’t know when symbols needs to be written unless all other symbols are encoded. Decoding: don’t know where symbols start
23
ENCODING
24
DECODING Decoding: don’t know where symbols start Need redundant calculation Uses decoding table, rather then tree Decode then shift by n bits. STEP 1: divide bitstream into overlapping segments. 65 bytes. Run 8 threads on each segment with different starting positions
25
DECODING STEP 2: Determine which threads are valid, throw away others
26
DECODING - challenges Each segment takes fixed number of encoded bits, but it results in variable length decoded output 64 bit can result in 64 bytes of output. Memory explosion Memory address for input do not advance in fixed pattern as output address Memory collisions Decoding table doesn’t fit into one address line Combining threads is serial NOTE: to simplify the algorithm, max symbol length was assumed to be 8 bits. (it didn’t help much)
27
Huffman Results Encoding Step one is very fast: ~100 speed up Step two – algorithm is wrong – no results Decoding 3 times slower then classic CPU method. Using shared memory for encoding table resolved only some conflicts (5 x slower -> 4 x slower) Conflicts on inputs bitstream Either conflicts on input or output data Moving 65 byte chunks to shared memory and ‘sharing’ it between 8 threads didn’t help much (4 x slower -> 3 x slower) ENCODING should be left to CPU
28
Conclusion & Results
29
Results CPU 512x512 - CPU1024x768 - CPU2048x2048 -CPU DCT3.3811.0557.12 Quantization5.7417.1675.97 IDCT3.3410.4956.5 GPU 512x512 -GPU1024x768 -GPU2048x2048 -GPU DCT0.1910.472.7 Quantization0.0390.0850.379 IDCT0.1710.4362.145 Performance Gain 512x5121024x7682048x2048 DCT17.6963350823.510638321.15555556 Quantization147.1794872201.8823529200.4485488 IDCT19.5321637424.0596330326.34032634
30
Performance Gain DCT and IDCT are the major consumers of the computation time. Computation increases with the increase with resolution. Total Processing time for 2k image is 5.224ms and for the CPU is 189.59 => speed up of 36x
31
GPU Performance DCT and IDCT still take up the major computation cycles but reduced by a x100 magnitude. 2K resolution processing time is 7ms using the GPU as compared to ~900ms with the CPU.
32
Conclusion CUDA implementation for transform and quantization is much faster than CPU (x36 faster) Huffman Algorithm does not parallelize well and final results show x3 slower than CPU. GPU architecture is well optimized for image and video related processing. High Performance Applications - Interframe, HD resolution/Realtime video compression/decompression.
33
Conclusion – Image Quality Resolution – 1024x768 CPU GPU
34
Conclusion – Image Quality Resolution – 2048x2048
35
Conclusion – Image Quality Resolution – 512x512 CPU GPU
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.