JPEG-GPU: A GPGPU IMPLEMENTATION OF JPEG CORE CODING SYSTEMS Ang Li University of Wisconsin-Madison
Outline Brief Introduction of Background Implementation Evaluation Conclusion 3/20/ NVIDIA GTC 2013
Background JPEG Encoding Parallelism Seeking Pre-processing: Color Conversion Block Encoding/Decoding 3/20/ NVIDIA GTC 2013
Implementation Step 1 – Find target functions Encode: encode_mcu_huff, encode_one_block, emit_bits_s Decode: decode_mcu_DC_first, decode_mcu_DC_refine Profiling to find other functions Using GPROF Encode: rgb_ycc_convert Decode: ycc_rgb_convert Both take small half of the total execution time of encoding/decoding 3/20/ NVIDIA GTC 2013
Implementation – Cont’d Step 2 – Parallel with CUDA First, implementing in OpenMP to make sure the understandings are correct E.g., in encode_one_block, emit_bits_s changes the state of system => parallel with multiple threads will lead to incorrect results! Secondly, make a baseline GPGPU implementation to all critical functions Thirdly, optimize GPGPU implementations Using constant memory 3/20/ NVIDIA GTC 2013 for (k = 1; k <= Se; k++) { … if (! emit_bits_s(…)) return FALSE; … if (! emit_bits_s(…)) return FALSE; … if (! emit_bits_s(…)) return FALSE; … }
Evaluation Evaluation Environment CPU: Intel Nehalem Xeon E GHz processor GPU: Tesla K20c Picture used My favorite picture Compressing: 1280 x 768 pixels Decompressing: the products after compressing Correctness checked by ``diff’’ 3/20/ NVIDIA GTC 2013
Evaluation – Cont’d SequentialOpenMPGPGPU BaseGPGPU Optimized Compress Decompress /20/ NVIDIA GTC 2013 Timings are in milliseconds, averagin 10 times of execution Four threads are forked for OpenMP implementation For both GPU implementations, configurations are tuned to be optimized Results discussion OpenMP is fastest. GPGPU basically degrades the performance while `optimized’ version degrades more (due to serialized constant memory accesses). Observations after hacking the code: Each kernel launch deals with at most 250 elements, too fine-grained. Kernel launch is expensive (allocation & copying the data) Using OpenMP is always going to better off as long as there is enough parallelism & loop iterations are not extremely trivial.
Conclusion For JPEG encoding/decoding core system, GPGPU basically degrades the performance. Coarser-grained parallelism is required. OpenMP acceleration can be easily applied to gain some performance. 3/20/ NVIDIA GTC 2013
Thank you. Ang Li 3/20/2013 NVIDIA GTC