Dongyue Mou and Zeng Xing

Slides:



Advertisements
Similar presentations
JPEG DCT Quantization FDCT of 8x8 blocks.
Advertisements

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
Run-Time FPGA Partial Reconfiguration for Image Processing Applications Shaon Yousuf Ph.D. Student NSF CHREC Center, University of Florida Dr. Ann Gordon-Ross.
Image Compression-JPEG Speaker: Ying Wun, Huang Adviser: Jian Jiun, Ding Date2011/10/14 1.
SWE 423: Multimedia Systems
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
School of Computing Science Simon Fraser University
Department of Computer Engineering University of California at Santa Cruz Data Compression (3) Hai Tao.
CHEN Guowang FANG Wei HUANG Baihan
Image (and Video) Coding and Processing Lecture: DCT Compression and JPEG Wade Trappe Again: Thanks to Min Wu for allowing me to borrow many of her slides.
Hao Jiang Computer Science Department Sept. 27, 2007
ARINDAM GOSWAMI ERIC HUNEKE MERT USTUN ADVANCED EMBEDDED SYSTEMS ARCHITECTURE SPRING 2011 HW/SW Implementation of JPEG Decoder.
Case Study ARM Platform-based JPEG Codec HW/SW Co-design
T.Sharon-A.Frank 1 Multimedia Image Compression 2 T.Sharon-A.Frank Coding Techniques – Hybrid.
CS430 © 2006 Ray S. Babcock Lossy Compression Examples JPEG MPEG JPEG MPEG.
5. 1 JPEG “ JPEG ” is Joint Photographic Experts Group. compresses pictures which don't have sharp changes e.g. landscape pictures. May lose some of the.
Roger Cheng (JPEG slides courtesy of Brian Bailey) Spring 2007
1 JPEG Compression CSC361/661 Burg/Wong. 2 Fact about JPEG Compression JPEG stands for Joint Photographic Experts Group JPEG compression is used with.jpg.
Image Compression JPEG. Fact about JPEG Compression JPEG stands for Joint Photographic Experts Group JPEG compression is used with.jpg and can be embedded.
Image Compression - JPEG. Video Compression MPEG –Audio compression Lossy / perceptually lossless / lossless 3 layers Models based on speech generation.
Trevor McCasland Arch Kelley.  Goal: reduce the size of stored files and data while retaining all necessary perceptual information  Used to create an.
Lossy Compression Based on spatial redundancy Measure of spatial redundancy: 2D covariance Cov X (i,j)=  2 e -  (i*i+j*j) Vertical correlation   
1. Introduction JPEG standard is a collaboration among : International Telecommunication Union (ITU) International Organization for Standardization (ISO)
JPEG C OMPRESSION A LGORITHM I N CUDA Group Members: Pranit Patel Manisha Tatikonda Jeff Wong Jarek Marczewski Date: April 14, 2009.
Chapter 2 Source Coding (part 2)
Introduction to JPEG Alireza Shafaei ( ) Fall 2005.
CS Spring 2012 CS 414 – Multimedia Systems Design Lecture 8 – JPEG Compression (Part 3) Klara Nahrstedt Spring 2012.
ECE472/572 - Lecture 12 Image Compression – Lossy Compression Techniques 11/10/11.
Jpeg Analyzer Ben Applegate CSS497 Advisor: Dr. Munehiro Fukuda.
1 Image Compression. 2 GIF: Graphics Interchange Format Basic mode Dynamic mode A LZW method.
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
Klara Nahrstedt Spring 2011
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
GPU Architecture and Programming
Hardware/Software Codesign Case Study : JPEG Compression.
Hardware Image Signal Processing and Integration into Architectural Simulator for SoC Platform Hao Wang University of Wisconsin, Madison.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
CS Spring 2014 CS 414 – Multimedia Systems Design Lecture 10 – Compression Basics and JPEG Compression (Part 4) Klara Nahrstedt Spring 2014.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
Copyright © 2003 Texas Instruments. All rights reserved. DSP C5000 Chapter 18 Image Compression and Hardware Extensions.
The JPEG Standard J. D. Huang Graduate Institute of Communication Engineering National Taiwan University, Taipei, Taiwan, ROC.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14,
Data compression. lossless – looking for unicolor areas or repeating patterns –Run length encoding –Dictionary compressions Lossy – reduction of colors.
Fig1: component of Demo Set. Fig2:Load Map of M16C Family.
JPEG (Joint Photographic Expert Group)
JPEG Image Compression Standard Introduction Lossless and Lossy Coding Schemes JPEG Standard Details Summary.
JPEG.
CS654: Digital Image Analysis
Introduction to JPEG m Akram Ben Ahmed
John Hamann Vickey Yeh Compression of Stereo Images.
Image Processing Architecture, © Oleh TretiakPage 1Lecture 7 ECEC 453 Image Processing Architecture Lecture 8, February 5, 2004 JPEG: A Standard.
JPEG. Introduction JPEG (Joint Photographic Experts Group) Basic Concept Data compression is performed in the frequency domain. Low frequency components.
ELE 488 F06 ELE 488 Fall 2006 Image Processing and Transmission ( ) JPEG block based transform coding.... Why DCT for Image transform? DFT DCT.
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
Implementing JPEG Encoder for FPGA ECE 734 PROJECT Deepak Agarwal.
By Dr. Hadi AL Saadi Lossy Compression. Source coding is based on changing of the original image content. Also called semantic-based coding High compression.
4C8 Dr. David Corrigan Jpeg and the DCT. 2D DCT.
Naga Shailaja Dasari Ranjan Desh Zubair M Old Dominion University Norfolk, Virginia, USA.
Sathish Vadhiyar Parallel Programming
Chapter 9 Image Compression Standards
Algorithms in the Real World
JPEG Image Coding Standard
Discrete Cosine Transform
CUDA Parallelism Model
Shaon Yousuf Ph.D. Student NSF CHREC Center, University of Florida
The JPEG Standard.
Presentation transcript:

Dongyue Mou and Zeng Xing cujpeg A Simple JPEG Encoder With CUDA Technology Dongyue Mou and Zeng Xing

Outline JPEG Algorithm Traditional Encoder What's new in cujpeg Benchmark Conclusion

Outline JPEG Algorithm Traditional Encoder What's new in cujpeg Benchmark Conclusion

JPEG Algorithm Serialization in zig-zag style JPEG is a commonly used method for image compression. JPEG Encoding Algorithm is consist of 7 steps: Divide image into 8x8 blocks [R,G,B] to [Y,Cb,Cr] conversion Downsampling (optional) FDCT(Forward Discrete Cosine Transform)‏ Quantization Serialization in zig-zag style Entropy encoding (Run Length Coding & Huffman coding)

JPEG Algorithm -- Example This is an example

Divide into 8x8 blocks This is an example

Divide into 8x8 blocks This is an example

RGB vs. YCC Color space conversion makes use of it! The precision of colors suffer less (for a human eye) than the precision of contours (based on luminance) Color space conversion makes use of it! Simple color space model: [R,G,B] per pixel JPEG uses [Y, Cb, Cr] Model Y = Brightness Cb = Color blueness Cr = Color redness

Convert RGB to YCC 8x8 pixel 1 pixel = 3 components MCU with sampling factor (1, 1, 1)

Downsampling Y is taken every pixel , and Cb,Cr are taken for a block of 2x2 pixels 4 blocks 16 x16 pixel MCU: minimum coded unit: The smallest group of data units that is coded. Data size reduces to a half immediately MCU with sampling factor (2, 1, 1)

Apply FDCT 2D IDCT: Bottleneck, the complexity of the algorithm is O(n^4) 1D IDCT: 2-D is equivalent to 1-D applied in each direction Kernel uses 1-D transforms

Apply FDCT Meaning of each position in DCT result- matrix DCT Result Shift operations From [0, 255] To [-128, 127] Meaning of each position in DCT result- matrix DCT Result

Quantization Quantization Matrix (adjustable according to quality)‏ DCT result Quantization result

Zigzag reordering / Run Length Coding Quantization result [ Number of Zero before me, my value]

Huffman encoding Total input: 512 bits, Output: 113 bits output Values G Real saved values -1, 1 -3, -2, 2, 3 -7,-6,-5,-4,5,6,7 . -32767..32767 1 2 3 4 5 15 0,1 00, 01, 10, 11 000,001,010,011,100,101,110,111 RLC result: [0, -3] [0, 12] [0, 3]......EOB After group number added: [0,2,00b] [0,4,1100b] [0,2,00b] ...... EOB First Huffman coding (i.e. for [0,2,00b] ): [0, 2, 00b] => [100b, 00b] ( look up e.g. table AC Chron) Total input: 512 bits, Output: 113 bits output

Outline JPEG Algorithm Traditional Encoder What's new in cujpeg Benchmark Conclusion

Traditional Encoder CPU Image .jpg Load image Color conversion DCT Quantization Zigzag Reorder Encoding .jpg

Outline JPEG Algorithm Traditional Encoder What's new in cujpeg Benchmark Conclusion

Algorithm Analyse 1x full 2D DCT scan O(N4) 8x Row 1D DCT scan 8x Column 1D DCT scan O(N3) 8 threads can paralell work

Algorithm Analyse

DCT In Place __device__ void vectorDCTInPlace(float *Vect0, int Step) { float *Vect1 = Vect0 + Step, *Vect2 = Vect1 + Step; float *Vect3 = Vect2 + Step, *Vect4 = Vect3 + Step; float *Vect5 = Vect4 + Step, *Vect6 = Vect5 + Step; float *Vect7 = Vect6 + Step; float X07P = (*Vect0) + (*Vect7); float X16P = (*Vect1) + (*Vect6); float X25P = (*Vect2) + (*Vect5); float X34P = (*Vect3) + (*Vect4); float X07M = (*Vect0) - (*Vect7); float X61M = (*Vect6) - (*Vect1); float X25M = (*Vect2) - (*Vect5); float X43M = (*Vect4) - (*Vect3); float X07P34PP = X07P + X34P; float X07P34PM = X07P - X34P; float X16P25PP = X16P + X25P; float X16P25PM = X16P - X25P; (*Vect0) = C_norm * (X07P34PP + X16P25PP); (*Vect2) = C_norm * (C_b * X07P34PM + C_e * X16P25PM); (*Vect4) = C_norm * (X07P34PP - X16P25PP); (*Vect6) = C_norm * (C_e * X07P34PM - C_b * X16P25PM); (*Vect1) = C_norm * (C_a * X07M - C_c * X61M + C_d * X25M - C_f * X43M); (*Vect3) = C_norm * (C_c * X07M + C_f * X61M - C_a * X25M + C_d * X43M); (*Vect5) = C_norm * (C_d * X07M + C_a * X61M + C_f * X25M - C_c * X43M); (*Vect7) = C_norm * (C_f * X07M + C_d * X61M + C_c * X25M + C_a * X43M); } __device__ void blockDCTInPlace(float *block) { for(int row = 0; row < 64; row += 8) vectorDCTInPlace(block + row, 1); for(int col = 0; col < 8; col++) vectorDCTInPlace(block + col, 1); } __device__ void parallelDCTInPlace(float *block) { int col = threadIdx.x % 8; int row = col * 8; __syncthreads(); vectorDCTInPlace(block + row, 1); vectorDCTInPlace(block + col, 1); }

Allocation Desktop PC Graphic Card CPU: 1 P4 Core, 3.0GHz RAM: 2GB GPU: 16 Core 575MHz 8 SP/Core, 1.35GHz RAM: 768MB

Binding Huffman Encoding Color conversion, DCT, Quantize many conditions/branchs intensive bit operating less computing Color conversion, DCT, Quantize intensive computing less conditions/branchs

Binding 1 CUDA Block = 504 Threads Result: maximal 21 MCUs/CUDA Block Hardware: 16KB Shared Memory Problem: 1 MCU contains 702 Byte data Result: maximal 21 MCUs/CUDA Block Hardware: 512 threads Problem: 1 MCU contains 3 Blocks, 1 Block needs 8 threads Result: 1 MCU needs 24 threads 1 CUDA Block = 504 Threads

cujpeg Encoder CPU GPU Image .jpg Load image Color conversion DCT Quantization Zigzag Reorder Encoding .jpg

cujpeg Encoder CPU GPU Image Shared Memory .jpg Texture Load image cudaMemcpy( ResultHost, ResultDevice, ResultSize, cudaMemcpyDeviceToHost); for (int i=0; i<BLOCK_WIDTH; i++) myDestBlock[myZLine[i]] = (int)(myDCTLine[i] * myDivQLine[i] + 0.5f); CPU GPU Texture Memory Color Conversion Shared Memory Image Load image Global Memory Quantization Reorder Result In Place DCT Host Memory Quantize Reorder int b = tex2D(TexSrc, TexPosX++, TexPosY); int g = tex2D(TexSrc, TexPosX++, TexPosY); int r = tex2D(TexSrc, TexPosX+=6, TexPosY); float y = 0.299*r + 0.587*g + 0.114*b - 128.0 + 0.5; float cb = -0.168763*r - 0.331264*g + 0.500*b + 0.5; float cr = 0.500*r - 0.418688f*g - 0.081312*b + 0.5; myDCTLine[Offset + i] = y; myDCTLine[Offset + 64 + i]= cb; myDCTLine[Offset + 128 + i]= cb; Encoding cudaMallocArray( &textureCache, &channel, scanlineSize, imgHeight )); cudaMemcpy2DToArray(textureCache, 0, 0, image, imageStride, imageWidth, imageHeight, cudaMemcpyHostToDevice )); cudaBindTextureToArray(TexSrc, textureCache, channel)); cudaMalloc((void **)(&ResultDevice), ResultSize); .jpg

Quantized/Reordered Data Scheduling For each MCU: 24 threads Convert 2 pixel 8 threads Convert rest 2 pixel Do 1x row vector DCT Do 1x column vector DCT Quantize 8x scalar value RGB Data x24 Y Cb Cr YCC Block x24 Y Cb Cr DCT Block x24 Quantized/Reordered Data

Outline JPEG Algorithm Traditional Encoder What's new in cujpeg Benchmark Conclusion

GPU Occupancy 504 16 16128 Threads Per Block Registers Per Thread Shared Memory Per Block (bytes) 16128 Active Threads per Multiprocessor Active Warps per Multiprocessor Active Thread Blocks per Multiprocessor 1 Occupancy of each Multiprocessor 67% Maximum Simultaneous Blocks per GPU

Benchmark 0.560s 1.171s 0.121s 0.237s ( Q = 80, Sample = 1:1:1 ) 512x512 1024x1024 2048x2048 4096x4096 cujpeg 0.321s 0.376s 0.560s 1.171s libjpeg 0.121s 0.237s 0.804s 3.971s ( Q = 80, Sample = 1:1:1 )

Benchmark Time Consumption (4096x4096) Load Tansfer Compute Encode Total Quality = 100 0.132s 0.348s 0.043s 0.837s 1.523s Quality = 80 0.121s 0.324s 0.480 1.123s Quality = 50 0.130s 0.353s 0.044s 0.468s 1.167s

Benchmark Each thread has 240 operations 24 threads process 1 MCU Time Consumption (4096x4096) Load Tansfer Compute Encode Total Quality = 100 0.132s 0.348s 0.043s 0.837s 1.523s Quality = 80 0.121s 0.324s 0.480 1.123s Quality = 50 0.130s 0.353s 0.044s 0.468s 1.167s Each thread has 240 operations 24 threads process 1 MCU 4096x4096 image includes 262144 MCUs. Total ops: 262144*24*210 = 1509949440 flops Speed: (Total ops) /0.043 = 35.12Gflops

Outline JPEG Algorithm Traditional Encoder What's new in cujpeg Benchmark Conclusion

Conclusion CUDA can obviously accelerate the JPEG compression. The over-all performance Depends on the system speed More bandwidth Besser encoding routine Support downsample