Download presentation
Presentation is loading. Please wait.
Published byEzra McKenzie Modified over 9 years ago
1
Jpeg Encoder Accelerator Advanced Embedded Systems Architecture EE-382N-4 Fall 2009 Anup P. Joshi Chandra Bhushan Prakash Karthick Santhanam Pratap Ramanathan
2
OVERVIEW JPEG Encoding Process JPEG Encoder Accelerator Existing Architecture Proposed Architectures Implementation Results Conclusion
3
JPEG Overview Raw Image – represents lots and lots of bytes of information Standardized image compression mechanism Exploits known limitations of the human eye - Small color changes are perceived less accurately than small changes in brightness Lossy method but achieves much greater compression compared to GIF, BMP etc Stores 24-bit-per-pixel color data instead of 8-bit-per-pixel data 24 bits per pixel gives 16 million colors as compared to 256 or fewer colors Disadvantage: Repeated compression and decompression will deteriorate image quality
4
Encoding scheme:
5
Step 1: Image Pixel Source imageDivision into 8x8 blocksOne 8x8 block
6
Step 2: Color Space Transform Color of each pixel 3-d vector (R,G,B) Significant correlation between these components Color space transform to produce a new vector Luminance Y; blue and red chrominance, Cb and Cr
7
Step 3: DCT Use Sequential DCT to transform block into set of 64 values (DCT coefficients) One DC coefficient; Measure of average of energy of block 63 AC coefficients, corresponding to high frequencies; Tend to be zero or near zero for most natural images Step 4: Quantizer 64 coefficients quantized using one of 64 corresponding values from a quantization table Facilitates greater compression, but lossy (removes most coefficients) Step 5: Encoder ‘Huffman’ encoder – most popular Previously quantized DC coefficient used to predict current coefficient, difference encoded
8
Accelerator considerations Hardware v/s Software Pure software always slower than hardware based implementation Dedicated Hardware functional unit (accelerator) – more faster Enhanced Architectural Options: Pipelining JPEG Encoder - already done Going for different architecture/microarchitecture Pipelining Individual blocks in encoder We chose the 2 nd option due to constraints in design (more in following slides)
9
Existing Pipelined Encoder - Open Source Design files acquired from Opencores.org Pipelined Encoder – Verilog source files Existing architecture for Encoding:
10
Existing Implementation Details Input to the Encoder (data_in) is 24-bit data bus with 8 bits each for the Red, Green and Blue pixels Follows sequential DCT-based mode : Inputs start with the top left 8x8 block of the image, starting with the top left pixel, going to the right, then down to the second row, etc. Input data for 1 st 8x8 block of pixels sent over 64 consecutive clocks After sending data for the first block, a delay of 33 clock cycles incurred due to the Encoding process (Huffman) before sending the next block Huffman encodes values based on previous block’s output dependency and delay introduced A candidate for improvisation Output: JPEG_bitstream, 32-bits produced out of the Huffman encoder
11
Experimented architectures # 1: Insert a buffer between Quantizer and Huffman encoder so that it doesn’t change for 97 cycles. But quantizer output changes every 64 cycles. Hence loss of data!!
12
Architectures # 2: Split image bitstream equally across 2 parallel paths – replicated functional units Equivalent to using 2 encoders – albeit delay within each encoder still remains ! Gross over-usage of Silicon area - additional overhead on software too
13
Architecture #3: Two Huffman blocks Eliminates bottleneck – helps in removing the delay between feeding two blocks of data Individual Huffman blocks are driven alternately : 1 st Huffman Block for every odd 8x8 pixel block 2 nd Huffman Block for every even 8x8 pixel block Negligible loss in compression – two separate first set input in Huffman blocks 64 Cycle - accumulation 97 Cycles in each Huffman Some cycles for synchronization
14
Implementation details Transform source image into the required R,G,B bit stream for each pixel Process it in the Design (Hardware) Generate encoded bit stream for every pixel Reconstruct image from the output of the Hardware implementation
15
Conversion of image to R,G,B bitstream In Matlab: Generated bit information using imread() function Generates a text file ‘bits.txt’ containing 24bit data for total number of pixels Properly formatted and supplied to the Design via Test bench.TIFF format (File size: 28KB) Supplied to the Testbench
16
Simulation results ( Existing architecture ): The ‘enable’ signal should be brought high when the data from the first pixel of the image is ready enable signal needs to stay high while the data is being input to the core Each 8x8 block of data needs to be input to the core on 64 consecutive clock cycles Takes additional 33 clocks to produce the JPEG bitstream for 64 pixels of data from 1 block of input Overall clock consumption (for this example): 143,120,000 / 10,000 = 14312 clocks
17
Simulation results ( New architecture): Alternates between the 2 Huffman encoder blocks Introduced 2 data_ready signals each corresponding to the two JPEG bitstreams coming out of the 2 Huffman encoder blocks Overhead in synchronizing the two Huffman Encoders: Only Eight! Overall clock consumption: 107,120,000 / 10,000 = 10712 clocks
18
Synthesis results:
19
Reconstructing the image Ideal reconstruction – Implement a decoder Functionally complex (Excessive design time) Alternative way to verify functionality- Software (Matlab) Re-construct the image using the generated bitstream – giving us the much-anticipated “JPEG image” Image-reconstruction performed in Matlab Verify against the input image (quality & compression)
20
Image reconstruction (software): JPEG Bitstream_odd JPEG Bitstream_even Reconstruct Merge JPEG format
21
Original Image Vs Jpeg Encoded Image Size: 28 KB, TIFF format Size: 3 KB, JPEG format
22
Performance Comparison of architectures Existing: Frequency: ~68MHz For test image, total clocks consumed = 14312 Total area = 1 374 028.8 sq. μm (Based on Design Vision synthesis) New: Frequency: ~68MHz For test image, total clocks consumed = 10712 Total area = 1 634 796.8 sq.μm (Based on Design Vision synthesis) Result summary: Overall savings in clock cycles (acceleration) : 3600 Savings per 8x8 block = 3600 / 144 = 25 clock cycles Overall increase in area (in terms of NAND1 gates) = (1 634 796.8 / 1.8772) - (1 374 028.8 / 1.8772) = 138 913.275 Change in power consumption ???
23
Design trade-offs Existing implementation had a lot of dependency between functional blocks Re-designing/pipelining the internal blocks is cumbersome Adopted a revised “Architectural” solution that uses multiple functional units Improves speed of encoding !!!!! Costs more Area and higher instantaneous power
24
A second chance? Possibly look at pipelining individual blocks Re-design Huffman block to reduce the internal dependency Reconstruct image using JPEG Decoder Accelerate the Decoding process as well Besides Starting early
25
Questions??
26
Back up Mapping onto an FPGA wasn’t successful due to too many cells – ran out of space!
27
Breakdown of work performed: Anup Joshi and Chandra Prakash Architecture with 2 encoders Architecture with buffer Synchronizing 2 Huffman blocks in proposed architecture Synthesis of encoder Karthick Santhanam and Pratap Ramanathan Analysis of open source code Architecture with 2 Huffman blocks Matlab code for generating input bit stream Matlab code for combining bit stream outputs
28
Architecture #2: But after the design we realized that the Huffman was the bottleneck No point in making the Quantizer’s output wait at the ‘already slow’ stage Lessons learnt: Identify the initial bottlenecks, DO NOT WASTE TIME
29
Lossy – quantization factor - 10 JPEG format, Size 1 KB
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.