High Speed Video Compression/Decompression Pipeline

High Speed Video Compression/Decompression Pipeline
Ariana Einsenstein Yuan Cao

Objective Transfer video frames to FPGA quickly
Video data has high bit rate Compress the data before transfering PC Some Interface Dram Wrapper DDR3 Dram De- compression Modules 1 Byte 512 bits Compression Modules 1 Byte 512 bits 50 MHz

Some standards JPEG Discrete Cosine Transform - medium
Huffman Encoder medium JPEG Better Discrete Wavelet Transform - complicated Entropy Encoder very complicated MPEG - Even Better Motion predictor very complicated let’s first look at the industry standards

Discrete Wavelet Transform (DWT)
What’s a DWT? The output of the transformation consist of low pass coefficients and high pass coefficients. Low Pass High Pass

DWT vs. FFT Keep 40 Largest Components
How does DWT compare to FFT for compression We take a signal, transform it with FT and DWT, we keep only 40 largest components, and we transform it back to time domain. DWT deals with sudden jumps in original signal much better than FT

DWT 2-dimensional LL2 LH2 LL LH HL2 HH2 HL HH Original 4-level DWT
Non-zero Original 4-level DWT LL2 LH2 LL LH HL2 HH2 Close to zero HL HH This transformation can be performed on both dimensions. The transformed image consists of four regions. Now we can take the LL region and transform it recursively

Lifting scheme y=(x0+x2)*a+x1
Fortunately, math tells us this transformation can be efficiently implemented using this network.

How much parallelism? 2,048 Multiplier/Adders
1,024 Samples 1,024 Samples 2,048 Multiplier/Adders 1,024 50MHz =50 GSPS 128*8 Samples There are some details to consider. Let’s say each line of the image contains 1024 samples. 128*8 Samples 16 Multiplier/Adders 8 50MHz =400 MSPS

Dummy FIFOs S1 S2 S3 S4 Sc S1 S2 S3 S4 Sc FIFO x0 l0 x0 l0 x1 h0 x1 h0
If we look carefully at the lifting scheme, each stage is data-dependent on the previous two stages, which makes decoupling and pipelining difficult. Original lifting scheme Modified with dummy buffers

DWT 1D micro-architecture
Stage_2 Stage_1 ififo_history s1xfifo s2save Coef_a Coef_b Mult-Adder Mult-Adder s1fifo s2fifo Vector#(B, WSample) ififo =0 =0 -1 counter +1 counter Stage_3 Stage_4 s3xfifo Coef_c Coef_d Mult-Adder Mult-Adder s3fifo s4fifo Here is the detailed 1D DWT module. As you can see, the different stages are decoupled and communicate only through FIFOs, therefore the module is fully-pipelined and latency insensitive. =0 =0 -1 counter +1 counter coef_scale Latency insensitive Fully pipelined Scale Vector#(B, WSample) Scale 1/coef_scale ofifo

DWT 2D Large BRAM buffers to store full lines Stage1 DWT1D Stage2
signal coef_a Vector#(B,WSample) Vector#(B,WSample) s1fifo[0] Stage1 DWT1D Mult-Add s1fifo[1] Distributor signal coef_b s1save Large BRAM buffers to store full lines s2fifo[0] Stage2 Mult-Add s2fifo[1] Vector#(B,WSample) signal coef_c output_fifo Assembler s3fifo[0] Stage3 Mult-Add Similarly a 2D transformation can be implemented based on the 1D transformation module. The only major difference is that these large BRAM FIFOs are needed to store full lines of a image. The resulting module is still fully-pipelined s3fifo[1] Distributor signal coef_d s3save s 1/s Scale Scale s4fifo[0] Stage4 Mult-Add Stage_sc s4fifo[1]

Multi-level DWT DWT2D N/2 DWT2D N/4 DWT2D N ofifo
Vector#(B, WSample) Low Low ofifo Vector#(B, WSample) High High BRAM FIFO 16 lines capacity 512kbit BRAM FIFO 16 lines capacity 512kbit And finally we can assemble multiple 2D transformation modules into a multi-level transformation. Because the output from the DWT module is interleaved, meaning low pass and high pass coefficients come out in an alternating manner, as shown in the image. But the next level only needs the low pass coeffcients. Everything is pipelined! Throughput=B Sample/cycle

Histogram of Samples - Output from 1-level DWT
Most Samples are around zero

Huffman Encoding Tree and Table
1 1 0:10 1:1100 -1:1101 2:11100 -2:11001 3:111100 -3:111101 -4:111110 x:111111 1 1 1 -1 1 1 1 2 -2 1 1 3 -3 -4

Encoder Architecture Coeff In Vector FIFO Encoded Value Vector FIFO
Circle Buffer (EHR) Byte Out FIFO Encoding Table Write Index Read Index Priority to writes

Decoder Architecture Byte In FIFO Circle Buffer (EHR) Coeff Out FIFO
Write Index Read Index Read Index + 6 Encoding Table

FPGA Results - 1024x1024 Monochrome
Initial Image (Padded to 1024x1024) Matlab Compressed Image Matlab and FPGA Compressed Image

FPGA Results - 512x512 Color Initial Image (Padded to 512x512) Matlab Compressed Image Matlab and FPGA Compressed Image 3-level DWT compression, 0.23 compression ratio (lossy)

Performance Throughput:
50MHz * Fully pipelined 1 Sample/cycle = 50 MSPS = 1280x720 18FPS Compression Ratio Down to 0.23 (5.5 bit/pixel) for lossy compression Utilization on FPGA synthesization 88k LUT, 41k Register, 300 Block RAM tiles Code 2,620L in BSV, 126L in C++ and 391L in MATLAB

Thank you!

DWT 1D module w/ serialization & dummy FIFOs
L0~L3 x0~x7 H0~H3 L4~L7 x8~x15 H4~H7 L8~L11 x16~x23 H8~H11 L12~L15 x24~x31 H12~H15

High Speed Video Compression/Decompression Pipeline

Similar presentations

Presentation on theme: "High Speed Video Compression/Decompression Pipeline"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

High Speed Video Compression/Decompression Pipeline

Similar presentations

Presentation on theme: "High Speed Video Compression/Decompression Pipeline"— Presentation transcript:

Similar presentations

About project

Feedback