VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Encoder “Human beings are great programmers, Computers are poor actors” VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Encoder Serene Banerjee Hamid R. Sheikh Lizy K. John Brian L. Evans Alan C. Bovik Department of Electrical and Computer Engineering The University of Texas at Austin VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Video Encoder November 1st, 2000 serene@ece.utexas.edu
Baseline H.263 Video Encoding I: Intra frame: Discrete Cosine Transform (DCT) is used to reduce spatial redundancy within a frame. P: Predicted frame: Motion compensated prediction (MCP) used to reduce temporal redundancy. DCT is used to reduce spatial redundancy in the prediction error. I P Frame … VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Video Encoder
Baseline H.263 Encoder 2-D DCT Coding control Video input Q Q-1 IDCT + Quantizer index for transform coefficient VLC: Variable Length Coding 2-D DCT Coding control Video input Q Q-1 IDCT + - ME Control info Motion vectors VLC MCP ME: Motion Estimation VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Video Encoder
H.263 Encoder Goals: baseline H.263 encoder only Evaluate performance of compiled C code on Very Long Instruction Word (VLIW) Digital Signal Processors (DSPs) and superscalar processors Hand optimize H.263 video encoder on VLIW DSP University of British Columbia (UBC) H.263 Version 2 (H.263+) video codec By Prof. Faouzi Kossentini’s group: http://spmg.ece.ubc.ca 23000 lines (720 kbytes) of C code targeted for PCs Baseline H.263 and many optional H.263+ modes Primarily for research purposes VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Video Encoder
TMS320C6701 Processor Up to 8 32-bit instructions are executed in one instruction cycle in an in-order way 2 32-bit data paths, with 16 32-bit registers and 16 16-bit data memory banks Program Fetch Control Registers Instruction Dispatch Instruction Decode Control Logic A Register File B Register File Test/ Emulation Interrupts control L1 S1 M1 D1 L2 S2 M2 D2 TMS320C6701 CPU Core VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Video Encoder
TMS320C6701 EVM TMS320C6701 processor External memory 11 - 17 stages of pipeline, depending on instruction External memory 256 kB of 133 MHz synchronous burst static random-access memory (SBSRAM) 8 MB of 100 MHz synchronous dynamic RAM (SDRAM) in two 16-bit RAM banks 100 MHz clock speed due to SDRAM Development environment Code Composer: Interactive real-time debugging Simulator: Does not report pipeline stalls VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Video Encoder
SimpleScalar Simulator Superscalar processor reorders sequential instructions based on data dependencies for parallel (out-of-order) execution SimpleScalar is configurable superscalar simulator: http://www.simplescalar.org Fetch Dispatch Scheduler Execute Writeback Memory Memory TLB: Translation lookahead buffer Instruction cache Virtual memory Data cache Data-TLB Commit Six pipeline stages for out-of-order simulation VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Video Encoder
Comparison of Processors VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Video Encoder
Encoder Profile for VLIW DSP (with level two C optimization only) 1476 Mcycles/frame for 128 x 96 resolution with full-search motion estimation SAD VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Video Encoder
Encoder Profile for SuperScalar (1-way with level two C optimization) 196 Mcycles/frame for 128 x 96 resolution with full-search motion estimation VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Video Encoder
H.263 Encoder Comparison (with level 2 C optimization only) Frame resolution: 128 x 96 (Sub-QCIF) Full search motion estimation Clock speed: 100 MHz VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Video Encoder
VLIW DSP Memory Optimizations Internal program memory holds Computationally intensive routines Commonly used runtime support functions from TI libraries (memcpy, memcmp and memset) Internal data memory holds Macroblocks and search area for motion estimation Macroblocks for DCT, quantization, coding, reconstruction Local data for computationally intensive routines Stack Speedup: 29 times over level two optimization VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Video Encoder
VLIW DSP Code Optimizations Compiler intrinsics gave little improvement Wrote assembly routines Parallel assembly: SAD, Clip_MB (clips overflowing values) Linear assembly: Interpolate, FillMBData (pack copy of pixel data into macroblock structures) Rewriting the C code Unroll loops and pipeline computations Use 32-bit packed data I/O to slower external RAM Avoid pipeline stalls due to memory bank conflicts Speedup: 4 times over level two C optimization VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Video Encoder
VLIW DSP Optimizations (assembly routines only) VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Video Encoder
VLIW DSP Encoder Profile (after all C6701 optimizations) 24 Mcycles/frame for 128 x 96 resolution with full-search motion estimation SAD VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Video Encoder
Superscalar Encoder Profile (256-way SimpleScalar processor) 28 Mcycles/frame for 128 x 96 resolution with full-search motion estimation VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Video Encoder
Subroutine Comparisons VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Video Encoder
H.263 Encoder Comparison Frame resolution: 128 x 96 (Sub-QCIF) Full search motion estimation Clock speed: 100 MHz VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Video Encoder
Conclusions With level 2 optimization only One-way superscalar is 7.5x faster than VLIW DSP Four-way to one-way issue speedup is 2.88x 256-way to four-way speedup is 2.4x Variable length coding much faster on superscalar VLIW DSP hand optimization produces 61x speedup vs. level two C optimization Placement of often-used data and code on-chip Hand coded SAD, interpolation, and reconstruction 14% faster than 256-way superscalar version http://www.ece.utexas.edu/~sheikh/h263 VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Video Encoder