Understanding the Sources of Inefficiency in General-Purpose Chips
General Purpose Processors serve a wide class of applications Pros: Quick recovery Non Recurring Engineering costs Cons: Low energy efficiency and poor performance Specific applications (cells, video cameras) have strict needs Video encoding is used as the representative application by the author Motivation
H.264 Encoding Format Prediction Inter Prediction (IME, FME) Intra Prediction Transform / Quantize Entropy Encode CABAC Input
H.264 Encoding Format Integer Motion Estimation – Finds closest match for an image block from previous image and computes a vector to represent the observed motion FME (Fractional Motion Estimation) – Finds a match at quarter pixel resolution
H.264 Encoding Format IP intra prediction – Uses previously encoded image-blocks within current image to form prediction of current image DCT/Quantization – difference between current and predicted image block CABAC – entropy encode coeffecients/elements
H.264 algorithm Prediction Inter Prediction (IME, FME) Intra Prediction Transform / Quantize Entropy Encode CABAC Input Data Parallel Sequential
Percentage execution time H.264 IME + FME account for 92% of the execution time! CABAC is small but sequential – becomes bottleneck
What is exactly the problem? (H.264) 2.8GHz Pentium 4 is 500x worse in energy Four processor Tensilica based CMP is also 500x worse in energy
ASIC – Application Specific integrated circuits
ASIC - Feasibility Is it feasible? – Inflexible Increased manufacturing and design time Non Recurring Engineering costs? Expensive to make for every different application
General Idea Is there any incremental way of going from GP to ASIC? What are the nature of the overheads? A solution that has the benefits of both GP and ASIC Provide flexibility for application experts to build customized solution for future energy efficiency Case study – transform a conventional CMP into a customizable processor which is an efficient H.264 encoder Use Tensilica to create optimized processors
Baseline H.264 Implementation H.264 video encoding path is long and sequential Map five major algorithmic blocks to macro blocks Map four macro block to 4 processor CMP system Each processor has 16KB 2-way set associative I & D caches
Baseline H.264 Implementation
SIMD and ILP Exploiting VLIW and SIMD SIMD – Single Instruction Multiple Data A0 A1 A2 A3 B0 B1 B2 B3 C0 C1 C2 C3
SIMD and ILP VLIW – Very Long Instruction Word
SIMD and ILP
Processor Energy breakdown SIMD and ILP
Operation Fusion Operation fusion – Fusion of complex instruction subgraph Reduces instruction count and register file accesses Intermediate results are consumed within op Eg: x n = x -2 -5x x 0 +20x 1 -5x 2 +x 3 (Pixel upsampling) After fusion acc = 0; acc = AddShft(acc, x 0, x 1, 20); acc = AddShft(acc, x -1, x 2, -5); acc = AddShft(acc, x -2, x 3, 1); X n = Sat(acc);
Operation Fusion Compiler can find interesting instructions to merge Tensilica’s Xpress tries to do this automatically Authors created manually Found ~20 fusion instructions across 4 algorithmic blocks
Not a big gain
Not good enough Problem remains that 90% of the energy is going in overhead instructions Need more compute / overhead Need to aggregate works in large chunks to create highly optimized FU
Magic Instructions Can achieve a large number of computation at very low costs Achieved by creating instructions that are tightly connected to custom data storage elements with algorithm specific communications links
IME Strategy SAD – Sum of absolute differences Hundreds of SAD calculations to get one image difference Data for each calculation is nearly the same
IME Strategy
FME Strategy Pixel up-sampling example Eg : x n = x -2 -5x x 0 +20x 1 -5x 2 +x 3 (Pixel upsampling) Normal register files require five register transfers per step Augment them with 6 8-bit wide entry shift register structure Works like FIFO – when a new entry comes, all shift
FME Strategy X4X4 Create a six input multiplier /adder For 2-D up-sampling, build a shift register that stores horizontally up- sampled data and feeds its output to the vertical up-sampling units
FME Strategy X3X3 X2X2 X1X1 X -1 X -2 X -3
Other magic instructions DCT Matrix Transpose Operation fusion with no limitation on number of operands Intra Prediction Customized interconnections for different prediction modes CABAC FIFO structures in binarization module Fundamentally different computation fused with no restrictions
Magic Instructions Energy ( within 3x of ASIC )
Magic Instructions Performance Over 35% of the energy is now used in the ALU’s Most of the code involved magic instructions
Summary Many operations are very simple with low energy The SIMD / Vector parallelize well but overheads dominate To get 100 ops/sec, need specialized hardware & memory Authors put emphasis on making chip customization feasible The focus should be on designing chip generators and not chips
Discussion Points How are their architecture designs going to scale across multiple applications? Their comparison baseline for a general purpose CMP is invalid. They should compare against designs having similar units For very varied applications having specific requirements, this might just boil down into designing ASIC’s They do not evaluate the quality of the video (and both encode time and power varies with quality)