 Understanding the Sources of Inefficiency in General-Purpose Chips.

 Understanding the Sources of Inefficiency in General-Purpose Chips

 General Purpose Processors serve a wide class of applications  Pros:  Quick recovery Non Recurring Engineering costs  Cons:  Low energy efficiency and poor performance  Specific applications (cells, video cameras) have strict needs  Video encoding is used as the representative application by the author Motivation

H.264 Encoding Format Prediction Inter Prediction (IME, FME) Intra Prediction Transform / Quantize Entropy Encode CABAC Input

H.264 Encoding Format  Integer Motion Estimation – Finds closest match for an image block from previous image and computes a vector to represent the observed motion  FME (Fractional Motion Estimation) – Finds a match at quarter pixel resolution

H.264 Encoding Format  IP intra prediction – Uses previously encoded image-blocks within current image to form prediction of current image  DCT/Quantization – difference between current and predicted image block  CABAC – entropy encode coeffecients/elements

H.264 algorithm Prediction Inter Prediction (IME, FME) Intra Prediction Transform / Quantize Entropy Encode CABAC Input Data Parallel Sequential

Percentage execution time H.264  IME + FME account for 92% of the execution time!  CABAC is small but sequential – becomes bottleneck

What is exactly the problem? (H.264) 2.8GHz Pentium 4 is 500x worse in energy Four processor Tensilica based CMP is also 500x worse in energy

ASIC – Application Specific integrated circuits

ASIC - Feasibility  Is it feasible? –  Inflexible  Increased manufacturing and design time  Non Recurring Engineering costs?  Expensive to make for every different application

General Idea  Is there any incremental way of going from GP to ASIC?  What are the nature of the overheads?  A solution that has the benefits of both GP and ASIC  Provide flexibility for application experts to build customized solution for future energy efficiency  Case study – transform a conventional CMP into a customizable processor which is an efficient H.264 encoder  Use Tensilica to create optimized processors

Baseline H.264 Implementation  H.264 video encoding path is long and sequential  Map five major algorithmic blocks to macro blocks  Map four macro block to 4 processor CMP system  Each processor has 16KB 2-way set associative I & D caches

Baseline H.264 Implementation

SIMD and ILP  Exploiting VLIW and SIMD  SIMD – Single Instruction Multiple Data A0 A1 A2 A3 B0 B1 B2 B3 C0 C1 C2 C3

SIMD and ILP  VLIW – Very Long Instruction Word

SIMD and ILP

Processor Energy breakdown SIMD and ILP

Operation Fusion  Operation fusion – Fusion of complex instruction subgraph  Reduces instruction count and register file accesses  Intermediate results are consumed within op Eg: x n = x -2 -5x -1 + 20x 0 +20x 1 -5x 2 +x 3 (Pixel upsampling) After fusion acc = 0; acc = AddShft(acc, x 0, x 1, 20); acc = AddShft(acc, x -1, x 2, -5); acc = AddShft(acc, x -2, x 3, 1); X n = Sat(acc);

Operation Fusion  Compiler can find interesting instructions to merge  Tensilica’s Xpress tries to do this automatically  Authors created manually  Found ~20 fusion instructions across 4 algorithmic blocks

Not a big gain

Not good enough  Problem remains that 90% of the energy is going in overhead instructions  Need more compute / overhead  Need to aggregate works in large chunks to create highly optimized FU

Magic Instructions  Can achieve a large number of computation at very low costs  Achieved by creating instructions that are tightly connected to custom data storage elements with algorithm specific communications links

IME Strategy  SAD – Sum of absolute differences  Hundreds of SAD calculations to get one image difference  Data for each calculation is nearly the same

IME Strategy

FME Strategy  Pixel up-sampling example Eg : x n = x -2 -5x -1 + 20x 0 +20x 1 -5x 2 +x 3 (Pixel upsampling)  Normal register files require five register transfers per step  Augment them with 6 8-bit wide entry shift register structure  Works like FIFO – when a new entry comes, all shift

FME Strategy X4X4  Create a six input multiplier /adder  For 2-D up-sampling, build a shift register that stores horizontally up- sampled data and feeds its output to the vertical up-sampling units

FME Strategy X3X3 X2X2 X1X1 X -1 X -2 X -3

Other magic instructions  DCT  Matrix Transpose  Operation fusion with no limitation on number of operands  Intra Prediction  Customized interconnections for different prediction modes  CABAC  FIFO structures in binarization module  Fundamentally different computation fused with no restrictions

Magic Instructions Energy ( within 3x of ASIC )

Magic Instructions Performance  Over 35% of the energy is now used in the ALU’s  Most of the code involved magic instructions

Summary  Many operations are very simple with low energy  The SIMD / Vector parallelize well but overheads dominate  To get 100 ops/sec, need specialized hardware & memory  Authors put emphasis on making chip customization feasible  The focus should be on designing chip generators and not chips

Discussion Points  How are their architecture designs going to scale across multiple applications?  Their comparison baseline for a general purpose CMP is invalid. They should compare against designs having similar units  For very varied applications having specific requirements, this might just boil down into designing ASIC’s  They do not evaluate the quality of the video (and both encode time and power varies with quality)

 Understanding the Sources of Inefficiency in General-Purpose Chips.

Similar presentations

Presentation on theme: " Understanding the Sources of Inefficiency in General-Purpose Chips."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

 Understanding the Sources of Inefficiency in General-Purpose Chips.

Similar presentations

Presentation on theme: " Understanding the Sources of Inefficiency in General-Purpose Chips."— Presentation transcript:

Similar presentations

About project

Feedback