Presentation is loading. Please wait.

Presentation is loading. Please wait.

 Understanding the Sources of Inefficiency in General-Purpose Chips.

Similar presentations


Presentation on theme: " Understanding the Sources of Inefficiency in General-Purpose Chips."— Presentation transcript:

1  Understanding the Sources of Inefficiency in General-Purpose Chips

2  General Purpose Processors serve a wide class of applications  Pros:  Quick recovery Non Recurring Engineering costs  Cons:  Low energy efficiency and poor performance  Specific applications (cells, video cameras) have strict needs  Video encoding is used as the representative application by the author Motivation

3 H.264 Encoding Format Prediction Inter Prediction (IME, FME) Intra Prediction Transform / Quantize Entropy Encode CABAC Input

4 H.264 Encoding Format  Integer Motion Estimation – Finds closest match for an image block from previous image and computes a vector to represent the observed motion  FME (Fractional Motion Estimation) – Finds a match at quarter pixel resolution

5 H.264 Encoding Format  IP intra prediction – Uses previously encoded image-blocks within current image to form prediction of current image  DCT/Quantization – difference between current and predicted image block  CABAC – entropy encode coeffecients/elements

6 H.264 algorithm Prediction Inter Prediction (IME, FME) Intra Prediction Transform / Quantize Entropy Encode CABAC Input Data Parallel Sequential

7 Percentage execution time H.264  IME + FME account for 92% of the execution time!  CABAC is small but sequential – becomes bottleneck

8 What is exactly the problem? (H.264) 2.8GHz Pentium 4 is 500x worse in energy Four processor Tensilica based CMP is also 500x worse in energy

9 ASIC – Application Specific integrated circuits

10 ASIC - Feasibility  Is it feasible? –  Inflexible  Increased manufacturing and design time  Non Recurring Engineering costs?  Expensive to make for every different application

11 General Idea  Is there any incremental way of going from GP to ASIC?  What are the nature of the overheads?  A solution that has the benefits of both GP and ASIC  Provide flexibility for application experts to build customized solution for future energy efficiency  Case study – transform a conventional CMP into a customizable processor which is an efficient H.264 encoder  Use Tensilica to create optimized processors

12 Baseline H.264 Implementation  H.264 video encoding path is long and sequential  Map five major algorithmic blocks to macro blocks  Map four macro block to 4 processor CMP system  Each processor has 16KB 2-way set associative I & D caches

13 Baseline H.264 Implementation

14 SIMD and ILP  Exploiting VLIW and SIMD  SIMD – Single Instruction Multiple Data A0 A1 A2 A3 B0 B1 B2 B3 C0 C1 C2 C3

15 SIMD and ILP  VLIW – Very Long Instruction Word

16 SIMD and ILP

17

18 Processor Energy breakdown SIMD and ILP

19 Operation Fusion  Operation fusion – Fusion of complex instruction subgraph  Reduces instruction count and register file accesses  Intermediate results are consumed within op Eg: x n = x -2 -5x -1 + 20x 0 +20x 1 -5x 2 +x 3 (Pixel upsampling) After fusion acc = 0; acc = AddShft(acc, x 0, x 1, 20); acc = AddShft(acc, x -1, x 2, -5); acc = AddShft(acc, x -2, x 3, 1); X n = Sat(acc);

20 Operation Fusion  Compiler can find interesting instructions to merge  Tensilica’s Xpress tries to do this automatically  Authors created manually  Found ~20 fusion instructions across 4 algorithmic blocks

21 Not a big gain

22 Not good enough  Problem remains that 90% of the energy is going in overhead instructions  Need more compute / overhead  Need to aggregate works in large chunks to create highly optimized FU

23 Magic Instructions  Can achieve a large number of computation at very low costs  Achieved by creating instructions that are tightly connected to custom data storage elements with algorithm specific communications links

24 IME Strategy  SAD – Sum of absolute differences  Hundreds of SAD calculations to get one image difference  Data for each calculation is nearly the same

25 IME Strategy

26 FME Strategy  Pixel up-sampling example Eg : x n = x -2 -5x -1 + 20x 0 +20x 1 -5x 2 +x 3 (Pixel upsampling)  Normal register files require five register transfers per step  Augment them with 6 8-bit wide entry shift register structure  Works like FIFO – when a new entry comes, all shift

27 FME Strategy X4X4  Create a six input multiplier /adder  For 2-D up-sampling, build a shift register that stores horizontally up- sampled data and feeds its output to the vertical up-sampling units

28 FME Strategy X3X3 X2X2 X1X1 X -1 X -2 X -3

29 Other magic instructions  DCT  Matrix Transpose  Operation fusion with no limitation on number of operands  Intra Prediction  Customized interconnections for different prediction modes  CABAC  FIFO structures in binarization module  Fundamentally different computation fused with no restrictions

30 Magic Instructions Energy ( within 3x of ASIC )

31 Magic Instructions Performance  Over 35% of the energy is now used in the ALU’s  Most of the code involved magic instructions

32 Summary  Many operations are very simple with low energy  The SIMD / Vector parallelize well but overheads dominate  To get 100 ops/sec, need specialized hardware & memory  Authors put emphasis on making chip customization feasible  The focus should be on designing chip generators and not chips

33 Discussion Points  How are their architecture designs going to scale across multiple applications?  Their comparison baseline for a general purpose CMP is invalid. They should compare against designs having similar units  For very varied applications having specific requirements, this might just boil down into designing ASIC’s  They do not evaluate the quality of the video (and both encode time and power varies with quality)


Download ppt " Understanding the Sources of Inefficiency in General-Purpose Chips."

Similar presentations


Ads by Google