Download presentation
Presentation is loading. Please wait.
Published byDominick McCoy Modified over 9 years ago
1
Understanding the Sources of Inefficiency in General-Purpose Chips
2
General Purpose Processors serve a wide class of applications Pros: Quick recovery Non Recurring Engineering costs Cons: Low energy efficiency and poor performance Specific applications (cells, video cameras) have strict needs Video encoding is used as the representative application by the author Motivation
3
H.264 Encoding Format Prediction Inter Prediction (IME, FME) Intra Prediction Transform / Quantize Entropy Encode CABAC Input
4
H.264 Encoding Format Integer Motion Estimation – Finds closest match for an image block from previous image and computes a vector to represent the observed motion FME (Fractional Motion Estimation) – Finds a match at quarter pixel resolution
5
H.264 Encoding Format IP intra prediction – Uses previously encoded image-blocks within current image to form prediction of current image DCT/Quantization – difference between current and predicted image block CABAC – entropy encode coeffecients/elements
6
H.264 algorithm Prediction Inter Prediction (IME, FME) Intra Prediction Transform / Quantize Entropy Encode CABAC Input Data Parallel Sequential
7
Percentage execution time H.264 IME + FME account for 92% of the execution time! CABAC is small but sequential – becomes bottleneck
8
What is exactly the problem? (H.264) 2.8GHz Pentium 4 is 500x worse in energy Four processor Tensilica based CMP is also 500x worse in energy
9
ASIC – Application Specific integrated circuits
10
ASIC - Feasibility Is it feasible? – Inflexible Increased manufacturing and design time Non Recurring Engineering costs? Expensive to make for every different application
11
General Idea Is there any incremental way of going from GP to ASIC? What are the nature of the overheads? A solution that has the benefits of both GP and ASIC Provide flexibility for application experts to build customized solution for future energy efficiency Case study – transform a conventional CMP into a customizable processor which is an efficient H.264 encoder Use Tensilica to create optimized processors
12
Baseline H.264 Implementation H.264 video encoding path is long and sequential Map five major algorithmic blocks to macro blocks Map four macro block to 4 processor CMP system Each processor has 16KB 2-way set associative I & D caches
13
Baseline H.264 Implementation
14
SIMD and ILP Exploiting VLIW and SIMD SIMD – Single Instruction Multiple Data A0 A1 A2 A3 B0 B1 B2 B3 C0 C1 C2 C3
15
SIMD and ILP VLIW – Very Long Instruction Word
16
SIMD and ILP
18
Processor Energy breakdown SIMD and ILP
19
Operation Fusion Operation fusion – Fusion of complex instruction subgraph Reduces instruction count and register file accesses Intermediate results are consumed within op Eg: x n = x -2 -5x -1 + 20x 0 +20x 1 -5x 2 +x 3 (Pixel upsampling) After fusion acc = 0; acc = AddShft(acc, x 0, x 1, 20); acc = AddShft(acc, x -1, x 2, -5); acc = AddShft(acc, x -2, x 3, 1); X n = Sat(acc);
20
Operation Fusion Compiler can find interesting instructions to merge Tensilica’s Xpress tries to do this automatically Authors created manually Found ~20 fusion instructions across 4 algorithmic blocks
21
Not a big gain
22
Not good enough Problem remains that 90% of the energy is going in overhead instructions Need more compute / overhead Need to aggregate works in large chunks to create highly optimized FU
23
Magic Instructions Can achieve a large number of computation at very low costs Achieved by creating instructions that are tightly connected to custom data storage elements with algorithm specific communications links
24
IME Strategy SAD – Sum of absolute differences Hundreds of SAD calculations to get one image difference Data for each calculation is nearly the same
25
IME Strategy
26
FME Strategy Pixel up-sampling example Eg : x n = x -2 -5x -1 + 20x 0 +20x 1 -5x 2 +x 3 (Pixel upsampling) Normal register files require five register transfers per step Augment them with 6 8-bit wide entry shift register structure Works like FIFO – when a new entry comes, all shift
27
FME Strategy X4X4 Create a six input multiplier /adder For 2-D up-sampling, build a shift register that stores horizontally up- sampled data and feeds its output to the vertical up-sampling units
28
FME Strategy X3X3 X2X2 X1X1 X -1 X -2 X -3
29
Other magic instructions DCT Matrix Transpose Operation fusion with no limitation on number of operands Intra Prediction Customized interconnections for different prediction modes CABAC FIFO structures in binarization module Fundamentally different computation fused with no restrictions
30
Magic Instructions Energy ( within 3x of ASIC )
31
Magic Instructions Performance Over 35% of the energy is now used in the ALU’s Most of the code involved magic instructions
32
Summary Many operations are very simple with low energy The SIMD / Vector parallelize well but overheads dominate To get 100 ops/sec, need specialized hardware & memory Authors put emphasis on making chip customization feasible The focus should be on designing chip generators and not chips
33
Discussion Points How are their architecture designs going to scale across multiple applications? Their comparison baseline for a general purpose CMP is invalid. They should compare against designs having similar units For very varied applications having specific requirements, this might just boil down into designing ASIC’s They do not evaluate the quality of the video (and both encode time and power varies with quality)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.