Understanding Sources of Inefficiency in General-Purpose Chips R.Hameed, W. Qadeer, M. Wachs, O. Azizi, A. Solomatnikov, B. Lee, S. Richardson, C. Kozyrakis,

Understanding Sources of Inefficiency in General-Purpose Chips R.Hameed, W. Qadeer, M. Wachs, O. Azizi, A. Solomatnikov, B. Lee, S. Richardson, C. Kozyrakis, M. Horowitz. 10/3/14

1.Discuss the motivation for this paper 2.Explore sources of performance and energy inefficiency overheads for a general purpose CMP system 3.Investigate methods to eliminate these overheads in a customized system 4.Summarize key takeaways Overview 2

Motivation Computing systems are power limited, what is the best way to address this issue? Power scaling? –No longer getting significant gains Other optimizations? –Improvement on performance (10x) and energy (7x) ASIC? –Pros: Great for performance and energy efficiency –Cons: Costly 3 Solution: Try to achieve ASIC-like efficiency by addressing overheads in general purpose CMP systems.

Motivation (Cont.) Overarching Strategy: Trying to achieve ASIC-like efficiency by addressing overheads Understand causes of overheads Use understanding to motivate design choices –Focus on exploiting insn and data level parallelism –Customize instructions by fusing complex and frequent instructions into subgraphs –Create application-specific data storage with fused functional units 4 By understanding the cause of overheads, future systems can better figure out nature and degree of customizability required. (Finding root cause of overhead)

Customizing general purpose processors to improve efficiency for specific applications is not a new idea  SIMD Used for multimedia Data-parallel applications  DSP Signal processing  ELM and AnySP Embedded/mobile signal processing applications Broad Spectrum applications, but can have special ops for critical (or frequently used) instructions Customizable Processors allow for customizable instructions and allow for creation of instructions. 2.1 Related work in Efficient Computing 5

Definition: Extensible processors: (Ex. Tensilica Xtensa) Processor with a base design that can extend custom instructions and datapath units. This can be done either through the ISA or manually with automated tools. Brief Aside – Extensible Processors 6 Taken from: http://intranet.deei.fct.ualg.pt/IHS/Papers/Gonzalez00.pdf

Five functions dominate 99% of the total execution time in the base CMP implementation of H.264 1)IME: Integer Motion Estimation (Takes up 56% of total encoder time, 52% of total energy) 2)FME: Fractional Motion Estimation (Takes up 36% of total encoder time, 40% of total energy) 3)IP: Intra Prediction 4)DCT/Quant: Transform and Quantization (Along with IP, takes up 7% of total encoder time, 6% of total energy) 5)CABAC: Context Adaptive Binary Arithmetic Coding. (Takes up 1.6% of total encoder time, 1.7% of total energy) 2.2 H.264 Algorithm and Computational Motifs 7 Taken from http://en.wikipedia.org/wiki/H.264/MPEG-4_AVC

So what do we know about current H.264 implementations? Typically implemented in ASIC. T. –C. Chen et al demonstrated low power and area cost is possible for HD H.264 encoding. Other implementations exist that trade SNR for reduced energy and area. Previous work uses efficient hardware architectures but do not explain what causes the increase in efficiency. Sparse search + algorithmic modifications of motion estimation steps (IME and FME) increase performance 10x. 2.3 Current H.264 implementations. 8

3.1 Baseline H.264 Implementation – Mapping Algorithm to MB 9

3.1 Baseline H.264 Implementation – Macroblock Breakdown 10

3.1 Baseline H.264 Implementation – MB/Datapath NRG Breakdown 11

3.2 Customization Strategies 12

Using TIE, SIMD execution units are added to the base processor with vector register files of custom depths and widths. IME/FME use 16 and 18-way SIMD  Speed ups of 10x and 14x INTRA uses 8-way SIMD  Speed up of 6x Concurrent ops barely increase % energy of FU (which comprise 10% of total energy) Reg file energy decreases by 4-6x VLIW using 2 slots 1.5x performance improvement CABAC’s code size increased, increase in cache size/access neutralize energy gains in this Macroblock Combined SIMD and VLIW Speed up of 10x IF energy down by 10x (% energy goin to IF doesn’t change) CABAC becomes a major contribution to power dissipation. SIMD is good for data parallel applications VLIW helps all h.264 sub algorithms achieve speedups (2-slot offers more gain) 4.1 SIMD and VLIW Enhancements 13 SIMD is good for data parallel applications. VLIW helps all h.264 sub algorithms achieve speedups.

4.1 SIMD and VLIW Enhancements - Graphs 14

4.1 SIMD and VLIW Enhancements - Graphs 15

Second customization strategy, operation fusion, is added on top of the first. Operation fusion can be targeted by automatic tools (positive!) Reduces both instruction count and register file access. Fused operations feed directly into each other elimination intermediary storage in register file. Example: Pixel Sampling Instructions fuse addition/subtraction with multiplication Implicitly use accumulator through fuses Reduce reg file accesses  forwarding results reduces energy 4.2 Operation Fusion 16

4.2 Operation Fusion - Graphs 17

4.2 Operation Fusion - Graphs 18

Magic Instructions can execute 100s of ops in 1 insn. Requires using custom data storage elems., with algorithm specific communications linked to lots of data. Highly specific to algorithm being optimized. FME Strategy (will show diagram on next slide) Reduce insn fetches and register file transfers, augment process registers Add six input multiplier/adder implemented using carry-save addition. Upsampling done in 2-D  use shift registers capable of storing horizontal upsampled data feeding outputs vertically. IME Strategy (will show in a couple slides) Custom datapath elements used to accelerate 4x4 sum of absolute differences. 16-way SIMD  16x16 SAD unit. Register files replaced with state regs. with parallel access CABAC Strategy (will show in a couple slides) Arithmetic coding  simple pipeline, reference code  binary encoding of each symbol limited to five insn Encoding loop  reduced to a single constant time insn + smaller loop. Non binary DCT coeffs  binary code (16-entry LIFO struct to store DCT coefs + 1-bit flag for zero values.) 4.3 Magic Instructions 19

4.3.1 FME Strategy 20

4.3.2 IME Strategy 21

4.3.3 CABAC Strategy 22

One more time… 23

4.4 Area Efficiency 24

So where does this leave us? We know ASIC can be vastly superior than a CMP system How do we achieve ASIC like energy efficiency/Performance? Use SIMD/VLIW Add in operation fusions Customize hardware and implement “magic” instructions But to do the above, we need to have a deep understanding of: 1)Algorithm 2)Hardware 3)Overheads By understanding sources of overhead first we can: Iterate faster Use a general processor (which gives flexibility for future algorithmic mods) Downside: We still need to customize (which includes designing and validating). 5 Energy efficient Computers/Wrap Up 25

Understanding Sources of Inefficiency in General-Purpose Chips R.Hameed, W. Qadeer, M. Wachs, O. Azizi, A. Solomatnikov, B. Lee, S. Richardson, C. Kozyrakis,

Similar presentations

Presentation on theme: "Understanding Sources of Inefficiency in General-Purpose Chips R.Hameed, W. Qadeer, M. Wachs, O. Azizi, A. Solomatnikov, B. Lee, S. Richardson, C. Kozyrakis,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Understanding Sources of Inefficiency in General-Purpose Chips R.Hameed, W. Qadeer, M. Wachs, O. Azizi, A. Solomatnikov, B. Lee, S. Richardson, C. Kozyrakis,

Similar presentations

Presentation on theme: "Understanding Sources of Inefficiency in General-Purpose Chips R.Hameed, W. Qadeer, M. Wachs, O. Azizi, A. Solomatnikov, B. Lee, S. Richardson, C. Kozyrakis,"— Presentation transcript:

Similar presentations

About project

Feedback