 Understanding the Sources of Inefficiency in General-Purpose Chips.

Slides:



Advertisements
Similar presentations
Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow
Advertisements

DSPs Vs General Purpose Microprocessors
VLIW Very Large Instruction Word. Introduction Very Long Instruction Word is a concept for processing technology that dates back to the early 1980s. The.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Breaking SIMD Shackles with an Exposed Flexible Microarchitecture and the Access Execute PDG Venkatraman Govindaraju, Tony Nowatzki, Karthikeyan Sankaralingam.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
The University of Adelaide, School of Computer Science
University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,
TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Parallell Processing Systems1 Chapter 4 Vector Processors.
Lecture 9: Coarse Grained FPGA Architecture October 6, 2004 ECE 697F Reconfigurable Computing Lecture 9 Coarse Grained FPGA Architecture.
From Sequences of Dependent Instructions to Functions An Approach for Improving Performance without ILP or Speculation Ben Rudzyn.
H.264/AVC Baseline Profile Decoder Complexity Analysis Michael Horowitz, Anthony Joch, Faouzi Kossentini, and Antti Hallapuro IEEE TRANSACTIONS ON CIRCUITS.
Yu-Han Chen, Tung-Chien Chen, Chuan-Yung Tsai, Sung-Fang Tsai, and Liang-Gee Chen, Fellow, IEEE IEEE CSVT
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
Instruction Level Parallelism (ILP) Colin Stevens.
H.264 / MPEG-4 Part 10 Nimrod Peleg March 2003.
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.
CS294-6 Reconfigurable Computing Day 3 September 1, 1998 Requirements for Computing Devices.
A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.
Compilation, Architectural Support, and Evaluation of SIMD Graphics Pipeline Programs on a General-Purpose CPU Mauricio Breternitz Jr, Herbert Hum, Sanjeev.
Real time DSP Professors: Eng. Julian Bruno Eng. Mariano Llamedo Soria.
CSE 690: GPGPU Lecture 4: Stream Processing Klaus Mueller Computer Science, Stony Brook University.
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
Enhancing GPU for Scientific Computing Some thoughts.
RICE UNIVERSITY Implementing the Viterbi algorithm on programmable processors Sridhar Rajagopal Elec 696
A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.
Computer Systems Organization CS 1428 Foundations of Computer Science.
Automated Design of Custom Architecture Tulika Mitra
Real-Time HD Harmonic Inc. Real Time, Single Chip High Definition Video Encoder! December 22, 2004.
1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.
© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
Adaptive Multi-path Prediction for Error Resilient H.264 Coding Xiaosong Zhou, C.-C. Jay Kuo University of Southern California Multimedia Signal Processing.
RICE UNIVERSITY ‘Stream’-based wireless computing Sridhar Rajagopal Research group meeting December 17, 2002 The figures used in the slides are borrowed.
Understanding Sources of Inefficiency in General-Purpose Chips R.Hameed, W. Qadeer, M. Wachs, O. Azizi, A. Solomatnikov, B. Lee, S. Richardson, C. Kozyrakis,
Lecture 16: Reconfigurable Computing Applications November 3, 2004 ECE 697F Reconfigurable Computing Lecture 16 Reconfigurable Computing Applications.
Bottlenecks of SIMD Haibin Wang Wei tong. Paper Bottlenecks in Multimedia Processing with SIMD Style Extensions and Architectural Enhancements One IEEE.
Gary MarsdenSlide 1University of Cape Town Chapter 5 - The Processor  Machine Performance factors –Instruction Count, Clock cycle time, Clock cycles per.
EE3A1 Computer Hardware and Digital Design
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 Bundled Execution.
ECEG-3202 Computer Architecture and Organization Chapter 7 Reduced Instruction Set Computers.
DSP Architectures Additional Slides Professor S. Srinivasan Electrical Engineering Department I.I.T.-Madras, Chennai –
The Alpha Thomas Daniels Other Dude Matt Ziegler.
A Programmable Single Chip Digital Signal Processing Engine MAPLD 2005 Paul Chiang, MathStar Inc. Pius Ng, Apache Design Solutions.
Page 11/28/2016 CSE 40373/60373: Multimedia Systems Quantization  F(u, v) represents a DCT coefficient, Q(u, v) is a “quantization matrix” entry, and.
Lx: A Technology Platform for Customizable VLIW Embedded Processing.
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
Sunpyo Hong, Hyesoon Kim
Systems Architecture, Fourth Edition 1 Processor Technology and Architecture Chapter 4.
Hierarchical Systolic Array Design for Full-Search Block Matching Motion Estimation Noam Gur Arie,August 2005.
Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning.
Winter-Spring 2001Codesign of Embedded Systems1 Essential Issues in Codesign: Architectures Part of HW/SW Codesign of Embedded Systems Course (CE )
PRESENTED BY: MOHAMAD HAMMAM ALSAFRJALANI UFL ECE Dept. 3/31/2010 UFL ECE Dept 1 CACHE OPTIMIZATION FOR AN EMBEDDED MPEG-4 VIDEO DECODER.
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
15-740/ Computer Architecture Lecture 3: Performance
William Stallings Computer Organization and Architecture 8th Edition
Embedded Systems Design
5.2 Eleven Advanced Optimizations of Cache Performance
Cache Memory Presentation I
CGRA Express: Accelerating Execution using Dynamic Operation Fusion
Sum of Absolute Differences Hardware Accelerator
STUDY AND IMPLEMENTATION
VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Encoder
Optimizing Baseline Profile in H
Chapter 4 The Von Neumann Model
Presentation transcript:

 Understanding the Sources of Inefficiency in General-Purpose Chips

 General Purpose Processors serve a wide class of applications  Pros:  Quick recovery Non Recurring Engineering costs  Cons:  Low energy efficiency and poor performance  Specific applications (cells, video cameras) have strict needs  Video encoding is used as the representative application by the author Motivation

H.264 Encoding Format Prediction Inter Prediction (IME, FME) Intra Prediction Transform / Quantize Entropy Encode CABAC Input

H.264 Encoding Format  Integer Motion Estimation – Finds closest match for an image block from previous image and computes a vector to represent the observed motion  FME (Fractional Motion Estimation) – Finds a match at quarter pixel resolution

H.264 Encoding Format  IP intra prediction – Uses previously encoded image-blocks within current image to form prediction of current image  DCT/Quantization – difference between current and predicted image block  CABAC – entropy encode coeffecients/elements

H.264 algorithm Prediction Inter Prediction (IME, FME) Intra Prediction Transform / Quantize Entropy Encode CABAC Input Data Parallel Sequential

Percentage execution time H.264  IME + FME account for 92% of the execution time!  CABAC is small but sequential – becomes bottleneck

What is exactly the problem? (H.264) 2.8GHz Pentium 4 is 500x worse in energy Four processor Tensilica based CMP is also 500x worse in energy

ASIC – Application Specific integrated circuits

ASIC - Feasibility  Is it feasible? –  Inflexible  Increased manufacturing and design time  Non Recurring Engineering costs?  Expensive to make for every different application

General Idea  Is there any incremental way of going from GP to ASIC?  What are the nature of the overheads?  A solution that has the benefits of both GP and ASIC  Provide flexibility for application experts to build customized solution for future energy efficiency  Case study – transform a conventional CMP into a customizable processor which is an efficient H.264 encoder  Use Tensilica to create optimized processors

Baseline H.264 Implementation  H.264 video encoding path is long and sequential  Map five major algorithmic blocks to macro blocks  Map four macro block to 4 processor CMP system  Each processor has 16KB 2-way set associative I & D caches

Baseline H.264 Implementation

SIMD and ILP  Exploiting VLIW and SIMD  SIMD – Single Instruction Multiple Data A0 A1 A2 A3 B0 B1 B2 B3 C0 C1 C2 C3

SIMD and ILP  VLIW – Very Long Instruction Word

SIMD and ILP

Processor Energy breakdown SIMD and ILP

Operation Fusion  Operation fusion – Fusion of complex instruction subgraph  Reduces instruction count and register file accesses  Intermediate results are consumed within op Eg: x n = x -2 -5x x 0 +20x 1 -5x 2 +x 3 (Pixel upsampling) After fusion acc = 0; acc = AddShft(acc, x 0, x 1, 20); acc = AddShft(acc, x -1, x 2, -5); acc = AddShft(acc, x -2, x 3, 1); X n = Sat(acc);

Operation Fusion  Compiler can find interesting instructions to merge  Tensilica’s Xpress tries to do this automatically  Authors created manually  Found ~20 fusion instructions across 4 algorithmic blocks

Not a big gain

Not good enough  Problem remains that 90% of the energy is going in overhead instructions  Need more compute / overhead  Need to aggregate works in large chunks to create highly optimized FU

Magic Instructions  Can achieve a large number of computation at very low costs  Achieved by creating instructions that are tightly connected to custom data storage elements with algorithm specific communications links

IME Strategy  SAD – Sum of absolute differences  Hundreds of SAD calculations to get one image difference  Data for each calculation is nearly the same

IME Strategy

FME Strategy  Pixel up-sampling example Eg : x n = x -2 -5x x 0 +20x 1 -5x 2 +x 3 (Pixel upsampling)  Normal register files require five register transfers per step  Augment them with 6 8-bit wide entry shift register structure  Works like FIFO – when a new entry comes, all shift

FME Strategy X4X4  Create a six input multiplier /adder  For 2-D up-sampling, build a shift register that stores horizontally up- sampled data and feeds its output to the vertical up-sampling units

FME Strategy X3X3 X2X2 X1X1 X -1 X -2 X -3

Other magic instructions  DCT  Matrix Transpose  Operation fusion with no limitation on number of operands  Intra Prediction  Customized interconnections for different prediction modes  CABAC  FIFO structures in binarization module  Fundamentally different computation fused with no restrictions

Magic Instructions Energy ( within 3x of ASIC )

Magic Instructions Performance  Over 35% of the energy is now used in the ALU’s  Most of the code involved magic instructions

Summary  Many operations are very simple with low energy  The SIMD / Vector parallelize well but overheads dominate  To get 100 ops/sec, need specialized hardware & memory  Authors put emphasis on making chip customization feasible  The focus should be on designing chip generators and not chips

Discussion Points  How are their architecture designs going to scale across multiple applications?  Their comparison baseline for a general purpose CMP is invalid. They should compare against designs having similar units  For very varied applications having specific requirements, this might just boil down into designing ASIC’s  They do not evaluate the quality of the video (and both encode time and power varies with quality)