Estimating Multimedia Instruction Performance Based on Workload Characterization and Measurement Gheewala, A.; Peir, J.-K.; Yen-Kuang Chen; Lai, K.; IEEE.

Slides:

Advertisements

Similar presentations

Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

Advertisements

CA 714CA Midterm Review. C5 Cache Optimization Reduce miss penalty –Hardware and software Reduce miss rate –Hardware and software Reduce hit time –Hardware.

Computer Abstractions and Technology

Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.

CEN 226: Computer Organization & Assembly Language :CSC 225 (Lec#3) By Dr. Syed Noman.

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

Parallell Processing Systems1 Chapter 4 Vector Processors.

TU/e Processor Design 5Z032 1 Processor Design 5Z032 The role of Performance Henk Corporaal Eindhoven University of Technology 2009.

2-1 Chapter 2 - Data Representation Computer Architecture and Organization by M. Murdocca and V. Heuring © 2007 M. Murdocca and V. Heuring Computer Architecture.

Chapter 2: Data Representation

Principles of Computer Architecture Miles Murdocca and Vincent Heuring Chapter 2: Data Representation.

Computer Organization and Architecture 18 th March, 2008.

Example (1) Two computer systems have been tested using three benchmarks. Using the normalized ratio formula and the following tables below, find which.

CSCE 212 Chapter 4: Assessing and Understanding Performance Instructor: Jason D. Bakos.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 5, 2005 Lecture 2.

Source Code Optimization and Profiling of Energy Consumption in Embedded System Simunic, T.; Benini, L.; De Micheli, G.; Hans, M.; Proceedings on The 13th.

Energy Evaluation Methodology for Platform Based System-On- Chip Design Hildingsson, K.; Arslan, T.; Erdogan, A.T.; VLSI, Proceedings. IEEE Computer.

Scheduling Reusable Instructions for Power Reduction J.S. Hu, N. Vijaykrishnan, S. Kim, M. Kandemir, and M.J. Irwin Proceedings of the Design, Automation.

Chapter 4 Assessing and Understanding Performance

Software-Based Cache Coherence with Hardware-Assisted Selective Self Invalidations Using Bloom Filters Authors ： Thomas J. Ashby, Pedro D´ıaz, Marcelo.

Functional Coverage Driven Test Generation for Validation of Pipelined Processors P. Mishra and N. Dutt Proceedings of the Design, Automation and Test.

Software Performance Tuning Project Monkey’s Audio Prepared by: Meni Orenbach Roman Kaplan Advisors: Liat Atsmon Kobi Gottlieb.

1 Chapter 4. 2 Measure, Report, and Summarize Make intelligent choices See through the marketing hype Key to understanding underlying organizational motivation.

Architectural and Compiler Techniques for Energy Reduction in High-Performance Microprocessors Nikolaos Bellas, Ibrahim N. Hajj, Fellow, IEEE, Constantine.

Riyadh Philanthropic Society For Science Prince Sultan College For Woman Dept. of Computer & Information Sciences CS 251 Introduction to Computer Organization.

10-1 Chapter 10 - Trends in Computer Architecture Principles of Computer Architecture by M. Murdocca and V. Heuring © 1999 M. Murdocca and V. Heuring Principles.

Image Recognition and Processing Using Artificial Neural Network Md. Iqbal Quraishi, J Pal Choudhury and Mallika De, IEEE.

10-1 Chapter 10 - Advanced Computer Architecture Computer Architecture and Organization by M. Murdocca and V. Heuring © 2007 M. Murdocca and V. Heuring.

Performance Evaluation of Parallel Processing. Why Performance?

Topics covered: Memory subsystem CSE243: Introduction to Computer Architecture and Hardware/Software Interface.

2-1 Chapter 2 - Data Representation Principles of Computer Architecture by M. Murdocca and V. Heuring © 1999 M. Murdocca and V. Heuring Principles of Computer.

Telecommunications and Signal Processing Seminar Ravi Bhargava * Lizy K. John * Brian L. Evans Ramesh Radhakrishnan * The University of Texas at.

Pipeline And Vector Processing. Parallel Processing The purpose of parallel processing is to speed up the computer processing capability and increase.

Y. Kotani · F. Ino · K. Hagihara Springer Science + Business Media B.V Reporter: 李長霖.

HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.

A Data Cache with Dynamic Mapping P. D'Alberto, A. Nicolau and A. Veidenbaum ICS-UCI Speaker Paolo D’Alberto.

CDA 3101 Fall 2013 Introduction to Computer Organization Computer Performance 28 August 2013.

10/19/2015Erkay Savas1 Performance Computer Architecture – CS401 Erkay Savas Sabanci University.

Performance Measurement. A Quantitative Basis for Design n Parallel programming is an optimization problem. n Must take into account several factors:

1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.

1 CS/EE 362 Hardware Fundamentals Lecture 9 (Chapter 2: Hennessy and Patterson) Winter Quarter 1998 Chris Myers.

Compiled by Maria Ramila Jimenez

Dynamic Load Balancing and Job Replication in a Global-Scale Grid Environment: A Comparison IEEE Transactions on Parallel and Distributed Systems, Vol.

1 CS/COE0447 Computer Organization & Assembly Language CHAPTER 4 Assessing and Understanding Performance.

Computer Architecture

Software Architecture Evaluation Methodologies Presented By: Anthony Register.

Introduction to MMX, XMM, SSE and SSE2 Technology

CS/EE 5810 CS/EE 6810 F00: 1 Multimedia. CS/EE 5810 CS/EE 6810 F00: 2 New Architecture Direction “… media processing will become the dominant force in.

Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.

1  1998 Morgan Kaufmann Publishers How to measure, report, and summarize performance (suorituskyky, tehokkuus)? What factors determine the performance.

Implementation of MPEG2 Codec with MMX/SSE/SSE2 Technology Speaker: Rong Jiang, Xu Jin Instructor: Yu-Hen Hu.

6.1 Advanced Operating Systems Lies, Damn Lies and Benchmarks Are your benchmark tests reliable?

10-1 Chapter 10 - Trends in Computer Architecture Department of Information Technology, Radford University ITEC 352 Computer Organization Principles of.

Sunpyo Hong, Hyesoon Kim

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.

ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Evaluation – Metrics, Simulation, and Workloads Copyright 2004 Daniel.

Computer Organization CS345 David Monismith Based upon notes by Dr. Bill Siever and from the Patterson and Hennessy Text.

Performance. Moore's Law Moore's Law Related Curves.

Unit 1 Introduction Number Systems and Conversion.

ECE 4100/6100 Advanced Computer Architecture Lecture 1 Performance

L. Benini, G. DeMicheli Stanford University, USA A. Macii, E. Macii, M

Microprocessor Systems Design I

Morgan Kaufmann Publishers

CSCE 212 Chapter 4: Assessing and Understanding Performance

Vector Processing => Multimedia

Pipelining and Vector Processing

STUDY AND IMPLEMENTATION

Parameters that affect it How to improve it and by how much

Stream-based Memory Specialization for General Purpose Processors

Presentation transcript:

Estimating Multimedia Instruction Performance Based on Workload Characterization and Measurement Gheewala, A.; Peir, J.-K.; Yen-Kuang Chen; Lai, K.; IEEE International Workshop on Workload Characterization Pages: Nov. 2002

Estimating Multimedia Instruction Performance Based on Workload Characterization and Measurement 2/ /6/22 Abstract  The increasing popularity in multimedia applications provokes microprocessors to include media-enhancement instructions. In this paper, we describe a methodology to estimate performance improvement of a new set of media instructions on emerging applications based on workload characterization and measurement. Application programs are characterized into a sequential segment, a vectorizable segment, and extra data moves for utilizing the SIMD capability of new media instructions.  Techniques based on benchmarking and measurements on existing systems are used to estimate the execution time of each segment. Based on the measurement results, the speedup and the additional data moves of using the new media instructions can be estimated to help processor architects and designers evaluate different design tradeoffs.

Estimating Multimedia Instruction Performance Based on Workload Characterization and Measurement 3/ /6/22 Outline  What’s the problem  Introduction  Methodology foundation and analysis  Proposed performance estimation methodology  Experimental results and evaluation  Conclusions

Estimating Multimedia Instruction Performance Based on Workload Characterization and Measurement 4/ /6/22 What’s the Problem  Traditional performance evaluation of a new set of media instructions is a time-consuming process  Requires detailed processor models to handle both regular and new SIMD media instructions  Needs to generate executable binary codes for the new media-extension instructions to drive simulator  It’s essential to quickly estimate the speedup of applications with a few additional media instructions to assess tradeoffs for new media instructions

Estimating Multimedia Instruction Performance Based on Workload Characterization and Measurement 5/ /6/22 Introduction  The proposed methodology  Based on timing measurement on existing systems Where the new SIMD instructions are not available  Execution time of the following segments can be derived Sequential segment Vectorized segment  code segment that can be vectorized by a set of new SIMD instructions Data move segment  Explicit data move code segment in using new SIMD instructions  Execution time of an application with SIMD instructions can be estimated from the three segments Only need existing hardware No cycle-accurate simulator is required

Estimating Multimedia Instruction Performance Based on Workload Characterization and Measurement 6/ /6/22 Estimating Speedup for MMX  Amdahl’s law can estimate the speedup of an application  f is fraction of the program that can be vectorized  n is the ideal speedup of f  Modify Amdahl’s law to accommodate the MMX technology  O is portion of the code in the vectorizable segment that can’t be replaced by MMX instructions Such as program constructs loop controls and procedure calls  D represents the fraction of the data move instructions Explicitly data move instruction to/from MMX register  m is the speedup of the data moves

Estimating Multimedia Instruction Performance Based on Workload Characterization and Measurement 7/ /6/22 SIMD with Data Rearrangement  Data Arrangement in Registers for Matrix Multiplication  Packed Multiply-and-Add (PMADDWD)  Performs four 16 bits multiplications and two 32 bits additions  Packed-Add (PADDD)  Performs two 32 bits additions 16 32

Estimating Multimedia Instruction Performance Based on Workload Characterization and Measurement 8/ /6/22 SIMD with Data Rearrangement (cont.)  Another Way of Data Arrangement in Registers  More natural data arrangement  Invent new PADDD to accomplish this Adds the high-order and low-order 32 bits of each of the two source registers 16 32

Estimating Multimedia Instruction Performance Based on Workload Characterization and Measurement 9/ /6/22 Workload Characterization and Measurement  Four types of code  Equivalent C-code (executable on existing system) Application program written in C  MMX-code (un-executable on existing system) Develops with new SIMD and data move instructions  Pseudo MMX-code (executable on existing system) Replaces new SIMD with equivalent MMX-like C instructions Includes all the data moves as that in the MMX-code  Cripple code (executable on existing system) Removes new SIMD in MMX-code without replacement  Important assumption  Four SIMD computation instructions are assumed to be new to the current MMX ISA PMADDWD, PADDD, PSUBD, PSRAD

Estimating Multimedia Instruction Performance Based on Workload Characterization and Measurement 10/ /6/22 Workload Characterization and Measurement Replaces the corresponding new SIMD instructions with the equivalent C instructions Keeps all the data move instructions as that in the original MMX-code Portion of the MMX-code and its equivalent pseudo MMX-code from IDCT

Estimating Multimedia Instruction Performance Based on Workload Characterization and Measurement 11/ /6/22 Timing Components of Four Types of Code Sequential segment (1-f) Vectorizable portion of the C-code (f-O) Unvectorizable portion (O) Data-move segment (D) Execution time for the individual components can be derived except for the new SIMD instructions Main target for improvement with new SIMD instructions

Estimating Multimedia Instruction Performance Based on Workload Characterization and Measurement 12/ /6/22 Performance Projection and Verification  Individual Timing Components Derivation  Data-move segment (D) Difference of execution time between equivalent C-code and pseudo MMX-code  Vectorizable portion of the C-code (f-O) Difference of execution time between Cripple code and pseudo MMX-code  Unvectorizable portion (O) Difference of execution time between vectorizable portion of the C-code (f-O) and original vectorizable segment (f)  Total execution time and speedup estimation  Sequential segment execution time (1-f)  Unvectorizable portion execution time (O)  Execution time spent on new SIMD instructions (f-O) / n  Data-move segment execution time (D)

Estimating Multimedia Instruction Performance Based on Workload Characterization and Measurement 13/ /6/22 Performance Projection and Verification  Steps for estimating speedup factor (n) of the new SIMD  Step1: Assembly code examined for each new SIMD instruction Explicit data-move instructions PMADDWD+ =

Estimating Multimedia Instruction Performance Based on Workload Characterization and Measurement 14/ /6/22 Performance Projection and Verification  Step2: Estimates execution latency of the assembly Execution latency of each assembly instruction is specified in the architectural book Finally, obtains the estimated speedup factor (n)  Step3: Repeats the above steps for new SIMD instructions Obtains the respective speedup of each new SIMD instruction  Step4: Calculates the weighted average speedup According to the number of occurrences of each new SIMD instruction in the application Thus, we can estimate the time spent on all the new SIMD instructions : (f-O) / n

Estimating Multimedia Instruction Performance Based on Workload Characterization and Measurement 15/ /6/22 IDCT Case Study Results  Estimated Speedup Factor (n) for New SIMD Instructions 8.09= New SIMD computation instruction equivalent C code

Estimating Multimedia Instruction Performance Based on Workload Characterization and Measurement 16/ /6/22 IDCT Case Study Results (cont.)  IDCT Performance Measurement and Project SequentialUnvectorizableNew MMXData moves+++ =

Estimating Multimedia Instruction Performance Based on Workload Characterization and Measurement 17/ /6/22 Experimental Results and Evaluation  Overall speedup is close 1.5 with 2 times of performance improvement for the new SIMD instructions  Overall speedup is over 2.5 given 10 times improvement of the new SIMD instructions Overall speedup Execution time

Estimating Multimedia Instruction Performance Based on Workload Characterization and Measurement 18/ /6/22 Experimental Results and Evaluation (cont.)  Overall speedup reduces from 2.9 to 2.7 with 30% more data move overhead  Overall speedup increases from 2.9 to 3.1 if data move overhead can be reduced by 30% Execution time Overall speedup

Estimating Multimedia Instruction Performance Based on Workload Characterization and Measurement 19/ /6/22 Conclusions  Presents a performance estimation method for using new media instructions  Base on characterize media workload with benchmarking and measurement on existing systems  No cycle-accurate simulator is required  Given a range of performance improvement of the new media instructions, the proposed method can estimate a range of overall speedup