Multimedia Characteristics and Optimizations Marilyn Wolf Dept. of EE Princeton University © 2004 Marilyn Wolf.

Slides:

Advertisements

Similar presentations

Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow

Advertisements

Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

DBMSs on a Modern Processor: Where Does Time Go? Anastassia Ailamaki Joint work with David DeWitt, Mark Hill, and David Wood at the University of Wisconsin-Madison.

Performance of Cache Memory

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

CML Efficient & Effective Code Management for Software Managed Multicores CODES+ISSS 2013, Montreal, Canada Ke Bai, Jing Lu, Aviral Shrivastava, and Bryce.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

Exploiting Spatial Locality in Data Caches using Spatial Footprints Sanjeev Kumar, Princeton University Christopher Wilkerson, MRL, Intel.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

CSC 4250 Computer Architectures December 8, 2006 Chapter 5. Memory Hierarchy.

Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.

Architectures for Video Signal Processing Wayne Wolf Dept. of EE Princeton University.

A High Performance Application Representation for Reconfigurable Systems Wenrui GongGang WangRyan Kastner Department of Electrical and Computer Engineering.

Chapter 12 CPU Structure and Function. Example Register Organizations.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)

CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

Waleed Alkohlani 1, Jeanine Cook 2, Nafiul Siddique 1 1 New Mexico Sate University 2 Sandia National Laboratories Insight into Application Performance.

A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,

Software Performance Analysis Using CodeAnalyst for Windows Sherry Hurwitz SW Applications Manager SRD Advanced Micro Devices Lei.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

Hybrid-Scheduling: A Compile-Time Approach for Energy–Efficient Superscalar Processors Madhavi Valluri and Lizy John Laboratory for Computer Architecture.

1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.

Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.

1 of 20 Phase-based Cache Reconfiguration for a Highly-Configurable Two-Level Cache Hierarchy This work was supported by the U.S. National Science Foundation.

Computer architecture Lecture 11: Reduced Instruction Set Computers Piotr Bilski.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Execution Characteristics of SPEC CPU2000 Benchmarks: Intel C++ vs. Microsoft VC++

Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.

© 2000 Morgan Kaufman Overheads for Computers as Components Program design and analysis zOptimizing for execution time. zOptimizing for energy/power. zOptimizing.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

Pipelining and Parallelism Mark Staveley

© 2000 Morgan Kaufman Overheads for Computers as Components Architectures and instruction sets zComputer architecture taxonomy. zAssembly language.

Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.

© 2004 Wayne Wolf Memory system optimizations Strictly software:  Effectively using the cache and partitioned memory. Hardware + software:  Scratch-pad.

Introduction: Memory Management 2 Ideally programmers want memory that is large fast non volatile Memory hierarchy small amount of fast, expensive memory.

Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.

An Automated Development Framework for a RISC Processor with Reconfigurable Instruction Set Extensions Nikolaos Vassiliadis, George Theodoridis and Spiridon.

SR: 599 report Channel Estimation for W-CDMA on DSPs Sridhar Rajagopal ECE Dept., Rice University Elec 599.

High Performance Embedded Computing © 2007 Elsevier Lecture 7: Memory Systems & Code Compression Embedded Computing Systems Mikko Lipasti, adapted from.

Sunpyo Hong, Hyesoon Kim

Jan. 5, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 1: Overview of High Performance Processors * Jeremy R. Johnson Wed. Sept. 27,

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.

Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.

1 of 14 Lab 2: Formal verification with UPPAAL. 2 of 14 2 The gossiping persons There are n persons. All have one secret to tell, which is not known to.

Memory Design Principles Principle of locality dominates design Smaller = faster Hierarchy goal: total memory system almost as cheap as the cheapest component,

Architectural Effects on DSP Algorithms and Optimizations Sajal Dogra Ritesh Rathore.

High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

1 of 14 Lab 2: Design-Space Exploration with MPARM.

PRESENTED BY: MOHAMAD HAMMAM ALSAFRJALANI UFL ECE Dept. 3/31/2010 UFL ECE Dept 1 CACHE OPTIMIZATION FOR AN EMBEDDED MPEG-4 VIDEO DECODER.

PipeliningPipelining Computer Architecture (Fall 2006)

CSE 340 Computer Architecture Summer 2016 Understanding Performance.

Vector Processing => Multimedia

CSCI1600: Embedded and Real Time Software

Tosiron Adegbija and Ann Gordon-Ross+

VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Encoder

Mapping DSP algorithms to a general purpose out-of-order processor

CSCI1600: Embedded and Real Time Software

Presentation transcript:

Multimedia Characteristics and Optimizations Marilyn Wolf Dept. of EE Princeton University © 2004 Marilyn Wolf

Outline zFritts: compiler studies. zLv: compiler studies. zMemory system optimizations.

Basic Characteristics zComparison of operation frequencies with SPEC y(ALU, mem, branch, shift, FP, mult) => (4, 2, 1, 1, 1, 1) yLower frequency of memory and floating-point operations yMore arithmetic operations yLarger variation in memory usage zBasic block statistics yAverage of 5.5 operations per basic block yNeed global scheduling techniques to extract ILP zStatic branch prediction yAverage of 89.5% static branch prediction on training input` yAverage of 85.9% static branch prediction on evaluation input zData types and sizes yNearly 70% of all instructions require only 8 or 16 bit data types

Breakdown of Data Types by Media Type

Memory Statistics zWorking set size ycache regression: cache sizes 1K to 4MB yassumed line size of 64 bytes ymeasured read and write miss ratios zSpatial locality ycache regression: line sizes 8 to 1024 bytes yassumed cache size of 64 KB ymeasure read and write miss ratios zMemory Results ydata memory:32 KB and 60.8% spatial locality (up to 128 bytes) yinstruction memory:8 KB and 84.8% spatial locality (up to 256 bytes)

Data Spatial Locality

Multimedia Looping Characteristics zHighly Loop Centric yNearly 95% of execution time spent within two innermost loop levels zLarge Number of Iterations ySignificant processing regularity yAbout 10 iterations per loop on average zPath Ratio indicates Intra-Loop Complexity yComputed as ratio of average number of instructions executed per loop invocation to total number of instructions in loop yAverage path ratio of 78% yIndicates greater control complexity than expected

Average Iterations per Loop and Path Ratio - average number of loop iterations - average path ratio

Instruction Level Parallelism zInstruction level parallelism ybase model:single issue using classical optimizations only yparallel model:8-issue zExplores only parallel scheduling performance yassumes an ideal processor model yno performance penalties from branches, cache misses, etc.

Workload Evaluation Conclusions zOperation Characteristics yMore arithmetic operations; less memory and floating-point usage yLarge variation in memory usage y(ALU, mem, branch, shift, FP, mult) => (4, 2, 1, 1, 1, 1) zGood Static Branch Prediction yMultimedia:10-15% avg. miss ratio yGeneral-purpose:20-30% avg. miss ratio ySimilar basic block sizes (5 instrs per basic block) zPrimarily Small Data Types (8 or 16 bits) yNearly 70% of instructions require 16-bit or smaller data types ySignificant opportunity for subword parallelism or narrower datapaths zMemory yTypically small data and instruction working set sizes yHigh data and instruction spatial locality zLoop-Centric yMajority of execution time spent in two innermost loops yAverage of 10 iterations per loop invocation yPath ratio indicates greater control complexity than expected

Architecture Evaluation

zDetermine fundamental architecture style yStatically Scheduled => Very Long Instruction Word (VLIW) xallows wider issue xsimple hardware=> potentially higher frequencies yDynamically Scheduled => Superscalar xallows decoupled data memory accesses xeffective at reducing penalties from stall zExamine variety of architecture parameters yFundamental Architecture Style yInstruction Fetch Architecture yHigh Frequency Effects yCache Memory Hierarchy zRelated Work [Lee98] “Media Architecture: General Purpose vs. Multiple Application-Specific Programmable Processors,” DAC-35, [PChang91] “Comparing Static and Dynamic Code Scheduling for Multiple-Instruction-Issue Processors,” MICRO-24, [ZWu99] “Architecture Evaluation of Multi-Cluster Wide-Issue Video Signal Processors,” Ph.D. Thesis, Princeton University, [DZucker95] “A comparison of hardware prefetching techniques for multimedia benchmarks,” Technical Report CSL-TR , Stanford University, 1995.

Fundamental Architecture Evaluation zFundamental architecture evaluation included: yStatic vs. dynamic scheduling yIssue width zFocused on non-memory limited applications yDetermine impact of datapath features independent of memory yAssume memory techniques can solve memory bottleneck zArchitecture model y8-issue processor yOperation latencies targeted for 500 MHz to 1 GHz y64 integer and floating-point registers yPipeline: 1 fetch, 2 decode, 1 write back, variable execute stages y32 KB direct-mapped L1 data cache with 64 byte lines y16 KB direct-mapped L1 instruction cache with 256 byte lines y256 KB 4-way set associate on-chip L2 cache y4:1 Processor to external bus frequency ratio

Static versus Dynamic Scheduling - static versus dynamic scheduling for various compiler methods - result of increasing issue width for the given architecture and compiler methods

Instruction Fetch Architecture - aggressive versus conservative fetch methods - comparison of dynamic branch prediction schemes

Experimental Configuration Single-issue processor zSimpleScalar sim-outorder zSingle issue configuration zRISC

Experimental Configuration -Benchmarks zSelected from different area of MediaBench zAdditional real- world applications

Baseline benchmark characteristics zMeasure on the single issue processor zExecution time closely related to dynamic instruction count

VLIW vs. Single Issue zStatic Code Size zDynamic Operation Count zExecution Speed zBasic Block Size

Static Code Size Results

Static Code Size Analysis zSimilar Static Code Size zOn average, TM1300 requires 17% more space

Dynamic Operation Count Results

Dynamic Operation Count Analysis zDynamic instruction counts are similar for two type of processors zOn average, TM1300 needs 20% more operations zISA difference resulted execution time is small

Execution Speed Results

Execution Speed Analysis zTM1300 executes all benchmarks faster than the single issue processor zOn average, the speedup is 3.4x ywide issue capability, is partly resulted yArchitecture features

Unoptimized Basic Block Size Results

Unoptimized Basic Block Size Analysis zTrimedia compile provides code with larger basic block size zOn average, the basic block on TM1300 is twice as large

Exploiting Special Features zMethods yUsing custom instruction yLoop transformation zMetrics yExecution Speed yMemory Access Count yBasic Block Size yOperation Level Parallelism

Execution Time Results

Execution Time Analysis z1.5 x average speedup zData transferring intensive, floating point intensive, and table looking intensive applications have less speedup

Memory Access Count Results

Memory Access Count Analysis zReduced average memory access count zMemory access can be bottleneck (MPEG)

Optimized Basic Block Size

Optimized Basic Block Size Analysis zSignificant basic block size change results performance gain ( Region)

Operation Level Parallelism Results

Operation Level Parallelism Analysis zOPC close to 2 zMemory access can be bottleneck zWider bus&super-word parallelism

Overall Performance Change Results

Overall Performance Change Analysis zTM1300 exhibit significant performance gain over single issue processor z5x speedup on average z10x best case speedup

What type of memory system? zCache: ysize, # sets, block size. zOn-chip main memory: yamount, type, banking, network to PEs. zOff-chip main memory: ytype, organization.

Memory system optimizations zStrictly software: yEffectively using the cache and partitioned memory. zHardware + software: yScratch-pad memories. yCustom memory hierarchies.

Taxonomy of memory optimizations (Wolf/Kandemir) zData vs. code. zArray/buffer vs. non-array. zCache/scratch pad vs. main memory. zCode size vs. data size. zProgram vs. process. zLanguages.

Software performance analysis zWorst-case execution time (WCET) analysis (Li/Malik): yFind longest path through CDFG. yCan use annotations of branch probabilities. yCan be mapped onto cache lines. yDifficult in practice---must analyze optimized code. zTrace-driven analysis: yWell understood. yRequires code, input vectors.

Software energy/power analysis zAnalytical models of cache (Su/Despain, Kamble/Ghose, etc.): yDecoding, memory core, I/O path, etc. zSystem-level models (Li/Henkel). zPower simulators (Vijaykrishnan et al, Brooks et al).

Power-optimizing transformations zKandemir et al: yMost energy is consumed by the memory system, not the CPU core. yPerformance-oriented optimizations reduce memory system energy but increase datapath energy consumption. yLarger caches increase cache energy consumption but reduce overall memory system energy.

Scratch pad memories zExplicitly managed local memory. zPanda et al used a static management scheme. yData structures assigned to off-chip memory or scratch pad at compile time. yPut scalars in scratch pad, arrays in main. zMay want to manage scratch pad at run time.

Reconfigurable caches zUse compiler to determine best cache configuration for various program regions. yMust be able to quickly reconfigure the cache. yMust be able to identify where program behavior changes.

Software methods for cache placement zMcFarling analyzed inter-function dependencies. zTomiyama and Yasuura used ILP. zLi and Wolf used a process-level model. zKirovski et al use profiling information plus graph model. zDwyer/Fernando use bit vectors to construct boudns in instruction caches. zParmeswaran and Henkel use heuristics.

Addressing optimizations zAddressing can be expensive: y55% of DSP56000 instructions performed addressing operations in MediaBench. zUtilize specialized addressing registers, pre/post-incr/decrement, etc. yPlace variables in proper order in memory so that simpler operations can be used to calculate next address from previous address.

Hardware methods for cache optimization zKirk and Strosnider divided the cache into sections and allocated timing-critical code to its own section.