Architectural Enhancements for Efficient Operand Transport in Multimedia Systems ECE7102 Class Presentation Date: 2006. 4. 13 Hongkyu Kim

Slides:

Advertisements

Similar presentations

CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.

Advertisements

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

Dynamically Collapsing Dependencies for IPC and Frequency Gain Peter G. Sassone D. Scott Wills Georgia Tech Electrical and Computer Engineering { sassone,

Breaking SIMD Shackles with an Exposed Flexible Microarchitecture and the Access Execute PDG Venkatraman Govindaraju, Tony Nowatzki, Karthikeyan Sankaralingam.

U P C MICRO36 San Diego December 2003 Flexible Compiler-Managed L0 Buffers for Clustered VLIW Processors Enric Gibert 1 Jesús Sánchez 2 Antonio González.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,

Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.

Parallell Processing Systems1 Chapter 4 Vector Processors.

Reconfigurable Microprocessors Lih Wen Koh 05s1 COMP4211 presentation 18 May 2005.

 Understanding the Sources of Inefficiency in General-Purpose Chips.

Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.

A Design Space Evaluation of Grid Processor Architecture Jiening Jiang May, 2005 The presentation based on the paper written by Ramadass Nagarajan, Karthikeyan.

From Sequences of Dependent Instructions to Functions An Approach for Improving Performance without ILP or Speculation Ben Rudzyn.

Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.

June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.

Scheduling Reusable Instructions for Power Reduction J.S. Hu, N. Vijaykrishnan, S. Kim, M. Kandemir, and M.J. Irwin Proceedings of the Design, Automation.

Compiling Application-Specific Hardware Mihai Budiu Seth Copen Goldstein Carnegie Mellon University.

Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1.

1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.

September 28 th 2004University of Utah1 A preliminary look Karthik Ramani Power and Temperature-Aware Microarchitecture.

Chapter 12 CPU Structure and Function. Example Register Organizations.

Hot Chips 16August 24, 2004 OptimoDE: Programmable Accelerator Engines Through Retargetable Customization Nathan Clark, Hongtao Zhong, Kevin Fan, Scott.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in partitioned architectures Rajeev Balasubramonian Naveen.

CS 7810 Lecture 9 Effective Hardware-Based Data Prefetching for High-Performance Processors T-F. Chen and J-L. Baer IEEE Transactions on Computers, 44(5)

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

Kathy Grimes. Signals Electrical Mechanical Acoustic Most real-world signals are Analog – they vary continuously over time Many Limitations with Analog.

ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.

Computer Architecture and Organization Introduction.

Architectures for mobile and wireless systems Ese 566 Report 1 Hui Zhang Preethi Karthik.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

By Michael Butler, Leslie Barnes, Debjit Das Sarma, Bob Gelinas This paper appears in: Micro, IEEE March/April 2011 (vol. 31 no. 2) pp 마이크로 프로세서.

TRIPS – An EDGE Instruction Set Architecture Chirag Shah April 24, 2008.

Hybrid-Scheduling: A Compile-Time Approach for Energy–Efficient Superscalar Processors Madhavi Valluri and Lizy John Laboratory for Computer Architecture.

Automated Design of Custom Architecture Tulika Mitra

Chapter 1 Introduction. Architecture & Organization 1 Architecture is those attributes visible to the programmer —Instruction set, number of bits used.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

Energy-Effective Issue Logic Hasan Hüseyin Yılmaz.

RF network in SoC1 SoC Test Architecture with RF/Wireless Connectivity 1. D. Zhao, S. Upadhyaya, M. Margala, “A new SoC test architecture with RF/wireless.

Dynamic Pipelines. Interstage Buffers Superscalar Pipeline Stages In Program Order In Program Order Out of Order.

Bottlenecks of SIMD Haibin Wang Wei tong. Paper Bottlenecks in Multimedia Processing with SIMD Style Extensions and Architectural Enhancements One IEEE.

CS/EE 5810 CS/EE 6810 F00: 1 Multimedia. CS/EE 5810 CS/EE 6810 F00: 2 New Architecture Direction “… media processing will become the dominant force in.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.

A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.

1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.

Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke

Design of A Custom Vector Operation API Exploiting SIMD Intrinsics within Java Presented by John-Marc Desmarais Authors: Jonathan Parri, John-Marc Desmarais,

EECS 583 – Class 22 Research Topic 4: Automatic SIMDization - Superword Level Parallelism University of Michigan December 10, 2012.

Memory-Aware Compilation Philip Sweany 10/20/2011.

Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.

Autumn 2006CSE P548 - Dataflow Machines1 Von Neumann Execution Model Fetch: send PC to memory transfer instruction from memory to CPU increment PC Decode.

My Coordinates Office EM G.27 contact time:

Winter-Spring 2001Codesign of Embedded Systems1 Essential Issues in Codesign: Architectures Part of HW/SW Codesign of Embedded Systems Course (CE )

UT-Austin CART 1 Mechanisms for Streaming Architectures Stephen W. Keckler Computer Architecture and Technology Laboratory Department of Computer Sciences.

CSE431 L13 SS Execute & Commit.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 13: SS Backend (Execute, Writeback & Commit) Mary Jane.

SECTIONS 1-7 By Astha Chawla

Prof. Onur Mutlu Carnegie Mellon University

Architecture & Organization 1

Vector Processing => Multimedia

Architecture & Organization 1

The Vector-Thread Architecture

Overview Prof. Eric Rotenberg

Chapter 12 Pipelining and RISC

The University of Adelaide, School of Computer Science

Presentation transcript:

Architectural Enhancements for Efficient Operand Transport in Multimedia Systems ECE7102 Class Presentation Date: Hongkyu Kim

2/40 Overview Introduction Characterization and modeling of operand usage and transport Dynamic execution technique exploiting regular operand transport patterns in multimedia –Instruction cluster mapping on the inter-ALU network for general-purpose domain –Dynamic SIMDization for application-specific domain Summary

3/40 Interconnect Complexity Exponential increase of chip capacity  More devices Exponential decrease of feature size  Interconnect limitation J.D. Meindl, Interconnect Opportunities for Gigascale Integration, IEEE MICRO, vol. 23, no. 4, pp.28-35, May/June 2003.

4/40 Interconnect Bottleneck ITRS 2002 Documents, Relative Delay Process Technology Node (nm) α 1/α 2 Disparity between wire delay and gate delay

5/40 Problem Statement High-performance interconnect –Interconnect organizations –Interconnect technologies Why architectural responses are limited? –Compatibility with old ISAs Sequentially-specified operations Restricted register file-based operand namespace –ILP mechanisms Operand bypass network, register renaming, and instruction scheduling Poorly scaling broadcast buses

6/40 Research Objective and Approach Objective Reduce latency of operand transport for multimedia –Development of dynamic execution techniques –Development of low-cost operand bypass networks Approach summary

7/40 Overview Introduction Characterization and modeling of operand usage and transport Dynamic execution technique exploiting regular operand transport patterns in multimedia –Instruction cluster mapping on the inter-ALU network for general-purpose domain –Dynamic SIMDization for application-specific domain Summary

8/40 Motivation and Approach Motivation –Shift of microarchitectural design focus Operand computation  Operand communication –Recognizing and understanding of operand usage and transport properties  Efficiently controlling operand traffic Approach summary –Operand usage characteristics How often operands are used  Examine temporal property Where operands are used  Examine spatial property –Operand transport properties What accounts for the majority of communication needs  Explore the impact of architectural techniques on the operand transport

9/40 Operand Usage Analysis General terms –Operands: values in registers, memory locations, or memory addresses –Operand transport: buffering and delivery of operands to FUs Operands’ temporal characteristics –Which inst. consumes operands after they are produced –Metrics: Degree of use, Age, Lifetime Operands’ spatial characteristics –From/to which FU operands are moved in the execution model –Metrics: Degree of functionality, Transport pattern

10/40 Operand Transport Analysis Operand transport model

11/40 Preliminary Results Operand usage properties (MediaBench average) 0123>3 123~5>5 123~56~10>10 01(same)1(different) H. Kim, D. Wills, and L. Wills, “Empirical analysis of operand usage and transport in multimedia applications,” Proc of the International Workshop on System-on-Chip for Real-Time Applications, pp , July >1

12/40 Preliminary Results (cntd.) Operand transport pattern (MediaBench average) integer  integer 43.0% integer  branch 14.9% integer  ld/st 13.6% ld/st  integer 13.8% ld/st  ld/st 6.6% Others 8.1%

13/40 Preliminary Results (cntd.) Effective architectural techniques on operand transport –Storage hierarchy: local buffering –Dedicated transport network –Lifetime detection: compile-time/run-time –Smart instruction steering

14/40 Overview Introduction Characterization and modeling of operand usage and transport Dynamic execution technique exploiting regular operand transport patterns in multimedia –Instruction cluster mapping on the inter-ALU network for general-purpose domain –Dynamic SIMDization for application-specific domain Summary

15/40 Motivation and Approach Motivation Multimedia applications –Operand movement is highly regular –Most operands are short lived, transient operands  Develop dynamic execution technique exploiting regular operand distribution patterns and local properties Approach summary –Instruction clustering: dynamic instruction grouping –Recognition of regular operand transport pattern –Efficient execution unit: reduce transport latency

16/40 Related Work Solutions for multimedia processing –Multimedia-specific ISA extensions Exploit data-level parallelism at subword level General-purpose domain: Intel’s MMX and SSE, AMD’s 3DNow!, Sun’s VIS, IBM’s Altivec Application-specific signal processing domain: Analog Device’s TigerShark, Trimedia –Vectorization and retargeting Manual assembly coding Hand-optimization: in-lined assembly code, library routines Automatic vectorization: compiler/retargeting technology

17/40 Solutions for reducing operand transport complexity –Communication-aware execution Network-connected tile architecture: RAW, GPA Transport triggered architecture: MOVE –Resource partitioning: Clustered architectures Heterogeneous: decoupled architecture Commercial: DEC Alpha21264 Academia: Multicluster, Palacharla’s, PEWs, ILDP, CTCP –Dynamic optimizations Fill unit: reform instructions in H/W, and cache them Small-scale dependence collapsing: combine dependences among multiple instructions  macro instruction Related Work (cntd.)

18/40 Related Research Landscape

19/40 Research Methodology

20/40 Dynamic Instruction Clustering Instruction Cluster –A connected subgraph of instructions joined by local operands –Dataflow graph  Dependence edge classification  Instruction grouping Dependence edge types –External: produced/consumed by previous/next blocks –Non-clusterable: operands from/to memory –Local: produced and consumed within the same block

21/40 Instruction Clustering Example Color conversion block in JPEG encoder

22/40 Overview Introduction Characterization and modeling of operand usage and transport Dynamic execution technique exploiting regular operand transport patterns in multimedia –Instruction cluster mapping on the inter-ALU network for general-purpose domain –Dynamic SIMDization for application-specific domain Summary

23/40 Raw cluster execution on inter-ALU network –Focus on intermediate, short-lived operands Local operands: inter-ALU dedicated bypass network Others: traditional global bypass network –Organization Instruction cluster formation Cluster queue and scheduling Cluster execution: inter-ALU network H. Kim, D. Wills, and L. Wills, “Reducing operand communication overhead using instruction clustering for multimedia applications,” Proc of 7th International Symposium on Multimedia, December Implementation Example - I

24/40 Cluster Queue and Scheduling Organization of cluster queue –Single entry per cluster (2D) –Ready flag for local operands are always set –Issue pointer for each entry, in-order issue

25/40 Cluster Execution Unit Cluster mapping on inter-ALU network –Local operands: dedicated bypass network –Others: traditional global bypass network

26/40 Experimental Setup Simulation Environment –SimpleScalar sim-outorder simulator –MediaBench application programs Processor Configurations 8-way16-way Queues 24 instruction queue, 8 cluster queue, 16 load/store queue 48 instruction queue, 16 cluster queue, 32 load/store queue FU resources 4 integer ALUs, 1 (4x4) network ALU, 2 integer MULs, 2 floating ALUs 1 floating MUL, 2 memory ports 8 integer ALUs, 2 (4x4) network ALUs, 2 integer MULs, 2 floating ALUs 1 floating MUL, 2 memory ports Operand bypass (latency) Local (0), pass-through (1), Global (1) Local (0), pass-through (1), Global (max 3)

27/40 Experimental Result Dynamic instruction coverage

28/40 Experimental Result (cntd.) Operand transport types 29.5% 11.0% 59.5% 31.5% 10.6% 57.8%

29/40 Experimental Result (cntd.) IPC speedup

30/40 Summary Summary of approach –Dynamically group dependent instructions into clusters –Store regular operand transport patterns –Execute them on inter-ALU network where intermediate values are propagated among ALUs w/o/ using global buses Summary of results (MediaBench average) –Dynamic instruction coverage –Shortest transport rate –IPC speedup 256 entry cluster cache 30% 16-way 8-way 32% 16-way 8-way 16.2%35.2%

31/40 Introduction Characterization and modeling of operand usage and transport Dynamic execution technique exploiting regular operand transport patterns in multimedia –Instruction cluster mapping on the inter-ALU network for general-purpose domain –Dynamic SIMDization for application-specific domain Summary Overview

32/40 Data parallel execution using dynamic SIMDization –Observation (Image processing applications) Operand movement w/in a loop iteration is highly regular Small # of inner loops covers most of execution time –Focus on regular operand transport pattern between iterations of innermost loop Stride prediction: break loop-carried dependences  data- parallel execution Operand lifetime detection  operand traffic control –Organization Instruction cluster formation SIMD instruction queue and scheduling SIMD PE array Implementation Example - II

33/40 Dynamic Instruction Clustering External dependence edge types –External-input: serving only as input –External-output: serving only as output –External-updated: serving as both input and output Parallel and non-parallel region detection –p-cluster: producing no external-updated output and not having unpredicted external-updated input –np-cluster

34/40 Instruction Clustering Example Image convolution code in TI’s IMGLIB

35/40 SIMD Execution Unit Cluster scheduling on SIMD PE array

36/40 SIMD Execution Unit (cntd.) Operand transport model

37/40 Summary of Approach Dynamic parallelization –Detect regular operand transport pattern on external- updated –Compute stride  predict external-update values Optimizing operand transport –Identify the lifetime of operands –Remove needless communication  localize transport Execute the clusters on 1-D mesh SIMD PE array

38/40 Overview Introduction Characterization and modeling of operand usage and transport Dynamic execution technique exploiting regular operand transport patterns in multimedia –Instruction cluster mapping on the inter-ALU network for general-purpose domain –Dynamic SIMDization for application-specific domain Summary

39/40 Summary Characterization and modeling of operand –Examine the operand usage properties –Explore the impact of architectural techniques on the operand transport Development of a dynamic execution technique –Instruction clustering –Recognition of regular operand transport pattern –Efficient execution unit

40/40 Thank you. Any questions?