Architectural Enhancements for Efficient Operand Transport in Multimedia Systems ECE7102 Class Presentation Date: 2006. 4. 13 Hongkyu Kim

Architectural Enhancements for Efficient Operand Transport in Multimedia Systems ECE7102 Class Presentation Date: 2006. 4. 13 Hongkyu Kim hongkim@ece.gatech.edu

2/40 Overview Introduction Characterization and modeling of operand usage and transport Dynamic execution technique exploiting regular operand transport patterns in multimedia –Instruction cluster mapping on the inter-ALU network for general-purpose domain –Dynamic SIMDization for application-specific domain Summary

3/40 Interconnect Complexity Exponential increase of chip capacity  More devices Exponential decrease of feature size  Interconnect limitation J.D. Meindl, Interconnect Opportunities for Gigascale Integration, IEEE MICRO, vol. 23, no. 4, pp.28-35, May/June 2003.

4/40 Interconnect Bottleneck ITRS 2002 Documents, http://public.itrs.net/Files/2002Update/Home.pdf. 1 10 100 0.1 Relative Delay 25018013090654542 Process Technology Node (nm) α 1/α 2 Disparity between wire delay and gate delay

5/40 Problem Statement High-performance interconnect –Interconnect organizations –Interconnect technologies Why architectural responses are limited? –Compatibility with old ISAs Sequentially-specified operations Restricted register file-based operand namespace –ILP mechanisms Operand bypass network, register renaming, and instruction scheduling Poorly scaling broadcast buses

6/40 Research Objective and Approach Objective Reduce latency of operand transport for multimedia –Development of dynamic execution techniques –Development of low-cost operand bypass networks Approach summary

8/40 Motivation and Approach Motivation –Shift of microarchitectural design focus Operand computation  Operand communication –Recognizing and understanding of operand usage and transport properties  Efficiently controlling operand traffic Approach summary –Operand usage characteristics How often operands are used  Examine temporal property Where operands are used  Examine spatial property –Operand transport properties What accounts for the majority of communication needs  Explore the impact of architectural techniques on the operand transport

9/40 Operand Usage Analysis General terms –Operands: values in registers, memory locations, or memory addresses –Operand transport: buffering and delivery of operands to FUs Operands’ temporal characteristics –Which inst. consumes operands after they are produced –Metrics: Degree of use, Age, Lifetime Operands’ spatial characteristics –From/to which FU operands are moved in the execution model –Metrics: Degree of functionality, Transport pattern

10/40 Operand Transport Analysis Operand transport model

11/40 Preliminary Results Operand usage properties (MediaBench average) 0123>3 123~5>5 123~56~10>10 01(same)1(different) H. Kim, D. Wills, and L. Wills, “Empirical analysis of operand usage and transport in multimedia applications,” Proc of the International Workshop on System-on-Chip for Real-Time Applications, pp. 168-171, July 2004. >1

12/40 Preliminary Results (cntd.) Operand transport pattern (MediaBench average) integer  integer 43.0% integer  branch 14.9% integer  ld/st 13.6% ld/st  integer 13.8% ld/st  ld/st 6.6% Others 8.1%

13/40 Preliminary Results (cntd.) Effective architectural techniques on operand transport –Storage hierarchy: local buffering –Dedicated transport network –Lifetime detection: compile-time/run-time –Smart instruction steering

15/40 Motivation and Approach Motivation Multimedia applications –Operand movement is highly regular –Most operands are short lived, transient operands  Develop dynamic execution technique exploiting regular operand distribution patterns and local properties Approach summary –Instruction clustering: dynamic instruction grouping –Recognition of regular operand transport pattern –Efficient execution unit: reduce transport latency

16/40 Related Work Solutions for multimedia processing –Multimedia-specific ISA extensions Exploit data-level parallelism at subword level General-purpose domain: Intel’s MMX and SSE, AMD’s 3DNow!, Sun’s VIS, IBM’s Altivec Application-specific signal processing domain: Analog Device’s TigerShark, Trimedia –Vectorization and retargeting Manual assembly coding Hand-optimization: in-lined assembly code, library routines Automatic vectorization: compiler/retargeting technology

17/40 Solutions for reducing operand transport complexity –Communication-aware execution Network-connected tile architecture: RAW, GPA Transport triggered architecture: MOVE –Resource partitioning: Clustered architectures Heterogeneous: decoupled architecture Commercial: DEC Alpha21264 Academia: Multicluster, Palacharla’s, PEWs, ILDP, CTCP –Dynamic optimizations Fill unit: reform instructions in H/W, and cache them Small-scale dependence collapsing: combine dependences among multiple instructions  macro instruction Related Work (cntd.)

18/40 Related Research Landscape

19/40 Research Methodology

20/40 Dynamic Instruction Clustering Instruction Cluster –A connected subgraph of instructions joined by local operands –Dataflow graph  Dependence edge classification  Instruction grouping Dependence edge types –External: produced/consumed by previous/next blocks –Non-clusterable: operands from/to memory –Local: produced and consumed within the same block

21/40 Instruction Clustering Example Color conversion block in JPEG encoder

23/40 Raw cluster execution on inter-ALU network –Focus on intermediate, short-lived operands Local operands: inter-ALU dedicated bypass network Others: traditional global bypass network –Organization Instruction cluster formation Cluster queue and scheduling Cluster execution: inter-ALU network H. Kim, D. Wills, and L. Wills, “Reducing operand communication overhead using instruction clustering for multimedia applications,” Proc of 7th International Symposium on Multimedia, December 2005. Implementation Example - I

24/40 Cluster Queue and Scheduling Organization of cluster queue –Single entry per cluster (2D) –Ready flag for local operands are always set –Issue pointer for each entry, in-order issue

25/40 Cluster Execution Unit Cluster mapping on inter-ALU network –Local operands: dedicated bypass network –Others: traditional global bypass network

26/40 Experimental Setup Simulation Environment –SimpleScalar sim-outorder simulator –MediaBench application programs Processor Configurations 8-way16-way Queues 24 instruction queue, 8 cluster queue, 16 load/store queue 48 instruction queue, 16 cluster queue, 32 load/store queue FU resources 4 integer ALUs, 1 (4x4) network ALU, 2 integer MULs, 2 floating ALUs 1 floating MUL, 2 memory ports 8 integer ALUs, 2 (4x4) network ALUs, 2 integer MULs, 2 floating ALUs 1 floating MUL, 2 memory ports Operand bypass (latency) Local (0), pass-through (1), Global (1) Local (0), pass-through (1), Global (max 3)

27/40 Experimental Result Dynamic instruction coverage

28/40 Experimental Result (cntd.) Operand transport types 29.5% 11.0% 59.5% 31.5% 10.6% 57.8%

29/40 Experimental Result (cntd.) IPC speedup

30/40 Summary Summary of approach –Dynamically group dependent instructions into clusters –Store regular operand transport patterns –Execute them on inter-ALU network where intermediate values are propagated among ALUs w/o/ using global buses Summary of results (MediaBench average) –Dynamic instruction coverage –Shortest transport rate –IPC speedup 57.3% @ 256 entry cluster cache 30% 16-way 8-way 32% 16-way 8-way 16.2%35.2%

31/40 Introduction Characterization and modeling of operand usage and transport Dynamic execution technique exploiting regular operand transport patterns in multimedia –Instruction cluster mapping on the inter-ALU network for general-purpose domain –Dynamic SIMDization for application-specific domain Summary Overview

32/40 Data parallel execution using dynamic SIMDization –Observation (Image processing applications) Operand movement w/in a loop iteration is highly regular Small # of inner loops covers most of execution time –Focus on regular operand transport pattern between iterations of innermost loop Stride prediction: break loop-carried dependences  data- parallel execution Operand lifetime detection  operand traffic control –Organization Instruction cluster formation SIMD instruction queue and scheduling SIMD PE array Implementation Example - II

33/40 Dynamic Instruction Clustering External dependence edge types –External-input: serving only as input –External-output: serving only as output –External-updated: serving as both input and output Parallel and non-parallel region detection –p-cluster: producing no external-updated output and not having unpredicted external-updated input –np-cluster

34/40 Instruction Clustering Example Image convolution code in TI’s IMGLIB

35/40 SIMD Execution Unit Cluster scheduling on SIMD PE array

36/40 SIMD Execution Unit (cntd.) Operand transport model

37/40 Summary of Approach Dynamic parallelization –Detect regular operand transport pattern on external- updated –Compute stride  predict external-update values Optimizing operand transport –Identify the lifetime of operands –Remove needless communication  localize transport Execute the clusters on 1-D mesh SIMD PE array

39/40 Summary Characterization and modeling of operand –Examine the operand usage properties –Explore the impact of architectural techniques on the operand transport Development of a dynamic execution technique –Instruction clustering –Recognition of regular operand transport pattern –Efficient execution unit

40/40 Thank you. Any questions?

Architectural Enhancements for Efficient Operand Transport in Multimedia Systems ECE7102 Class Presentation Date: 2006. 4. 13 Hongkyu Kim

Similar presentations

Presentation on theme: "Architectural Enhancements for Efficient Operand Transport in Multimedia Systems ECE7102 Class Presentation Date: 2006. 4. 13 Hongkyu Kim"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Architectural Enhancements for Efficient Operand Transport in Multimedia Systems ECE7102 Class Presentation Date: 2006. 4. 13 Hongkyu Kim

Similar presentations

Presentation on theme: "Architectural Enhancements for Efficient Operand Transport in Multimedia Systems ECE7102 Class Presentation Date: 2006. 4. 13 Hongkyu Kim"— Presentation transcript:

Similar presentations

About project

Feedback