Download presentation
Presentation is loading. Please wait.
Published byLorraine Booker Modified over 9 years ago
1
Architectural Enhancements for Efficient Operand Transport in Multimedia Systems ECE7102 Class Presentation Date: 2006. 4. 13 Hongkyu Kim hongkim@ece.gatech.edu
2
2/40 Overview Introduction Characterization and modeling of operand usage and transport Dynamic execution technique exploiting regular operand transport patterns in multimedia –Instruction cluster mapping on the inter-ALU network for general-purpose domain –Dynamic SIMDization for application-specific domain Summary
3
3/40 Interconnect Complexity Exponential increase of chip capacity More devices Exponential decrease of feature size Interconnect limitation J.D. Meindl, Interconnect Opportunities for Gigascale Integration, IEEE MICRO, vol. 23, no. 4, pp.28-35, May/June 2003.
4
4/40 Interconnect Bottleneck ITRS 2002 Documents, http://public.itrs.net/Files/2002Update/Home.pdf. 1 10 100 0.1 Relative Delay 25018013090654542 Process Technology Node (nm) α 1/α 2 Disparity between wire delay and gate delay
5
5/40 Problem Statement High-performance interconnect –Interconnect organizations –Interconnect technologies Why architectural responses are limited? –Compatibility with old ISAs Sequentially-specified operations Restricted register file-based operand namespace –ILP mechanisms Operand bypass network, register renaming, and instruction scheduling Poorly scaling broadcast buses
6
6/40 Research Objective and Approach Objective Reduce latency of operand transport for multimedia –Development of dynamic execution techniques –Development of low-cost operand bypass networks Approach summary
7
7/40 Overview Introduction Characterization and modeling of operand usage and transport Dynamic execution technique exploiting regular operand transport patterns in multimedia –Instruction cluster mapping on the inter-ALU network for general-purpose domain –Dynamic SIMDization for application-specific domain Summary
8
8/40 Motivation and Approach Motivation –Shift of microarchitectural design focus Operand computation Operand communication –Recognizing and understanding of operand usage and transport properties Efficiently controlling operand traffic Approach summary –Operand usage characteristics How often operands are used Examine temporal property Where operands are used Examine spatial property –Operand transport properties What accounts for the majority of communication needs Explore the impact of architectural techniques on the operand transport
9
9/40 Operand Usage Analysis General terms –Operands: values in registers, memory locations, or memory addresses –Operand transport: buffering and delivery of operands to FUs Operands’ temporal characteristics –Which inst. consumes operands after they are produced –Metrics: Degree of use, Age, Lifetime Operands’ spatial characteristics –From/to which FU operands are moved in the execution model –Metrics: Degree of functionality, Transport pattern
10
10/40 Operand Transport Analysis Operand transport model
11
11/40 Preliminary Results Operand usage properties (MediaBench average) 0123>3 123~5>5 123~56~10>10 01(same)1(different) H. Kim, D. Wills, and L. Wills, “Empirical analysis of operand usage and transport in multimedia applications,” Proc of the International Workshop on System-on-Chip for Real-Time Applications, pp. 168-171, July 2004. >1
12
12/40 Preliminary Results (cntd.) Operand transport pattern (MediaBench average) integer integer 43.0% integer branch 14.9% integer ld/st 13.6% ld/st integer 13.8% ld/st ld/st 6.6% Others 8.1%
13
13/40 Preliminary Results (cntd.) Effective architectural techniques on operand transport –Storage hierarchy: local buffering –Dedicated transport network –Lifetime detection: compile-time/run-time –Smart instruction steering
14
14/40 Overview Introduction Characterization and modeling of operand usage and transport Dynamic execution technique exploiting regular operand transport patterns in multimedia –Instruction cluster mapping on the inter-ALU network for general-purpose domain –Dynamic SIMDization for application-specific domain Summary
15
15/40 Motivation and Approach Motivation Multimedia applications –Operand movement is highly regular –Most operands are short lived, transient operands Develop dynamic execution technique exploiting regular operand distribution patterns and local properties Approach summary –Instruction clustering: dynamic instruction grouping –Recognition of regular operand transport pattern –Efficient execution unit: reduce transport latency
16
16/40 Related Work Solutions for multimedia processing –Multimedia-specific ISA extensions Exploit data-level parallelism at subword level General-purpose domain: Intel’s MMX and SSE, AMD’s 3DNow!, Sun’s VIS, IBM’s Altivec Application-specific signal processing domain: Analog Device’s TigerShark, Trimedia –Vectorization and retargeting Manual assembly coding Hand-optimization: in-lined assembly code, library routines Automatic vectorization: compiler/retargeting technology
17
17/40 Solutions for reducing operand transport complexity –Communication-aware execution Network-connected tile architecture: RAW, GPA Transport triggered architecture: MOVE –Resource partitioning: Clustered architectures Heterogeneous: decoupled architecture Commercial: DEC Alpha21264 Academia: Multicluster, Palacharla’s, PEWs, ILDP, CTCP –Dynamic optimizations Fill unit: reform instructions in H/W, and cache them Small-scale dependence collapsing: combine dependences among multiple instructions macro instruction Related Work (cntd.)
18
18/40 Related Research Landscape
19
19/40 Research Methodology
20
20/40 Dynamic Instruction Clustering Instruction Cluster –A connected subgraph of instructions joined by local operands –Dataflow graph Dependence edge classification Instruction grouping Dependence edge types –External: produced/consumed by previous/next blocks –Non-clusterable: operands from/to memory –Local: produced and consumed within the same block
21
21/40 Instruction Clustering Example Color conversion block in JPEG encoder
22
22/40 Overview Introduction Characterization and modeling of operand usage and transport Dynamic execution technique exploiting regular operand transport patterns in multimedia –Instruction cluster mapping on the inter-ALU network for general-purpose domain –Dynamic SIMDization for application-specific domain Summary
23
23/40 Raw cluster execution on inter-ALU network –Focus on intermediate, short-lived operands Local operands: inter-ALU dedicated bypass network Others: traditional global bypass network –Organization Instruction cluster formation Cluster queue and scheduling Cluster execution: inter-ALU network H. Kim, D. Wills, and L. Wills, “Reducing operand communication overhead using instruction clustering for multimedia applications,” Proc of 7th International Symposium on Multimedia, December 2005. Implementation Example - I
24
24/40 Cluster Queue and Scheduling Organization of cluster queue –Single entry per cluster (2D) –Ready flag for local operands are always set –Issue pointer for each entry, in-order issue
25
25/40 Cluster Execution Unit Cluster mapping on inter-ALU network –Local operands: dedicated bypass network –Others: traditional global bypass network
26
26/40 Experimental Setup Simulation Environment –SimpleScalar sim-outorder simulator –MediaBench application programs Processor Configurations 8-way16-way Queues 24 instruction queue, 8 cluster queue, 16 load/store queue 48 instruction queue, 16 cluster queue, 32 load/store queue FU resources 4 integer ALUs, 1 (4x4) network ALU, 2 integer MULs, 2 floating ALUs 1 floating MUL, 2 memory ports 8 integer ALUs, 2 (4x4) network ALUs, 2 integer MULs, 2 floating ALUs 1 floating MUL, 2 memory ports Operand bypass (latency) Local (0), pass-through (1), Global (1) Local (0), pass-through (1), Global (max 3)
27
27/40 Experimental Result Dynamic instruction coverage
28
28/40 Experimental Result (cntd.) Operand transport types 29.5% 11.0% 59.5% 31.5% 10.6% 57.8%
29
29/40 Experimental Result (cntd.) IPC speedup
30
30/40 Summary Summary of approach –Dynamically group dependent instructions into clusters –Store regular operand transport patterns –Execute them on inter-ALU network where intermediate values are propagated among ALUs w/o/ using global buses Summary of results (MediaBench average) –Dynamic instruction coverage –Shortest transport rate –IPC speedup 57.3% @ 256 entry cluster cache 30% 16-way 8-way 32% 16-way 8-way 16.2%35.2%
31
31/40 Introduction Characterization and modeling of operand usage and transport Dynamic execution technique exploiting regular operand transport patterns in multimedia –Instruction cluster mapping on the inter-ALU network for general-purpose domain –Dynamic SIMDization for application-specific domain Summary Overview
32
32/40 Data parallel execution using dynamic SIMDization –Observation (Image processing applications) Operand movement w/in a loop iteration is highly regular Small # of inner loops covers most of execution time –Focus on regular operand transport pattern between iterations of innermost loop Stride prediction: break loop-carried dependences data- parallel execution Operand lifetime detection operand traffic control –Organization Instruction cluster formation SIMD instruction queue and scheduling SIMD PE array Implementation Example - II
33
33/40 Dynamic Instruction Clustering External dependence edge types –External-input: serving only as input –External-output: serving only as output –External-updated: serving as both input and output Parallel and non-parallel region detection –p-cluster: producing no external-updated output and not having unpredicted external-updated input –np-cluster
34
34/40 Instruction Clustering Example Image convolution code in TI’s IMGLIB
35
35/40 SIMD Execution Unit Cluster scheduling on SIMD PE array
36
36/40 SIMD Execution Unit (cntd.) Operand transport model
37
37/40 Summary of Approach Dynamic parallelization –Detect regular operand transport pattern on external- updated –Compute stride predict external-update values Optimizing operand transport –Identify the lifetime of operands –Remove needless communication localize transport Execute the clusters on 1-D mesh SIMD PE array
38
38/40 Overview Introduction Characterization and modeling of operand usage and transport Dynamic execution technique exploiting regular operand transport patterns in multimedia –Instruction cluster mapping on the inter-ALU network for general-purpose domain –Dynamic SIMDization for application-specific domain Summary
39
39/40 Summary Characterization and modeling of operand –Examine the operand usage properties –Explore the impact of architectural techniques on the operand transport Development of a dynamic execution technique –Instruction clustering –Recognition of regular operand transport pattern –Efficient execution unit
40
40/40 Thank you. Any questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.