Download presentation
Presentation is loading. Please wait.
1
Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen Muralimanohar Karthik Ramani Venkatanand Venkatachalapathy
2
University of Utah February 14 th 2005 2 Overview/Motivation Wire delays are costly for performance and power Latencies of 30 cycles to reach ends of a chip 50% of dynamic power is in interconnect switching (Magen et al. SLIP 04) Abundant number of metal layers
3
University of Utah February 14 th 2005 3 Wire Characteristics Wire Resistance and capacitance per unit length (Width & Spacing) Delay (as delay RC), Bandwidth ResistanceCapacitanceBandwidth Width Spacing
4
University of Utah February 14 th 2005 4 Design Space Exploration Tuning wire width and spacing d 2d B Wires Resistance Capacitance Resistance Capacitance Bandwidth L wires
5
University of Utah February 14 th 2005 5 Transmission Lines Allow extremely low delay High implementation complexity and overhead! Large width Large spacing between wires Design of sensing circuit Shielding power and ground lines adjacent to each line Implemented in test CMOS chips Not employed in this study
6
University of Utah February 14 th 2005 6 Design Space Exploration Tuning Repeater size and spacing Traditional Wires Large repeaters Optimum spacing Power Optimal Wires Smaller repeaters Increased spacing Delay Power
7
University of Utah February 14 th 2005 7 Design Space Exploration Base case B wires Bandwidth Optimized W wires Power Optimized P wires Power and B/W Optimized PW wires Fast, low bandwidth L wires
8
University of Utah February 14 th 2005 8 Outline Overview Wire Design Space Exploration Employing L wires for Performance PW wires: The Power Optimizers Results Conclusions
9
University of Utah February 14 th 2005 9 Evaluation Platform L1 D Cache Cluster Centralized front-end I-Cache & D-Cache LSQ Branch Predictor Clustered back-end
10
University of Utah February 14 th 2005 10 Cache Pipeline L1 D Cache LSQLSQ Eff. Address Transfer 10c Mem. Dep Resolution 5c Cache Access 5c Data return at 20c L1 D Cache LSQLSQ Eff. Address Transfer 10c Mem. Dep Resolution 5c Cache Access 5c Data return at 20c L1 D Cache LSQLSQ Eff. Address Transfer 10c Partial Mem. Dep Resolution 3c Cache Access 5c 8-bit Transfer 5c Data return at 14c Functional Unit
11
University of Utah February 14 th 2005 11 L wires: Accelerating cache access Transmit LSB bits of effective address through L wires Faster memory disambiguation Partial comparison of loads and stores in LSQ Introduces false dependences ( < 9%) Indexing data and tag RAM arrays LSB bits can prefetch data out of L1$ Reduce access latency of loads
12
University of Utah February 14 th 2005 12 L wires: Narrow Bit Width Operands PowerPC: Data bit-width determines FU latency Transfer of 10 bit integers on L wires Can introduce scheduling difficulties A predictor table of saturating counters Accuracy of 98% Reduction in branch mispredict penalty
13
University of Utah February 14 th 2005 13 Power Efficient Wires. Base case B wires Power and B/W Optimized PW wires Idea: steer non-critical data through energy efficient PW interconnect
14
University of Utah February 14 th 2005 14 PW wires: Power/Bandwidth Efficient Ready Register operands Transfer of data at instruction dispatch Transfer of input operands to remote register file Covered by long dispatch to issue latency Store data Could stall commit process Delay dependent loads Rename & Dispatch IQ Regfile FU IQ Regfile FU IQ Regfile FU IQ Regfile FU Operand is ready at cycle 90 Consumer instruction Dispatched at cycle 100
15
University of Utah February 14 th 2005 15 Outline Overview Wire Design Space Exploration Employing L wires for Performance PW wires: The Power Optimizers Results Conclusions
16
University of Utah February 14 th 2005 16 Evaluation Methodology L1 D Cache B wires (2 cycles) L wires (1 cycle) PW wires (3 cycles) Cluster Simplescalar -3.0 augmented to simulate a dynamically scheduled 4-cluster model Crossbar interconnects (L, B and PW wires)
17
University of Utah February 14 th 2005 17 Heterogeneous Interconnects Intercluster global Interconnect 72 B wires (64 data bits and 8 control bits) Repeaters sized and spaced for optimum delay 18 L wires Wide wires and large spacing Occupies more area Low latencies 144 PW wires Poor delay High bandwidth Low power
18
University of Utah February 14 th 2005 18 Analytical Model C = C a + W s C b + C c /W s 123 1Fringing Capacitance 2Capacitance between different layers of wires 3Capacitance between wires Of same metal layer RC Model of the wire Total Power = Short-Circuit Power + Switching Power + Leakage Power
19
University of Utah February 14 th 2005 19 Evaluation methodology I-Cache D-cache LSQ Cluster Cross bar Ring interconnect Simplescalar -3.0 augmented to simulate a dynamically scheduled 16-cluster model Ring latencies B wires ( 4 cycles) PW wires ( 6 cycles) L wires (2 cycles)
20
University of Utah February 14 th 2005 20 IPC improvements: L wires L wires improve performance by 4.2% on four cluster system and 7.1% on a sixteen cluster system
21
University of Utah February 14 th 2005 21 Four Cluster System: ED 2 Improvements 92.195.0970.961.5144 PW 36 L 99.296.61030.982.0288 B 94.593.31010.992.0144 B, 36 L 93.294.4990.972.0288 PW,36 L 100.2103.4970.921.0288 PW 100 0.951.0144 B Relative ED 2 (20%) Relative ED 2 (10%) Relative processor energy (10%) IPCRelative metal area Link
22
University of Utah February 14 th 2005 22 Sixteen Cluster system: ED 2 gains 93.11051.18288 B 88.71071.22288 B, 36 L 88.71021.19144 B, 36 L 105.3941.05144 PW, 36 L 100 1.11144 B Relative ED 2 (20%) Relative Processor Energy (20%) IPCLink
23
University of Utah February 14 th 2005 23 Conclusions Exposing the wire design space to the architecture A case for micro-architectural wire management! A low latency low bandwidth network alone helps improve performance by up to 7% ED 2 improvements of about 11% compared to a baseline processor with homogeneous interconnect Entails hardware complexity
24
University of Utah February 14 th 2005 24 Future work 3-D wire model for the interconnects Design of heterogeneous clusters Interconnects for cache coherence and L2$
25
University of Utah February 14 th 2005 25 Questions and Comments? Thank you!
26
University of Utah February 14 th 2005 26 Backup
27
University of Utah February 14 th 2005 27 L wires: Accelerating cache access TLB access for page look up Transmit a few bits of Virtual page number on L wires Prefetch data our of L1$ and TLB 18 L wires( 6 tag bits, 8 L1 index and 4 TLB index bits) Wire Type Crossb ar delay Ring hop delay PW wires 36 B wires24 L wires12
28
University of Utah February 14 th 2005 28 Model parameters Simplescalar-3.0 with separate integer and floating point queues 32 KB 2 way Instruction cache 32 KB 4 way Data cache 128 entry 8 way I and D TLB
29
University of Utah February 14 th 2005 29 Overview/Motivation: ± Three wire implementations employed in this study ± B wires: traditional Optimal delay Huge power consumption ± L wires: Faster than B wires Lesser bandwidth ± PW wires: Reduced power consumption Higher bandwidth compared to B wires Increased delay through the wires
30
University of Utah February 14 th 2005 30
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.