Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen Muralimanohar Karthik Ramani Venkatanand Venkatachalapathy
University of Utah February 14 th Overview/Motivation Wire delays are costly for performance and power Latencies of 30 cycles to reach ends of a chip 50% of dynamic power is in interconnect switching (Magen et al. SLIP 04) Abundant number of metal layers
University of Utah February 14 th Wire Characteristics Wire Resistance and capacitance per unit length (Width & Spacing) Delay (as delay RC), Bandwidth ResistanceCapacitanceBandwidth Width Spacing
University of Utah February 14 th Design Space Exploration Tuning wire width and spacing d 2d B Wires Resistance Capacitance Resistance Capacitance Bandwidth L wires
University of Utah February 14 th Transmission Lines Allow extremely low delay High implementation complexity and overhead! Large width Large spacing between wires Design of sensing circuit Shielding power and ground lines adjacent to each line Implemented in test CMOS chips Not employed in this study
University of Utah February 14 th Design Space Exploration Tuning Repeater size and spacing Traditional Wires Large repeaters Optimum spacing Power Optimal Wires Smaller repeaters Increased spacing Delay Power
University of Utah February 14 th Design Space Exploration Base case B wires Bandwidth Optimized W wires Power Optimized P wires Power and B/W Optimized PW wires Fast, low bandwidth L wires
University of Utah February 14 th Outline Overview Wire Design Space Exploration Employing L wires for Performance PW wires: The Power Optimizers Results Conclusions
University of Utah February 14 th Evaluation Platform L1 D Cache Cluster Centralized front-end I-Cache & D-Cache LSQ Branch Predictor Clustered back-end
University of Utah February 14 th Cache Pipeline L1 D Cache LSQLSQ Eff. Address Transfer 10c Mem. Dep Resolution 5c Cache Access 5c Data return at 20c L1 D Cache LSQLSQ Eff. Address Transfer 10c Mem. Dep Resolution 5c Cache Access 5c Data return at 20c L1 D Cache LSQLSQ Eff. Address Transfer 10c Partial Mem. Dep Resolution 3c Cache Access 5c 8-bit Transfer 5c Data return at 14c Functional Unit
University of Utah February 14 th L wires: Accelerating cache access Transmit LSB bits of effective address through L wires Faster memory disambiguation Partial comparison of loads and stores in LSQ Introduces false dependences ( < 9%) Indexing data and tag RAM arrays LSB bits can prefetch data out of L1$ Reduce access latency of loads
University of Utah February 14 th L wires: Narrow Bit Width Operands PowerPC: Data bit-width determines FU latency Transfer of 10 bit integers on L wires Can introduce scheduling difficulties A predictor table of saturating counters Accuracy of 98% Reduction in branch mispredict penalty
University of Utah February 14 th Power Efficient Wires. Base case B wires Power and B/W Optimized PW wires Idea: steer non-critical data through energy efficient PW interconnect
University of Utah February 14 th PW wires: Power/Bandwidth Efficient Ready Register operands Transfer of data at instruction dispatch Transfer of input operands to remote register file Covered by long dispatch to issue latency Store data Could stall commit process Delay dependent loads Rename & Dispatch IQ Regfile FU IQ Regfile FU IQ Regfile FU IQ Regfile FU Operand is ready at cycle 90 Consumer instruction Dispatched at cycle 100
University of Utah February 14 th Outline Overview Wire Design Space Exploration Employing L wires for Performance PW wires: The Power Optimizers Results Conclusions
University of Utah February 14 th Evaluation Methodology L1 D Cache B wires (2 cycles) L wires (1 cycle) PW wires (3 cycles) Cluster Simplescalar -3.0 augmented to simulate a dynamically scheduled 4-cluster model Crossbar interconnects (L, B and PW wires)
University of Utah February 14 th Heterogeneous Interconnects Intercluster global Interconnect 72 B wires (64 data bits and 8 control bits) Repeaters sized and spaced for optimum delay 18 L wires Wide wires and large spacing Occupies more area Low latencies 144 PW wires Poor delay High bandwidth Low power
University of Utah February 14 th Analytical Model C = C a + W s C b + C c /W s 123 1Fringing Capacitance 2Capacitance between different layers of wires 3Capacitance between wires Of same metal layer RC Model of the wire Total Power = Short-Circuit Power + Switching Power + Leakage Power
University of Utah February 14 th Evaluation methodology I-Cache D-cache LSQ Cluster Cross bar Ring interconnect Simplescalar -3.0 augmented to simulate a dynamically scheduled 16-cluster model Ring latencies B wires ( 4 cycles) PW wires ( 6 cycles) L wires (2 cycles)
University of Utah February 14 th IPC improvements: L wires L wires improve performance by 4.2% on four cluster system and 7.1% on a sixteen cluster system
University of Utah February 14 th Four Cluster System: ED 2 Improvements PW 36 L B B, 36 L PW,36 L PW B Relative ED 2 (20%) Relative ED 2 (10%) Relative processor energy (10%) IPCRelative metal area Link
University of Utah February 14 th Sixteen Cluster system: ED 2 gains B B, 36 L B, 36 L PW, 36 L B Relative ED 2 (20%) Relative Processor Energy (20%) IPCLink
University of Utah February 14 th Conclusions Exposing the wire design space to the architecture A case for micro-architectural wire management! A low latency low bandwidth network alone helps improve performance by up to 7% ED 2 improvements of about 11% compared to a baseline processor with homogeneous interconnect Entails hardware complexity
University of Utah February 14 th Future work 3-D wire model for the interconnects Design of heterogeneous clusters Interconnects for cache coherence and L2$
University of Utah February 14 th Questions and Comments? Thank you!
University of Utah February 14 th Backup
University of Utah February 14 th L wires: Accelerating cache access TLB access for page look up Transmit a few bits of Virtual page number on L wires Prefetch data our of L1$ and TLB 18 L wires( 6 tag bits, 8 L1 index and 4 TLB index bits) Wire Type Crossb ar delay Ring hop delay PW wires 36 B wires24 L wires12
University of Utah February 14 th Model parameters Simplescalar-3.0 with separate integer and floating point queues 32 KB 2 way Instruction cache 32 KB 4 way Data cache 128 entry 8 way I and D TLB
University of Utah February 14 th Overview/Motivation: ± Three wire implementations employed in this study ± B wires: traditional Optimal delay Huge power consumption ± L wires: Faster than B wires Lesser bandwidth ± PW wires: Reduced power consumption Higher bandwidth compared to B wires Increased delay through the wires
University of Utah February 14 th