Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04
Beckmann & WoodManaging Wire Delay in Large CMP Caches2 Overview Managing wire delay in shared CMP caches Three techniques extended to CMPs 1.On-chip Strided Prefetching (not in talk – see paper) –Scientific workloads: 10% average reduction –Commercial workloads: 3% average reduction 2.Cache Block Migration (e.g. D-NUCA) –Block sharing limits average reduction to 3% –Dependence on difficult to implement smart search 3.On-chip Transmission Lines (e.g. TLC) –Reduce runtime by 8% on average –Bandwidth contention accounts for 26% of L2 hit latency Combining techniques +Potentially alleviates isolated deficiencies –Up to 19% reduction vs. baseline –Implementation complexity
Beckmann & WoodManaging Wire Delay in Large CMP Caches3 Current CMP: IBM Power 5 L2 Bank L2 Bank L2 Bank 2 CPUs 3 L2 Cache Banks CPU 0 L1 I$ L1 D$ CPU 1 L1 D$ L1 I$
Reachable Distance / Cycle CMP Trends L2 CPU 0 L1 I$ L1 D$ CPU 1 L1 D$ L1 I$ 2004 technology 2010 technology CPU 2 L1 I$ L1 D$ CPU 3 L1 D$ L1 I$ CPU 4 L1 I$ L1 D$ CPU 5 L1 D$ L1 I$ CPU 6 L1 I$ L1 D$ CPU 7 L1 D$ L1 I$ L Reachable Distance / Cycle
5 Baseline: CMP-SNUCA L1 I $ L1 D $ CPU 2 L1 I $ L1 D $ CPU 3 L1 D $ L1 I $ CPU 7 L1 D $ L1 I $ CPU 6 L1 D $ L1 I $ CPU 1 L1 D $ L1 I $ CPU 0 L1 I $ L1 D $ CPU 4 L1 I $ L1 D $ CPU 5
Beckmann & WoodManaging Wire Delay in Large CMP Caches6 Outline Global interconnect and CMP trends Latency Management Techniques Evaluation –Methodology –Block Migration: CMP-DNUCA –Transmission Lines: CMP-TLC –Combination: CMP-Hybrid
7 Block Migration: CMP-DNUCA L1 I $ L1 D $ CPU 2 L1 I $ L1 D $ CPU 3 L1 D $ L1 I $ CPU 7 L1 D $ L1 I $ CPU 6 L1 D $ L1 I $ CPU 1 L1 D $ L1 I $ CPU 0 L1 I $ L1 D $ CPU 4 L1 I $ L1 D $ CPU 5 B A A B
Beckmann & WoodManaging Wire Delay in Large CMP Caches8 On-chip Transmission Lines Similar to contemporary off-chip communication Provides a different latency / bandwidth tradeoff Wires behave more “transmission-line” like as frequency increases –Utilize transmission line qualities to our advantage –No repeaters – route directly over large structures –~10x lower latency across long distances Limitations –Requires thick wires and dielectric spacing –Increases manufacturing cost
9 Transmission Lines: CMP-TLC CPU 3 L1 I $ L1 D $ L1 I $ L1 D $ L1 I $ L1 D $ L1 I $ L1 D $ L1 I $ L1 D $ L1 I $ L1 D $ L1 I $ L1 D $ L1 I $ L1 D $ CPU 2 CPU 1 CPU 0 CPU 4 CPU 5 CPU 6 CPU byte links
10 Combination: CMP-Hybrid L1 I $ L1 D $ CPU 2 L1 I $ L1 D $ CPU 3 L1 D $ L1 I $ CPU 7 L1 D $ L1 I $ CPU 6 L1 D $ L1 I $ CPU 1 L1 D $ L1 I $ CPU 0 L1 I $ L1 D $ CPU 4 L1 I $ L1 D $ CPU byte links
Beckmann & WoodManaging Wire Delay in Large CMP Caches11 Outline Global interconnect and CMP trends Latency Management Techniques Evaluation –Methodology –Block Migration: CMP-DNUCA –Transmission Lines: CMP-TLC –Combination: CMP-Hybrid
Beckmann & WoodManaging Wire Delay in Large CMP Caches12 Methodology Full system simulation –Simics –Timing model extensions Out-of-order processor Memory system Workloads –Commercial apache, jbb, otlp, zeus –Scientific Splash: barnes & ocean SpecOMP: apsi & fma3d
Beckmann & WoodManaging Wire Delay in Large CMP Caches13 System Parameters Memory SystemDynamically Scheduled Processor L1 I & D caches64 KB, 2-way, 3 cyclesClock frequency10 GHz Unified L2 cache16 MB, 256x64 KB, 16- way, 6 cycle bank access Reorder buffer / scheduler 128 / 64 entries L1 / L2 cache block size 64 BytesPipeline width4-wide fetch & issue Memory latency260 cyclesPipeline stages30 Memory bandwidth320 GB/sDirect branch predictor3.5 KB YAGS Memory size4 GB of DRAMReturn address stack64 entries Outstanding memory request / CPU 16Indirect branch predictor256 entries (cascaded)
Beckmann & WoodManaging Wire Delay in Large CMP Caches14 Outline Global interconnect and CMP trends Latency Management Techniques Evaluation –Methodology –Block Migration: CMP-DNUCA –Transmission Lines: CMP-TLC –Combination: CMP-Hybrid
15 CMP-DNUCA: Organization Bankclusters Local Inter. Center CPU 2 CPU 3 CPU 7 CPU 6 CPU 1 CPU 0 CPU 4 CPU 5
Beckmann & WoodManaging Wire Delay in Large CMP Caches16 Hit Distribution: Grayscale Shading CPU 2 CPU 3 CPU 7 CPU 6 CPU 1 CPU 0 CPU 4 CPU 5 Greater % of L2 Hits
Beckmann & WoodManaging Wire Delay in Large CMP Caches17 CMP-DNUCA: Migration Migration policy –Gradual movement –Increases local hits and reduces distant hits other bankclusters my center bankcluster my inter. bankcluster my local bankcluster
Beckmann & WoodManaging Wire Delay in Large CMP Caches18 CMP-DNUCA: Hit Distribution Ocean per CPU CPU 0CPU 1CPU 2CPU 3 CPU 4 CPU 5 CPU 6 CPU 7
Beckmann & WoodManaging Wire Delay in Large CMP Caches19 CMP-DNUCA: Hit Distribution Ocean all CPUs Block migration successfully separates the data sets
Beckmann & WoodManaging Wire Delay in Large CMP Caches20 CMP-DNUCA: Hit Distribution OLTP all CPUs
Beckmann & WoodManaging Wire Delay in Large CMP Caches21 CMP-DNUCA: Hit Distribution OLTP per CPU Hit Clustering: Most L2 hits satisfied by the center banks CPU 0CPU 1CPU 2CPU 3 CPU 4 CPU 5 CPU 6 CPU 7
Beckmann & WoodManaging Wire Delay in Large CMP Caches22 CMP-DNUCA: Search Search policy –Uniprocessor DNUCA solution: partial tags Quick summary of the L2 tag state at the CPU No known practical implementation for CMPs –Size impact of multiple partial tags –Coherence between block migrations and partial tag state –CMP-DNUCA solution: two-phase search 1 st phase: CPU’s local, inter., & 4 center banks 2 nd phase: remaining 10 banks Slow 2 nd phase hits and L2 misses
Beckmann & WoodManaging Wire Delay in Large CMP Caches23 CMP-DNUCA: L2 Hit Latency
Beckmann & WoodManaging Wire Delay in Large CMP Caches24 CMP-DNUCA Summary Limited success –Ocean successfully splits Regular scientific workload – little sharing –OLTP congregates in the center Commercial workload – significant sharing Smart search mechanism –Necessary for performance improvement –No known implementations –Upper bound – perfect search
Beckmann & WoodManaging Wire Delay in Large CMP Caches25 Outline Global interconnect and CMP trends Latency Management Techniques Evaluation –Methodology –Block Migration: CMP-DNUCA –Transmission Lines: CMP-TLC –Combination: CMP-Hybrid
Beckmann & WoodManaging Wire Delay in Large CMP Caches26 L2 Hit Latency Bars Labeled D: CMP-DNUCA T: CMP-TLC H: CMP-Hybrid
Beckmann & WoodManaging Wire Delay in Large CMP Caches27 Overall Performance Transmission lines improve L2 hit and L2 miss latency
Beckmann & WoodManaging Wire Delay in Large CMP Caches28 Conclusions Individual Latency Management Techniques –Strided Prefetching: subset of misses –Cache Block Migration: sharing impedes migration –On-chip Transmission Lines: limited bandwidth Combination: CMP-Hybrid –Potentially alleviates bottlenecks –Disadvantages Relies on smart-search mechanism Manufacturing cost of transmission lines
Beckmann & WoodManaging Wire Delay in Large CMP Caches29 Backup Slides
Beckmann & WoodManaging Wire Delay in Large CMP Caches30 Strided Prefetching Utilize repeatable memory access patterns –Subset of misses –Tolerates latency within the memory hierarchy Our implementation –Similar to Power4 –Unit and Non-unit stride misses L1 – L2 L2 – Mem
Beckmann & WoodManaging Wire Delay in Large CMP Caches31 On and Off-chip Prefetching Commercial Scientific Benchmarks
Beckmann & WoodManaging Wire Delay in Large CMP Caches32 CMP Sharing Patterns
Beckmann & WoodManaging Wire Delay in Large CMP Caches33 CMP Request Distribution
34 2nd Search Phase CMP-DNUCA: Search Strategy Bankclusters Local Inter. Center CPU 2 CPU 3 CPU 7 CPU 6 CPU 1 CPU 0 CPU 4 CPU 5 1st Search Phase Uniprocessor DNUCA: partial tag array for smart searches Significant implementation complexity for CMP-DNUCA
35 CMP-DNUCA: Migration Strategy CPU 2 CPU 3 CPU 7 CPU 6 CPU 1 CPU 0 CPU 4 CPU 5 other local other inter. other center my center my inter. my local Bankclusters Local Inter. Center
Beckmann & WoodManaging Wire Delay in Large CMP Caches36 Uncontended Latency Comparison
Beckmann & WoodManaging Wire Delay in Large CMP Caches37 CMP-DNUCA: L2 Hit Distribution Benchmarks
Beckmann & WoodManaging Wire Delay in Large CMP Caches38 CMP-DNUCA: L2 Hit Latency
Beckmann & WoodManaging Wire Delay in Large CMP Caches39 CMP-DNUCA: Runtime
Beckmann & WoodManaging Wire Delay in Large CMP Caches40 CMP-DNUCA Problems Hit clustering –Shared blocks move within the center –Equally far from all processors Search complexity –16 separate clusters –Partial tags impractical Distributed information Synchronization complexity
Beckmann & WoodManaging Wire Delay in Large CMP Caches41 CMP-TLC: L2 Hit Latency Bars Labeled D: CMP-DNUCA T: CMP-TLC
Beckmann & WoodManaging Wire Delay in Large CMP Caches42 Runtime: Isolated Techniques
Beckmann & WoodManaging Wire Delay in Large CMP Caches43 CMP-Hybrid: Performance
Beckmann & WoodManaging Wire Delay in Large CMP Caches44 Energy Efficiency