Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO 2004 12/8/04.

Slides:

Advertisements

Similar presentations

Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology.

Advertisements

Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.

A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.

Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.

To Include or Not to Include? Natalie Enright Dana Vantrease.

ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors Mohammad Hammoud, Sangyeun Cho, and Rami Melhem Presenter: Socrates Demetriades.

Multi-Level Caches Vittorio Zaccaria. Preview What you have seen: Data organization, Associativity, Cache size Policies -- how to manage the data once.

A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.

1 Memory Performance and Scalability of Intel’s and AMD’s Dual-Core Processors: A Case Study Lu Peng 1, Jih-Kwon Peir 2, Tribuvan K. Prakash 1, Yen-Kuang.

1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.

High Performing Cache Hierarchies for Server Workloads

Practical Caches COMP25212 cache 3. Learning Objectives To understand: –Additional Control Bits in Cache Lines –Cache Line Size Tradeoffs –Separate I&D.

Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.

Amoeba-Cache Adaptive Blocks for Eliminating Waste in the Memory Hierarchy Snehasish Kumar Arrvindh Shriraman Eric Matthews Lesley Shannon Hongzhou Zhao.

Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs Mrinmoy Ghosh Hsien-Hsin S. Lee School.

1 Lecture 16: Large Cache Design Papers: An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches, Kim et al., ASPLOS’02 Distance Associativity.

CS752 Decoupled Architecture for Data Prefetching Jichuan Chang Kai Xu.

1 Lecture 12: Large Cache Design Papers (papers from last class and…): Co-Operative Caching for Chip Multiprocessors, Chang and Sohi, ISCA’06 Victim Replication,

HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs.

June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.

1 Lecture 16: Large Cache Innovations Today: Large cache design and other cache innovations Midterm scores  91-80: 17 students  79-75: 14 students 

Utilizing Shared Data in Chip Multiprocessors with the Nahalal Architecture Zvika Guz, Idit Keidar, Avinoam Kolodny, Uri C. Weiser The Technion – Israel.

(C) 2002 Milo MartinHPCA, Feb Bandwidth Adaptive Snooping Milo M.K. Martin, Daniel J. Sorin Mark D. Hill, and David A. Wood Wisconsin Multifacet.

Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers Manu Awasthi, David Nellans, Kshitij Sudan, Rajeev Balasubramonian,

1 E. Bolotin – The Power of Priority, NoCs 2007 The Power of Priority : NoC based Distributed Cache Coherency Evgeny Bolotin, Zvika Guz, Israel Cidon,

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

1 Lecture 11: Large Cache Design Topics: large cache basics and… An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches, Kim et al.,

Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.

1 Lecture 15: Large Cache Design Topics: innovations for multi-mega-byte cache hierarchies Reminders:  Assignment 5 posted.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in partitioned architectures Rajeev Balasubramonian Naveen.

CS 7810 Lecture 17 Managing Wire Delay in Large CMP Caches B. Beckmann and D. Wood Proceedings of MICRO-37 December 2004.

University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

Interactions Between Compression and Prefetching in Chip Multiprocessors Alaa R. Alameldeen* David A. Wood Intel CorporationUniversity of Wisconsin-Madison.

TLC: Transmission Line Caches Brad Beckmann David Wood Multifacet Project University of Wisconsin-Madison 12/3/03.

1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian.

CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.

Cooperative Caching for Chip Multiprocessors Jichuan Chang Guri Sohi University of Wisconsin-Madison ISCA-33, June 2006.

Javier Lira (Intel-UPC, Spain)Timothy M. Jones (U. of Cambridge, UK) Carlos Molina (URV, Spain)Antonio.

Achieving Non-Inclusive Cache Performance with Inclusive Caches Temporal Locality Aware (TLA) Cache Management Policies Aamer Jaleel,

Comparing Memory Systems for Chip Multiprocessors Leverich et al. Computer Systems Laboratory at Stanford Presentation by Sarah Bird.

Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University.

A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Jason Zebchuk, Elham Safi, and Andreas Moshovos

ImanFaraji Time-based Snoop Filtering in Chip Multiprocessors Amirkabir University of Technology Tehran, Iran University of Victoria Victoria, Canada Amirali.

1 Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly Koushik Chakraborty Philip Wells Gurindar Sohi

Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.

Virtual Hierarchies to Support Server Consolidation Mike Marty Mark Hill University of Wisconsin-Madison ISCA 2007.

Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

ASPLOS’02 Presented by Kim, Sun-Hee.  Technology trends ◦ The rate of frequency scaling is slowing down  Performance must come from exploiting concurrency.

Optimizing Replication, Communication, and Capacity Allocation in CMPs Z. Chishti, M. D. Powell, and T. N. Vijaykumar Presented by: Siddhesh Mhambrey Published.

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Zhichun Zhu Zhao Zhang ECE Department ECE Department

ASR: Adaptive Selective Replication for CMP Caches

ISPASS th April Santa Rosa, California

Managing Wire Delay in CMP Caches

Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper, Dan Sorin§, Mark.

Pablo Abad, Pablo Prieto, Valentin Puente, Jose-Angel Gregorio

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Temporal Streaming of Shared Memory

Lecture 12: Cache Innovations

Accelerating Dependent Cache Misses with an Enhanced Memory Controller

Optical Overlay NUCA: A High Speed Substrate for Shared L2 Caches

Address-Value Delta (AVD) Prediction

Lecture: Cache Innovations, Virtual Memory

Impact of Interconnection Network resources on CMP performance

Improving Multiple-CMP Systems with Token Coherence

Design and Management of 3D CMP’s using Network-in-Memory

Lecture: Cache Hierarchies

A Case for Interconnect-Aware Architectures

Presentation transcript:

Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04

Beckmann & WoodManaging Wire Delay in Large CMP Caches2 Overview Managing wire delay in shared CMP caches Three techniques extended to CMPs 1.On-chip Strided Prefetching (not in talk – see paper) –Scientific workloads: 10% average reduction –Commercial workloads: 3% average reduction 2.Cache Block Migration (e.g. D-NUCA) –Block sharing limits average reduction to 3% –Dependence on difficult to implement smart search 3.On-chip Transmission Lines (e.g. TLC) –Reduce runtime by 8% on average –Bandwidth contention accounts for 26% of L2 hit latency Combining techniques +Potentially alleviates isolated deficiencies –Up to 19% reduction vs. baseline –Implementation complexity

Beckmann & WoodManaging Wire Delay in Large CMP Caches3 Current CMP: IBM Power 5 L2 Bank L2 Bank L2 Bank 2 CPUs 3 L2 Cache Banks CPU 0 L1 I$ L1 D$ CPU 1 L1 D$ L1 I$

Reachable Distance / Cycle CMP Trends L2 CPU 0 L1 I$ L1 D$ CPU 1 L1 D$ L1 I$ 2004 technology 2010 technology CPU 2 L1 I$ L1 D$ CPU 3 L1 D$ L1 I$ CPU 4 L1 I$ L1 D$ CPU 5 L1 D$ L1 I$ CPU 6 L1 I$ L1 D$ CPU 7 L1 D$ L1 I$ L Reachable Distance / Cycle

5 Baseline: CMP-SNUCA L1 I $ L1 D $ CPU 2 L1 I $ L1 D $ CPU 3 L1 D $ L1 I $ CPU 7 L1 D $ L1 I $ CPU 6 L1 D $ L1 I $ CPU 1 L1 D $ L1 I $ CPU 0 L1 I $ L1 D $ CPU 4 L1 I $ L1 D $ CPU 5

Beckmann & WoodManaging Wire Delay in Large CMP Caches6 Outline Global interconnect and CMP trends Latency Management Techniques Evaluation –Methodology –Block Migration: CMP-DNUCA –Transmission Lines: CMP-TLC –Combination: CMP-Hybrid

7 Block Migration: CMP-DNUCA L1 I $ L1 D $ CPU 2 L1 I $ L1 D $ CPU 3 L1 D $ L1 I $ CPU 7 L1 D $ L1 I $ CPU 6 L1 D $ L1 I $ CPU 1 L1 D $ L1 I $ CPU 0 L1 I $ L1 D $ CPU 4 L1 I $ L1 D $ CPU 5 B A A B

Beckmann & WoodManaging Wire Delay in Large CMP Caches8 On-chip Transmission Lines Similar to contemporary off-chip communication Provides a different latency / bandwidth tradeoff Wires behave more “transmission-line” like as frequency increases –Utilize transmission line qualities to our advantage –No repeaters – route directly over large structures –~10x lower latency across long distances Limitations –Requires thick wires and dielectric spacing –Increases manufacturing cost

9 Transmission Lines: CMP-TLC CPU 3 L1 I $ L1 D $ L1 I $ L1 D $ L1 I $ L1 D $ L1 I $ L1 D $ L1 I $ L1 D $ L1 I $ L1 D $ L1 I $ L1 D $ L1 I $ L1 D $ CPU 2 CPU 1 CPU 0 CPU 4 CPU 5 CPU 6 CPU byte links

10 Combination: CMP-Hybrid L1 I $ L1 D $ CPU 2 L1 I $ L1 D $ CPU 3 L1 D $ L1 I $ CPU 7 L1 D $ L1 I $ CPU 6 L1 D $ L1 I $ CPU 1 L1 D $ L1 I $ CPU 0 L1 I $ L1 D $ CPU 4 L1 I $ L1 D $ CPU byte links

Beckmann & WoodManaging Wire Delay in Large CMP Caches11 Outline Global interconnect and CMP trends Latency Management Techniques Evaluation –Methodology –Block Migration: CMP-DNUCA –Transmission Lines: CMP-TLC –Combination: CMP-Hybrid

Beckmann & WoodManaging Wire Delay in Large CMP Caches12 Methodology Full system simulation –Simics –Timing model extensions Out-of-order processor Memory system Workloads –Commercial apache, jbb, otlp, zeus –Scientific Splash: barnes & ocean SpecOMP: apsi & fma3d

Beckmann & WoodManaging Wire Delay in Large CMP Caches13 System Parameters Memory SystemDynamically Scheduled Processor L1 I & D caches64 KB, 2-way, 3 cyclesClock frequency10 GHz Unified L2 cache16 MB, 256x64 KB, 16- way, 6 cycle bank access Reorder buffer / scheduler 128 / 64 entries L1 / L2 cache block size 64 BytesPipeline width4-wide fetch & issue Memory latency260 cyclesPipeline stages30 Memory bandwidth320 GB/sDirect branch predictor3.5 KB YAGS Memory size4 GB of DRAMReturn address stack64 entries Outstanding memory request / CPU 16Indirect branch predictor256 entries (cascaded)

Beckmann & WoodManaging Wire Delay in Large CMP Caches14 Outline Global interconnect and CMP trends Latency Management Techniques Evaluation –Methodology –Block Migration: CMP-DNUCA –Transmission Lines: CMP-TLC –Combination: CMP-Hybrid

15 CMP-DNUCA: Organization Bankclusters Local Inter. Center CPU 2 CPU 3 CPU 7 CPU 6 CPU 1 CPU 0 CPU 4 CPU 5

Beckmann & WoodManaging Wire Delay in Large CMP Caches16 Hit Distribution: Grayscale Shading CPU 2 CPU 3 CPU 7 CPU 6 CPU 1 CPU 0 CPU 4 CPU 5 Greater % of L2 Hits

Beckmann & WoodManaging Wire Delay in Large CMP Caches17 CMP-DNUCA: Migration Migration policy –Gradual movement –Increases local hits and reduces distant hits other bankclusters my center bankcluster my inter. bankcluster my local bankcluster

Beckmann & WoodManaging Wire Delay in Large CMP Caches18 CMP-DNUCA: Hit Distribution Ocean per CPU CPU 0CPU 1CPU 2CPU 3 CPU 4 CPU 5 CPU 6 CPU 7

Beckmann & WoodManaging Wire Delay in Large CMP Caches19 CMP-DNUCA: Hit Distribution Ocean all CPUs Block migration successfully separates the data sets

Beckmann & WoodManaging Wire Delay in Large CMP Caches20 CMP-DNUCA: Hit Distribution OLTP all CPUs

Beckmann & WoodManaging Wire Delay in Large CMP Caches21 CMP-DNUCA: Hit Distribution OLTP per CPU Hit Clustering: Most L2 hits satisfied by the center banks CPU 0CPU 1CPU 2CPU 3 CPU 4 CPU 5 CPU 6 CPU 7

Beckmann & WoodManaging Wire Delay in Large CMP Caches22 CMP-DNUCA: Search Search policy –Uniprocessor DNUCA solution: partial tags Quick summary of the L2 tag state at the CPU No known practical implementation for CMPs –Size impact of multiple partial tags –Coherence between block migrations and partial tag state –CMP-DNUCA solution: two-phase search 1 st phase: CPU’s local, inter., & 4 center banks 2 nd phase: remaining 10 banks Slow 2 nd phase hits and L2 misses

Beckmann & WoodManaging Wire Delay in Large CMP Caches23 CMP-DNUCA: L2 Hit Latency

Beckmann & WoodManaging Wire Delay in Large CMP Caches24 CMP-DNUCA Summary Limited success –Ocean successfully splits Regular scientific workload – little sharing –OLTP congregates in the center Commercial workload – significant sharing Smart search mechanism –Necessary for performance improvement –No known implementations –Upper bound – perfect search

Beckmann & WoodManaging Wire Delay in Large CMP Caches25 Outline Global interconnect and CMP trends Latency Management Techniques Evaluation –Methodology –Block Migration: CMP-DNUCA –Transmission Lines: CMP-TLC –Combination: CMP-Hybrid

Beckmann & WoodManaging Wire Delay in Large CMP Caches26 L2 Hit Latency Bars Labeled D: CMP-DNUCA T: CMP-TLC H: CMP-Hybrid

Beckmann & WoodManaging Wire Delay in Large CMP Caches27 Overall Performance Transmission lines improve L2 hit and L2 miss latency

Beckmann & WoodManaging Wire Delay in Large CMP Caches28 Conclusions Individual Latency Management Techniques –Strided Prefetching: subset of misses –Cache Block Migration: sharing impedes migration –On-chip Transmission Lines: limited bandwidth Combination: CMP-Hybrid –Potentially alleviates bottlenecks –Disadvantages Relies on smart-search mechanism Manufacturing cost of transmission lines

Beckmann & WoodManaging Wire Delay in Large CMP Caches29 Backup Slides

Beckmann & WoodManaging Wire Delay in Large CMP Caches30 Strided Prefetching Utilize repeatable memory access patterns –Subset of misses –Tolerates latency within the memory hierarchy Our implementation –Similar to Power4 –Unit and Non-unit stride misses L1 – L2 L2 – Mem

Beckmann & WoodManaging Wire Delay in Large CMP Caches31 On and Off-chip Prefetching Commercial Scientific Benchmarks

Beckmann & WoodManaging Wire Delay in Large CMP Caches32 CMP Sharing Patterns

Beckmann & WoodManaging Wire Delay in Large CMP Caches33 CMP Request Distribution

34 2nd Search Phase CMP-DNUCA: Search Strategy Bankclusters Local Inter. Center CPU 2 CPU 3 CPU 7 CPU 6 CPU 1 CPU 0 CPU 4 CPU 5 1st Search Phase Uniprocessor DNUCA: partial tag array for smart searches Significant implementation complexity for CMP-DNUCA

35 CMP-DNUCA: Migration Strategy CPU 2 CPU 3 CPU 7 CPU 6 CPU 1 CPU 0 CPU 4 CPU 5 other local other inter. other center my center my inter. my local Bankclusters Local Inter. Center

Beckmann & WoodManaging Wire Delay in Large CMP Caches36 Uncontended Latency Comparison

Beckmann & WoodManaging Wire Delay in Large CMP Caches37 CMP-DNUCA: L2 Hit Distribution Benchmarks

Beckmann & WoodManaging Wire Delay in Large CMP Caches38 CMP-DNUCA: L2 Hit Latency

Beckmann & WoodManaging Wire Delay in Large CMP Caches39 CMP-DNUCA: Runtime

Beckmann & WoodManaging Wire Delay in Large CMP Caches40 CMP-DNUCA Problems Hit clustering –Shared blocks move within the center –Equally far from all processors Search complexity –16 separate clusters –Partial tags impractical Distributed information Synchronization complexity

Beckmann & WoodManaging Wire Delay in Large CMP Caches41 CMP-TLC: L2 Hit Latency Bars Labeled D: CMP-DNUCA T: CMP-TLC

Beckmann & WoodManaging Wire Delay in Large CMP Caches42 Runtime: Isolated Techniques

Beckmann & WoodManaging Wire Delay in Large CMP Caches43 CMP-Hybrid: Performance

Beckmann & WoodManaging Wire Delay in Large CMP Caches44 Energy Efficiency