Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper, Dan Sorin§, Mark.

Slides:

Advertisements

Similar presentations

Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.

Advertisements

Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.

Cache Coherence Mechanisms (Research project) CSCI-5593

1 A Hybrid Adaptive Feedback Based Prefetcher Santhosh Verma, David Koppelman and Lu Peng Louisiana State University.

1 Lecture 17: Large Cache Design Papers: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation, Cho and Jin, MICRO’06 Co-Operative Caching.

Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.

ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors Mohammad Hammoud, Sangyeun Cho, and Rami Melhem Presenter: Socrates Demetriades.

A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.

1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.

Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.

1 Lecture 4: Directory Protocols Topics: directory-based cache coherence implementations.

Cache Optimization Summary

Exploiting Spatial Locality in Data Caches using Spatial Footprints Sanjeev Kumar, Princeton University Christopher Wilkerson, MRL, Intel.

Circuit-Switched Coherence Natalie Enright Jerger*, Li-Shiuan Peh +, Mikko Lipasti* *University of Wisconsin - Madison + Princeton University 2 nd IEEE.

CS752 Decoupled Architecture for Data Prefetching Jichuan Chang Kai Xu.

Variability in Architectural Simulations of Multi-threaded Workloads Alaa R. Alameldeen and David A. Wood University of Wisconsin-Madison

(C) 2002 Milo MartinHPCA, Feb Bandwidth Adaptive Snooping Milo M.K. Martin, Daniel J. Sorin Mark D. Hill, and David A. Wood Wisconsin Multifacet.

(C) 2003 Milo Martin Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper,

Evaluating Non-deterministic Multi-threaded Commercial Workloads Computer Sciences Department University of Wisconsin—Madison

Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.

(C) 2004 Daniel SorinDuke Architecture Using Speculation to Simplify Multiprocessor Design Daniel J. Sorin 1, Milo M. K. Martin 2, Mark D. Hill 3, David.

Multiprocessor Cache Coherency

Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.

Interactions Between Compression and Prefetching in Chip Multiprocessors Alaa R. Alameldeen* David A. Wood Intel CorporationUniversity of Wisconsin-Madison.

Presented by Deepak Srinivasan Alaa Aladmeldeen, Milo Martin, Carl Mauer, Kevin Moore, Min Xu, Daniel Sorin, Mark Hill and David Wood Computer Sciences.

Computer Architecture and System Research For Future Computing Jaehyuk Huh.

A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Jason Zebchuk, Elham Safi, and Andreas Moshovos

SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,

Predicting Coherence Communication by Tracking Synchronization Points at Run Time Socrates Demetriades and Sangyeun Cho 45 th International Symposium in.

Using Prediction to Accelerate Coherence Protocols Authors : Shubendu S. Mukherjee and Mark D. Hill Proceedings. The 25th Annual International Symposium.

(C) 2003 Daniel SorinDuke Architecture Dynamic Verification of End-to-End Multiprocessor Invariants Daniel J. Sorin 1, Mark D. Hill 2, David A. Wood 2.

Effects of wrong path mem. ref. in CC MP Systems Gökay Burak AKKUŞ Cmpe 511 – Computer Architecture.

Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.

Increasing Cache Efficiency by Eliminating Noise Prateek Pujara & Aneesh Aggarwal {prateek,

CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z.

CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 9: May 3, 2001 Distributed Shared Memory.

Token Coherence: Decoupling Performance and Correctness Milo M. D. Martin Mark D. Hill David A. Wood University of Wisconsin-Madison ISCA-30 (2003)

ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Evaluation – Metrics, Simulation, and Workloads Copyright 2004 Daniel.

March University of Utah CS 7698 Token Coherence: Decoupling Performance and Correctness Article by: Martin, Hill & Wood Presented by: Michael Tabet.

CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.

The University of Adelaide, School of Computer Science

An Adaptive Cache Coherence Protocol Optimized for Producer-Consumer Sharing Liquin Cheng, John B. Carter and Donglai Dai cs.utah.edu by Evangelos Vlachos.

1 Computer Architecture & Assembly Language Spring 2001 Dr. Richard Spillman Lecture 26 – Alternative Architectures.

Presented by: Nick Kirchem Feb 13, 2004

Framework For Exploring Interconnect Level Cache Coherency

ASR: Adaptive Selective Replication for CMP Caches

Reactive NUMA A Design for Unifying S-COMA and CC-NUMA

The University of Adelaide, School of Computer Science

A New Coherence Method Using A Multicast Address Network

Lecture 18: Coherence and Synchronization

Multiprocessor Cache Coherency

Cache Memory Presentation I

Jason F. Cantin, Mikko H. Lipasti, and James E. Smith

Moinuddin K. Qureshi ECE, Georgia Tech Gabriel H. Loh, AMD

Address Translation for Manycore Systems

CMSC 611: Advanced Computer Architecture

Reducing Memory Reference Energy with Opportunistic Virtual Caching

Improving Multiple-CMP Systems with Token Coherence

E. Bilir, R. Dickson, Y. Hu, M. Plakal, D. Sorin,

CS 213 Lecture 11: Multiprocessor 3: Directory Organization

Simulating a $2M Commercial Server on a $2K PC

Leveraging Optical Technology in Future Bus-based Chip Multiprocessors

Lecture 25: Multiprocessors

Lecture 25: Multiprocessors

The University of Adelaide, School of Computer Science

Dynamic Verification of Sequential Consistency

Coherent caches Adapted from a lecture by Ian Watson, University of Machester.

Lecture 19: Coherence and Synchronization

Lecture 18: Coherence and Synchronization

The University of Adelaide, School of Computer Science

Presentation transcript:

Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper, Dan Sorin§, Mark Hill, and David Wood University of Wisconsin §Duke University Wisconsin Multifacet Project http://www.cs.wisc.edu/multifacet/

Cache-Coherence: A Latency/Bandwidth Tradeoff Average Miss Latency Bandwidth usage Broadcast Snooping Burns bandwidth for low latency (Cost) Broadcast snooping Evolved from bus to switched interconnect Direct data response (no indirection) Extra traffic (due to broadcast) Burns bandwidth for low latency

Cache-Coherence: A Latency/Bandwidth Tradeoff Sacrifices latency for scalability Directory Protocol Average Miss Latency Bandwidth usage Broadcast Snooping (Cost) Directory protocols Send to directory which forwards as needed Avoids broadcast Adds latency for cache-to-cache misses Sacrifices latency for scalability/bandwidth improvement

Cache-Coherence: A Latency/Bandwidth Tradeoff Directory Protocol Average Miss Latency Bandwidth usage Ideal Goal: move toward “ideal” design point Broadcast Snooping (Cost) DEFINE dest-set term here Right choice depends upon available bandwidth and other factors We present 4 predictors types, plus explore indexing modes Something for everyone If you like directory protocols, we can reduce the average miss latency at the cost of modest extra b/w If you like snooping protocols, we can significantly reduce the traffic as the cost of a modest loss average miss latency Approach: Destination-set Prediction Learn from past coherence behavior Predict destinations for each coherence request Correct prediction avoid indirection & broadcast

Destination-set predictor System Model Processor/memory nodes Destination-set predictor Directory state Destination-set predictor Network Interface Caches Processor Memory Directory Controller Home $ P M ? Similar to current systems, only highlight differences Differences from a directory protocol: ordered interconnect Differences from a snooping system: directory at memory Differences from both: a simple bandwidth adaptive mechanism Predictor Interconnect

Quantifying potential benefit Predictors Conclusions Outline Introduction Quantifying potential benefit Predictors Conclusions Forecast combine description/evaluation

Potential Benefit: Two Questions Directory Protocol Average Miss Latency Bandwidth use Ideal #1 1. Significant performance difference? (y-axis) #2 2. Significant bandwidth difference? (x-axis) Broadcast Snooping

Potential Benefit: Question 1 of 2

Potential Benefit: Question 1 of 2 Directory Protocol Average Miss Latency Bandwidth use Ideal #1 Broadcast Snooping Larger cache don’t help; make percentage worse Yes! Significant performance difference? Frequent cache-to-cache L2 misses (35%-96%) Directory lookup on the critical path

Potential Benefit: Question 2 of 2 Directory Protocol Average Miss Latency Bandwidth use Ideal #2 Broadcast Snooping Significant bandwidth difference?

Potential Benefit: Question 2 of 2 Broadcast is overkill (~10% require > 1) (and the gap will grow with more processors)

Potential Benefit: Question 2 of 2 Directory Protocol Average Miss Latency Bandwidth use Ideal #2 Broadcast Snooping Yes! Significant bandwidth difference? Only ~10% requests contact > 1 other processor Broadcast is overkill The gap will grow with more processors

Outline Introduction Quantifying potential benefit Predictors Conclusions

Protocols for Destination-Set Prediction Predict “minimal set” = Directory Protocol Average Miss Latency Bandwidth use Allow direct responses for “correct” predictions Ideal Predict “broadcast” = Broadcast Snooping Destination-Set Prediction Sharing patterns (describe the properties we want, not the mechanisms) Random would fall along a line connecting the two One protocol that unites directory protocols, snooping protocols, and predictive design-points in between

Predictor Design Space Destination-Set Predictor Predictions Requests Training Information 1. Responses to own requests 2. Coherence requests from others (read & write) Basic organization One predictor at each processor’s L2 cache Accessed in parallel with L2 tags No modifications to processor core All simple cache-like (tagged) predictors Index with data block address Single-level predictor Prediction On tag miss, send to minimal set (directory & self) Otherwise, generate prediction (as described next)

Evaluation Methods Six multiprocessor workloads Simulation environment Online transaction processing (OLTP) Java middleware (SPECjbb) Static and dynamic web serving (Apache & Slash) Scientific applications (Barnes & Ocean) Simulation environment Full-system simulation using Simics 16-processor SPARC MOSI multiprocessor Many parameters (see paper) Traces (for exploration) & timing simulation (for runtime results) Keep IEEE Computer title on slide, but don’t say it

Trace-based Predictor Evaluation OLTP workload Directory Corresponds to latency Snooping The rate of indirection can be thought of as prediction accuracy Corresponds to bandwidth Quickly explore design space

Predictor #1: Broadcast-if-shared Performance of snooping, fewer broadcasts Broadcast for “shared” data Minimal set for “private” data Each entry: valid bit, 2-bit counter Decrement on data from memory Increment on data from a processor Increment other processor’s request Prediction If counter > 1 then broadcast Otherwise, send only to minimal set

Predictor #1: Broadcast-if-shared OLTP workload Directory Broadcast-if-shared Snooping NEED TO SAY UNBOUND PREDICTORS! Unbounded predictor, indexed with datablock (64B) Performance of snooping with less traffic

Predictor #2: Owner Traffic similar to directory, fewer indirections Predict one extra processor (the “owner”) Pairwise sharing, write part of migratory sharing Each entry: valid bit, predicted owner ID Set “owner” on data from other processor Set “owner” on other’s request to write Unset “owner” on response from memory Prediction If “valid” then predict “owner” + minimal set Otherwise, send only to minimal set

Traffic of directory with higher performance Predictor #2: Owner OLTP workload Directory Owner Broadcast-if-shared Snooping Traffic of directory with higher performance

Predictor #3: Group Try to achieve ideal bandwidth/latency Detect groups of sharers Temporary groups or logical partitions (LPAR) Each entry: N 2-bit counters Response or request from another processor  Increment corresponding counter Train down by occasionally decrement all counters (every 2N increments) Prediction Begin with minimal set For each processor, if the corresponding counter > 1, add it in the predicted set

A design point between directory and snooping protocols Predictor #3: Group OLTP workload Directory Owner Group Broadcast-if-shared Snooping A design point between directory and snooping protocols

Predictor #4: Owner/Group Group predictor Exculsive requests Owner Predictor Shared requests

Predictor #4: Owner/Group OLTP workload Directory Owner Owner/Group Group Broadcast-if-shared Snooping A hybrid design between Owner and Group

Indexing Design Space Index by cache block (64B) as shown Index by program counter (PC) Simple schemes not as effective with PCs Index by macroblock (256B or 1024B) Exploit spatial predictability of sharing misses Spatially-related blocks

Data block indexing yields better predictions PC Indexing Data block indexing yields better predictions

Macroblock Indexing Group Macroblock indexing is an improvement Group improves substantially (30%  10%)

Finite Size Predictors 8192 entries  32kB to 64kB predictor

Runtime Results What point in the design space to simulate? As available bandwidth  infinite snooping performs best (no indirections) As available bandwidth  0, directory performs best (bandwidth efficient) Bandwidth/latency  cost/performance tradeoff Cost is difficult to quantify (cost of chip bandwidth) Other associated costs (snoop b/w, power use) Bandwidth under-design will reduce performance In the paper: measure runtime & traffic Simulate plentiful (but limited) bandwidth This is a bandwidth-rich configuration. With less available bandwidth, the snooping system would not always be the highest performing solution Like other architecture paper that estimate die area overheads, we provide an indirect measure of cost, and show performance

Runtime Results: OLTP Mostly data traffic This is a bandwidth-rich configuration. With less available bandwidth, the snooping system would not always be the highest performing solution Remind the audience my there is such a large performance gap Group: 1/2 runtime of directory, 2/3 traffic of snooping

Conclusions Destination-Set Prediction provides better bandwidth/latency tradeoffs (Not just the extremes of snooping and directory) Significant benefit from macroblock indexing A 8192 entries predictor requires only 64KB space Result summary: 90% the performance of snooping, only 15% more bandwidth than directory Reference: http://www.cs.wisc.edu/multifacet/