Effects of wrong path mem. ref. in CC MP Systems Gökay Burak AKKUŞ Cmpe 511 – Computer Architecture.

Slides:

Advertisements

Similar presentations

L.N. Bhuyan Adapted from Patterson’s slides

Advertisements

The University of Adelaide, School of Computer Science

1 Lecture 6: Directory Protocols Topics: directory-based cache coherence implementations (wrap-up of SGI Origin and Sequent NUMA case study)

A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.

Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.

1 Lecture 4: Directory Protocols Topics: directory-based cache coherence implementations.

Cache Optimization Summary

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

The University of Adelaide, School of Computer Science

CIS629 Coherence 1 Cache Coherence: Snooping Protocol, Directory Protocol Some of these slides courtesty of David Patterson and David Culler.

CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.

(C) 2002 Milo MartinHPCA, Feb Bandwidth Adaptive Snooping Milo M.K. Martin, Daniel J. Sorin Mark D. Hill, and David A. Wood Wisconsin Multifacet.

1 Lecture 1: Introduction Course organization:  4 lectures on cache coherence and consistency  2 lectures on transactional memory  2 lectures on interconnection.

1 Lecture 3: Snooping Protocols Topics: snooping-based cache coherence implementations.

1 Lecture 18: Coherence Protocols Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections )

CS252/Patterson Lec /28/01 CS 213 Lecture 9: Multiprocessor: Directory Protocol.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

1 Lecture 2: Intro and Snooping Protocols Topics: multi-core cache organizations, programming models, cache coherence (snooping-based)

1 Lecture 3: Directory-Based Coherence Basic operations, memory-based and cache-based directories.

CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.

CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.

Snooping Cache and Shared-Memory Multiprocessors

1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.

Multiprocessor Cache Coherency

Spring 2003CSE P5481 Cache Coherency Cache coherent processors reading processor must get the most current value most current value is the last write Cache.

Shared Address Space Computing: Hardware Issues Alistair Rendell See Chapter 2 of Lin and Synder, Chapter 2 of Grama, Gupta, Karypis and Kumar, and also.

Quantifying and Comparing the Impact of Wrong-Path Memory References in Multiple-CMP Systems Ayse Yilmazer, University of Rhode Island Resit Sendag, University.

EECS 252 Graduate Computer Architecture Lec 13 – Snooping Cache and Directory Based Multiprocessors David Patterson Electrical Engineering and Computer.

A Low-Overhead Coherence Solution for Multiprocessors with Private Cache Memories Also known as “Snoopy cache” Paper by: Mark S. Papamarcos and Janak H.

Lecture 13: Multiprocessors Kai Bu

Ch4. Multiprocessors & Thread-Level Parallelism 2. SMP (Symmetric shared-memory Multiprocessors) ECE468/562 Advanced Computer Architecture Prof. Honggang.

Evaluating the Performance of Four Snooping Cache Coherency Protocols Susan J. Eggers, Randy H. Katz.

PARALLEL PROCESSOR- TAXONOMY. CH18 Parallel Processing {Multi-processor, Multi-computer} Multiple Processor Organizations Symmetric Multiprocessors Cache.

RSIM: An Execution-Driven Simulator for ILP-Based Shared-Memory Multiprocessors and Uniprocessors.

Analyzing the Impact of Data Prefetching on Chip MultiProcessors Naoto Fukumoto, Tomonobu Mihara, Koji Inoue, Kazuaki Murakami Kyushu University, Japan.

1 Lecture 3: Coherence Protocols Topics: consistency models, coherence protocol examples.

The University of Adelaide, School of Computer Science

1 Lecture 7: PCM Wrap-Up, Cache coherence Topics: handling PCM errors and writes, cache coherence intro.

CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.

The University of Adelaide, School of Computer Science

22/12/2005 Distributed Shared-Memory Architectures by Seda Demirağ Distrubuted Shared-Memory Architectures by Seda Demirağ.

Multi Processing prepared and instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University June 2016Multi Processing1.

Lecture 13: Multiprocessors Kai Bu

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

A New Coherence Method Using A Multicast Address Network

CS 704 Advanced Computer Architecture

A Study on Snoop-Based Cache Coherence Protocols

12.4 Memory Organization in Multiprocessor Systems

Multiprocessor Cache Coherency

Jason F. Cantin, Mikko H. Lipasti, and James E. Smith

The University of Adelaide, School of Computer Science

CMSC 611: Advanced Computer Architecture

Example Cache Coherence Problem

The University of Adelaide, School of Computer Science

Lecture 2: Snooping-Based Coherence

CMSC 611: Advanced Computer Architecture

Lecture 8: Directory-Based Cache Coherence

Lecture 7: Directory-Based Cache Coherence

11 – Snooping Cache and Directory Based Multiprocessors

Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP

High Performance Computing

Lecture 25: Multiprocessors

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

Coherent caches Adapted from a lecture by Ian Watson, University of Machester.

Lecture 17 Multiprocessors and Thread-Level Parallelism

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

Presentation transcript:

Effects of wrong path mem. ref. in CC MP Systems Gökay Burak AKKUŞ Cmpe 511 – Computer Architecture

About the papers  R. Sendag, A. Yilmazer, J.J. Yi, and Augustus K. Uht, Quantifying and Reducing the Effects of Wrong-Path Memory References in Cache-Coherent Multiprocessor Systems, IPDPS2006, 2006  O. Mutlu, H. Kim, D. Armstrong, and Y. Patt. Cache filtering techniques to reduce the negative impact of useless speculative memory references on processor performance. Symposium on Computer Architecture and High Performance Computing,  O. Mutlu, H. Kim, D. Armstrong, and Y. Patt. Understanding the effects of wrong-path memory references on processor performance. Workshop on Memory Performance Issues, 2004.

What is it all about?  how wrong-path memory accesses affect the cache coherence traffic state transitions, the resource utilization.  proposes a filtering mechanism and a replacement policy

Subjects  SMPs: Shared-memory MultiProcessor systems  Cache Coherence  Branch Prediction and prefetching  Wrong paths

Cache Coherence Solutions  Snooping Solution (Snoopy Bus): Send all requests for data to all processors Processors snoop to see if they have a copy and respond accordingly Requires broadcast, since caching information is at processors Works well with bus (natural broadcast medium) Dominates for small scale machines (most of the market)  Directory-Based Schemes Keep track of what is being shared in 1 centralized place (logically) Distributed memory => distributed directory for scalability (avoids bottlenecks) Send point-to-point requests to processors via network Scales better than Snooping Actually existed BEFORE Snooping-based schemes

Cache Coherence Protocols  MSI (Modified, Shared, Invalid)  MESI (Modified, Shared, Exclusive, Invalid)  MOESI (Modified, Owned, Shared, Exclusive, Invalid)

Wrong-path effects  Replacements  Writebacks  Invalidations  Cache Block State Transitions  Data/Bus Traffic and Coherence Transactions  Power Consumption  Resource Contention

Replacements  Cause: speculatively-executed load instruction  mispredicted path  a cache block brought into data cache  One of the cache blocks replaced by the new one

Writebacks  When a replacement occurs by a wrong path reference The evicted cache block may have the state M (exclusive, dirty) or O (share, dirty) Before removing this block from cache a writeback occurs  For MSI and MESI if a requested cache block has the state M, before it is sent to the requestor it is written back to memory Then its state is set to S in the original owner’s cache.

Invalidations  Assume MOESI protocol A wrong-path load instruction accesses a cache block that is modified by nother processor The owner sets the state to O The requestor gets the block and the state is S if the owner needs to write to that block  Changes state from O to M  Then invalidates all other copies

Cache Block State Transitions  2 extra cache transitions in the owner’s cache When a modified block is requested  Cache state changes from M to O When that block is modified  Again the cache state becomes M

Data/Bus Traffic and Coherence Transactions  Due to L1 and L2 cache accesses Caused by extra replacements, writebacks, invalidations and state transitions Traffic also increases  Snoop or Directory requests also increase traffic

Power Consumption  As there are unnecessary snoops, Traffic overhead State transition overhead  Power consumption increases  Ex: Filtering unnecessary snoops may reduce L2 cache power by 30% (see Moshovos et al.)

Resource Contention  wrong-path memory accesses compete with correct-path memory accesses for the multiprocessor’s resources  additional cache coherence transactions may increase the frequency of full service buffers  Result: increasing chance of deadlocks

Simulation  SPLASH-2 benchmark suite  em3d simulation benchmark  MOSI and MOESI protocols used  16-processor SPARC v9

Statement based on experiments  mispredicted branches are resolved before 94% of wrong-path L2 misses complete.  Therefore, whether “an L2 cache miss is speculative” is usually known before the block is placed into the L2 cache. [REF2]

Reducing Cache Pollution  Filtering Filtering applied to L2 cache Observation:  if a speculatively-fetched cache block is not used while it resides in the L1 cache, then it is likely that that block will not be used at all or will not be used before being evicted from the L2 cache In this mechanism  all memory references made by wrong-path instructions or the prefetcher are fetched only into the first-level cache  the processor monitors whether they are referenced by non-speculative (correctpath) instructions  Based on the predefined observation, the processor may choose to not write the block into the L2 cache or may adopt a policy that gives lower priority to the unused speculatively-fetched block.

Wrong Path Aware Replacement Policy  when a block is brought into the cache, it is marked as being either on the correct-path or on the wrong- path  when a block needs to be evicted wrong-path blocks are evicted first, on a LRU basis if there are multiple wrong- path blocks.

Performance Evaluation

Conclusions & Critics  IPC (instruction per cycle) can be used as the metric  In some cases wrong-path executions positively effect overall performance mcf, parser, and perlbmk  In some cases significantly negative effect vpr and gcc  To model or not to model especially for future systems with longer memory interconnect latencies and processors with larger instruction windows.  The real effect: Cache pollution In SMP case especially  For a workload with many cache-to-cache transfers, wrong-path memory references can significantly affect the coherence actions.  Proposed solutions yet not studied deeply

References  R. Sendag, A. Yilmazer, J.J. Yi, and Augustus K. Uht, Quantifying and Reducing the Effects of Wrong-Path Memory References in Cache-Coherent Multiprocessor Systems, IPDPS2006, 2006  O. Mutlu, H. Kim, D. Armstrong, and Y. Patt. Cache filtering techniques to reduce the negative impact of useless speculative memory references on processor performance. Symposium on Computer Architecture and High Performance Computing,  O. Mutlu, H. Kim, D. Armstrong, and Y. Patt. Understanding the effects of wrong-path memory references on processor performance. Workshop on Memory Performance Issues, 2004.