Flexible Snooping: Adaptive Forwarding and Filtering of Snoops in Embedded-Ring Multiprocessors Karin Strauss, Xiaowei Shen*, Josep Torrellas University.

Slides:



Advertisements
Similar presentations
Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology.
Advertisements

L.N. Bhuyan Adapted from Patterson’s slides
Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.
Cache Coherence Mechanisms (Research project) CSCI-5593
A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.
Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
Virtual Exclusion: An Architectural Approach to Reducing Leakage Energy in Multiprocessor Systems Mrinmoy Ghosh Hsien-Hsin S. Lee School of Electrical.
Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.
Cache Optimization Summary
Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture Aanjhan Ranganathan (ETH Zurich), Ali Galip Bayrak (EPFL), Theo Kluter.
June 30th, 2006 ICS’06 -- Håkan Zeffer: Håkan Zeffer Zoran Radovic Martin Karlsson Erik Hagersten Uppsala University Sweden TMA A Trap-Based.
Manager-Client Pairing: A Framework for Implementing Coherence Hierarchies Jesse G. Beu Michael C. Rosier Thomas M. Conte Tinker Research Georgia Institute.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
CIS629 Coherence 1 Cache Coherence: Snooping Protocol, Directory Protocol Some of these slides courtesty of David Patterson and David Culler.
Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors Karin StraussAMD Advanced Architecture and Technology.
CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.
(C) 2002 Milo MartinHPCA, Feb Bandwidth Adaptive Snooping Milo M.K. Martin, Daniel J. Sorin Mark D. Hill, and David A. Wood Wisconsin Multifacet.
CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
1 Lecture 21: Coherence and Interconnection Networks Papers: Flexible Snooping: Adaptive Filtering and Forwarding in Embedded Ring Multiprocessors, UIUC,
CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.
Snooping Cache and Shared-Memory Multiprocessors
1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.
Multiprocessor Cache Coherency
DDM - A Cache-Only Memory Architecture Erik Hagersten, Anders Landlin and Seif Haridi Presented by Narayanan Sundaram 03/31/2008 1CS258 - Parallel Computer.
Shuchang Shan † ‡, Yu Hu †, Xiaowei Li † † Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences.
Cooperative Caching for Chip Multiprocessors Jichuan Chang Guri Sohi University of Wisconsin-Madison ISCA-33, June 2006.
1 Cache coherence CEG 4131 Computer Architecture III Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini.
Comparing Memory Systems for Chip Multiprocessors Leverich et al. Computer Systems Laboratory at Stanford Presentation by Sarah Bird.
CSIE30300 Computer Architecture Unit 15: Multiprocessors Hsin-Chou Chi [Adapted from material by and
Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University.
A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Jason Zebchuk, Elham Safi, and Andreas Moshovos
Predicting Coherence Communication by Tracking Synchronization Points at Run Time Socrates Demetriades and Sangyeun Cho 45 th International Symposium in.
A Low-Overhead Coherence Solution for Multiprocessors with Private Cache Memories Also known as “Snoopy cache” Paper by: Mark S. Papamarcos and Janak H.
Using Prediction to Accelerate Coherence Protocols Authors : Shubendu S. Mukherjee and Mark D. Hill Proceedings. The 25th Annual International Symposium.
Effects of wrong path mem. ref. in CC MP Systems Gökay Burak AKKUŞ Cmpe 511 – Computer Architecture.
Moshovos © 1 RegionScout: Exploiting Coarse Grain Sharing in Snoop Coherence Andreas Moshovos
Distributed Shared Memory Based on Reference paper: Distributed Shared Memory, Concepts and Systems.
Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.
Virtual Hierarchies to Support Server Consolidation Mike Marty Mark Hill University of Wisconsin-Madison ISCA 2007.
Analyzing Performance Vulnerability due to Resource Denial-Of-Service Attack on Chip Multiprocessors Dong Hyuk WooGeorgia Tech Hsien-Hsin “Sean” LeeGeorgia.
DISTRIBUTED COMPUTING
CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z.
By Islam Atta Supervised by Dr. Ihab Talkhan
1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.
Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers Jack Sampson*, Rubén González†, Jean-Francois Collard¤, Norman P.
컴퓨터교육과 이상욱 Published in: COMPUTER ARCHITECTURE LETTERS (VOL. 10, NO. 1) Issue Date: JANUARY-JUNE 2011 Publisher: IEEE Authors: Omer Khan (Massachusetts.
ECE/CS 552: Shared Memory © Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and Jim Smith.
March University of Utah CS 7698 Token Coherence: Decoupling Performance and Correctness Article by: Martin, Hill & Wood Presented by: Michael Tabet.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
The University of Adelaide, School of Computer Science
Multi Processing prepared and instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University June 2016Multi Processing1.
An Adaptive Cache Coherence Protocol Optimized for Producer-Consumer Sharing Liquin Cheng, John B. Carter and Donglai Dai cs.utah.edu by Evangelos Vlachos.
Maurice Herlihy and J. Eliot B. Moss,  ISCA '93
Cache-Coherence in High-Performance Shared-Memory Multiprocessors
Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper, Dan Sorin§, Mark.
Multiprocessor Cache Coherency
CMSC 611: Advanced Computer Architecture
Example Cache Coherence Problem
Lecture 1: Parallel Architecture Intro
The Stanford FLASH Multiprocessor
CMSC 611: Advanced Computer Architecture
Natalie Enright Jerger, Li Shiuan Peh, and Mikko Lipasti
11 – Snooping Cache and Directory Based Multiprocessors
CS 213 Lecture 11: Multiprocessor 3: Directory Organization
DDM – A Cache-Only Memory Architecture
Leveraging Optical Technology in Future Bus-based Chip Multiprocessors
High Performance Computing
Cache coherence CEG 4131 Computer Architecture III
Coherent caches Adapted from a lecture by Ian Watson, University of Machester.
Lecture 19: Coherence and Synchronization
Presentation transcript:

Flexible Snooping: Adaptive Forwarding and Filtering of Snoops in Embedded-Ring Multiprocessors Karin Strauss, Xiaowei Shen*, Josep Torrellas University of Illinois at Urbana-Champaign *IBM Research

Karin StraussFlexible Snooping2 Motivation CMPs are becoming standard components cheaper to build medium size machines –32 to 128 cores (multi-CMP) shared memory, cache coherent –easier to program, easier to manage supporting cache coherence is difficult

Karin StraussFlexible Snooping3 Cache coherence solutions long latenciessimpleno snoopy embedded ring difficult to scale simpleyes snoopy broadcast bus indirection, extra hardware scalableno directory based protocol conspros ordered network? strategy other proposals (e.g. token coherence)

Karin StraussFlexible Snooping4 Contributions compared to fastest state-of-the-art scheme performance energy consumption Superset Aggressive performance energy consumption Superset Conservative family of adaptive coherence protocols for rings two were chosen as best options high performance schemeenergy conscious scheme

Karin StraussFlexible Snooping5 Multi-CMP multiprocessor local network CMP Proc + L1 + L2 memory coherence protocol used: only one supplier if line is cached

Karin StraussFlexible Snooping6 Ring in action R S R S R S supplier predictor snoop request cmp LazyEager Oracle response data

Karin StraussFlexible Snooping7 Ring in action R S R S R S latency snoops messages goal: adaptive schemes that approximate Oracle’s behavior LazyEager Oracle

Karin StraussFlexible Snooping8 Primitive snooping actions X X snoop and then forward forward and then snoop forward only + fewer messages + shorter latency + fewer snoops + shorter latency – false negative predictions not allowed

Karin StraussFlexible Snooping9 Predictors and algorithms snoopforwardExact forward then snoop Agg forward snoop forward then snoop Subset action on positive prediction action on negative prediction predictor / algorithm Super set Con snoop then forward node can supply in predictor set of addresses:

Karin StraussFlexible Snooping10 Eager Subset Lazy SupersetAgg SupersetCon Oracle Algorithms / Exact number of snoops snoop message latency number of messages Per miss service: algorithmnegativepositive Subset forward then snoop snoop S u p e r set ConCon forward snoop then forward AggAggg forward then snoop Exactforwardsnoop

Karin StraussFlexible Snooping11 Predictor implementation Subset – associative table: subset of addresses that can be supplied by node Superset – bloom filter: superset of addresses that can be supplied by node – associative table (exclude cache): addresses that recently suffered false positives Exact – associative table: all addresses that can be supplied by node – downgrading: if address has to be evicted from predictor table, corresponding line in node has to be downgraded

Karin StraussFlexible Snooping12 Downgrading A B ES Negative effects: writes by this node need to snoop other nodes reads and writes by other nodes need to fetch line from memory A

Karin StraussFlexible Snooping13 Experiments 8 CMPs, 4 ooo cores each = 32 cores –private L2 caches on-chip bus interconnect off-chip 2D torus interconnect with embedded unidirectional ring per node predictors: latency of 3 processor cycles sesc simulator (sesc.sourceforge.net) SPLASH-2, SPECjbb, SPECweb

Karin StraussFlexible Snooping14 Execution time SPLASH-2SPECjbbSPECweb Normalized execution time Lazy Eager Oracle Subset SupersetCon SupersetAgg Exact the fastest of all algorithms is SupersetAgg performance of most flexible snooping algorithms is similar to Eager 

Karin StraussFlexible Snooping15 Miss service energy SPLASH-2SPECjbbSPECweb Normalized energy consumption Lazy Eager Oracle Subset SupersetCon SupersetAgg Exact 3.22 SupersetCon is least energy-hungry algorithm algorithms that eagerly forward messages use more energy 

Karin StraussFlexible Snooping16 Most cost-effective algorithms SPLASH-2SPECjbbSPECweb Normalized execution time Lazy Eager Oracle Subset SupersetCon SupersetAgg Exact SPLASH-2SPECjbb SPECweb Normalized energy consumption 3.22 Lazy Eager Oracle Subset SupersetCon SupersetAgg Exact high performance: Superset Aggressive faster than Eager at lower energy consumption energy conscious: Superset Conservative slightly slower than Eager at much lower energy consumption  

Karin StraussFlexible Snooping17 Most cost-effective algorithms compared to fastest state-of-the-art scheme (Eager) can be combined by only changing forwarding policy performance energy consumption performance energy consumption Superset Aggressive high performance scheme Superset Conservative energy conscious scheme

Karin StraussFlexible Snooping18 Conclusions proposed flexible snooping, a family of adaptive protocols for embedded rings two chosen protocols – high performance: Superset Aggressive – energy conservation: Superset Conservative – can be selected dynamically embedded-ring protocols more attractive

Karin StraussFlexible Snooping19 Arch map Google: architecture conference map (1 st hit)

Flexible Snooping: Adaptive Forwarding and Filtering of Snoops in Embedded-Ring Multiprocessors Karin Strauss, Xiaowei Shen*, Josep Torrellas University of Illinois at Urbana-Champaign *IBM Research