Building Expressive, Area-Efficient Coherence Directories Michael C. Huang Guofan Jiang Zhejiang University University of Rochester IBM 1 Lei Fang, Peng.

Slides:



Advertisements
Similar presentations
Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.
Advertisements

A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy J. Zebchuk, E. Safi, and A. Moshovos.
A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.
1 Lecture 6: Directory Protocols Topics: directory-based cache coherence implementations (wrap-up of SGI Origin and Sequent NUMA case study)
Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.
1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.
Nikos Hardavellas, Northwestern University
The Locality-Aware Adaptive Cache Coherence Protocol George Kurian 1, Omer Khan 2, Srini Devadas 1 1 Massachusetts Institute of Technology 2 University.
Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.
1 Lecture 4: Directory Protocols Topics: directory-based cache coherence implementations.
University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,
UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.
Operating Systems Lecture 10 Issues in Paging and Virtual Memory Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard. Zhiqing.
Exploiting Spatial Locality in Data Caches using Spatial Footprints Sanjeev Kumar, Princeton University Christopher Wilkerson, MRL, Intel.
Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring Lei Jin and Sangyeun Cho Dept. of Computer Science University.
Speculative Sequential Consistency with Little Custom Storage Impetus Group Computer Architecture Lab (CALCM) Carnegie Mellon University
CSC 4250 Computer Architectures December 8, 2006 Chapter 5. Memory Hierarchy.
Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.
Comparative Performance Evaluation of Cache-Coherent NUMA and COMA Architectures Per Stenstrom, Truman Joe and Anoop Gupta Presented by Colleen Lewis.
Memory Management (II)
1 Lecture 2: Snooping and Directory Protocols Topics: Snooping wrap-up and directory implementations.
1 Chapter 8 Virtual Memory Virtual memory is a storage allocation scheme in which secondary memory can be addressed as though it were part of main memory.
1 Lecture 5: Directory Protocols Topics: directory-based cache coherence implementations.
Towards Scalable, Energy-Efficient, Bus-Based On-Chip Networks Aniruddha N. Udipi with Naveen Muralimanohar*, Rajeev Balasubramonian University of Utah.
An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.
CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.
EECC722 - Shaaban #1 Lec # 9 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.
Revisiting Load Value Speculation:
NVSleep: Using Non-Volatile Memory to Enable Fast Sleep/Wakeup of Idle Cores Xiang Pan and Radu Teodorescu Computer Architecture Research Lab
Computer Architecture Lecture 28 Fasih ur Rehman.
Cooperative Caching for Chip Multiprocessors Jichuan Chang Guri Sohi University of Wisconsin-Madison ISCA-33, June 2006.
A Hardware-based Cache Pollution Filtering Mechanism for Aggressive Prefetches Georgia Institute of Technology Atlanta, GA ICPP, Kaohsiung, Taiwan,
Déjà Vu Switching for Multiplane NoCs NOCS’12 University of Pittsburgh Ahmed Abousamra Rami MelhemAlex Jones.
StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University.
A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Jason Zebchuk, Elham Safi, and Andreas Moshovos
TEMPLATE DESIGN © Hardware Design, Synthesis, and Verification of a Multicore Communication API Ben Meakin, Ganesh Gopalakrishnan.
Predicting Coherence Communication by Tracking Synchronization Points at Run Time Socrates Demetriades and Sangyeun Cho 45 th International Symposium in.
ImanFaraji Time-based Snoop Filtering in Chip Multiprocessors Amirkabir University of Technology Tehran, Iran University of Victoria Victoria, Canada Amirali.
Effects of wrong path mem. ref. in CC MP Systems Gökay Burak AKKUŞ Cmpe 511 – Computer Architecture.
Moshovos © 1 RegionScout: Exploiting Coarse Grain Sharing in Snoop Coherence Andreas Moshovos
MadCache: A PC-aware Cache Insertion Policy Andrew Nere, Mitch Hayenga, and Mikko Lipasti PHARM Research Group University of Wisconsin – Madison June 20,
Virtual Memory 1 1.
Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Memory Management Overview.
Copyright © 2010 Houman Homayoun Houman Homayoun National Science Foundation Computing Innovation Fellow Department of Computer Science University of California.
BarrierWatch: Characterizing Multithreaded Workloads across and within Program-Defined Epochs Socrates Demetriades and Sangyeun Cho Computer Frontiers.
Hardware Architectures for Power and Energy Adaptation Phillip Stanley-Marbell.
By Islam Atta Supervised by Dr. Ihab Talkhan
1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.
Lluc Álvarez, Lluís Vilanova, Miquel Moretó, Marc Casas, Marc Gonzàlez, Xavier Martorell, Nacho Navarro, Eduard Ayguadé, Mateo Valero Coherence.
Token Coherence: Decoupling Performance and Correctness Milo M. D. Martin Mark D. Hill David A. Wood University of Wisconsin-Madison ISCA-30 (2003)
LECTURE 12 Virtual Memory. VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a “cache”
“An Evaluation of Directory Schemes for Cache Coherence” Presented by Scott Weber.
1 IP Routing table compaction and sampling schemes to enhance TCAM cache performance Author: Ruirui Guo, Jose G. Delgado-Frias Publisher: Journal of Systems.
An Adaptive Cache Coherence Protocol Optimized for Producer-Consumer Sharing Liquin Cheng, John B. Carter and Donglai Dai cs.utah.edu by Evangelos Vlachos.
Memory COMPUTER ARCHITECTURE
Zhichun Zhu Zhao Zhang ECE Department ECE Department
Lecture 12 Virtual Memory.
Xiaodong Wang, Shuang Chen, Jeff Setter,
A New Coherence Method Using A Multicast Address Network
Building Expressive, Area-Efficient Coherence Directories
Fine-Grain CAM-Tag Cache Resizing Using Miss Tags
Directory-based Protocol
Improving Multiple-CMP Systems with Token Coherence
Memory Management Overview
Natalie Enright Jerger, Li Shiuan Peh, and Mikko Lipasti
Lecture 23: Virtual Memory, Multiprocessors
Lois Orosa, Rodolfo Azevedo and Onur Mutlu
Virtual Memory 1 1.
Presentation transcript:

Building Expressive, Area-Efficient Coherence Directories Michael C. Huang Guofan Jiang Zhejiang University University of Rochester IBM 1 Lei Fang, Peng Liu, and Qi Hu

Motivation 2  Technology scaling has steadily increased the number of cores in a mainstream CMP.  Snoop-based protocol generate too much traffic, which causes performance degradation.  A directory-based approach will be increasingly seen as a serious candidate for on-chip coherence solution.  The directory occupies significant area, which grows as the number of processors increases.

2-D array 3 [1] A. Agarwal “An Evaluation of Directory Schemes for Cache Coherence,” ISCA1988 [2] A. Gupta “Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes,” ICPP1990 [3] D. Sanchez “SCD: A Scalable Coherence Directory with Flexible Sharer Set Encoding,” HPCA2012 [4] B. Cuesta “Increasing the Effectiveness of Directory Caches by Deactivating Coherence for Private Memory Blocks,” ISCA2011 [5] A. Moshovos “RegionScout: Exploiting Coarse Grain Sharing in Snoop-Based Coherence,” ISCA2005

Outline 4  Motivation  Hybrid representation (HR)  Multi-granular tracking (MG)  Experimental analysis  Conclusion

Hybrid representation 5  People have observed that most cache lines have a small number of sharers.  A subtle but important difference: a lot of entries tracks only one sharer. 99% The simulation is carried out in a 16-way CMP with 8-way associative directory cache. About 99% of sets have 2 or less entries tracking multiple sharers.

Implementation of hybrid representation 6

Multi-granular tracking 7  People have proposed to identify the pattern of region and avoid tracking the private or read only regions.  We exploit the consequence (of private pages etc) that consecutive blocks may have the same access pattern.  We try to use a region entry to track the entire region.

Implementation of multi-granular tracking 8  Region entry: blocks with similar pattern.  Line entry: exceptional blocks.  Simple implementation  Start with region entry;  Use line entry for exceptional blocks.

Hardware support 9  Grain size bit for distinguish.  Index of line entries align with region entry.  Region entry and line entries for the same region reside in the same set.  When both are found, the line entry takes priority.

Sizing of regions 10  A larger region size create a more compact tracking when the region is homogeneous.  It can lead to more space waste when the actual size of a region with homogeneous sharing pattern is smaller.

System setup 11 Processor core Fetch/Decode/Commit ROB Issue Q/Reg. (int, fp) LSQ (LQ, SQ) Branch predictor -Gshare -Bimodal/Meta/BTB Br. mispred. Penalty 4 / 4 / 4 64 (32, 32) / (64, 64) 32 (16, 16) 2 search ports Bimodal + Gshare 8K entries, 13 bit history 4K / 8K / 4K (4-way) entries At least 7 cycles Memory hierarchy L1 D cache (private) L1 I cache (private) L2 cache (shared) 16KB, 2-way, 64B, 2 cycles, 2ports 32KB, 2-way, 64B, 2 cycles 256KB slice, 8-way, 64B, 15 cycles, 2ports Directory cache 128 sets slice, 8-way, 15 cycles, 2ports Intra-node fabric delay3 cycles Main memory At least 250 cycles, 8 MEM controllers Network packets Flit size: 72-bits Data: 5 flits, meta: 1 flit NoC interconnect 4 VCs; 2-cycle router; buffer: 5×12 flits Wire delay: 1 cycle per hop  Simulator based on SimpleScalar with extensive modification.  Directory protocols models all stable and transient states.  Multi-threaded apps Including SPLASH-2, PARSEC, em3d, jacobi, mp3d, shallow, tsp.

Experimental result of hybrid representation 12  The ratio of vector entries: associating 25% of the entries with vector causes an increase of 0.4% in cache miss.  The figure shows the normalized performance with 2 vector in the 8- way set in 16-way CMP. The area reduction is 1.3X. The average degradation is less than 0.5%.  For 64-way CMP, the area reduction becomes 2X with little impact.

Comparison for hybrid representation 13 Area reduction Increment of network packets(%) Increment of execution time(%) HR2X LP[1]1.8X LP+HR2.5X CV[2]1.8X CV+HR2.5X SCD[3]2.1X SCD+HR2.6X  HR outperforms other schemes and causes negligible degradation.  HR is orthogonal to other schemes. Compare HR with other schemes in 64-way CMP. [1] A. Agarwal “An Evaluation of Directory Schemes for Cache Coherence,” ISCA1988 [2] A. Gupta “Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes,” ICPP1990 [3] D. Sanchez “SCD: A Scalable Coherence Directory with Flexible Sharer Set Encoding,” HPCA2012

Experimental result of multi-granular 14  Sizing of region: size of 16 achieves the best performance.  The impact on performance as the size of directory shrinks. 2.4%1.6% 5.9%

Comparison for multi-granular 15  Page-bypassing  Identify the pages with the aid of TLB and OS;  Avoid tracking private or read only pages.  Impact of page-bypassing/MG/page-bypassing + MG

Combination of HR and MG 16  Since the two techniques work on different dimensions, they can be combined in a rather straightforward manner.  In a directory cache with multi-granular tracking, the sharer list can be implemented in either pointer or vector format as in hybrid representation.  We implement the combination of HR and MG in a 16- way CMP. The area reduction is 10X and the performance impact is about 1.2%.

Conclusion 17  We have proposed an expressive, area-efficient directory.  Two techniques:  HR: reduce the size of directory entry  MG: reduce the number of directory entries.  Simple hardware support without any OS or software support.  When combine the 2 techniques together, the storage of directory can be reduced by more than an order of magnitude with almost negligible performance impact.

Building Expressive, Area-Efficient Coherence Directories Michael C. Huang Guofan Jiang Zhejiang University University of Rochester IBM 18 Lei Fang, Peng Liu, and Qi Hu