Www.bsc.es Lluc Álvarez, Lluís Vilanova, Miquel Moretó, Marc Casas, Marc Gonzàlez, Xavier Martorell, Nacho Navarro, Eduard Ayguadé, Mateo Valero Coherence.

Slides:



Advertisements
Similar presentations
Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.
Advertisements

Cache coherence for CMPs Miodrag Bolic. Private cache Each cache bank is private to a particular core Cache coherence is maintained at the L2 cache level.
Lecture 19: Cache Basics Today’s topics: Out-of-order execution
1 A Hybrid Adaptive Feedback Based Prefetcher Santhosh Verma, David Koppelman and Lu Peng Louisiana State University.
Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.
ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors Mohammad Hammoud, Sangyeun Cho, and Rami Melhem Presenter: Socrates Demetriades.
Nikos Hardavellas, Northwestern University
The Locality-Aware Adaptive Cache Coherence Protocol George Kurian 1, Omer Khan 2, Srini Devadas 1 1 Massachusetts Institute of Technology 2 University.
Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.
UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.
Speculative Sequential Consistency with Little Custom Storage Impetus Group Computer Architecture Lab (CALCM) Carnegie Mellon University
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
CS752 Decoupled Architecture for Data Prefetching Jichuan Chang Kai Xu.
Euro-Par Uppsala Architecture Research Team [UART] | Uppsala University Dept. of Information Technology Div. of.
Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.
Glenn Reinman, Brad Calder, Department of Computer Science and Engineering, University of California San Diego and Todd Austin Department of Electrical.
2/15/2006"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA Software-Hardware Cooperative Memory Disambiguation Ruke Huang, Alok.
Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure Oguz Ergin*, Deniz Balkan, Kanad Ghose, Dmitry Ponomarev Department.
Projects Using gem5 ParaDIME (2012 – 2015) RoMoL (2013 – 2018)
An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.
CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.
Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.
Cache Organization of Pentium
Flexicache: Software-based Instruction Caching for Embedded Processors Jason E Miller and Anant Agarwal Raw Group - MIT CSAIL.
McPAT: An Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures Runjie Zhang Dec.3 S. Li et al. in MICRO’09.
Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.
A Hardware-based Cache Pollution Filtering Mechanism for Aggressive Prefetches Georgia Institute of Technology Atlanta, GA ICPP, Kaohsiung, Taiwan,
Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.
Building Expressive, Area-Efficient Coherence Directories Michael C. Huang Guofan Jiang Zhejiang University University of Rochester IBM 1 Lei Fang, Peng.
Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.
ACSAC’04 Choice Predictor for Free Mongkol Ekpanyapong Pinar Korkmaz Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Institute.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
AMD Opteron Overview Michael Trotter (mjt5v) Tim Kang (tjk2n) Jeff Barbieri (jjb3v)
Predicting Coherence Communication by Tracking Synchronization Points at Run Time Socrates Demetriades and Sangyeun Cho 45 th International Symposium in.
Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.
CSE 378 Cache Performance1 Performance metrics for caches Basic performance metric: hit ratio h h = Number of memory references that hit in the cache /
1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)
1 Lecture 13: Cache, TLB, VM Today: large caches, virtual memory, TLB (Sections 2.4, B.4, B.5)
Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh.
UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)
1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.
DeSC: Decoupled Supply-Compute Communication Management for Heterogeneous Architectures 1 Tae Jun Ham (Princeton Univ.) Juan Luis Aragón (Univ. of Murcia)
Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.
컴퓨터교육과 이상욱 Published in: COMPUTER ARCHITECTURE LETTERS (VOL. 10, NO. 1) Issue Date: JANUARY-JUNE 2011 Publisher: IEEE Authors: Omer Khan (Massachusetts.
IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.
“An Evaluation of Directory Schemes for Cache Coherence” Presented by Scott Weber.
UltraSparc IV Tolga TOLGAY. OUTLINE Introduction History What is new? Chip Multitreading Pipeline Cache Branch Prediction Conclusion Introduction History.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
An Adaptive Cache Coherence Protocol Optimized for Producer-Consumer Sharing Liquin Cheng, John B. Carter and Donglai Dai cs.utah.edu by Evangelos Vlachos.
Chang Hyun Park, Taekyung Heo, and Jaehyuk Huh
Cache Organization of Pentium
Data Prefetching Smruti R. Sarangi.
Lecture: Large Caches, Virtual Memory
ISPASS th April Santa Rosa, California
Stash: Have Your Scratchpad and Cache it Too
Cache Memory Presentation I
Building Expressive, Area-Efficient Coherence Directories
Lu Peng, Jih-Kwon Peir, Konrad Lai
Spare Register Aware Prefetching for Graph Algorithms on GPUs
Accelerating Dependent Cache Misses with an Enhanced Memory Controller
Milad Hashemi, Onur Mutlu, Yale N. Patt
Lecture: Cache Innovations, Virtual Memory
Lecture 20: OOO, Memory Hierarchy
Data Prefetching Smruti R. Sarangi.
Lecture: Cache Hierarchies
Main Memory Background
Lois Orosa, Rodolfo Azevedo and Onur Mutlu
Overview Problem Solution CPU vs Memory performance imbalance
Presentation transcript:

Lluc Álvarez, Lluís Vilanova, Miquel Moretó, Marc Casas, Marc Gonzàlez, Xavier Martorell, Nacho Navarro, Eduard Ayguadé, Mateo Valero Coherence Protocol for Transparent Management of Scratchpad Memories in Shared Memory Manycore Architectures Discussion Lead By: Vijay Thiruvengadam

Introduction 2/23 Traditional Hardware Caching a problem in future manycores Hybrid memory hierarchies in HPC (like GPU) –Power efficiency –Scalability –Programmability –Caches Power efficiency Scalability Programmability –Scratchpad memories Power efficiency Scalability Programmability

Introduction 3/23 Our goal –Introduce SPMs alongside the L1 cache Advantages in power and scalability –Combination of compiler, runtime and hardware techniques manage the SPMs and coherence between SPM & regular memory Minimal programmer involvement Overview of Solution –Compiler transforms code based on programmer annotation and compiler analysis –Memory accesses split into – regular (GM) access, scratchpad (SPM) access, potentially aliased (Guarded GM) access –Hardware intercepts guarded accesses and redirects them to SPM in case it is aliased with SPM data CC L1 Cluster Interconnect SPM L1 SPM CC L1 SPM L1 SPM CC L1 SPM L1 SPM C C L1 SPM L1 SPM DRAM L2

Hybrid Memory Hierarchy 4/23 Code Transformation –Programmer annotates code suitable for transformation –Compiler applies tiling transformations Strided accesses mapped to the SPM Random accesses mapped to the cache hierarchy for (i=0; i<N; i++) { a[i] = b[i]; c[b[i]] = 0; ptr[a[i]]++; } Strided: a, b Random: c, ptr for (i=0; i<N; i++) { a[i] = b[i]; c[b[i]] = 0; ptr[a[i]]++; } for (i=0; i<N; i++) { a[i] = b[i]; c[b[i]] = 0; ptr[a[i]]++; } i=0; while (i<N) { MAP (&a[i], _a, iters, tags); MAP (&b[i], _b, iters, tags); n = (i+iters>N) ? N : i+iters; SYNCH (tags) for (_i=0; _i<n; _i++, i++) { _a[_i] = _b[_i]; c[_b[_i]] = 0; ptr[_a[_i]]++; } Control Work Synch C L1 SPM a, b c, ptr

Coherence Problem 5/23 SPMs are not coherent with the cache hierarchy –Invalid results if strided and random accesses alias Very challenging problem for the compiler –Alias analysis Restrictive or inefficient solutions –Programmer identifies private data and applies code transformations –Prior work by authors proposes similar solution but restricts a core to access only it’s own SPM Our solution solves this issue –Compiler can generate code and mark accesses that might be aliased –Hardware intercepts these potentially aliased accesses and resolves it

Outline Introduction –Hybrid memory hierarchy –Coherence problem Hardware-software coherence protocol –Compiler support –Hardware design Evaluation –Experimental framework –Comparison with cache hierarchies Conclusions 6/23

Hardware-Software Coherence Protocol 7/23 Basic idea –Avoid maintaining two coherent copies of the data –Ensure the valid copy is always accessed Compiler detects potentially incoherent accesses –Emits guarded memory instructions Hardware diverts them to the valid copy of the data –Distributed hardware directory to track the contents of the SPMs –Hierarchy of filters to track data not mapped to any SPM on any core

for (i=0; i<N; i++) { a[i] = b[i]; c[b[i]] = 0; ptr[a[i]]++; } Compiler Support 8/23 Step 1: Classification of memory references –Strided accesses –Random accesses –Potentially incoherent accesses Strided: a, b Random: c Potentially incoherent: ptr for (i=0; i<N; i++) { a[i] = b[i]; c[b[i]] = 0; ptr[a[i]]++; } for (i=0; i<N; i++) { a[i] = b[i]; c[b[i]] = 0; ptr[a[i]]++; } for (i=0; i<N; i++) { a[i] = b[i]; c[b[i]] = 0; ptr[a[i]]++; } C Dir ptr L1 SPM a, b c

Compiler Support 9/23 Step 2: Code transformation –Only for strided accesses Apply tiling Change memory references for (i=0; i<N; i++) { a[i] = b[i]; c[b[i]] = 0; ptr[a[i]]++; } i=0; while (i<N) { MAP (&a[i], _a, iters, tags); MAP (&b[i], _b, iters, tags); n = (i+iters>N) ? N : i+iters; SYNCH (tags) for (_i=0; _i<n; _i++, i++) { _a[_i] = _b[_i]; c[_b[_i]] = 0; ptr[_a[_i]]++; } Control Work Synch

Compiler Support 10/23 Step 3: Code generation –Guarded memory instructions for potentially incoherent accesses i=0; while (i<N) { MAP (&a[i], _a, iters, tags); MAP (&b[i], _b, iters, tags); n = (i+iters>N) ? N : i+iters; SYNCH (tags) for (_i=0; _i<n; _i++, i++) { // _a[_i] = _b[_i]; ld _b(,_i,4),r1 st r1,_a(,_i,4) // c[_b[_i]] = 0; mv #0,r3 st r3,c(,r1,4) // ptr[_a[_i]]++; gld ptr(,r1,4),r2 inc r2,r2 gst r2,ptr(,r1,4) } Control Work Synch i=0; while (i<N) { MAP (&a[i], _a, iters, tags); MAP (&b[i], _b, iters, tags); n = (i+iters>N) ? N : i+iters; SYNCH (tags) for (_i=0; _i<n; _i++, i++) { _a[_i] = _b[_i]; c[_b[_i]] = 0; ptr[_a[_i]]++; } Control Work Synch

Hardware Design 11/23 Distributed hardware directory (SPMDir) –One directory CAM per core –Each core tracks contents of its SPM –Maps regular address to SPM address Filter –One filter CAM per core –Track data not mapped to any SPM Directory of filters (FilterDir) –Tracks contents of all filters –Located at L2 cache shared by all cores –Given a regular address, gives out the list of cores that have it in their local filters Core TLB L1D SPM SPMDir Filter CPU Cache directory FilterDir

Strided accesses Hardware Design 12/23 FilterDir Local core Remote core TLB L1D SPM SPMDir Filter TLB L1D SPM SPMDir Filter

Random accesses Hardware Design 13/23 FilterDir Local core Remote core TLB L1D SPM SPMDir Filter TLB L1D SPM SPMDir Filter

Potentially incoherent accesses –No mapping in SPMs –Mapping in local SPM –Mapping in remote SPM –No mapping with filter miss Hardware Design 14/23 FilterDir Local core Remote core TLB L1D SPM SPMDir Filter TLB L1D SPM SPMDir Filter HIT MISS HIT MISS HIT MISS UP When data is mapped to some SPM –Update SPMDir –Invalidate filters

Outline Introduction –Hybrid memory hierarchy –Coherence problem Hardware-software coherence protocol –Compiler support –Hardware design Evaluation –Experimental framework –Comparison with cache hierarchies Conclusions 15/23

Gem5 –x86 –64 OoO cores –L1 32KB, SPM 32KB, L2 256KB McPAT –22nm –Clock gating NAS benchmarks –CG, EP, FT, IS, MG, SP –Tiling transformations by hand Potentially incoherent accesses –GCC alias analysis report –Unused x86 instruction prefix Experimental Framework 16/23 ParameterDescription Cores64 cores, OoO, 6 instruction wide, 2GHz Pipeline front end13 cycles. 4-way BTB 4K entries, RAS 32 entries. Branch predictor 4K selector, 4K G-share, 4K Bimodal ExecutionROB 160 entries. IQ 96 entries. LQ/SQ 48/32 entries. 3 INT ALU, 3 FP ALU, 3 LD/ST units. 256/256 INT/FP register file. Full bypass L1 I-cache2 cycles, 32KB, 4-way, pseudoLRU L1 D-cache2 cycles, 32KB, 4-way, pseudoLRU, stride prefetcher L2 cacheShared unified NUCA 16MB, sliced 256 KB/core 15 cycles, 16-way, pseudoLRU Cache coherenceMOESI. Distributed 4-way cache directory 64K entries NoCMesh. Link 1 cycle, router 1 cycle SPM2 cycles, 32KB DMACCommand queue 32 entries in-order Bus request queue 512 entries in-order SPMDir32 entries Filter48 entries, fully associative, pseudoLRU FilterDirDistributed 4K entries, fully associative, pseudoLRU

Performance –Hybrid memory hierarchy (32KB L1 + 32KB SPM) –Cache hierarchy (64KB L1) Comparison with Cache Hierarchies 17/23 14% 12% 22% 3%

Comparison with Cache Hierarchies 18/23 23% 20% 34% 2% NoC traffic –Hybrid memory hierarchy (32KB L1 + 32KB SPM) –Cache hierarchy (64KB L1)

Comparison with Cache Hierarchies 19/23 15% 13% 24% -3% Energy consumption –Hybrid memory hierarchy (32KB L1 + 32KB SPM) –Cache hierarchy (64KB L1)

Outline Introduction –Hybrid memory hierarchy –Coherence problem Hardware-software coherence protocol –Compiler support –Hardware design Evaluation –Experimental framework –Comparison with cache hierarchies Conclusions 20/23

Conclusions Hybrid memory hierarchy –Attractive solution for future manycores –Coherence problem Hardware-software coherence protocol –Straightforward compiler support –Simple hardware design with low overheads The hybrid memory hierarchy can be programmed with shared memory programming models The hybrid memory hierarchy outperforms cache hierarchies –Average speedup of 14% –Average NoC traffic reduction of 23% –Average energy consumption reduction of 15% 21/23

Thanks for your attention! Questions? Coherence Protocol for Transparent Management of Scratchpad Memories in Shared Memory Manycore Architectures