Last Bank: Dealing with Address Reuse in Non-Uniform Cache Architecture for CMPs Javier Lira ψ Carlos Molina ф Antonio González λ λ Intel Barcelona Research.

Slides:

Advertisements

Similar presentations

UPC MICRO35 Istanbul Nov Effective Instruction Scheduling Techniques for an Interleaved Cache Clustered VLIW Processor Enric Gibert 1 Jesús Sánchez.

Advertisements

Directory-Based Cache Coherence Marc De Melo. Outline Non-Uniform Cache Architecture (NUCA) Cache Coherence Implementation of directories in multicore.

Performance Analysis of NUCA Policies for CMPs Using Parsec v2.0 Benchmark Suite Javier Lira ψ Carlos Molina ф Antonio González λ λ Intel Barcelona Research.

Javier Lira (UPC, Spain)Carlos Molina (URV, Spain) David Brooks (Harvard, USA)Antonio González (Intel-UPC,

ICS’02 UPC An Interleaved Cache Clustered VLIW Processor E. Gibert, J. Sánchez * and A. González * Dept. d’Arquitectura de Computadors Universitat Politècnica.

U P C CGO’03 San Francisco March 2003 Local Scheduling Techniques for Memory Coherence in a Clustered VLIW Processor with a Distributed Data Cache Enric.

UPC Compiler Support for Trace-Level Speculative Multithreaded Architectures Antonio González λ,ф Carlos Molina ψ Jordi Tubella ф INTERACT-9, San Francisco.

Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.

1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.

August 8 th, 2011 Kevan Thompson Creating a Scalable Coherent L2 Cache.

Nikos Hardavellas, Northwestern University

U P C MICRO36 San Diego December 2003 Flexible Compiler-Managed L0 Buffers for Clustered VLIW Processors Enric Gibert 1 Jesús Sánchez 2 Antonio González.

The Locality-Aware Adaptive Cache Coherence Protocol George Kurian 1, Omer Khan 2, Srini Devadas 1 1 Massachusetts Institute of Technology 2 University.

Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.

Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.

Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring Lei Jin and Sangyeun Cho Dept. of Computer Science University.

1 Lecture 16: Large Cache Design Papers: An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches, Kim et al., ASPLOS’02 Distance Associativity.

HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs.

UPC Reducing Misspeculation Penalty in Trace-Level Speculative Multithreaded Architectures Carlos Molina ψ, ф Jordi Tubella ф Antonio González λ,ф ISHPC-VI,

Non-Uniform Power Access in Large Caches with Low-Swing Wires Aniruddha N. Udipi with Naveen Muralimanohar*, Rajeev Balasubramonian University of Utah.

UPC Dynamic Removal of Redundant Computations Carlos Molina, Antonio González and Jordi Tubella Universitat Politècnica de Catalunya - Barcelona

Utilizing Shared Data in Chip Multiprocessors with the Nahalal Architecture Zvika Guz, Idit Keidar, Avinoam Kolodny, Uri C. Weiser The Technion – Israel.

Very low power pipelines using significance compression Canal, R. Gonzalez, A. Smith, J.E. Dept. d'Arquitectura de Computadors, Univ. Politecnica de Catalunya,

UPC Trace-Level Reuse A. González, J. Tubella and C. Molina Dpt. d´Arquitectura de Computadors Universitat Politècnica de Catalunya 1999 International.

UPC Reducing Power Consumption of the Issue Logic Daniele Folegnani and Antonio González Universitat Politècnica de Catalunya.

UPC Value Compression to Reduce Power in Data Caches Carles Aliagas, Carlos Molina and Montse García Universitat Rovira i Virgili – Tarragona, Spain {caliagas,

1 Lecture 8: Large Cache Design I Topics: Shared vs. private, centralized vs. decentralized, UCA vs. NUCA, recent papers.

Lecture 17: Virtual Memory, Large Caches

The Auction: Optimizing Banks Usage in Non-Uniform Cache Architectures Javier Lira ψ Carlos Molina ψ,ф Antonio González ψ,λ λ Intel Barcelona Research.

1 Lecture 11: Large Cache Design Topics: large cache basics and… An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches, Kim et al.,

Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.

University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian.

UPC Trace-Level Speculative Multithreaded Architecture Carlos Molina Universitat Rovira i Virgili – Tarragona, Spain Antonio González.

Conference title 1 A Research-Oriented Advanced Multicore Architecture Course Julio Sahuquillo, Salvador Petit, Vicent Selfa, and María E. Gómez May 25,

TLC: Transmission Line Caches Brad Beckmann David Wood Multifacet Project University of Wisconsin-Madison 12/3/03.

Non-Uniform Cache Architectures for Wire Delay Dominated Caches Abhishek Desai Bhavesh Mehta Devang Sachdev Gilles Muller.

1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian.

Physical Planning for the Architectural Exploration of Large-Scale Chip Multiprocessors Javier de San Pedro, Nikita Nikitin, Jordi Cortadella and Jordi.

Boosting Mobile GPU Performance with a Decoupled Access/Execute Fragment Processor José-María Arnau, Joan-Manuel Parcerisa (UPC) Polychronis Xekalakis.

Cooperative Caching for Chip Multiprocessors Jichuan Chang Guri Sohi University of Wisconsin-Madison ISCA-33, June 2006.

Javier Lira (Intel-UPC, Spain)Timothy M. Jones (U. of Cambridge, UK) Carlos Molina (URV, Spain)Antonio.

Achieving Non-Inclusive Cache Performance with Inclusive Caches Temporal Locality Aware (TLA) Cache Management Policies Aamer Jaleel,

Cache Improvements James Brock, Joseph Schmigel May 12, 2006 – Computer Architecture.

Déjà Vu Switching for Multiplane NoCs NOCS’12 University of Pittsburgh Ahmed Abousamra Rami MelhemAlex Jones.

Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University.

Predicting Coherence Communication by Tracking Synchronization Points at Run Time Socrates Demetriades and Sangyeun Cho 45 th International Symposium in.

Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.

1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)

1 Lecture 13: Cache, TLB, VM Today: large caches, virtual memory, TLB (Sections 2.4, B.4, B.5)

02/21/2003 CART 1 On-chip MRAM as a High-Bandwidth, Low-Latency Replacement for DRAM Physical Memories Rajagopalan Desikan, Charles R. Lefurgy, Stephen.

Analysis of NUCA Policies for CMPs Using Parsec Benchmark Suite Javier Lira ψ Carlos Molina ф Antonio González λ λ Intel Barcelona Research Center Intel.

CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z.

ASPLOS’02 Presented by Kim, Sun-Hee.  Technology trends ◦ The rate of frequency scaling is slowing down  Performance must come from exploiting concurrency.

1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)

An Accurate and Detailed Prefetching Simulation Framework for gem5 Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture.

By Islam Atta Supervised by Dr. Ihab Talkhan

1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.

Optimizing Replication, Communication, and Capacity Allocation in CMPs Z. Chishti, M. D. Powell, and T. N. Vijaykumar Presented by: Siddhesh Mhambrey Published.

Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers Jack Sampson*, Rubén González†, Jean-Francois Collard¤, Norman P.

IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.

OLSRp: Predicting Control Information to Achieve Scalability in OLSR Ad Hoc Networks Esunly Medina ф Roc Meseguer ф Carlos Molina λ Dolors Royo ф Santander.

LRU-PEA: A Smart Replacement Policy for NUCA caches on Chip Multiprocessors Javier Lira ψ Carlos Molina ψ,ф Antonio González ψ,λ λ Intel Barcelona Research.

Cache Data Compaction: Milestone 2 Edward Ma, Siva Penke, Abhijeeth Nuthan.

1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)

University of Utah 1 Interconnect Design Considerations for Large NUCA Caches Naveen Muralimanohar Rajeev Balasubramonian.

Lecture: Large Caches, Virtual Memory

ASR: Adaptive Selective Replication for CMP Caches

Lecture 13: Large Cache Design I

Lecture 12: Cache Innovations

CARP: Compression-Aware Replacement Policies

Presentation transcript:

Last Bank: Dealing with Address Reuse in Non-Uniform Cache Architecture for CMPs Javier Lira ψ Carlos Molina ф Antonio González λ λ Intel Barcelona Research Center Intel Labs - UPC Barcelona, Spain ф Dept. Enginyeria Informàtica Universitat Rovira i Virgili Tarragona, Spain ψ Dept. Arquitectura de Computadors Universitat Politècnica de Catalunya Barcelona, Spain Euro-Par 2009, Delft (The Netherlands) - August 27, 2009

Outline Introduction Methodology Last Bank Characterization of replacements in NUCA Last Bank Optimizations Conclusions 2

Introduction CMPs have emerged as a dominant paradigm in system design. 1. Keep performance improvement while reducing power consumption. 2. Take advantage of Thread-level parallelism. Commercial CMPs are currently available. CMPs incorporate larger and shared last-level caches. Wire delay is a key constraint. 3

NUCA Non-Uniform Cache Architecture (NUCA) was first proposed in ASPLOS 2002 by Kim et al. [1]. NUCA divides a large cache in smaller and faster banks. Banks close to cache controller have smaller latencies than further banks. Processor [1] C. Kim, D. Burger and S.W. Keckler. An Adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. ASPLOS ‘02 4

Outline Introduction Methodology Last Bank Characterization of replacements in NUCA Last Bank Optimizations Conclusions 5

Methodology Simulation tools: Simics + GEMS CACTI v6.0 PARSEC Benchmark Suite Number of cores8 Core processorOut-of-order SPARCv9 Main Memory Size4 Gbytes Memory Bandwidth512 Bytes/cycle L1 cache latency3 cycles NUCA bank latency2 cycles Router delay1 cycle On-chip wire delay1 cycle Main memory latency350 cycles (from core) Private L1 data caches8 KBytes Private L1 instr. caches8 KBytes Shared L2 NUCA cache1 MByte, 256 Banks

Baseline NUCA cache architecture 8 cores 256 banks [2] B. M. Beckmann and D. A. Wood. Managing wire delay in large chip-multiprocessor caches. MICRO ‘04

Outline Introduction Methodology Last Bank Characterization of replacements in NUCA Last Bank Optimizations Conclusions 8

Last Bank Data movements concentrate most accessed data in few banks. Data replacements in HOT banks are unfair. 9

Last Bank An extra bank is included in the NUCA cache. Acts as a Victim cache, but it is not fully-associative. Provides evicted data a second chance for keeping in the NUCA. 10 Last Bank

11 Performance benefits restricted by Last Bank size. Significant performance potential. Analysis of reused addresses to find improvement points.

Outline Introduction Methodology Last Bank Characterization of replacements in NUCA Last Bank Optimizations Conclusions 12

Characterization of replacements in NUCA How many evicted addresses are later reused? How many cycles do a reused address usually spend out of the NUCA before being reinserted? Where were reused addresses located within the NUCA just before being evicted? What action did motivate reused addresses eviction from NUCA? 13

Reused address statistics 14 Nearly 70% of evicted addresses return to the NUCA cache. Most of the reused address, return to NUCA at least twice.

Time between Eviction and Reinsertion 15 Nearly 30% of evicted addresses return in less than 100,000 cycles. In blackscholes, almost 50% of reused addresses return to NUCA in less than 1,000 cycles.

Last location within the NUCA Most of reused addresses were evicted from Local Banks. Most of addresses replaced from Central Banks are not later reused. 16

Outline Introduction Methodology Last Bank Characterization of replacements in NUCA Last Bank Optimizations Conclusions 17

Selective Last Bank Target: To reduce pollution in Last Bank. This mechanism allows to select the evicted data blocks that are going to be stored in the Last Bank. Implemented Selective Last Bank: Stores data blocks, if and only if, they were evicted from a Local Bank. Otherwise, sends them back to the main memory. 18

LRU Prioritising Last Bank Target: To maintain reused addresses in the NUCA cache. Modification of data eviction policy of NUCA banks. Prioritises lines that come from Last Bank during the data replacement process. P: P: P: P: P: P: P: P: 0 P:0

Results 20 Both optimizations increase Last Bank performance benefits. There is still room for improvement. Adaptive filters will be analysed in future works.

Outline Introduction Methodology Last Bank Characterization of replacements in NUCA Last Bank Optimizations Conclusions 21

Conclusions Data movements provoke unfair replacements in HOT banks. Last Bank reduce access latency of promptly reused addresses. Huge performance potential. Two optimizations are proposed: Selective Last Bank: Reduce pollution in Last Bank. LRU Prioritising Last Bank: Maintain reused addresses in the NUCA cache. 22

Last Bank: Dealing with Address Reuse in Non-Uniform Cache Architecture for CMPs Questions?