Javier Lira (Intel-UPC, Spain)Timothy M. Jones (U. of Cambridge, UK) Carlos Molina (URV, Spain)Antonio.

Slides:

Advertisements

Similar presentations

SE-292 High Performance Computing

Advertisements

Performance Analysis of NUCA Policies for CMPs Using Parsec v2.0 Benchmark Suite Javier Lira ψ Carlos Molina ф Antonio González λ λ Intel Barcelona Research.

Javier Lira (UPC, Spain)Carlos Molina (URV, Spain) David Brooks (Harvard, USA)Antonio González (Intel-UPC,

SE-292 High Performance Computing Memory Hierarchy R. Govindarajan

Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.

Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.

ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors Mohammad Hammoud, Sangyeun Cho, and Rami Melhem Presenter: Socrates Demetriades.

Nikos Hardavellas, Northwestern University

Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.

University of Utah1 Interconnect-Aware Coherence Protocols for Chip Multiprocessors Liqun Cheng Naveen Muralimanohar Karthik Ramani Rajeev Balasubramonian.

Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring Lei Jin and Sangyeun Cho Dept. of Computer Science University.

Non-Uniform Cache Architecture Prof. Hsien-Hsin S

1 Lecture 16: Large Cache Design Papers: An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches, Kim et al., ASPLOS’02 Distance Associativity.

HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs.

Non-Uniform Power Access in Large Caches with Low-Swing Wires Aniruddha N. Udipi with Naveen Muralimanohar*, Rajeev Balasubramonian University of Utah.

Utilizing Shared Data in Chip Multiprocessors with the Nahalal Architecture Zvika Guz, Idit Keidar, Avinoam Kolodny, Uri C. Weiser The Technion – Israel.

Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers Manu Awasthi, David Nellans, Kshitij Sudan, Rajeev Balasubramonian,

1 E. Bolotin – The Power of Priority, NoCs 2007 The Power of Priority : NoC based Distributed Cache Coherency Evgeny Bolotin, Zvika Guz, Israel Cidon,

The Auction: Optimizing Banks Usage in Non-Uniform Cache Architectures Javier Lira ψ Carlos Molina ψ,ф Antonio González ψ,λ λ Intel Barcelona Research.

1 Balanced Cache:Reducing Conflict Misses of Direct-Mapped Caches through Programmable Decoders ISCA 2006,IEEE. By Chuanjun Zhang Speaker: WeiZeng.

1 Lecture 11: Large Cache Design Topics: large cache basics and… An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches, Kim et al.,

1 Lecture 15: Large Cache Design Topics: innovations for multi-mega-byte cache hierarchies Reminders:  Assignment 5 posted.

CS 524 (Wi 2003/04) - Asim LUMS 1 Cache Basics Adapted from a presentation by Beth Richardson

An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.

CS 7810 Lecture 17 Managing Wire Delay in Large CMP Caches B. Beckmann and D. Wood Proceedings of MICRO-37 December 2004.

University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian.

McRouter: Multicast within a Router for High Performance NoCs

Interactions Between Compression and Prefetching in Chip Multiprocessors Alaa R. Alameldeen* David A. Wood Intel CorporationUniversity of Wisconsin-Madison.

TLC: Transmission Line Caches Brad Beckmann David Wood Multifacet Project University of Wisconsin-Madison 12/3/03.

Non-Uniform Cache Architectures for Wire Delay Dominated Caches Abhishek Desai Bhavesh Mehta Devang Sachdev Gilles Muller.

1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian.

Hardware Caches with Low Access Times and High Hit Ratios Xiaodong Zhang College of William and Mary.

ECE8833 Polymorphous and Many-Core Computer Architecture Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Lecture 5 Non-Uniform Cache.

CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.

Boosting Mobile GPU Performance with a Decoupled Access/Execute Fragment Processor José-María Arnau, Joan-Manuel Parcerisa (UPC) Polychronis Xekalakis.

Dynamic Cache Clustering for Chip Multiprocessors

Cache Improvements James Brock, Joseph Schmigel May 12, 2006 – Computer Architecture.

International Symposium on Low Power Electronics and Design NoC Frequency Scaling with Flexible- Pipeline Routers Pingqiang Zhou, Jieming Yin, Antonia.

1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University.

Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University.

A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Jason Zebchuk, Elham Safi, and Andreas Moshovos

Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.

1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)

Last Bank: Dealing with Address Reuse in Non-Uniform Cache Architecture for CMPs Javier Lira ψ Carlos Molina ф Antonio González λ λ Intel Barcelona Research.

Analysis of NUCA Policies for CMPs Using Parsec Benchmark Suite Javier Lira ψ Carlos Molina ф Antonio González λ λ Intel Barcelona Research Center Intel.

MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.

Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh.

CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z.

ASPLOS’02 Presented by Kim, Sun-Hee.  Technology trends ◦ The rate of frequency scaling is slowing down  Performance must come from exploiting concurrency.

1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)

An Accurate and Detailed Prefetching Simulation Framework for gem5 Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture.

By Islam Atta Supervised by Dr. Ihab Talkhan

1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.

IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.

LRU-PEA: A Smart Replacement Policy for NUCA caches on Chip Multiprocessors Javier Lira ψ Carlos Molina ψ,ф Antonio González ψ,λ λ Intel Barcelona Research.

Quantifying and Controlling Impact of Interference at Shared Caches and Main Memory Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, Onur.

Cache Data Compaction: Milestone 2 Edward Ma, Siva Penke, Abhijeeth Nuthan.

University of Utah 1 Interconnect Design Considerations for Large NUCA Caches Naveen Muralimanohar Rajeev Balasubramonian.

COSC3330 Computer Architecture

Lecture: Large Caches, Virtual Memory

ASR: Adaptive Selective Replication for CMP Caches

Pablo Abad, Pablo Prieto, Valentin Puente, Jose-Angel Gregorio

Cache Memory Presentation I

Prefetch-Aware Cache Management for High Performance Caching

Lecture 13: Large Cache Design I

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Lecture 12: Cache Innovations

Lecture: Cache Innovations, Virtual Memory

Lecture: Cache Hierarchies

Presentation transcript:

Javier Lira (Intel-UPC, Spain)Timothy M. Jones (U. of Cambridge, UK) Carlos Molina (URV, Spain)Antonio González (Intel-UPC, Spain) HiPEAC 2012, Paris (France) – January 23, 2012

 CMPs have become the dominant paradigm.  Incorporate large shared last- level caches.  Access latency in large caches is dominated by wire delays. 24 MBytes Intel® 32 MBytes IBM® 32 MBytes Tilera® Nehalem POWER7 Tile-GX 2

 NUCA divides a large cache in smaller and faster banks.  Cache access latency consists of the routing and bank access latencies.  Banks close to cache controller have smaller latencies than further banks. Processor [1] Kim et al. An Adaptive, Non-Uniform Cache Structure for Wire-Delay Dominated On-Chip Architectures. ASPLOS’02 3

 Data can be mapped in multiple banks.  Migration allows data to adapt to application’s behaviour. 4 S-NUCAD-NUCA Migration movements are effective, but about 50% of hits still happen in non-optimal banks.

 Introduction  Methodology  The Migration Prefetcher  Analysis of results  Conclusions 5

MigrationPlacement Access Replacement Placement Access Migration Replacement Core 0Core 1Core 2Core 3 Core 4Core 5Core 6Core 7 16 positions per data Partitioned multicast Gradual promotion LRU + Zero-copy Core 0 [2] Beckmann and Wood. Managing Wire Delay in Large Chip-Multiprocessor Caches. MICRO’04 6

Number of cores8 – UltraSPARC IIIi Frequency1.5 GHz Main Memory Size4 Gbytes Memory Bandwidth512 Bytes/cycle Private L1 caches8 x 32 Kbytes, 2-way Shared L2 NUCA cache8 MBytes, 128 Banks NUCA Bank64 KBytes, 8-way L1 cache latency3 cycles NUCA bank latency4 cycles Router delay1 cycle On-chip wire delay1 cycle Main memory latency250 cycles (from core) GEMS Simics Solaris 10 PARSEC SPEC CPU x UltraSPARC IIIi Ruby Garnet Orion 7

 Introduction  Methodology  The Migration Prefetcher  Analysis of results  Conclusions 8

 Uses prefetching principles on data migration.  This not a traditional prefetcher. ◦ It does not bring data from main memory. ◦ Potential benefits are much restricted.  Require simple data correlation. 9

Core 0Core 1Core 2Core 3 Core 4 Core 5Core 6Core 7 Next block NAT PS 10 B 5 A B

 Fraction of prefetching requests that ended up being useful confidence bit is effective. > 1 bit is not worthy.

 Percentage of prefetching requests submitted with other address’s information bits use about 25% of erroneous information. NAT with 12 addressable bits is 232 KBytes in total.

 Percentage of prefetching requests that are found in the NUCA cache. 13 Predicting data location in based on the last appearance provides 50% accuracy. Accuracy increases accessing to local bank.

 The realistic Migration Prefetcher uses: ◦ 1-bit confidence for data patterns. ◦ A NAT with 12 addressable bits (29KBytes/table). ◦ Last responder + Local as search scheme.  Total hardware overhead is 264 KBytes.  Latency: 2 cycles. 14

 Introduction  Methodology  The Migration Prefetcher  Analysis of results  Conclusions 15

16

 Achieves overall performance improvements of 4%, and up to 17%.  NUCA is up to 25% faster with the Migration Prefetcher.  Reduces NUCA cache latency by 15%, on average. 17

 This technique does not increase energy consumption.  The prefetcher introduces extra traffic into the network.  In case of hit, reduces the number of messages significantly. 18

 Introduction  Methodology  The Migration Prefetcher  Analysis of results  Conclusions 19

 Existing migration techniques effectively concentrate most accessed data to banks that are close to the cores.  About 50% of hits in NUCA are in non-optimal banks.  The Migration Prefetcher anticipates migrations based on the past.  It reduces the average NUCA latency by 15%.  Outperforms the baseline configuration by 4%, on average, and does not increase energy consumption. 20

Questions? HiPEAC 2012, Paris (France) – January 23, 2012