Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.

Slides:

Advertisements

Similar presentations

Prefetch-Aware Shared-Resource Management for Multi-Core Systems Eiman Ebrahimi * Chang Joo Lee * + Onur Mutlu Yale N. Patt * * HPS Research Group The.

Advertisements

Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.

A Case for Refresh Pausing in DRAM Memory Systems

To Include or Not to Include? Natalie Enright Dana Vantrease.

Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.

A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.

1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.

1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.

Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs Mrinmoy Ghosh Hsien-Hsin S. Lee School.

Teaching Old Caches New Tricks: Predictor Virtualization Andreas Moshovos Univ. of Toronto Ioana Burcea’s Thesis work Some parts joint with Stephen Somogyi.

CS752 Decoupled Architecture for Data Prefetching Jichuan Chang Kai Xu.

HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs.

The Auction: Optimizing Banks Usage in Non-Uniform Cache Architectures Javier Lira ψ Carlos Molina ψ,ф Antonio González ψ,λ λ Intel Barcelona Research.

1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.

Data Cache Prefetching using a Global History Buffer Presented by: Chuck (Chengyan) Zhao Mar 30, 2004 Written by: - Kyle Nesbit - James Smith Department.

Interactions Between Compression and Prefetching in Chip Multiprocessors Alaa R. Alameldeen* David A. Wood Intel CorporationUniversity of Wisconsin-Madison.

Shuchang Shan † ‡, Yu Hu †, Xiaowei Li † † Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences.

CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.

Boosting Mobile GPU Performance with a Decoupled Access/Execute Fragment Processor José-María Arnau, Joan-Manuel Parcerisa (UPC) Polychronis Xekalakis.

Cooperative Caching for Chip Multiprocessors Jichuan Chang Guri Sohi University of Wisconsin-Madison ISCA-33, June 2006.

Javier Lira (Intel-UPC, Spain)Timothy M. Jones (U. of Cambridge, UK) Carlos Molina (URV, Spain)Antonio.

Achieving Non-Inclusive Cache Performance with Inclusive Caches Temporal Locality Aware (TLA) Cache Management Policies Aamer Jaleel,

Many-Thread Aware Prefetching Mechanisms for GPGPU Applications Jaekyu LeeNagesh B. Lakshminarayana Hyesoon KimRichard Vuduc.

Déjà Vu Switching for Multiplane NoCs NOCS’12 University of Pittsburgh Ahmed Abousamra Rami MelhemAlex Jones.

Comparing Memory Systems for Chip Multiprocessors Leverich et al. Computer Systems Laboratory at Stanford Presentation by Sarah Bird.

DUKE UNIVERSITY Self-Tuned Congestion Control for Multiprocessor Networks Shubhendu S. Mukherjee VSSAD, Alpha Development Group.

Quantifying and Comparing the Impact of Wrong-Path Memory References in Multiple-CMP Systems Ayse Yilmazer, University of Rhode Island Resit Sendag, University.

Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University.

A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Jason Zebchuk, Elham Safi, and Andreas Moshovos

Department of Computer Science and Engineering The Pennsylvania State University Akbar Sharifi, Emre Kultursay, Mahmut Kandemir and Chita R. Das Addressing.

Effects of wrong path mem. ref. in CC MP Systems Gökay Burak AKKUŞ Cmpe 511 – Computer Architecture.

Last Bank: Dealing with Address Reuse in Non-Uniform Cache Architecture for CMPs Javier Lira ψ Carlos Molina ф Antonio González λ λ Intel Barcelona Research.

1 Evaluation of Cooperative Web Caching with Web Polygraph Ping Du and Jaspal Subhlok Department of Computer Science University of Houston presented at.

Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.

Performance of Adaptive Beam Nulling in Multihop Ad Hoc Networks Under Jamming Suman Bhunia, Vahid Behzadan, Paulo Alexandre Regis, Shamik Sengupta.

CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z.

Design Issues of Prefetching Strategies for Heterogeneous Software DSM Author :Ssu-Hsuan Lu, Chien-Lung Chou, Kuang-Jui Wang, Hsiao-Hsi Wang, and Kuan-Ching.

Analyzing the Impact of Data Prefetching on Chip MultiProcessors Naoto Fukumoto, Tomonobu Mihara, Koji Inoue, Kazuaki Murakami Kyushu University, Japan.

BarrierWatch: Characterizing Multithreaded Workloads across and within Program-Defined Epochs Socrates Demetriades and Sangyeun Cho Computer Frontiers.

Fetch Directed Prefetching - a Study

An Accurate and Detailed Prefetching Simulation Framework for gem5 Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture.

Flexible Filters for High Performance Embedded Computing Rebecca Collins and Luca Carloni Department of Computer Science Columbia University.

By Islam Atta Supervised by Dr. Ihab Talkhan

1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.

Hardware-based Job Queue Management for Manycore Architectures and OpenMP Environments Junghee Lee, Chrysostomos Nicopoulos, Yongjae Lee, Hyung Gyu Lee.

IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.

Migration Cost Aware Task Scheduling Milestone Shraddha Joshi, Brian Osbun 10/24/2013.

Video Caching in Radio Access network: Impact on Delay and Capacity

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.

LRU-PEA: A Smart Replacement Policy for NUCA caches on Chip Multiprocessors Javier Lira ψ Carlos Molina ψ,ф Antonio González ψ,λ λ Intel Barcelona Research.

The University of Adelaide, School of Computer Science

Taeho Kgil, Trevor Mudge Advanced Computer Architecture Laboratory The University of Michigan Ann Arbor, USA CASES’06.

An Adaptive Cache Coherence Protocol Optimized for Producer-Consumer Sharing Liquin Cheng, John B. Carter and Donglai Dai cs.utah.edu by Evangelos Vlachos.

Reza Yazdani Albert Segura José-María Arnau Antonio González

Lecture: Large Caches, Virtual Memory

ASR: Adaptive Selective Replication for CMP Caches

ISPASS th April Santa Rosa, California

Outline Motivation Project Goals Methodology Preliminary Results

Auburn University COMP7500 Advanced Operating Systems I/O-Aware Load Balancing Techniques (2) Dr. Xiao Qin Auburn University.

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Improving cache performance of MPEG video codec

Lecture 14: Reducing Cache Misses

Optical Overlay NUCA: A High Speed Substrate for Shared L2 Caches

Interconnect with Cache Coherency Manager

ICIEV 2014 Dhaka, Bangladesh

CANDY: Enabling Coherent DRAM Caches for Multi-node Systems

Lecture: Cache Hierarchies

Dynamic Verification of Sequential Consistency

Presentation transcript:

Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech

Outline Introduction Naming the challenges Challenge evaluation methodology Experimental framework Challenge Quantification Facing the Challenges Conclusions 2

Outline Introduction Naming the challenges Challenge evaluation methodology Experimental framework Challenge Quantification Facing the Challenges Conclusions 3

Prefetching Reduce memory latency Bring to a nearest cache next data required by CPU Increase the hit ratio It is implemented in most of the commercial processors Erroneous prefetching may produce – Cache pollution – Resources consumption (queues, bandwidth, etc.) – Power consumption 4

Motivation Number of cores in a same chip grows every year Nehalem 4~6 Cores Tilera 64~100 Cores Intel Polaris 80 Cores Nvidia GeForce Up to 256 Cores 5

Prefetch in CMPs Useful prefetchers implies more performance – Avoid network latency – Reduce memory access latency Useless prefetchers implies less performance – More power consumption – More NoC congestion – Interference with other cores requests 6

Prefetch adverse behaviors 7 M. Torrents, R. Martínez, C. Molina. “Network Aware Performance Evaluation of Prefetching Techniques in CMPs”. Simulation Modeling Practice and Theory (SIMPAT), 2014.

Distributed memories 8 Distribution of the memory @

TILE 00 TILE 01 TILE 02 TILE 03 TILE 04 TILE 05 TILE 06 TILE 07 Distributed memories 9 Distribution of the memory @+14

Outline Introduction Naming the challenges Challenge evaluation methodology Experimental framework Challenge Quantification Facing the Challenges Conclusions 10

Prefetch Distributed Memory Systems Analysis phase 11 DISTRIBUTED L2 L1 MISS Distributed patterns

Pattern Detection Challenge Distribution of the memory stream Prefetcher aware of a certain part of the stream Harder to detect access patterns or correlation Not all the prefetchers affected – Correlation prefetchers affected: GHB – One Block Lookahead not affected: Tagged 12

Prefetch Distributed Memory Systems Request generation phase 13 DISTRIBUTED L2 MEMORY Queue filtering

Prefetch Queue Filtering Challenge Prefetch requests queued in distributed queues Independent engines generating requests Repeated requests can be queued In a centralized queue those would be merged Adverse effects: – Power consumption – Network contention 14

Prefetch Distributed Memory Systems Evaluation phase 15 DISTRIBUTED L2 MEMORY L1 MISS + 2 Dynamic profiling

Dynamic Profiling Challenge Prefetch requests generated in one tile Dynamic profiling information in another tile Erroneous profiling in the self tile Techniques using this info may work erroneously – Filtering – Throttling – Concrete prefetching engines 16

Outline Introduction Naming the challenges Challenge evaluation methodology Experimental framework Challenge Quantification Facing the Challenges Conclusions 17

Challenge evaluation methodology Three environments to test the challenges Pattern Detection Challenge: Ideal Prefetcher – Prefetcher that it is aware of all the memory stream – No extra network contention added in the system – No extra power consumed – Requests classified depending on its core identifier – To preserve the original stream of each core Prefetcher used to test: Global History Buffer 18

Pattern Detection Challenge 19

Challenge evaluation methodology Three environments to test the challenges Prefetch Queue Filtering: Centralized queue – All the requests sent to a centralized queue – Repeated requests are merged – No extra network contention added in the system – No extra power consumed – Repeated requests are not issued Prefetcher used to test: Tagged prefercher 20

Prefetch Queue Filtering Challenge 21

Challenge evaluation methodology 22

Dynamic Profiling Challenge 23

Outline Introduction Naming the challenges Challenge evaluation methodology Experimental framework Challenge Quantification Facing the Challenges Conclusions 24

Experimental framework Gem5 – 64 x86 CPUs – Ruby memory system – L2 prefetchers – MOESI coherency protocol – Garnet network simulator Parsecs

Simulation environment 26

Outline Introduction Naming the challenges Challenge evaluation methodology Experimental framework Challenge Quantification Facing the Challenges Conclusions 27

Pattern Detection Challenge 28

Prefetch Queue Filtering Challenge 29

Dynamic Profiling Challenge 30

Outline Introduction Naming the challenges Challenge evaluation methodology Experimental framework Challenge Quantification Facing the Challenges Conclusions 31

Facing the challenges 32 There are two main options – Redesign the entire prefetch philosophy – Adapt the current techniques to work with DSMs Moreover, there are two main directions – Centralize the information – Handicap of communication increment – Distribute the prefetcher – Handicap of smartly distribute the prefetcher

Outline Introduction Naming the challenges Challenge evaluation methodology Experimental framework Challenge Quantification Facing the Challenges Conclusions 33

Conclusions 34 Three challenges when prefetching in DSMs – Prefetch Queue Filtering Challenge – Dynamic Profiling Challenge – Challenge evaluation methodology Directions for future investigators There are no evident solutions for them Not solving them -> limited prefetch performance

Q & A 35

Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech