An Accurate and Detailed Prefetching Simulation Framework for gem5 Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture.

Slides:



Advertisements
Similar presentations
1 A Hybrid Adaptive Feedback Based Prefetcher Santhosh Verma, David Koppelman and Lu Peng Louisiana State University.
Advertisements

Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Cache III Steve Ko Computer Sciences and Engineering University at Buffalo.
Performance of Cache Memory
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.
UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.
Access Map Pattern Matching Prefetch: Optimization Friendly Method
Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.
Information and Control in Gray-Box Systems Arpaci-Dusseau and Arpaci-Dusseau SOSP 18, 2001 John Otto Wi06 CS 395/495 Autonomic Computing Systems.
CGS 3763 Operating Systems Concepts Spring 2013 Dan C. Marinescu Office: HEC 304 Office hours: M-Wd 11: :30 AM.
CS7810 Prefetching Seth Pugsley. Predicting the Future Where have we seen prediction before? – Does it always work? Prefetching is prediction – Predict.
CS752 Decoupled Architecture for Data Prefetching Jichuan Chang Kai Xu.
Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.
Glenn Reinman, Brad Calder, Department of Computer Science and Engineering, University of California San Diego and Todd Austin Department of Electrical.
Memory Problems Prof. Sin-Min Lee Department of Mathematics and Computer Sciences.
The Memory Behavior of Data Structures Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences The University.
An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.
Data Cache Prefetching using a Global History Buffer Presented by: Chuck (Chengyan) Zhao Mar 30, 2004 Written by: - Kyle Nesbit - James Smith Department.
INSTITUTE OF COMPUTING TECHNOLOGY DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance Dang Tang, Yungang Bao, Weiwu.
2017/4/21 Towards Full Virtualization of Heterogeneous Noc-based Multicore Embedded Architecture 2012 IEEE 15th International Conference on Computational.
Topics covered: Memory subsystem CSE243: Introduction to Computer Architecture and Hardware/Software Interface.
Cooperative Caching for Chip Multiprocessors Jichuan Chang Guri Sohi University of Wisconsin-Madison ISCA-33, June 2006.
Javier Lira (Intel-UPC, Spain)Timothy M. Jones (U. of Cambridge, UK) Carlos Molina (URV, Spain)Antonio.
Comparing Memory Systems for Chip Multiprocessors Leverich et al. Computer Systems Laboratory at Stanford Presentation by Sarah Bird.
Page Overlays An Enhanced Virtual Memory Framework to Enable Fine-grained Memory Management Vivek Seshadri Gennady Pekhimenko, Olatunji Ruwase, Onur Mutlu,
Quantifying and Comparing the Impact of Wrong-Path Memory References in Multiple-CMP Systems Ayse Yilmazer, University of Rhode Island Resit Sendag, University.
Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University.
A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Jason Zebchuk, Elham Safi, and Andreas Moshovos
Electrical and Computer Engineering University of Wisconsin - Madison Prefetching Using a Global History Buffer Kyle J. Nesbit and James E. Smith.
Cache Control and Cache Coherence Protocols How to Manage State of Cache How to Keep Processors Reading the Correct Information.
INSTITUTE OF COMPUTING TECHNOLOGY DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance Dang Tang, Yungang Bao, Weiwu.
Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.
Performance of the Shasta distributed shared memory protocol Daniel J. Scales Kourosh Gharachorloo 創造情報学専攻 M グェン トアン ドゥク.
Effects of wrong path mem. ref. in CC MP Systems Gökay Burak AKKUŞ Cmpe 511 – Computer Architecture.
1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)
SYNAR Systems Networking and Architecture Group CMPT 886: Computer Architecture Primer Dr. Alexandra Fedorova School of Computing Science SFU.
Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Jason Bosko March 5 th, 2008 Based on “Managing Distributed, Shared L2 Caches through.
Latency Reduction Techniques for Remote Memory Access in ANEMONE Mark Lewandowski Department of Computer Science Florida State University.
Computer Architecture Lecture 27 Fasih ur Rehman.
Project Presentation By: Dean Morrison 12/6/2006 Dynamically Adaptive Prepaging for Effective Virtual Memory Management.
CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Demand Paging.
Analyzing the Impact of Data Prefetching on Chip MultiProcessors Naoto Fukumoto, Tomonobu Mihara, Koji Inoue, Kazuaki Murakami Kyushu University, Japan.
컴퓨터교육과 이상욱 Published in: COMPUTER ARCHITECTURE LETTERS (VOL. 10, NO. 1) Issue Date: JANUARY-JUNE 2011 Publisher: IEEE Authors: Omer Khan (Massachusetts.
IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.
On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.
SYNAR Systems Networking and Architecture Group CMPT 886: Computer Architecture Primer Dr. Alexandra Fedorova School of Computing Science SFU.
Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.
Virtual Memory By CS147 Maheshpriya Venkata. Agenda Review Cache Memory Virtual Memory Paging Segmentation Configuration Of Virtual Memory Cache Memory.
An Adaptive Cache Coherence Protocol Optimized for Producer-Consumer Sharing Liquin Cheng, John B. Carter and Donglai Dai cs.utah.edu by Evangelos Vlachos.
CSCI206 - Computer Organization & Programming
Lecture: Large Caches, Virtual Memory
A Study on Snoop-Based Cache Coherence Protocols
Luis M. Ramos, José Luis Briz, Pablo E. Ibáñez and Víctor Viñals.
12.4 Memory Organization in Multiprocessor Systems
Jason F. Cantin, Mikko H. Lipasti, and James E. Smith
Accelerating Dependent Cache Misses with an Enhanced Memory Controller
CMPT 886: Computer Architecture Primer
Evaluating Proxy Caching Algorithms in Mobile Environments
Chapter 5 Memory CSE 820.
Lecture 17: Case Studies Topics: case studies for virtual memory and cache hierarchies (Sections )
Lecture: Cache Innovations, Virtual Memory
High Performance Computing
© 2004 Ed Lazowska & Hank Levy
Lecture: Cache Hierarchies
Lecture 24: Virtual Memory, Multiprocessors
Lecture 23: Virtual Memory, Multiprocessors
Presentation transcript:

An Accurate and Detailed Prefetching Simulation Framework for gem5 Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech

Outline Introduction Current ruby prefetching solution Our solution Implementation details Case of study Conclusions 2

Outline Introduction Current ruby prefetching solution Our solution Implementation details Case of study Conclusions 3

Prefetching Reduce memory latency Bring to a nearest cache data required by CPU Increase the hit ratio Implemented in many commercial processors Erroneous prefetching may produce slowdown Simulation tools should include this capability 4

Outline Introduction Current ruby prefetching solution Our solution Implementation details Case of study Conclusions 5

Current ruby prefetching solution 6

Outline Introduction Current ruby prefetching solution Our solution Implementation details Case of study Conclusions 7

Our solution 8

9

Outline Introduction Current ruby prefetching solution Our solution Implementation details Case of study Conclusions 10

Implementation details 11 Receives command line options Creates and inits the prefetch engine Manages communication Controller –> Prefetcher And Prefetcher –> Prefetch Queue –> Controller Collects statistics

Implementation details 12 Accumulates the statistics observed_misses - numMissesObserved Cancelled - numDroppedPrefetches completed - numPrefetchAccepted hit in_cache late – numPartialHits/numHits overflowed page_faults – numPagesCrossed total - numPrefetchRequested unuseful useful queue_merged_requests generated_prefetches_per_t rain - streams numMissedPrefetchBlocks MISS

Implementation details 13 Collects the requests generated by the engine Checks the page fault error Merges repeated requests

Implementation details 14 Works as an abstract class Must be inherited by the specific pref Virtual functions must be redeclared – Init: Initialization function – Prefetcher size, distance, aggressiveness, etc. – Observe request: Called on each cache access – Hit/miss, prefetch/no_prefetch, accessed address, etc. – Allocate: Called when data allocated in cache – Same as observe request – Deallocate: Called when evicting from cache – Only evicted address

Implementation details 15 Notifies the wrapper: – Cache accesses – Cache allocation – Cache evictions Reads from Prefetch Queue Prefetch issued when no Loads/Stores Protocol modification similar to a Load operation Very similar to the current solution

Implementation details 16 Same as in the L1 but… L2 local hits – Some data that is invalid in L2 but locally allocated L1_GETS does not store in L2 – Protocol modified to store pref requests in L2 Pref queue generates a request for another tile – Request is forwarded to the corresponding tile

Outline Introduction Current ruby prefetching solution Our solution Implementation details Case of study Conclusions 17

Case of Study: NoC aware prefetch performance evaluation 18 We tested 3 classical prefetch engines: – Tagged Prefetcher – Reference Prediction Table (RPT) – Global History Buffer (GHB) With the gem5 Simulator using – 16 tiled x86 CPUs – L1 prefetchers – Ruby memory system – MOESI coherency protocol – Garnet network simulator Parsecs 2.1

Case of Study: Results 19 M. Torrents, R. Martínez, C. Molina. “Network Aware Performance Evaluation of Prefetching Techniques in CMPs”. Simulation Modeling Practice and Theory (SIMPAT), 2014.

Outline Introduction Current ruby prefetching solution Our solution Implementation details Case of study Conclusions 20

Conclusions 21 Prefetcher is important and it must be simulated Current solution is ok Our solution goes one step farther – Easy to change/add new prefetch engines – Detailed statistics about prefetching – Garnet can identify prefetching traffic – Useful for statistics or traffic manipulation Current tool can be easily included in new solution Current solution is ok for non prefetch researchers Our tool is better for research related with prefetch

An Accurate and Detailed Prefetching Simulation Framework for gem5 Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech