1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance.

Slides:



Advertisements
Similar presentations
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
Advertisements

Page 15/4/2015 CSE 30341: Operating Systems Principles Allocation of Frames  How should the OS distribute the frames among the various processes?  Each.
1 Thursday, July 06, 2006 “Experience is something you don't get until just after you need it.” - Olivier.
A Proposal for a New Hardware Cache Monitoring Architecture Martin Schulz, Jie Tao, Jürgen Jeitner, Wolfgang Karl Lehrstuhl für Rechnertechnik und Rechnerorganisation.
CSIE30300 Computer Architecture Unit 10: Virtual Memory Hsin-Chou Chi [Adapted from material by and
Virtual Memory Hardware Support
CSC 4250 Computer Architectures December 8, 2006 Chapter 5. Memory Hierarchy.
Presented By Srinivas Sundaravaradan. MACH µ-Kernel system based on message passing Over 5000 cycles to transfer a short message Buffering IPC L3 Similar.
Recap. The Memory Hierarchy Increasing distance from the processor in access time L1$ L2$ Main Memory Secondary Memory Processor (Relative) size of the.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 31, 2005 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
Memory Organization.
1 PATH: Page Access Tracking Hardware to Improve Memory Management Reza Azimi, Livio Soares, Michael Stumm, Tom Walsh, and Angela Demke Brown University.
Chapter 3 Memory Management
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Computer Organization Cs 147 Prof. Lee Azita Keshmiri.
Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.
1  2004 Morgan Kaufmann Publishers Chapter Seven.
1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.
An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.
A Characterization of Processor Performance in the VAX-11/780 From the ISCA Proceedings 1984 Emer & Clark.
1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.
Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.
Virtual Memory By: Dinouje Fahih. Definition of Virtual Memory Virtual memory is a concept that, allows a computer and its operating system, to use a.
Memory Systems Architecture and Hierarchical Memory Systems
Microprocessor-based systems Curse 7 Memory hierarchies.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
CSE431 L22 TLBs.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 22. Virtual Memory Hardware Support Mary Jane Irwin (
2013/01/14 Yun-Chung Yang Energy-Efficient Trace Reuse Cache for Embedded Processors Yi-Ying Tsai and Chung-Ho Chen 2010 IEEE Transactions On Very Large.
1 Linux Operating System 許 富 皓. 2 Memory Addressing.
Garo Bournoutian and Alex Orailoglu Proceedings of the 45th ACM/IEEE Design Automation Conference (DAC’08) June /10/28.
Computer Architecture Memory organization. Types of Memory Cache Memory Serves as a buffer for frequently accessed data Small  High Cost RAM (Main Memory)
Computer Architecture Lecture 26 Fasih ur Rehman.
CS 149: Operating Systems March 3 Class Meeting Department of Computer Science San Jose State University Spring 2015 Instructor: Ron Mak
Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:
Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.
Accelerating Two-Dimensional Page Walks for Virtualized Systems Jun Ma.
Virtual Memory 1 1.
Improving Disk Throughput in Data-Intensive Servers Enrique V. Carrera and Ricardo Bianchini Department of Computer Science Rutgers University.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Virtual Memory Hardware.
Computer Organization CS224 Fall 2012 Lessons 45 & 46.
Introduction: Memory Management 2 Ideally programmers want memory that is large fast non volatile Memory hierarchy small amount of fast, expensive memory.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
1  2004 Morgan Kaufmann Publishers Chapter Seven Memory Hierarchy-3 by Patterson.
CS.305 Computer Architecture Memory: Virtual Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.
1  1998 Morgan Kaufmann Publishers Chapter Seven.
1  2004 Morgan Kaufmann Publishers Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality:
1 Appendix C. Review of Memory Hierarchy Introduction Cache ABCs Cache Performance Write policy Virtual Memory and TLB.
LECTURE 12 Virtual Memory. VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a “cache”
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
1 Contents Memory types & memory hierarchy Virtual memory (VM) Page replacement algorithms in case of VM.
Memory Management memory hierarchy programs exhibit locality of reference - non-uniform reference patterns temporal locality - a program that references.
Author : Tzi-Cker Chiueh, Prashant Pradhan Publisher : High-Performance Computer Architecture, Presenter : Jo-Ning Yu Date : 2010/11/03.
Memory Hierarchy Ideal memory is fast, large, and inexpensive
Memory COMPUTER ARCHITECTURE
Lecture 12 Virtual Memory.
Improving Memory Access 1/3 The Cache and Virtual Memory
TLC: A Tag-less Cache for reducing dynamic first level Cache Energy
Lecture 23: Cache, Memory, Virtual Memory
FIGURE 12-1 Memory Hierarchy
Morgan Kaufmann Publishers Memory Hierarchy: Virtual Memory
Contents Memory types & memory hierarchy Virtual memory (VM)
Realizing Closed-loop, Online Tuning and Control for Configurable-Cache Embedded Systems: Progress and Challenges Islam S. Badreldin*, Ann Gordon-Ross*,
Chapter Five Large and Fast: Exploiting Memory Hierarchy
Database System Architectures
CS703 - Advanced Operating Systems
Virtual Memory 1 1.
ARM920T Processor This training module provides an introduction to the ARM920T processor embedded in the AT91RM9200 microcontroller.We’ll identify the.
Presentation transcript:

1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance

The analysis of the memory access behavior of applications, an essential step for a successful cache optimization, is a complex task. It needs to be supported with appropriate tools and monitoring facilities. Currently, however, users can only rely on either simulation based approaches, which deliver a large degree of detail but are restricted in their applicability, or on hardware counters embedded into processors, which allow to keep track of very few, mostly global events and hence only provide limited data. 2

In this work a proposal for novel hardware monitoring facility is presented which exhibits both the details of traditional simulations and the low-overhead of hardware counters. Like the latter approach, it is also targeted towards an implementation within the processor for a direct and non- intrusive access to caches and memory busses. Unlike traditional counters, however, it delivers a detailed picture of the complete memory access behavior of applications. This is achieved by generating so-called memory access histograms, which show access frequencies in relation to the applications address space. Such spatial memory access information can then be used for efficient program optimization by focusing on the code and data segments which were found to exhibit a poor cache behavior. 3

Existing analysis of the memory access behavior of application approaches can roughly divided into two classes which include simulation and hardware counters. Tracing large application can significantly slow down simulation due to the huge amount of memory operation. The counter approaches are restricted very few individual counters in each processor. 4

5 [5] [10] Cache performance simulator [3] [8] [13] Hardware counters inside processor [1] [2] A sampling based system DCPI and its extension Utilize the information available from hardware counters to combine histogram Proposed hardware cache monitoring architecture

6 Cache-like Counter Organization User-definable Granularity Multilevel Monitoring Adding Temporal Information Integration Aspects

7

8 Filter each event using an address range of interest Mask all monitored events allowing to delete unimportant lower bits from any address Aggregation of neighboring events during the monitoring process

9 Used to compute both hit or miss rates for any cache with respect to given address region.

10 The main drawback of this approach does not provide any temporal information of memory access behavior, however, only contain the aggregated memory behavior of the whole monitoring period. Monitor barriers which trigger a complete swap out of all partial results.

11 How to swap out counter value into the ring buffer without influencing the own measurements. Masking these access through appropriate filter mechanisms or by establishing separate store pipelines. The monitor will observe physical address, which need to be transferred into virtual address space for an evaluation at application by using correct page table.

12 A general tool designed for evaluation of NUMA shared memory machines and includes simulation of complete memory hierarchy. Non-Uniform Memory Access (NUMA) is a computer memory design used in multiprocessors, where the memory access time depends on the memory location relative to a processor. Single CPU with two-level memory hierarchy with 8 KB direct mapped L1 and 256 KB L2 two-way associative cache, each with 32 bytes cache lines. LU decomposition, RADIX sort and WATER.

13 Show cache and memory access behaviors. Show cache miss rates by multilevel memory hierarchy.

14 Show partial memory access behaviors. Phase 3, 5 show high locality on very few memory blocks. Phase 4, 5 show large range of blocks access.

15 All histograms shown without any filtering and at highest possible accuracy by assuming cache line granularity. Lead to runtime overhead due to the on-line delivery of memory access histograms. The reduction in granularity and increased number of counters are two options in order to compensate overhead and keep minimum influence of monitor. The reduction in granularity will lead to reduction in accuracy. Increased number of counters will increase hardware complexity.

16 Reduction high runtime overhead with coarsened granularity. 32 or 64 counters are reasonable number and hence with reasonable hardware complexity. Granularity Number of Counters Swap out events

17 Novel architecture without major overhead of hardware complexity. experimental shown that the precision of these access histograms is close to a full simulation. This monitoring system is capable of observing any kind of event that can be related to an address.

18 Multilevel memory monitoring architecture. partial and temporal cache behavior aggregation technique. Counter array and ring buffer technique.