Cache Simulations and Application Performance Christopher Kerr Philip Mucci Jeff Brown Los Alamos, Sandia.

Slides:

Advertisements

Similar presentations

Performance Analysis and Optimization through Run-time Simulation and Statistics Philip J. Mucci University Of Tennessee

Advertisements

CS492B Analysis of Concurrent Programs Memory Hierarchy Jaehyuk Huh Computer Science, KAIST Part of slides are based on CS:App from CMU.

1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.

Practical Caches COMP25212 cache 3. Learning Objectives To understand: –Additional Control Bits in Cache Lines –Cache Line Size Tradeoffs –Separate I&D.

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

Virtual Memory. The Limits of Physical Addressing CPU Memory A0-A31 D0-D31 “Physical addresses” of memory locations Data All programs share one address.

Simulations of Memory Hierarchy LAB 2: CACHE LAB.

On-Chip Cache Analysis A Parameterized Cache Implementation for a System-on-Chip RISC CPU.

Cs 325 virtualmemory.1 Accessing Caches in Virtual Memory Environment.

Introduction CS 524 – High-Performance Computing.

Memory Subsystem Performance of Programs using Coping Garbage Collection Authers: Amer Diwan David Traditi Eliot Moss Presented by: Ronen Shabo.

331 Lec20.1Fall :332:331 Computer Architecture and Assembly Language Fall 2003 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.

1 PATH: Page Access Tracking Hardware to Improve Memory Management Reza Azimi, Livio Soares, Michael Stumm, Tom Walsh, and Angela Demke Brown University.

Cache Memories May 5, 2008 Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance EECS213.

331 Lec20.1Spring :332:331 Computer Architecture and Assembly Language Spring 2005 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 30, 2002 Topic: Caches (contd.)

Virtual Memory Topics Virtual Memory Access Page Table, TLB Programming for locality Memory Mountain Revisited.

1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance.

Lecture 33: Chapter 5 Today’s topic –Cache Replacement Algorithms –Multi-level Caches –Virtual Memories 1.

Systems I Locality and Caching

Cache Lab Implementation and Blocking

Toolbox for Dimensioning Windows Storage Systems Jalil Boukhobza, Claude Timsit 12/09/2006 Versailles Saint Quentin University.

CacheLab 10/10/2011 By Gennady Pekhimenko. Outline Memory organization Caching – Different types of locality – Cache organization Cachelab – Warnings.

Web Cache Replacement Policies: Properties, Limitations and Implications Fabrício Benevenuto, Fernando Duarte, Virgílio Almeida, Jussara Almeida Computer.

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

Copyright © 2013, SAS Institute Inc. All rights reserved. MEMORY CACHE – PERFORMANCE CONSIDERATIONS CLAIRE CATES DISTINGUISHED DEVELOPER

A Data Cache with Dynamic Mapping P. D'Alberto, A. Nicolau and A. Veidenbaum ICS-UCI Speaker Paolo D’Alberto.

An I/O Simulator for Windows Systems Jalil Boukhobza, Claude Timsit 27/10/2004 Versailles Saint Quentin University laboratory.

CacheLab Recitation 7 10/8/2012. Outline Memory organization Caching – Different types of locality – Cache organization Cachelab – Tips (warnings, getopt,

CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and

Computer Architecture Memory organization. Types of Memory Cache Memory Serves as a buffer for frequently accessed data Small  High Cost RAM (Main Memory)

Computer Architecture Lecture 26 Fasih ur Rehman.

Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.

Lecture 12: Reconfigurable Systems II October 20, 2004 ECE 697F Reconfigurable Computing Lecture 12 Reconfigurable Systems II: Exploring Programmable Systems.

C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 31/01/2014 Compilers for Embedded Systems.

CS 241 Section Week #9 (11/05/09). Topics MP6 Overview Memory Management Virtual Memory Page Tables.

Summertime Fun Everyone loves performance Shirley Browne, George Ho, Jeff Horner, Kevin London, Philip Mucci, John Thurman.

Improving Disk Throughput in Data-Intensive Servers Enrique V. Carrera and Ricardo Bianchini Department of Computer Science Rutgers University.

J-PARC Trace3D Upgrades Christopher K. Allen Los Alamos National Laboratory.

1 Seoul National University Cache Memories. 2 Seoul National University Cache Memories Cache memory organization and operation Performance impact of caches.

Computer Organization CS224 Fall 2012 Lessons 45 & 46.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

Virtual Application Profiler (VAPP) Problem – Increasing hardware complexity – Programmers need to understand interactions between architecture and their.

Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.

CS.305 Computer Architecture Memory: Caches Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.

11 Intro to cache memory Kosarev Nikolay MIPT Nov, 2009.

CacheMiner : Run-Time Cache Locality Exploitation on SMPs CPU On-chip cache Off-chip cache Interconnection Network Shared Memory CPU On-chip cache Off-chip.

Systems I Cache Organization

CS/EE 217 GPU Architecture and Parallel Programming Midterm Review

Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)

Los Alamos National Laboratory Streams-C Maya Gokhale Los Alamos National Laboratory September, 1999.

Memory Hierarchy— Five Ways to Reduce Miss Penalty.

Memory Hierarchy and Cache Design (3). Reducing Cache Miss Penalty 1. Giving priority to read misses over writes 2. Sub-block placement for reduced miss.

Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,

1 Memory Hierarchy Design Chapter 5. 2 Cache Systems CPUCache Main Memory Data object transfer Block transfer CPU 400MHz Main Memory 10MHz Bus 66MHz CPU.

A Graph Theoretic Approach to Cache-Conscious Placement of Data for Direct Mapped Caches Mirza Beg and Peter van Beek University of Waterloo June

Chapter 5 Memory II CSE 820. Michigan State University Computer Science and Engineering Equations CPU execution time = (CPU cycles + Memory-stall cycles)

Virtual memory.

Memory COMPUTER ARCHITECTURE

Self Healing and Dynamic Construction Framework:

In-situ Visualization using VisIt

The Hardware/Software Interface CSE351 Winter 2013

Morgan Kaufmann Publishers

Workshop in Nihzny Novgorod State University Activity Report

Bojian Zheng CSCD70 Spring 2018

Outline Midterm results summary Distributed file systems – continued

Lecture 22: Cache Hierarchies, Memory

Module IV Memory Organization.

Virtual Memory Overcoming main memory size limitation

Caches: AAT, 3C’s model of misses Prof. Eric Rotenberg

Presentation transcript:

Cache Simulations and Application Performance Christopher Kerr Philip Mucci Jeff Brown Los Alamos, Sandia National Laboratories

Goal To optimize large, numerically intensive applications with poor cache utilization. By taking advantage of the memory hierarchy and we can often achieve the greatest performance improvements for our time.

Philosophy By simulating the cache hierarchy, we wish to understand how the application’s data maps to a specific cache architecture. In addition, we wish to understand the application’s reference pattern and the relationship to the mapping. Performance improvements can be obtained from this information algorithmically.

Cache Simulator Consists of: –Instrumentation assistant (Perl) –Header files –Run-time library to be linked with the application (C) Works with: –C, C++, Fortran 77, Fortran 90

How it works Cache simulator is called on memory (array) references Cache simulator reads a configuration file containing an architectural description of the memory hierarchy for multiple machines. Environment variables enable different options

How it works (cont) Each call to the simulator provides as input: –Address of the reference –Size of the datum being accessed –Symbolic name consisting of the name, file and line number

Instrumentation (before) subroutine kji(A, ii, jj, lda, B, kk, ldb, C, ldc) dimension A(lda,lda), B(ldb,ldb), C(ldc,ldc) do k = 1, kk do j = 1, jj do i = 1, ii A(i,j) = A(i,j) + B(i,k) * C(k,j) enddo return end

Instrumentation (after). do i = 1, ii call cache_sim(A(i,j),KIND(A(i,j)), & 'A(i,j):7:stdin\0') call cache_sim(A(i,j),KIND(A(i,j)), & 'A(i,j):7:stdin\0') call cache_sim(B(i,k),KIND(B(i,k)), & 'B(i,k):7:stdin\0') call cache_sim(C(k,j),KIND(C(k,j)), & 'C(k,j):7:stdin\0') A(i,j) = A(i,j) + B(i,k) * C(k,j) enddo.

Output Summary Misses by name Misses by address Conflict matrix Address trace

Summary Machine 1: test-machine Cache level 1: size 32kB, line size 32B, associativity 2, 1024 lines total Total mem accesses: Total cache misses: Total cache hits: Total hit rate: Num split accesses: 0.00 Cold cache misses: Real misses: Real hit rate: 94.41

Misses by name Name Trace: miss rate for references >= 0 percent. Percentage Real misses Line:File:Reference X(i,j):26:stencil.F X(i,j-1):26:stencil.F R(i,j):26:stencil.F X(i+1,j-1):26:stencil.F X(i+1,j+1):26:stencil.F AN(i,j):26:stencil.F ANE(i,j):17:stencil.F AN(i,j):15:stencil.F ANE(i,j-1):26:stencil.F ANE(i,j):26:stencil.F.

Misses by address Address Trace: miss rate for references >= 0 percent. Percentage Real misses Address Line:File:Reference x32d60 R(i,j):13:stencil.F x32d80 R(i,j):13:stencil.F x32da0 R(i,j):13:stencil.F x32dc0 R(i,j):13:stencil.F x32de0 R(i,j):13:stencil.F x32e00 R(i,j):13:stencil.F.

Conflict matrix Each axis represents the different arrays X axis is replacer, Y is replacee Elements are the number of replacements of one array element with another Goal is to algorithmically determine optimal layout, placement, padding and blocking. This is a minimization problem.

Conflict matrix (cont) Num Name 0 A(i,j):7:kji.f 1 B(i,k):7:kji.f 2 C(k,j):7:kji.f

Address dump Symbolic name Virtual address Cache line Goal is to use this for replay.

Future Handle nonblocking caches, replacement policies, write strategies and buffering. Add output file with starting address and extents for each array. Facility for replay of the simulator using the address dump, and data regarding padding, blocking and alignment. This will eliminates the need for additional runs.

Future (cont) Categorize cold misses for repeatedly accessed data items. Provide cost metrics to analyze approximate performance loss due to poor locality. Full Lex/Yacc based parser. Perl/Tk GUI for finer control of instrumentation.

Future (cont) MPI, Thread aware Reduction in run-time requirements MUT integration Tools to compare data sets