Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.

Slides:

Advertisements

Similar presentations

System Integration and Performance

Advertisements

Part IV: Memory Management

CGrid 2005, slide 1 Empirical Evaluation of Shared Parallel Execution on Independently Scheduled Clusters Mala Ghanesh Satish Kumar Jaspal Subhlok University.

Lecture 8: Memory Hierarchy Cache Performance Kai Bu

Exploiting Spatial Locality in Data Caches using Spatial Footprints Sanjeev Kumar, Princeton University Christopher Wilkerson, MRL, Intel.

Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.

Program Slicing Mark Weiser and Precise Dynamic Slicing Algorithms Xiangyu Zhang, Rajiv Gupta & Youtao Zhang Presented by Harini Ramaprasad.

© Karen Miller, What do we want from our computers?  correct results we assume this feature, but consider... who defines what is correct?  fast.

Enabling Efficient On-the-fly Microarchitecture Simulation Thierry Lafage September 2000.

Workloads Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Live workload Benchmark applications Micro- benchmark.

SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.

Communication Pattern Based Node Selection for Shared Networks

1 Lecture 20 – Caching and Virtual Memory  2004 Morgan Kaufmann Publishers Lecture 20 Caches and Virtual Memory.

Statistics CSE 807.

1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.

Memory Management and Paging CSCI 3753 Operating Systems Spring 2005 Prof. Rick Han.

1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

Memory Subsystem Performance of Programs using Coping Garbage Collection Authers: Amer Diwan David Traditi Eliot Moss Presented by: Ronen Shabo.

Memory Management 2010.

1 PATH: Page Access Tracking Hardware to Improve Memory Management Reza Azimi, Livio Soares, Michael Stumm, Tom Walsh, and Angela Demke Brown University.

1 Chapter 8 Virtual Memory Virtual memory is a storage allocation scheme in which secondary memory can be addressed as though it were part of main memory.

1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

1 External Sorting Chapter Why Sort?  A classic problem in computer science!  Data requested in sorted order  e.g., find students in increasing.

1  2004 Morgan Kaufmann Publishers Chapter Seven.

1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.

Comparison of JVM Phases on Data Cache Performance Shiwen Hu and Lizy K. John Laboratory for Computer Architecture The University of Texas at Austin.

1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance.

1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.

Basics of Operating Systems March 4, 2001 Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard.

Rice01, slide 1 Characterizing NAS Benchmark Performance on Shared Heterogeneous Networks Jaspal Subhlok Shreenivasa Venkataramaiah Amitoj Singh University.

Chapter 3 Memory Management: Virtual Memory

Managing Multi-Configuration Hardware via Dynamic Working Set Analysis By Ashutosh S.Dhodapkar and James E.Smith Presented by Kyriakos Yioutanis.

Microprocessor-based systems Curse 7 Memory hierarchies.

A Measurement Based Memory Performance Evaluation of High Throughput Servers Garba Isa Yau Department of Computer Engineering King Fahd University of Petroleum.

IPDPS 2005, slide 1 Automatic Construction and Evaluation of “Performance Skeletons” ( Predicting Performance in an Unpredictable World ) Sukhdeep Sodhi.

Using Prediction to Accelerate Coherence Protocols Authors : Shubendu S. Mukherjee and Mark D. Hill Proceedings. The 25th Annual International Symposium.

CS 149: Operating Systems March 3 Class Meeting Department of Computer Science San Jose State University Spring 2015 Instructor: Ron Mak

CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.

Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.

1 RECONSTRUCTION OF APPLICATION LAYER MESSAGE SEQUENCES BY NETWORK MONITORING Jaspal SubhlokAmitoj Singh University of Houston Houston, TX Fermi National.

Virtual Memory 1 1.

Rassul Ayani 1 Performance of parallel and distributed systems  What is the purpose of measurement?  To evaluate a system (or an architecture)  To compare.

Lecture 08: Memory Hierarchy Cache Performance Kai Bu

Review °Apply Principle of Locality Recursively °Manage memory to disk? Treat as cache Included protection as bonus, now critical Use Page Table of mappings.

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

Full and Para Virtualization

1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

1  1998 Morgan Kaufmann Publishers Chapter Seven.

1  2004 Morgan Kaufmann Publishers Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality:

University of Toronto Department of Electrical and Computer Engineering Jason Zebchuk and Andreas Moshovos June 2006.

Chapter 8: Main Memory. 8.2 Silberschatz, Galvin and Gagne ©2005 Operating System Concepts – 7 th Edition, Feb 22, 2005 Memory and Addressing It all starts.

LECTURE 12 Virtual Memory. VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a “cache”

Network Weather Service. Introduction “NWS provides accurate forecasts of dynamically changing performance characteristics from a distributed set of metacomputing.

1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.

LACSI 2002, slide 1 Performance Prediction for Simple CPU and Network Sharing Shreenivasa Venkataramaiah Jaspal Subhlok University of Houston LACSI Symposium.

Quantifying and Controlling Impact of Interference at Shared Caches and Main Memory Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, Onur.

1 Evaluation of Cooperative Web Caching with Web Polygraph Ping Du and Jaspal Subhlok Department of Computer Science University of Houston presented at.

Accurate WiFi Packet Delivery Rate Estimation and Applications Owais Khan and Lili Qiu. The University of Texas at Austin 1 Infocom 2016, San Francisco.

Qin Zhao1, Joon Edward Sim2, WengFai Wong1,2 1SingaporeMIT Alliance 2Department of Computer Science National University of Singapore

Virtual memory.

Ramya Kandasamy CS 147 Section 3

How will execution time grow with SIZE?

Swapping Segmented paging allows us to have non-contiguous allocations

Shreeni Venkataramaiah

Christophe Dubach, Timothy M. Jones and Michael F.P. O’Boyle

José A. Joao* Onur Mutlu‡ Yale N. Patt*

Operating Systems: Internals and Design Principles, 6/E

Virtual Memory 1 1.

Presentation transcript:

Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By

Resource Selection for Grid Applications Application Network ? where is the best performance Data Sim 1 GUI Model Pre Stream

Motivation Estimating performance of an application in dynamically changing grid environment. -Estimation based on generic system probes (like NWS) is expensive and error prone. Estimating performance for micro-architectural simulations. -Executing full application is prohibitively expensive.

Our Approach Application Network PREDICT APPLICATION PERFORMANCE BY RUNNING A SMALL PROGRAM REPRESENTATIVE OF ACTUAL DISTRIBUTED APPLICATION Data Sim 1 GUI Model Pre Stream

Performance Skeletons A synthetically generated short running program. Skeleton reflects the performance of the application it represents in any execution scenario. - E.g. skeleton execution time is always 1/1000 th of application execution time An application and its skeleton should have similar execution activities for the above to be true -Communication activity -CPU activity -Memory access pattern

Memory Skeleton Given an executable application, construct a short running skeleton program whose memory access behavior is representative of the application. An application and its memory skeleton should have similar cache performance for any cache hierarchy. Solution approach: create a program that recreates memory accesses in a sequence of representative slices of the executing program

Challenges in creating a Memory Skeleton Memory trace is prohibitively large even for a few minutes of execution solution approach : sampling and compression – lossy if necessary Recreating memory accesses from a trace is difficult cache is corrupted by management code recreation has substantial overhead – several instructions have to be executed to issue a memory access request solution approach: avoid cache corruption and allow reordering that would minimize overhead per access

Memory Access Behavior of Applications Two types of locality. Spatial Locality - if one memory location is accessed then nearby memory locations are also likely to be accessed. Temporal Locality - if something is accessed once, it is likely to be accessed again soon. These locality principles should be preserved in the memory skeleton

Data Sim 1 GUI Model Pre Stream Data Sim 1 GUI Model Pre Stre am CREATE SKELETON Collect data address trace samples of the application. Application Skeleton Summarize the trace samples. Generate memory skeleton. Automatic Skeleton Construction Framework

Data Sim 1 GUI Model Pre Stream Data Sim 1 GUI Model Pre Stre am CREATE SKELETON Collect data address trace samples of the application. Application Skeleton Summarize the trace samples. Generate memory skeleton. Automatic Skeleton Construction Framework

Address Trace Collection Link the application executable with Valgrind Tool - Generates address trace of the application. - Access to source code not required. Issues - Unacceptable level of storage space and time overhead. Hence, sampling of address trace must be done during trace collection itself – collection of full traces of applications is prohibitively expensive.

Trace Sampling - Divide the trace into trace slices: set of consecutive memory references. - tool can be periodically switched on and off to capture these slices. - slices can be collected at random or uniform intervals. Slice size should be at least one order of magnitude greater than the largest cache expected, in order to capture the temporal locality. Address Trace Collection (Contd…)

Data Sim 1 GUI Model Pre Stream Data Sim 1 GUI Model Pre Stre am CREATE SKELETON Collect data address trace samples of the application. Application Skeleton Summarize the trace samples. Generate memory skeleton. Automatic Skeleton Construction Framework

Trace Compaction Recorded trace is still large and expensive to recreate Compress the trace using the following two ideas - Exact address in a trace is not critical – a nearby address will work – may affect spatial locality - Slight reordering of address trace does not affect performance. – may affect temporal locality This is lossy compression but impact on locality can be reduced to be negligible

Trace Compaction (contd…) Divide the address space into lines of size ~ typical cache line - record only the line number, not full address. Impact on spatial locality should be minimal Divide the temporal sequence of line numbers into clusters. Reordering within a cluster is allowed. Cluster size should be much smaller than the smallest expected cache size so temporal locality is not affected by reordering

Data Sim 1 GUI Model Pre Stream Data Sim 1 GUI Model Pre Stre am CREATE SKELETON Collect data address trace samples of the application. Application Skeleton Summarize the trace samples. Generate memory skeleton. Automatic Skeleton Construction Framework

Memory Skeleton Generation Create a C-program that synthetically generates a sequence of memory references recorded in the previous step. Challenges Minimizing extraneous address references. - Any executing program has memory accesses of its own. - Generate a loop structure for each cluster. - Reading linenumber-frequency pairs once leads to series of actual memory references from the trace without intervening address reads.

Skeleton Generation (contd…) Eliminating cache corruption - Reading trace data from disk impacts the memory simulation. Use 2 nd machine, read data on one machine and send it through sockets to the other machine where simulation is done. Socket buffer is kept to a very small size Also reduces overhead on the main simulation machine.

Skeleton Generation (contd…) Allocating memory - The regions of virtual memory that will actually be used are not known prior to simulation Dynamic block allocation Substantial size block of memory is allocated when an address reference is made to a location that is not allocated. Maintain Sparse Index Table to access the blocks.

Experiments and Results Skeletons constructed for Class-W NAS serial benchmarks (and Class-A IS benchmark). Experiments conducted on Intel Xeon dual CPU 1.7 GHz machines with 256KB 8-way set associative L2 cache and 64 byte lines, running Linux Objectives: Prediction of cache miss ratio of corresponding applications. Predictions across different memory hierarchies. Trace slices picked uniformly throughout the trace for all the experiments.

Prediction of Cache Miss Ratios Comparison of data cache miss rate of benchmarks and corresponding memory skeletons. Average error < 5% IS application is an exception. Trace Sampling ratio = 10% Trace slice size > 10 million references No. of slices picked > 10

Impact of trace slice selection Data cache miss rates for different sets of trace slices in skeletons. Actual Data cache miss rates of IS – 3.9% BT – 2.76% MG – 1.57% Traces for IS, BT and MG benchmarks divided into 100 uniform slices. 10 different versions of skeletons generated each using different sets of 10 uniformly spaced trace slices. - MG and BT have similar cache miss rates in all cases. - IS has significant variation in cache miss prediction with different sets of slices. Reason: IS execution goes through different phases with different memory access behavior unlike CG and BT.

Impact of trace slice selection (contd…) - Greater the number of trace slices in an application trace, smaller the size of each trace slice. Data cache miss rates of IS skeletons with different sets of slices and for different number of slices. - Having large number of slices in the trace captures the multiphase behavior of applications. True cache miss ratio

Impact of trace slice size Error in cache miss prediction with memory skeletons for different trace slice sizes for MG. - The cache miss ratio prediction error increases rapidly when the slice sizes are reduced below a certain point.

Prediction across hardware platforms Cache miss comparison of CG benchmark and its skeleton across different memory hierarchies. Cache miss ratios were predicted fairly accurately with error < 5% across all machines.

Conclusion and Discussions Presents a methodology to build memory skeletons for prediction of application cache miss ratio across hardware platforms. A step towards building good performance skeletons Extends our group’s previous work on skeletons to memory characteristics. Major Contribution Low overhead generation of memory accesses from a trace.

Limitations Instruction references Space and time Overhead Timing accuracy Integration with Communication and CPU events Conclusion and Discussions (Contd…)