Profiling Memory Subsystem Performance in an Advanced POWER Virtualization Environment The prominent role of the memory hierarchy as one of the major bottlenecks.

Slides:



Advertisements
Similar presentations
Module 13: Performance Tuning. Overview Performance tuning methodologies Instance level Database level Application level Overview of tools and techniques.
Advertisements

Dynamic Optimization using ADORE Framework 10/22/2003 Wei Hsu Computer Science and Engineering Department University of Minnesota.
Combining Statistical and Symbolic Simulation Mark Oskin Fred Chong and Matthew Farrens Dept. of Computer Science University of California at Davis.
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
Analysis of Database Workloads on Modern Processors Advisor: Prof. Shan Wang P.h.D student: Dawei Liu Key Laboratory of Data Engineering and Knowledge.
NUMA Tuning for Java Server Applications Mustafa M. Tikir.
1 Characterizing the Sort Operation on Multithreaded Architectures Layali Rashid, Wessam M. Hassanein, and Moustafa A. Hammad* The Advanced Computer Architecture.
Presenter: Jyun-Yan Li Multiprocessor System-on-Chip Profiling Architecture: Design and Implementation Po-Hui Chen, Chung-Ta King, Yuan-Ying Chang, Shau-Yin.
1 Dr. Frederica Darema Senior Science and Technology Advisor NSF Future Parallel Computing Systems – what to remember from the past RAMP Workshop FCRC.
Instrumentation and Profiling David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA
CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance.
Processor Consistency [Goodman 1989]* Processor Consistency is a memory model in which the result of any execution is the same as if the operations of.
A Characterization of Processor Performance in the VAX-11/780 From the ISCA Proceedings 1984 Emer & Clark.
By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and
23 September 2004 Evaluating Adaptive Middleware Load Balancing Strategies for Middleware Systems Department of Electrical Engineering & Computer Science.
Virtual Memory Tuning   You can improve a server’s performance by optimizing the way the paging file is used   You may want to size the paging file.
Measuring zSeries System Performance Dr. Chu J. Jong School of Information Technology Illinois State University 06/11/2012 Sponsored in part by Deer &
A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems Dimitris Kaseridis, Jeffery Stuecheli,
Department of Computer Science Mining Performance Data from Sampled Event Traces Bret Olszewski IBM Corporation – Austin, TX Ricardo Portillo, Diana Villa,
Computer Science Department University of Texas at El Paso PCAT Performance Counter Assessment Team PAPI Development Team SC 2003, Phoenix, AZ – November.
Kinshuk Govil, Dan Teodosiu*, Yongqiang Huang, and Mendel Rosenblum
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
Hadi Salimi Distributed Systems Lab, School of Computer Engineering, Iran University of Science and Technology, Fall 2010 Performance.
John Mellor-Crummey Robert Fowler Nathan Tallent Gabriel Marin Department of Computer Science, Rice University Los Alamos Computer Science Institute HPCToolkit.
1 Wenguang WangRichard B. Bunt Department of Computer Science University of Saskatchewan November 14, 2000 Simulating DB2 Buffer Pool Management.
1 Fast and Efficient Partial Code Reordering Xianglong Huang (UT Austin, Adverplex) Stephen M. Blackburn (Intel) David Grove (IBM) Kathryn McKinley (UT.
Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.
Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.
Srihari Makineni & Ravi Iyer Communications Technology Lab
1 Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly Koushik Chakraborty Philip Wells Gurindar Sohi
1 Evaluation of Cooperative Web Caching with Web Polygraph Ping Du and Jaspal Subhlok Department of Computer Science University of Houston presented at.
Computer Science Lecture 7, page 1 CS677: Distributed OS Multiprocessor Scheduling Will consider only shared memory multiprocessor Salient features: –One.
Memory Performance Profiling via Sampled Performance Monitor Event Traces Diana Villa, Patricia J. Teller, and Jaime Acosta The University of Texas at.
Session 7C July 9, 2004ICPADS ‘04 A Framework for Profiling Multiprocessor Memory Performance Diana Villa, Jaime Acosta, Patricia J. Teller The University.
© 2006, National Research Council Canada © 2006, IBM Corporation Solving performance issues in OTS-based systems Erik Putrycz Software Engineering Group.
CSC Multiprocessor Programming, Spring, 2012 Chapter 11 – Performance and Scalability Dr. Dale E. Parson, week 12.
MEMORY SYSTEM CHARACTERIZATION OF COMMERCIAL WORKLOADS Authors: Luiz André Barroso (Google, DEC; worked on Piranha) Kourosh Gharachorloo (Compaq, DEC;
Virtual Application Profiler (VAPP) Problem – Increasing hardware complexity – Programmers need to understand interactions between architecture and their.
Full and Para Virtualization
Infrastructure for Data Warehouses. Basics Of Data Access Data Store Machine Memory Buffer Memory Cache Data Store Buffer Bus Structure.
Layali Rashid, Wessam M. Hassanein, and Moustafa A. Hammad*
Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.
Vertical Profiling : Understanding the Behavior of Object-Oriented Applications Sookmyung Women’s Univ. PsLab Sewon,Moon.
Department of Computer Science 6 th Annual Austin CAS Conference – 24 February 2005 Ricardo Portillo, Diana Villa, Patricia J. Teller The University of.
1 Evaluation of Cooperative Web Caching with Web Polygraph Ping Du and Jaspal Subhlok Department of Computer Science University of Houston presented at.
LIOProf: Exposing Lustre File System Behavior for I/O Middleware
VU-Advanced Computer Architecture Lecture 1-Introduction 1 Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 1.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
L2-Cache Miss Profiling on the p690 for a Large-scale Database Application Trevor Morgan, Diana Villa, Patricia J. Teller, and Jaime Acosta The University.
Compilers: History and Context COMP Outline Compilers and languages Compilers and architectures – parallelism – memory hierarchies Other uses.
Computer Sciences Department University of Wisconsin-Madison
Dag Toppe Larsen UiB/CERN CERN,
Dag Toppe Larsen UiB/CERN CERN,
Operating System Structure
Memory System Characterization of Commercial Workloads
Morgan Kaufmann Publishers
The Problem Finding a needle in haystack An expert (CPU)
Bank-aware Dynamic Cache Partitioning for Multicore Architectures
What we need to be able to count to tune programs
ColdFusion Performance Troubleshooting and Tuning
Tools.
Horizontally Partitioned Hybrid Main Memory with PCM
Tools.
Chapter 4 Multiprocessors
Performance And Scalability In Oracle9i And SQL Server 2000
Database System Architectures
Harrison Howell CSCE 824 Dr. Farkas
CSC Multiprocessor Programming, Spring, 2011
Presentation transcript:

Profiling Memory Subsystem Performance in an Advanced POWER Virtualization Environment The prominent role of the memory hierarchy as one of the major bottlenecks in achieving good program performance has motivated the search for ways of capturing the memory performance of an application/machine pair that is both practical in terms of time and space, yet detailed enough to gain useful and relevant information. The strategy that we endorse periodically samples events during program execution, producing an event trace that is both manageable and informative. Additionally, we developed a fast and flexible performance evaluation framework with which to analyze and understand the performance data contained within the sampled event traces. We have shown the potential of our performance evaluation methodology by using it to analyze a disparate set of performance issues for large, complex applications running on a multiprocessor system. For example, we have applied our methodology to characterize performance issues such as memory access performance, process migration, compulsory and conflict misses, and false sharing. To date, we have studied the memory subsystem performance of several complex applications, including the TPC-C and SPECsfs benchmarks, executing on different configurations of the IBM eServer pSeries 690. Additionally, we have begun to investigate the effectiveness of our performance evaluation framework when studying memory subsystem performance in a virtualized environment. Virtualization allows multiple execution environments to time-share the same physical hardware in an effort to increase machine utilization. However, there is an inherent performance overhead associated with sharing a fixed set of hardware resources. The goal of our work is to identify and analyze the performance overhead associated with virtualization using our performance evaluation framework. To date, we have studied the memory subsystem performance of TRADE3, an on-line stock brokerage application, executing on different configurations of the IBM eServer p5 570, a commercial server designed to support virtualization. Department of Computer Science Diana Villa, Ph.D. Candidate Mitesh Meswani, Ph.D. Candidate Dr. Patricia Teller, Professor Bret Olszewski Mala Anand Carole Gottlieb Austin, TX Data Collection 1 Environment  IBM eServer p5 570 (p570) architecture  1.65 GHz POWER5 processor  4-processor configuration Workload  TRADE3  On-line stock brokerage application  Three-tier configuration Websphere, DB2, Application Code Data  Collected via Event-based Sampling (record periodic occurrence of monitored event)  Organized as Sampled Event Traces (one per CPU)  Event Record PIDTIDTimestamp Effective Instruction Address Effective Data Address A8C Events Profiled 2  L2-Cache Data Load Misses - require the CPU to access off-chip memory to be resolved  Classified according to level at which they are resolved and state of the requested block 4-processor configuration of the p570 L2.75 (different DCM) L3 DCM 0 P P L2 L3 MEM DCM 1 P P L2 L3 MEM L3.75 cache L2.75 cache L3 cache L2 cache Load Latency L2-Cache Access Resolution Site LMEM RMEM 14 cycles 91 cycles 121 cycles 205 cycles 281 cycles 307 cycles Load Latencies of 4-processor Configuration L3.75 (different DCM) LMEM LMEM (different DCM) Performance Framework 3  MySQL databases catalog/store sampled event traces  Java tools interface with databases to load sampled event traces and run queries Reports Graphs Database Load DB Java Tool Data Collection Environment TRADE3 p570 Sampled Event Traces Report Generator Java Tool 5 BufferPool Data,BSS,Heap Kernel PID TID Timestamp InstrAddr DataAddr Virtualization 4  Virtualize resources to facilitate time-sharing of the hardware by different execution environments  Emergence of virtualization technology in new environments (e.g., newer architectures, open source)  POWER Hypervisor facilitates resource sharing and supports as many as 254 active partitions Data Analysis and Results 5  Performance overhead associated with virtualization due to sharing a fixed-set of hardware resources  Goal: Observe differences in data-load behavior that could represent the performance overhead  Compared executions of TRADE3 in non-virtualized (1P) and virtualized (5P) environments  Observed an increased locality of reference for 5P data-loads in memory  Indicates a possible increase in capacity/conflict misses in 5P case due to contention for hardware resources DCM 0 P P L2 L3 MEM DCM 1 P P L2 L3 MEM APP 3 OS 3 POWER Hypervisor APP 2 OS 2 APP 1 OS 1 APP 4 OS 4 APP N OS N Publications  Villa, D., Meswani, M., Teller, P.J., and Olszewski, B., "Profiling Memory Subsystem Performance in an Advanced POWER Virtualization Environment", To appear in the Proceedings of the 1st International Workshop on Operating System Interference in High Performance Applications, September 2004, St. Louis, MO.  Portillo, R., Villa, D., Teller, P.J., and Olszewski, B., "Mining Performance Data from Sampled Event Traces", Proceedings of the 6th Annual Austin Center for Advanced Studies (CAS) Conference, February 2005, Austin, TX  Villa, D., Acosta, J., Teller, P.J., Olszewski, B., and Morgan, T., "Memory Performance Profiling via Sampled Performance Monitor Event Traces", Proceedings of the 5th Annual Los Alamos Computer Science Institute Symposium (LACSI), October, 2004, Santa Fe, NM.  Portillo, R., Villa, D., Teller, P.J., and Olszewski, B., "Mining Performance Data from Sampled Event Traces", Proceedings of the 12th Annual Meeting of the IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS), October 2004, Volendam, The Netherlands.  Villa, D., Acosta, J., Teller, P.J., Olszewski, B., and Morgan, T., "A Framework for Profiling Multiprocessor Memory Performance", Proceedings of the 10th International Conference on Parallel and Distributed Systems (ICPADS), July 2004, Long Beach, CA.  Villa, D., Acosta, J., Teller, P.J., Olszewski, B., and Morgan, T., "Memory Performance Profiling via Sampled Performance Monitor Event Traces", Proceedings of the 5th Annual Austin Center for Advanced Studies (CAS) Conference, February 2004, Austin, TX  Villa, D. (2003). Using Sampled Performance Monitor Event Traces to Characterize Application Behavior. Unpublished master's thesis, The University of Texas at El Paso, El Paso, TX.  Morgan, T., Villa, D., Teller, P.J., Olszewski, B., and Acosta, J., "L2 Miss Profiling on the p690 for a Large-scale Database Application", Proceedings of the 4th Annual Austin Center for Advanced Studies (CAS) Conference, February 2003, Austin, TX.