Memory Performance Profiling via Sampled Performance Monitor Event Traces Diana Villa, Patricia J. Teller, and Jaime Acosta The University of Texas at.

Slides:



Advertisements
Similar presentations
Diagnosing Performance Overheads in the Xen Virtual Machine Environment Aravind Menon Willy Zwaenepoel EPFL, Lausanne Jose Renato Santos Yoshio Turner.
Advertisements

Dynamic Optimization using ADORE Framework 10/22/2003 Wei Hsu Computer Science and Engineering Department University of Minnesota.
1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.
Lecture 8: Memory Hierarchy Cache Performance Kai Bu
DBMSs on a Modern Processor: Where Does Time Go? Anastassia Ailamaki Joint work with David DeWitt, Mark Hill, and David Wood at the University of Wisconsin-Madison.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.
Department of Computer Science iGPU: Exception Support and Speculative Execution on GPUs Jaikrishnan Menon, Marc de Kruijf Karthikeyan Sankaralingam Vertical.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
An Analysis of Database Workload Performance on Simultaneous Multithreaded Processors Jack L. Lo, Luiz André Barroso, Susan Eggers Kourosh Gharachorloo,
NUMA Tuning for Java Server Applications Mustafa M. Tikir.
Accuracy of Performance Monitoring Hardware Michael E. Maxwell, Patricia J. Teller, and Leonardo Salayandia University of Texas-El Paso and Shirley Moore.
An Adaptable Benchmark for MPFS Performance Testing A Master Thesis Presentation Yubing Wang Advisor: Prof. Mark Claypool.
Decomposing Memory Performance Data Structures and Phases Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences.
Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.
Comparison of JVM Phases on Data Cache Performance Shiwen Hu and Lizy K. John Laboratory for Computer Architecture The University of Texas at Austin.
By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and
Page 1 © 2001 Hewlett-Packard Company Tools for Measuring System and Application Performance Introduction GlancePlus Introduction Glance Motif Glance Character.
February 11, 2003Ninth International Symposium on High Performance Computer Architecture Memory System Behavior of Java-Based Middleware Martin Karlsson,
A Self-tuning Page Cleaner for the DB2 Buffer Pool Wenguang Wang Rick Bunt Department of Computer Science University of Saskatchewan.
2017/4/21 Towards Full Virtualization of Heterogeneous Noc-based Multicore Embedded Architecture 2012 IEEE 15th International Conference on Computational.
AICS Café – 2013/01/18 AICS System Software team Akio SHIMADA.
Department of Computer Science Mining Performance Data from Sampled Event Traces Bret Olszewski IBM Corporation – Austin, TX Ricardo Portillo, Diana Villa,
A Unified, Low-overhead Framework to Support Continuous Profiling and Optimization Xubin (Ben) He Storage Technology & Architecture Research(STAR)
Software Performance Analysis Using CodeAnalyst for Windows Sherry Hurwitz SW Applications Manager SRD Advanced Micro Devices Lei.
Computer Science Department University of Texas at El Paso PCAT Performance Counter Assessment Team PAPI Development Team SC 2003, Phoenix, AZ – November.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
Statistical Simulation of Superscalar Architectures using Commercial Workloads Lieven Eeckhout and Koen De Bosschere Dept. of Electronics and Information.
WMPI 2006, Austin, Texas © 2006 John C. Koob An Empirical Evaluation of Semiconductor File Memory as a Disk Cache John C. Koob Duncan G. Elliott Bruce.
®® Microsoft Windows 7 for Power Users Tutorial 9 Evaluating System Performance.
1 Wenguang WangRichard B. Bunt Department of Computer Science University of Saskatchewan November 14, 2000 Simulating DB2 Buffer Pool Management.
Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.
Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.
Profiling Memory Subsystem Performance in an Advanced POWER Virtualization Environment The prominent role of the memory hierarchy as one of the major bottlenecks.
Kyushu University Koji Inoue ICECS'061 Supporting A Dynamic Program Signature: An Intrusion Detection Framework for Microprocessors Koji Inoue Department.
A Single-Pass Cache Simulation Methodology for Two-level Unified Caches + Also affiliated with NSF Center for High-Performance Reconfigurable Computing.
Disco : Running commodity operating system on scalable multiprocessor Edouard et al. Presented by Vidhya Sivasankaran.
WMPI 2006, Austin, Texas © 2006 John C. Koob An Empirical Evaluation of Semiconductor File Memory as a Disk Cache John C. Koob Duncan G. Elliott Bruce.
Session 7C July 9, 2004ICPADS ‘04 A Framework for Profiling Multiprocessor Memory Performance Diana Villa, Jaime Acosta, Patricia J. Teller The University.
Embedded System Lab. 정범종 A_DRM: Architecture-aware Distributed Resource Management of Virtualized Clusters H. Wang et al. VEE, 2015.
© 2006, National Research Council Canada © 2006, IBM Corporation Solving performance issues in OTS-based systems Erik Putrycz Software Engineering Group.
Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)
Virtual Application Profiler (VAPP) Problem – Increasing hardware complexity – Programmers need to understand interactions between architecture and their.
On-board Performance Counters: What do they really tell us? Pat Teller The University of Texas at El Paso (UTEP) PTools 2002 Annual Meeting, University.
CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara) Dimitris Kaseridis and Lizy K. John The University of Texas at Austin Laboratory for Computer.
Exploiting Instruction Streams To Prevent Intrusion Milena Milenkovic.
Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.
DISSERTATION RESEARCH PLAN Mitesh Meswani. Outline  Dissertation Research Update  Previous Approach and Results  Modified Research Plan  Identifying.
Princess Sumaya Univ. Computer Engineering Dept. Chapter 5:
Vertical Profiling : Understanding the Behavior of Object-Oriented Applications Sookmyung Women’s Univ. PsLab Sewon,Moon.
An Architectural Evaluation of Java TPC-W Harold “Trey” Cain, Ravi Rajwar, Morris Marden, Mikko Lipasti University of Wisconsin-Madison
Department of Computer Sciences, University of Wisconsin Madison DADA – Dynamic Allocation of Disk Area Jayaram Bobba Vivek Shrivastava.
CSE598c - Virtual Machines - Spring Diagnosing Performance Overheads in the Xen Virtual Machine EnvironmentPage 1 CSE 598c Virtual Machines “Diagnosing.
Department of Computer Science 6 th Annual Austin CAS Conference – 24 February 2005 Ricardo Portillo, Diana Villa, Patricia J. Teller The University of.
Desktop Workload Characterization for CMP/SMT and Implications for Operating System Design Sven Bachthaler Fernando Belli Alexandra Fedorova Simon Fraser.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
A Framework For Trusted Instruction Execution Via Basic Block Signature Verification Milena Milenković, Aleksandar Milenković, and Emil Jovanov Electrical.
CERN IT Department CH-1211 Genève 23 Switzerland t Load testing & benchmarks on Oracle RAC Romain Basset – IT PSS DP.
L2-Cache Miss Profiling on the p690 for a Large-scale Database Application Trevor Morgan, Diana Villa, Patricia J. Teller, and Jaime Acosta The University.
Computer Sciences Department University of Wisconsin-Madison
CMSC 611: Advanced Computer Architecture
Chapter 4: Multithreaded Programming
DADA – Dynamic Allocation of Disk Area
What we need to be able to count to tune programs
Department of Computer Science University of California, Santa Barbara
Tools.
Horizontally Partitioned Hybrid Main Memory with PCM
Tools.
Department of Computer Science University of California, Santa Barbara
Presentation transcript:

Memory Performance Profiling via Sampled Performance Monitor Event Traces Diana Villa, Patricia J. Teller, and Jaime Acosta The University of Texas at El Paso Department of Computer Science Trevor Morgan Exxon/Mobil Bret Olszewski IBM Corporation-Austin 5 th Annual IBM Austin CAS Conference – 20 February 2004

Outline Motivation Data  Events Profiled  Information Collected Analysis  Approach  Performance Evaluation Framework Results Conclusions and Future Work

5 th Annual IBM Austin CAS Conference – 20 February 2004 Motivation Overall research goal General workload characterization model Project goal  Develop a performance evaluation framework to facilitate analysis of large sampled event traces  Study load access patterns of key applications  Identify and remedy performance impediments

5 th Annual IBM Austin CAS Conference – 20 February 2004 Data Collection Environment IBM eserver p-Series 690 architecture 8- and 32-processor configurations TPC-C benchmark Data collected via event trace sampling: Timestamp Effective instruction and data addresses CPU id Process id Thread id

5 th Annual IBM Austin CAS Conference – 20 February 2004 Platform -1 P X XP XP P X X X X P PP P L2 L3 MCM 0 MCM 1 X 8-processor p690 configuration

5 th Annual IBM Austin CAS Conference – 20 February 2004 Platform - 2 P P PP PP P L2 L3 MCM 0 P P P PP PP P L2 L3 MCM 2 P P P PP PP P L2 L3 MCM 1 P P P PP PP P L2 L3 MCM 3 P 32-processor p690 configuration

5 th Annual IBM Austin CAS Conference – 20 February 2004 Events Resolution of L2-cache data-load misses  L2.5 L2.5 shared L2.5 modified  L2.75 L2.75 shared L2.75 modified  L3  L3.5

5 th Annual IBM Austin CAS Conference – 20 February 2004 L2.5 P X XP XP P X X X X P PP P L2 L3 Penalty: 73 cycles MCM 0 MCM 1 X

5 th Annual IBM Austin CAS Conference – 20 February 2004 L2.75 P X XP XP P X X X X P PP P L2 L3 Penalty: 96 cycles MCM 0 MCM 1 X

5 th Annual IBM Austin CAS Conference – 20 February 2004 L3 P X XP XP P X X X X P PP P L2 L3 Penalty: 112 cycles MCM 0 MCM 1 X

5 th Annual IBM Austin CAS Conference – 20 February 2004 L3.5 P X XP XP P X X X X P PP P L2 L3 Penalty: 143 cycles MCM 0 MCM 1 X

5 th Annual IBM Austin CAS Conference – 20 February 2004 Analysis Identify application-specific sources of performance degradation associated with data references Level of Memory Hierarchy kernel …. text buffer pool data,bss,heap …. Address Space Segment Page Page Offset/ Cache line

5 th Annual IBM Austin CAS Conference – 20 February 2004 Performance Evaluation Framework Database Load DB Java Tool Report Generation Java Tool p690TPC-C Data Collection Environment Reports 5 BufferPool Data,BSS,Heap Kernel Graphs Sampled Event Traces PID TID Timestamp Instr.Addr. DataAddr.

5 th Annual IBM Austin CAS Conference – 20 February 2004 Results

5 th Annual IBM Austin CAS Conference – 20 February 2004 Results - Memory Regions

5 th Annual IBM Austin CAS Conference – 20 February 2004 Results - L3 Cache

5 th Annual IBM Austin CAS Conference – 20 February 2004 Results - Segment

5 th Annual IBM Austin CAS Conference – 20 February 2004 Results - Pages

5 th Annual IBM Austin CAS Conference – 20 February 2004 Results – Cache Lines

5 th Annual IBM Austin CAS Conference – 20 February 2004 Results - Instructions Lock OperationsAtomic Operations simple_lockfetch_and_add simple_lock_ppcfetch_and_add_h simple_unlockfetch_and_addlp disable_lockfetch_and_or unlock_enablefetch_and_orlp simple_unlock_memfetch_and_and unlock_enable_memfetch_and_andlp

5 th Annual IBM Austin CAS Conference – 20 February 2004 Targets for performance improvement of TPC-C are associated mainly with two regions of the address space:  buffer pool  data, bss, heap TPC-C lock instructions are not key to performance degradation 8- and 32-processor data have same reference pattern, thus, a model of TPC-C memory access may be possible Conclusions

5 th Annual IBM Austin CAS Conference – 20 February 2004 Suggest ways to improve performance of applications executed on p690 Enhance performance evaluation framework Quantify representativeness of sampled event traces Expand study of application data load behavior  Process characterization  Process migration  Other performance issues Compulsory vs. capacity/conflict misses False sharing Contention for resources Develop synthetic applications that mimic the behavior of key p690 applications; use these to study application behavior and experiment with modifications to applications that may affect performance Future Work

5 th Annual IBM Austin CAS Conference – 20 February 2004 Questions?