Department of Computer Science Mining Performance Data from Sampled Event Traces Bret Olszewski IBM Corporation – Austin, TX Ricardo Portillo, Diana Villa,

Slides:

Advertisements

Similar presentations

Bypass and Insertion Algorithms for Exclusive Last-level Caches

Advertisements

Full-System Timing-First Simulation Carl J. Mauer Mark D. Hill and David A. Wood Computer Sciences Department University of Wisconsin—Madison.

Discovering and Exploiting Program Phases Timothy Sherwood, Erez Perelman, Greg Hamerly, Suleyman Sair, Brad Calder CSE 231 Presentation by Justin Ma.

Dynamic Optimization using ADORE Framework 10/22/2003 Wei Hsu Computer Science and Engineering Department University of Minnesota.

Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors Chinnakrishnan S. Ballapuram Ahmad Sharif Hsien-Hsin S.

FLEXclusion: Balancing Cache Capacity and On-chip Bandwidth via Flexible Exclusion Jaewoong Sim Jaekyu Lee Moinuddin K. Qureshi Hyesoon Kim.

PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.

1.Calculate number of events by searching for event in assembly file or analytical model. 2.Validate the numbers from step one with a simulator. 3.Compare.

Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

Enabling Efficient On-the-fly Microarchitecture Simulation Thierry Lafage September 2000.

IEEE PIMRC A Comparative Measurement Study of the Workload of Wireless Access Points in Campus Networks Maria Papadopouli Assistant Professor Department.

Microarchitectural Characterization of Production JVMs and Java Workload work in progress Jungwoo Ha (UT Austin) Magnus Gustafsson (Uppsala Univ.) Stephen.

NUMA Tuning for Java Server Applications Mustafa M. Tikir.

Accuracy of Performance Monitoring Hardware Michael E. Maxwell, Patricia J. Teller, and Leonardo Salayandia University of Texas-El Paso and Shirley Moore.

Using one level of Cache:

Comparison of JVM Phases on Data Cache Performance Shiwen Hu and Lizy K. John Laboratory for Computer Architecture The University of Texas at Austin.

Code Coverage Testing Using Hardware Performance Monitoring Support Alex Shye, Matthew Iyer, Vijay Janapa Reddi and Daniel A. Connors University of Colorado.

Catching Accurate Profiles in Hardware Satish Narayanasamy, Timothy Sherwood, Suleyman Sair, Brad Calder, George Varghese Presented by Jelena Trajkovic.

ABACUS: A Hardware-Based Software Profiler for Modern Processors Eric Matthews Lesley Shannon School of Engineering Science Sergey Blagodurov Sergey Zhuravlev.

By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and

Flexible Reference-Counting-Based Hardware Acceleration for Garbage Collection José A. Joao * Onur Mutlu ‡ Yale N. Patt * * HPS Research Group University.

Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.

An Analytical Performance Model for Co-Management of Last-Level Cache and Bandwidth Sharing Taecheol Oh, Kiyeon Lee, and Sangyeun Cho Computer Science.

Multi-core Programming VTune Analyzer Basics. 2 Basics of VTune™ Performance Analyzer Topics What is the VTune™ Performance Analyzer? Performance tuning.

A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.

Software Performance Analysis Using CodeAnalyst for Windows Sherry Hurwitz SW Applications Manager SRD Advanced Micro Devices Lei.

Computer Science Department University of Texas at El Paso PCAT Performance Counter Assessment Team PAPI Development Team SC 2003, Phoenix, AZ – November.

Performance Monitoring on the Intel ® Itanium ® 2 Processor CGO’04 Tutorial 3/21/04 CK. Luk Massachusetts Microprocessor Design.

A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Jason Zebchuk, Elham Safi, and Andreas Moshovos

1 Wenguang WangRichard B. Bunt Department of Computer Science University of Saskatchewan November 14, 2000 Simulating DB2 Buffer Pool Management.

1 Sampling-based Program Locality Approximation Yutao Zhong, Wentao Chang Department of Computer Science George Mason University June 8th,2008.

Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.

Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.

Profiling Memory Subsystem Performance in an Advanced POWER Virtualization Environment The prominent role of the memory hierarchy as one of the major bottlenecks.

ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Execution Characteristics of SPEC CPU2000 Benchmarks: Intel C++ vs. Microsoft VC++

Computer Organization and Architecture Tutorial 1 Kenneth Lee.

Memory Performance Profiling via Sampled Performance Monitor Event Traces Diana Villa, Patricia J. Teller, and Jaime Acosta The University of Texas at.

Analyzing Performance Vulnerability due to Resource Denial-Of-Service Attack on Chip Multiprocessors Dong Hyuk WooGeorgia Tech Hsien-Hsin “Sean” LeeGeorgia.

Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.

Session 7C July 9, 2004ICPADS ‘04 A Framework for Profiling Multiprocessor Memory Performance Diana Villa, Jaime Acosta, Patricia J. Teller The University.

Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.

On-board Performance Counters: What do they really tell us? Pat Teller The University of Texas at El Paso (UTEP) PTools 2002 Annual Meeting, University.

CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara) Dimitris Kaseridis and Lizy K. John The University of Texas at Austin Laboratory for Computer.

Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.

DISSERTATION RESEARCH PLAN Mitesh Meswani. Outline  Dissertation Research Update  Previous Approach and Results  Modified Research Plan  Identifying.

컴퓨터교육과 이상욱 Published in: COMPUTER ARCHITECTURE LETTERS (VOL. 10, NO. 1) Issue Date: JANUARY-JUNE 2011 Publisher: IEEE Authors: Omer Khan (Massachusetts.

Migration Cost Aware Task Scheduling Milestone Shraddha Joshi, Brian Osbun 10/24/2013.

CISC 879 : Advanced Parallel Programming Rahul Deore Dept. of Computer & Information Sciences University of Delaware Exploring Memory Consistency for Massively-Threaded.

Sunpyo Hong, Hyesoon Kim

Performance profiling of Experiments’ Geant4 Simulations Geant4 Technical Forum Ryszard Jurga.

Department of Computer Science 6 th Annual Austin CAS Conference – 24 February 2005 Ricardo Portillo, Diana Villa, Patricia J. Teller The University of.

Desktop Workload Characterization for CMP/SMT and Implications for Operating System Design Sven Bachthaler Fernando Belli Alexandra Fedorova Simon Fraser.

CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 2) Jonathan Winter.

KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association SYSTEM ARCHITECTURE GROUP DEPARTMENT OF COMPUTER.

Confessions of a Performance Monitor Hardware Designer Workshop on Hardware Performance Monitor Design HPCA February 2005 Jim Callister Intel Corporation.

Qin Zhao1, Joon Edward Sim2, WengFai Wong1,2 1SingaporeMIT Alliance 2Department of Computer Science National University of Singapore

L2-Cache Miss Profiling on the p690 for a Large-scale Database Application Trevor Morgan, Diana Villa, Patricia J. Teller, and Jaime Acosta The University.

Itanium® 2 Processor Architecture

Computer Sciences Department University of Wisconsin-Madison

Improving the support for ARM in IgProf

Memory System Characterization of Commercial Workloads

Characterization of Parallel Scientific Simulations

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Upgrading to Microsoft SQL Server 2014

Department of Computer Science University of California, Santa Barbara

Understanding Performance Counter Data - 1

José A. Joao* Onur Mutlu‡ Yale N. Patt*

Department of Computer Science University of California, Santa Barbara

What Are Performance Counters?

Presentation transcript:

Department of Computer Science Mining Performance Data from Sampled Event Traces Bret Olszewski IBM Corporation – Austin, TX Ricardo Portillo, Diana Villa, Patricia J. Teller The University of Texas at El Paso Department of Computer Science

Outline  Motivation  Data Collection Environment Workload & Platform Monitored Events  Data Analysis & Results  Conclusions and Future Work

Department of Computer Science Motivation  Capturing Event Traces  System Simulation: Overhead penalty is too high  Real-time Metrics: Capture every event during actual execution  Problem Growing size of full event traces is becoming unmanageable  Goal Use sampled event traces to analyze execution behavior

Department of Computer Science Data Collection Environment  Workload TPC-C benchmark  Commercial  OLTP  Platform IBM eServer pSeries 690 architecture (p690) 8- and 32-processor configurations

Department of Computer Science P X XP XP L2 L3 MCM 0 8-processor p690 configuration Platform P X XP XP P L2 L3 MCM 1 X XP L2

Department of Computer Science 32-processor p690 configuration Platform P P PP PP P L2 L3 MCM 0 P P P PP PP P L2 L3 MCM 2 P P P PP PP P L2 L3 MCM 1 P P P PP PP P L2 L3 MCM 3 P

Department of Computer Science Monitored Events  L2-cache data-load misses L2.5 L2.75 L3 L3.5 MEM

Department of Computer Science P X XP XP L2 L3 MCM 0 P X XP XP P L2 L3 MCM 1 X XP L2 Where is L2 Miss Resolved? L2

Department of Computer Science P X XP XP L2 L3 MCM 0 P X XP XP P L2 L3 MCM 1 X XP L2 Where is L2 Miss Resolved? L2.5 Event

Department of Computer Science P X XP XP L2 L3 MCM 0 P X XP XP P L2 L3 MCM 1 X XP L2 Where is L2 Miss Resolved? L2.5 Event L2.75 Event

Department of Computer Science P X XP XP L2 L3 MCM 0 P X XP XP P L3 MCM 1 X XP L2 Where is L2 Miss Resolved? L2.5 Event L2.75 EventL3 Event

Department of Computer Science P X XP XP L2 MCM 0 P X XP XP P L3 MCM 1 X XP Where is L2 Miss Resolved? L2.5 Event L2.75 EventL3 Event L3 L2 L3.5 Event

Department of Computer Science Data Collection  Performance Monitoring Unit (PMU) Special-purpose registers Programming interface Kernel extension  eprof PMU configuration Event-based sampling

Department of Computer Science Sampled Event Trace  10-minute observation interval Record periodic occurrences of an event 100 events/sec/CPU  Event record A8C PIDTIDTimestamp Effective Instruction Address Effective Data Address  Average number of samples collected/event 238,448 for 8-processor data 212,396 for 32-processor data

Department of Computer Science Analysis Memory Hotspots Individual Address Region Process Migration

Department of Computer Science L3 and Memory are most active memory levels Counted total number of L3 hits Counted number of L3 hits per address region Counted number of unique cache lines referenced per region Memory Hotspots

Department of Computer Science Memory Hotspots

Department of Computer Science Individual Address Region We can look at an address region in more detail Looked at Buffer Pool region Counted number of references per memory level Counted number of unique cache lines referenced per memory level

Department of Computer Science L2L2.5 MODL2.75 MODL3L3.5MEM Event Name Distribution of Data Load Hits: BUFFER_POOL DataLoadHits UniqueCacheLines Individual Address Region

Department of Computer Science Process Migration Process migration from one chip to another can degrade performance when all or part of the process' working set must follow, via L2-cache misses Looked at 885 threads Counted number of migrations per thread Counted number of L2.5 hits per thread

Department of Computer Science Process Migration

Department of Computer Science  Only a few addresses in Buffer Pool region are causing most of its L3 hits  For Buffer Pool, heavily referenced shared data is constantly resolved outside an MCM  Process migration is not a source of performance degradation Conclusions

Department of Computer Science  Quantify representativeness of sampled event traces  Suggest more ways to improve p690 application performance  Study sampled event traces for other workloads  In depth study of process characterization Future Work

Department of Computer Science Thank You!