Department of Computer Science 6 th Annual Austin CAS Conference – 24 February 2005 Ricardo Portillo, Diana Villa, Patricia J. Teller The University of.

Slides:



Advertisements
Similar presentations
Bypass and Insertion Algorithms for Exclusive Last-level Caches
Advertisements

Dynamic Optimization using ADORE Framework 10/22/2003 Wei Hsu Computer Science and Engineering Department University of Minnesota.
Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.
1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs Mrinmoy Ghosh Hsien-Hsin S. Lee School.
Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.
Using Hardware Vulnerability Factors to Enhance AVF Analysis Vilas Sridharan RAS Architecture and Strategy AMD, Inc. International Symposium on Computer.
Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring Lei Jin and Sangyeun Cho Dept. of Computer Science University.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
Variability in Architectural Simulations of Multi-threaded Workloads Alaa R. Alameldeen and David A. Wood University of Wisconsin-Madison
Developing a Characterization of Business Intelligence Workloads for Sizing New Database Systems Ted J. Wasserman (IBM Corp. / Queen’s University) Pat.
NUMA Tuning for Java Server Applications Mustafa M. Tikir.
Accuracy of Performance Monitoring Hardware Michael E. Maxwell, Patricia J. Teller, and Leonardo Salayandia University of Texas-El Paso and Shirley Moore.
The ATHA Environment: Experience with a User Friendly Environment for Opportunistic Computing M.A.R.Dantas Department of Informatics (INE) University of.
Order-Independent Texture Synthesis Li-Yi Wei Marc Levoy Gcafe 1/30/2003.
(C) 2002 Milo MartinHPCA, Feb Bandwidth Adaptive Snooping Milo M.K. Martin, Daniel J. Sorin Mark D. Hill, and David A. Wood Wisconsin Multifacet.
Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.
Comparison of JVM Phases on Data Cache Performance Shiwen Hu and Lizy K. John Laboratory for Computer Architecture The University of Texas at Austin.
Code Coverage Testing Using Hardware Performance Monitoring Support Alex Shye, Matthew Iyer, Vijay Janapa Reddi and Daniel A. Connors University of Colorado.
1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance.
By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and
Improving the Performance of Object-Oriented Languages with Dynamic Predication of Indirect Jumps José A. Joao *‡ Onur Mutlu ‡* Hyesoon Kim § Rishi Agarwal.
A Self-tuning Page Cleaner for the DB2 Buffer Pool Wenguang Wang Rick Bunt Department of Computer Science University of Saskatchewan.
Flexible Reference-Counting-Based Hardware Acceleration for Garbage Collection José A. Joao * Onur Mutlu ‡ Yale N. Patt * * HPS Research Group University.
Dan Tang, Yungang Bao, Yunji Chen, Weiwu Hu, Mingyu Chen
2017/4/21 Towards Full Virtualization of Heterogeneous Noc-based Multicore Embedded Architecture 2012 IEEE 15th International Conference on Computational.
Department of Computer Science Mining Performance Data from Sampled Event Traces Bret Olszewski IBM Corporation – Austin, TX Ricardo Portillo, Diana Villa,
A Unified, Low-overhead Framework to Support Continuous Profiling and Optimization Xubin (Ben) He Storage Technology & Architecture Research(STAR)
Computer Science Department University of Texas at El Paso PCAT Performance Counter Assessment Team PAPI Development Team SC 2003, Phoenix, AZ – November.
Web Cache Replacement Policies: Properties, Limitations and Implications Fabrício Benevenuto, Fernando Duarte, Virgílio Almeida, Jussara Almeida Computer.
M.A.Doman Short video intro Model for enabling the delivery of computing as a SERVICE.
Performance Monitoring on the Intel ® Itanium ® 2 Processor CGO’04 Tutorial 3/21/04 CK. Luk Massachusetts Microprocessor Design.
Hadi Salimi Distributed Systems Lab, School of Computer Engineering, Iran University of Science and Technology, Fall 2010 Performance.
A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Jason Zebchuk, Elham Safi, and Andreas Moshovos
1 Wenguang WangRichard B. Bunt Department of Computer Science University of Saskatchewan November 14, 2000 Simulating DB2 Buffer Pool Management.
Software Dynamics: A New Method of Evaluating Real-Time Performance of Distributed Systems Janusz Zalewski Computer Science Florida Gulf Coast University.
Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.
Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.
Profiling Memory Subsystem Performance in an Advanced POWER Virtualization Environment The prominent role of the memory hierarchy as one of the major bottlenecks.
Srihari Makineni & Ravi Iyer Communications Technology Lab
1 Evaluation of Cooperative Web Caching with Web Polygraph Ping Du and Jaspal Subhlok Department of Computer Science University of Houston presented at.
Investigating the Effects of Using Different Nursery Sizing Policies on Performance Tony Guan, Witty Srisa-an, and Neo Jia Department of Computer Science.
02/09/2010 Industrial Project Course (234313) Virtualization-aware database engine Final Presentation Industrial Project Course (234313) Virtualization-aware.
Memory Performance Profiling via Sampled Performance Monitor Event Traces Diana Villa, Patricia J. Teller, and Jaime Acosta The University of Texas at.
Energy Efficient D-TLB and Data Cache Using Semantic-Aware Multilateral Partitioning School of Electrical and Computer Engineering Georgia Institute of.
Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.
Session 7C July 9, 2004ICPADS ‘04 A Framework for Profiling Multiprocessor Memory Performance Diana Villa, Jaime Acosta, Patricia J. Teller The University.
Embedded System Lab. 정범종 A_DRM: Architecture-aware Distributed Resource Management of Virtualized Clusters H. Wang et al. VEE, 2015.
© 2006, National Research Council Canada © 2006, IBM Corporation Solving performance issues in OTS-based systems Erik Putrycz Software Engineering Group.
An Efficient Gigabit Ethernet Switch Model for Large-Scale Simulation Dong (Kevin) Jin.
On-board Performance Counters: What do they really tell us? Pat Teller The University of Texas at El Paso (UTEP) PTools 2002 Annual Meeting, University.
CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara) Dimitris Kaseridis and Lizy K. John The University of Texas at Austin Laboratory for Computer.
Exploiting Instruction Streams To Prevent Intrusion Milena Milenkovic.
An Accurate and Detailed Prefetching Simulation Framework for gem5 Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture.
An Efficient Gigabit Ethernet Switch Model for Large-Scale Simulation Dong (Kevin) Jin.
1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.
컴퓨터교육과 이상욱 Published in: COMPUTER ARCHITECTURE LETTERS (VOL. 10, NO. 1) Issue Date: JANUARY-JUNE 2011 Publisher: IEEE Authors: Omer Khan (Massachusetts.
Desktop Workload Characterization for CMP/SMT and Implications for Operating System Design Sven Bachthaler Fernando Belli Alexandra Fedorova Simon Fraser.
1 Internet Traffic Measurement and Modeling Carey Williamson Department of Computer Science University of Calgary.
1 Evaluation of Cooperative Web Caching with Web Polygraph Ping Du and Jaspal Subhlok Department of Computer Science University of Houston presented at.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
CERN IT Department CH-1211 Genève 23 Switzerland t Load testing & benchmarks on Oracle RAC Romain Basset – IT PSS DP.
L2-Cache Miss Profiling on the p690 for a Large-scale Database Application Trevor Morgan, Diana Villa, Patricia J. Teller, and Jaime Acosta The University.
Computer Sciences Department University of Wisconsin-Madison
Memory System Characterization of Commercial Workloads
Improving java performance using Dynamic Method Migration on FPGAs
Bank-aware Dynamic Cache Partitioning for Multicore Architectures
José A. Joao* Onur Mutlu‡ Yale N. Patt*
Stream-based Memory Specialization for General Purpose Processors
Presentation transcript:

Department of Computer Science 6 th Annual Austin CAS Conference – 24 February 2005 Ricardo Portillo, Diana Villa, Patricia J. Teller The University of Texas at El Paso Department of Computer Science Bret Olszewski IBM Corporation – Austin, TX Mining Performance Data from Sampled Event Traces

Department of Computer Science 6 th Annual Austin CAS Conference – 24 February 2005 Outline  Motivation  Data Collection Environment Workload & Platform Monitored Events  Sampled Event Traces  Data Analysis & Results  Conclusions & Future Work

Department of Computer Science 6 th Annual Austin CAS Conference – 24 February 2005 Motivation  Capturing event traces System simulation: High overhead Real-time measurement: Capture information about every event  Problem Unmanageable size of full event traces  Goal Use sampled event traces to analyze application behavior

Department of Computer Science 6 th Annual Austin CAS Conference – 24 February 2005 Data Collection Environment  Workload TPC-C benchmark  Commercial, OLTP application  Oracle  Platform IBM eServer pSeries 690 architecture (p690)  8- and 32-processor configurations

Department of Computer Science 6 th Annual Austin CAS Conference – 24 February 2005 P X XP XP L2 L3 MCM 0 P X XP XP P L2 L3 MCM 1 X XP L2 Platform 8-processor p690 configuration

Department of Computer Science 6 th Annual Austin CAS Conference – 24 February 2005 Platform 32-processor p690 configuration P P PP PP P L2 L3 MCM 0 P P P PP PP P L2 L3 MCM 2 P P P PP PP P L2 L3 MCM 1 P P P PP PP P L2 L3 MCM 3 P

Department of Computer Science 6 th Annual Austin CAS Conference – 24 February 2005 Monitored Events  L2-Cache Data Load Misses L2.5 L2.75 L3 L3.5 MEM  L1-Cache Data Load Misses L2

Department of Computer Science 6 th Annual Austin CAS Conference – 24 February 2005 P X XP XP L2 L3 MCM 0 P X XP XP P L2 L3 MCM 1 X XP L2 Load Latencies L2 12 cycles L2

Department of Computer Science 6 th Annual Austin CAS Conference – 24 February 2005 P X XP XP L2 L3 MCM 0 P X XP XP P L2 L3 MCM 1 X XP L2 Load Latencies L2 12 cycles L cycles L2.5

Department of Computer Science 6 th Annual Austin CAS Conference – 24 February 2005 P X XP XP L2 L3 MCM 0 P X XP XP P L2 L3 MCM 1 X XP L2 Load Latencies L2 12 cycles L cycles L cycles L2.75

Department of Computer Science 6 th Annual Austin CAS Conference – 24 February 2005 P X XP XP L2 L3 MCM 0 P X XP XP P L2 L3 MCM 1 X XP L2 Load Latencies L2 12 cycles L cycles L cycles L3112 cycles L3

Department of Computer Science 6 th Annual Austin CAS Conference – 24 February 2005 P X XP XP L2 L3 MCM 0 P X XP XP P L2 L3 MCM 1 X XP L2 Load Latencies L2 12 cycles L cycles L cycles L3112 cycles L cycles L3.5

Department of Computer Science 6 th Annual Austin CAS Conference – 24 February 2005 P X XP XP L2 L3 MCM 0 P X XP XP P L2 L3 MCM 1 X XP L2 Load Latencies L2 12 cycles L cycles L cycles L3112 cycles L cycles MEM320 cycles

Department of Computer Science 6 th Annual Austin CAS Conference – 24 February 2005 Data Collection  10-minute observation interval  Performance Monitoring Unit (PMU) Special-purpose registers Programming interface Kernel extension  eprof PMU configuration Event-based sampling

Department of Computer Science 6 th Annual Austin CAS Conference – 24 February 2005 Sampled Event Traces  Sampling Record periodic occurrences of an event 100 events/sec/CPU  Event record A8C PIDTIDTimestamp Effective Instruction Address Effective Data Address  Average number of samples collected/event 238,448 for 8-processor data 212,396 for 32-processor data

Department of Computer Science 6 th Annual Austin CAS Conference – 24 February 2005 Performance Framework Database Load DB Java Tool p690TPC-C Data Collection Environment Reports 5 BufferPool Data,BSS,Heap Kernel Sampled Event Traces Report Generation Java Tool Graphs PID TID Timestamp Instr.Addr. DataAddr.

Department of Computer Science 6 th Annual Austin CAS Conference – 24 February 2005 Data Analysis & Results  Locality of reference at high-penalty resolution sites  Characterization of differences between shared and private data loads  Cost of process migration  False sharing

Department of Computer Science 6 th Annual Austin CAS Conference – 24 February 2005 Data Analysis & Results Goal 1: Identify sources of application performance degradation Identify concentrated areas of locality of reference at high-penalty miss resolution sites

Department of Computer Science 6 th Annual Austin CAS Conference – 24 February 2005 Data Analysis & Results Goal 1: Identify sources of application performance degradation Identify concentrated areas of locality of reference at high-penalty miss resolution sites

Department of Computer Science 6 th Annual Austin CAS Conference – 24 February 2005 Data Analysis & Results Goal 1: Identify sources of application performance degradation Identify concentrated areas of locality of reference at high-penalty miss resolution sites

Department of Computer Science 6 th Annual Austin CAS Conference – 24 February 2005 Goal 2: Study effectiveness of design and policies associated with p690 memory hierarchy w.r.t workload demands Characterize behavioral difference between private and shared data loads Data Analysis & Results Private Distribution of Data Load Hits: Data,BSS,Heap Shared Distribution of Data Load Hits: Buffer Pool

Department of Computer Science 6 th Annual Austin CAS Conference – 24 February 2005 Goal 2: Study effectiveness of design and policies associated with p690 memory hierarchy w.r.t workload demands Data Analysis & Results Private Distribution of Data Load Hits: Data,BSS,Heap Shared Distribution of Data Load Hits: Buffer Pool Good Application/Architecture Match

Department of Computer Science 6 th Annual Austin CAS Conference – 24 February 2005 Goal 2: Study effectiveness of design and policies associated with p690 memory hierarchy w.r.t workload demands Data Analysis & Results Private Distribution of Data Load Hits: Data,BSS,Heap Shared Distribution of Data Load Hits: Buffer Pool Possible Performance Impediment

Department of Computer Science 6 th Annual Austin CAS Conference – 24 February 2005 Goal 2: Study effectiveness of design and policies associated with p690 memory hierarchy w.r.t workload demands Data Analysis & Results Private Distribution of Data Load Hits: Data,BSS,Heap Shared Distribution of Data Load Hits: Buffer Pool MEM Data Load Hits Primarily Due To Compulsory Misses

Department of Computer Science 6 th Annual Austin CAS Conference – 24 February 2005 Data Analysis & Results Goal 3: Study “cost” of intra-MCM migrations Intra-MCM process migration overhead in terms of L2.5 data load hit events

Department of Computer Science 6 th Annual Austin CAS Conference – 24 February 2005 Conclusions - 1 Targets for performance improvement of TPC-C are associated mainly with two regions of the address space: buffer pool data, bss, heap References for private data Satisfied within the MCM Good application/architecture match References for shared data Referenced outside the MCM Target for performance improvement

Department of Computer Science 6 th Annual Austin CAS Conference – 24 February 2005 Conclusions - 2 Main memory accesses primarily associated with compulsory misses Intra-MCM process migration not a possible source of performance degradation Model of TPC-C memory access may be possible Similar reference patterns observed:  8- and 32- processor TPC-C data  8-processor TPC-C/Oracle and TPC-C/Sybase data

Department of Computer Science 6 th Annual Austin CAS Conference – 24 February 2005 Future Work  Suggest ways to improve p690 application performance  Quantify representativeness of sampled event traces  Expand study of application data load behavior e.g., process characterization  Develop synthetic applications Mimic the behavior of key p690 applications  Use these to study application behavior  Experiment with modifications that may affect performance  Enhance performance evaluation framework Virtualization Study performance issues related to POWER5 virtualization

Department of Computer Science 6 th Annual Austin CAS Conference – 24 February 2005 Thank You. Questions?