Department of Computer Science 6 th Annual Austin CAS Conference – 24 February 2005 Ricardo Portillo, Diana Villa, Patricia J. Teller The University of Texas at El Paso Department of Computer Science Bret Olszewski IBM Corporation – Austin, TX Mining Performance Data from Sampled Event Traces
Department of Computer Science 6 th Annual Austin CAS Conference – 24 February 2005 Outline Motivation Data Collection Environment Workload & Platform Monitored Events Sampled Event Traces Data Analysis & Results Conclusions & Future Work
Department of Computer Science 6 th Annual Austin CAS Conference – 24 February 2005 Motivation Capturing event traces System simulation: High overhead Real-time measurement: Capture information about every event Problem Unmanageable size of full event traces Goal Use sampled event traces to analyze application behavior
Department of Computer Science 6 th Annual Austin CAS Conference – 24 February 2005 Data Collection Environment Workload TPC-C benchmark Commercial, OLTP application Oracle Platform IBM eServer pSeries 690 architecture (p690) 8- and 32-processor configurations
Department of Computer Science 6 th Annual Austin CAS Conference – 24 February 2005 P X XP XP L2 L3 MCM 0 P X XP XP P L2 L3 MCM 1 X XP L2 Platform 8-processor p690 configuration
Department of Computer Science 6 th Annual Austin CAS Conference – 24 February 2005 Platform 32-processor p690 configuration P P PP PP P L2 L3 MCM 0 P P P PP PP P L2 L3 MCM 2 P P P PP PP P L2 L3 MCM 1 P P P PP PP P L2 L3 MCM 3 P
Department of Computer Science 6 th Annual Austin CAS Conference – 24 February 2005 Monitored Events L2-Cache Data Load Misses L2.5 L2.75 L3 L3.5 MEM L1-Cache Data Load Misses L2
Department of Computer Science 6 th Annual Austin CAS Conference – 24 February 2005 P X XP XP L2 L3 MCM 0 P X XP XP P L2 L3 MCM 1 X XP L2 Load Latencies L2 12 cycles L2
Department of Computer Science 6 th Annual Austin CAS Conference – 24 February 2005 P X XP XP L2 L3 MCM 0 P X XP XP P L2 L3 MCM 1 X XP L2 Load Latencies L2 12 cycles L cycles L2.5
Department of Computer Science 6 th Annual Austin CAS Conference – 24 February 2005 P X XP XP L2 L3 MCM 0 P X XP XP P L2 L3 MCM 1 X XP L2 Load Latencies L2 12 cycles L cycles L cycles L2.75
Department of Computer Science 6 th Annual Austin CAS Conference – 24 February 2005 P X XP XP L2 L3 MCM 0 P X XP XP P L2 L3 MCM 1 X XP L2 Load Latencies L2 12 cycles L cycles L cycles L3112 cycles L3
Department of Computer Science 6 th Annual Austin CAS Conference – 24 February 2005 P X XP XP L2 L3 MCM 0 P X XP XP P L2 L3 MCM 1 X XP L2 Load Latencies L2 12 cycles L cycles L cycles L3112 cycles L cycles L3.5
Department of Computer Science 6 th Annual Austin CAS Conference – 24 February 2005 P X XP XP L2 L3 MCM 0 P X XP XP P L2 L3 MCM 1 X XP L2 Load Latencies L2 12 cycles L cycles L cycles L3112 cycles L cycles MEM320 cycles
Department of Computer Science 6 th Annual Austin CAS Conference – 24 February 2005 Data Collection 10-minute observation interval Performance Monitoring Unit (PMU) Special-purpose registers Programming interface Kernel extension eprof PMU configuration Event-based sampling
Department of Computer Science 6 th Annual Austin CAS Conference – 24 February 2005 Sampled Event Traces Sampling Record periodic occurrences of an event 100 events/sec/CPU Event record A8C PIDTIDTimestamp Effective Instruction Address Effective Data Address Average number of samples collected/event 238,448 for 8-processor data 212,396 for 32-processor data
Department of Computer Science 6 th Annual Austin CAS Conference – 24 February 2005 Performance Framework Database Load DB Java Tool p690TPC-C Data Collection Environment Reports 5 BufferPool Data,BSS,Heap Kernel Sampled Event Traces Report Generation Java Tool Graphs PID TID Timestamp Instr.Addr. DataAddr.
Department of Computer Science 6 th Annual Austin CAS Conference – 24 February 2005 Data Analysis & Results Locality of reference at high-penalty resolution sites Characterization of differences between shared and private data loads Cost of process migration False sharing
Department of Computer Science 6 th Annual Austin CAS Conference – 24 February 2005 Data Analysis & Results Goal 1: Identify sources of application performance degradation Identify concentrated areas of locality of reference at high-penalty miss resolution sites
Department of Computer Science 6 th Annual Austin CAS Conference – 24 February 2005 Data Analysis & Results Goal 1: Identify sources of application performance degradation Identify concentrated areas of locality of reference at high-penalty miss resolution sites
Department of Computer Science 6 th Annual Austin CAS Conference – 24 February 2005 Data Analysis & Results Goal 1: Identify sources of application performance degradation Identify concentrated areas of locality of reference at high-penalty miss resolution sites
Department of Computer Science 6 th Annual Austin CAS Conference – 24 February 2005 Goal 2: Study effectiveness of design and policies associated with p690 memory hierarchy w.r.t workload demands Characterize behavioral difference between private and shared data loads Data Analysis & Results Private Distribution of Data Load Hits: Data,BSS,Heap Shared Distribution of Data Load Hits: Buffer Pool
Department of Computer Science 6 th Annual Austin CAS Conference – 24 February 2005 Goal 2: Study effectiveness of design and policies associated with p690 memory hierarchy w.r.t workload demands Data Analysis & Results Private Distribution of Data Load Hits: Data,BSS,Heap Shared Distribution of Data Load Hits: Buffer Pool Good Application/Architecture Match
Department of Computer Science 6 th Annual Austin CAS Conference – 24 February 2005 Goal 2: Study effectiveness of design and policies associated with p690 memory hierarchy w.r.t workload demands Data Analysis & Results Private Distribution of Data Load Hits: Data,BSS,Heap Shared Distribution of Data Load Hits: Buffer Pool Possible Performance Impediment
Department of Computer Science 6 th Annual Austin CAS Conference – 24 February 2005 Goal 2: Study effectiveness of design and policies associated with p690 memory hierarchy w.r.t workload demands Data Analysis & Results Private Distribution of Data Load Hits: Data,BSS,Heap Shared Distribution of Data Load Hits: Buffer Pool MEM Data Load Hits Primarily Due To Compulsory Misses
Department of Computer Science 6 th Annual Austin CAS Conference – 24 February 2005 Data Analysis & Results Goal 3: Study “cost” of intra-MCM migrations Intra-MCM process migration overhead in terms of L2.5 data load hit events
Department of Computer Science 6 th Annual Austin CAS Conference – 24 February 2005 Conclusions - 1 Targets for performance improvement of TPC-C are associated mainly with two regions of the address space: buffer pool data, bss, heap References for private data Satisfied within the MCM Good application/architecture match References for shared data Referenced outside the MCM Target for performance improvement
Department of Computer Science 6 th Annual Austin CAS Conference – 24 February 2005 Conclusions - 2 Main memory accesses primarily associated with compulsory misses Intra-MCM process migration not a possible source of performance degradation Model of TPC-C memory access may be possible Similar reference patterns observed: 8- and 32- processor TPC-C data 8-processor TPC-C/Oracle and TPC-C/Sybase data
Department of Computer Science 6 th Annual Austin CAS Conference – 24 February 2005 Future Work Suggest ways to improve p690 application performance Quantify representativeness of sampled event traces Expand study of application data load behavior e.g., process characterization Develop synthetic applications Mimic the behavior of key p690 applications Use these to study application behavior Experiment with modifications that may affect performance Enhance performance evaluation framework Virtualization Study performance issues related to POWER5 virtualization
Department of Computer Science 6 th Annual Austin CAS Conference – 24 February 2005 Thank You. Questions?