Session 7C July 9, 2004ICPADS ‘04 A Framework for Profiling Multiprocessor Memory Performance Diana Villa, Jaime Acosta, Patricia J. Teller The University of Texas at El Paso Department of Computer Science Bret Olszewski IBM Corporation – Austin, TX
ICPADS ’04 Outline Motivation Data Collection Environment Workload & Platform Monitored Events Sampled Event Traces Performance Evaluation Framework Data Analysis & Results Conclusions and Future Work
ICPADS ’04 Motivation Modern Systems Performance governed by memory subsystem SMPs Deeper and larger memory hierarchies Performance analysis considerations Time to results and size of data set Goal Develop a new performance analysis methodology
ICPADS ’04 Data Collection Environment Workload TPC-C benchmark Commercial OLTP Platform IBM eServer pSeries 690 architecture (p690) 8- and 32-processor configurations
ICPADS ’04 Platform P X XP XP P X X X X P PP P L2 L3 MCM 0 MCM 1 X 8-processor p690 configuration
ICPADS ’04 Platform P P PP PP P L2 L3 MCM 0 P P P PP PP P L2 L3 MCM 2 P P P PP PP P L2 L3 MCM 1 P P P PP PP P L2 L3 MCM 3 P 32-processor p690 configuration
ICPADS ’04 Monitored Events L2-cache data-load misses L2.5 L2.75 L3 L3.5 MEM L1-cache data-load miss L2
ICPADS ’04 L2 P X XP XP P X X X X P PP P L3 Penalty: 12 cycles MCM 0 MCM 1 X
ICPADS ’04 L2.5 P X XP XP P X X X X P PP P L2 L3 Penalty: 73 cycles MCM 0 MCM 1 X
ICPADS ’04 L2.75 P X XP XP P X X X X P PP P L2 L3 Penalty: 96 cycles MCM 0 MCM 1 X
ICPADS ’04 L3 P X XP XP P X X X X P PP P L2 L3 Penalty: 112 cycles MCM 0 MCM 1 X
ICPADS ’04 L3.5 P X XP XP P X X X X P PP P L2 L3 Penalty: 143 cycles MCM 0 MCM 1 X
ICPADS ’04 Data Collection 10-minute observation interval Performance Monitoring Unit (PMU) Special-purpose registers Programming interface Kernel extension eprof PMU configuration Event-based sampling
ICPADS ’04 Sampled Event Traces Sampling Record periodic occurrences of an event 100 events/sec/CPU Event record A8C PIDTIDTimestamp Effective Instruction Address Effective Data Address Average number of samples collected/event 238,448 for 8-processor data 212,396 for 32-processor data
ICPADS ’04 Performance Framework Database Load DB Java Tool Report Generation Java Tool p690TPC-C Data Collection Environment Reports 5 BufferPool Data,BSS,Heap Kernel Sampled Event Traces PID TID Timestamp Instr.Addr. DataAddr. Graphs
ICPADS ’04 Data Analysis - 1 Overall goal Study effectiveness of p690 memory hierarchy Characterize differences between private and shared data loads Track missing L2-cache lines across levels of the p690 memory hierarchy Studied address regions Referenced by 90% of L2-cache data-load misses Private: Data,BSS,Heap Shared: Buffer Pool
ICPADS ’04 Data Analysis - 2 Private data loads Accessible only to owner process Examples: process’ return stack, local variables Ideal: Remain close to executing processor Shared data loads Accessible by every TPC-C process Examples: application code, global variables Ideal: Remain in higher levels of memory hierarchy
ICPADS ’04 Results 32-Processor Data SharedPrivate
ICPADS ’04 Results 32-Processor Data Good Application/Architecture Match PrivateShared
ICPADS ’04 Results 32-Processor Data Possible Performance Impediment SharedPrivate
ICPADS ’04 Results 32-Processor Data Shared Data References More Localized than Private Data References PrivateShared
ICPADS ’04 Results 32-Processor Data MEM Data Load Hits Primarily Due To Compulsory Misses SharedPrivate
ICPADS ’04 Conclusions - 1 Developed new performance evaluation framework Applicable to large SMP systems Sampled performance monitor event traces Manageable, Collected in real-time Core Database management system (MySQL), Java tools Applied methodology to study memory-subsystem behavior TPC-C executing on p690 Evaluated differences between private and shared data loads
ICPADS ’04 Conclusions - 2 References for private data Satisfied within the MCM Good application/architecture match References for shared data Referenced outside the MCM Increased locality of reference Target for performance improvement Main memory accesses primarily associated with compulsory misses
ICPADS ’04 Future Work Quantify representativeness of sampled event traces Enhance performance evaluation framework Expand study of application data load behavior e.g., process characterization Suggest ways to improve performance of TPC-C executing on p690 Improved memory management of Buffer Pool resulting in performance improvements Track performance impediments to actual code and/or data structures
ICPADS ’04 Thank You. Questions?