Session 7C July 9, 2004ICPADS ‘04 A Framework for Profiling Multiprocessor Memory Performance Diana Villa, Jaime Acosta, Patricia J. Teller The University.

Slides:

Advertisements

Similar presentations

Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors Chinnakrishnan S. Ballapuram Ahmad Sharif Hsien-Hsin S.

Advertisements

DBMSs on a Modern Processor: Where Does Time Go? Anastassia Ailamaki Joint work with David DeWitt, Mark Hill, and David Wood at the University of Wisconsin-Madison.

1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.

PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

Erhan Erdinç Pehlivan Computer Architecture Support for Database Applications.

Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.

Using Hardware Vulnerability Factors to Enhance AVF Analysis Vilas Sridharan RAS Architecture and Strategy AMD, Inc. International Symposium on Computer.

Enabling Efficient On-the-fly Microarchitecture Simulation Thierry Lafage September 2000.

NUMA Tuning for Java Server Applications Mustafa M. Tikir.

Accuracy of Performance Monitoring Hardware Michael E. Maxwell, Patricia J. Teller, and Leonardo Salayandia University of Texas-El Paso and Shirley Moore.

(C) 2002 Milo MartinHPCA, Feb Bandwidth Adaptive Snooping Milo M.K. Martin, Daniel J. Sorin Mark D. Hill, and David A. Wood Wisconsin Multifacet.

Presenter: Jyun-Yan Li Multiprocessor System-on-Chip Profiling Architecture: Design and Implementation Po-Hui Chen, Chung-Ta King, Yuan-Ying Chang, Shau-Yin.

Page 1 CS Department Parallel Design of JPEG2000 Image Compression Xiuzhen Huang CS Department UC Santa Barbara April 30th, 2003.

The Memory Behavior of Data Structures Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences The University.

Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.

Comparison of JVM Phases on Data Cache Performance Shiwen Hu and Lizy K. John Laboratory for Computer Architecture The University of Texas at Austin.

Code Coverage Testing Using Hardware Performance Monitoring Support Alex Shye, Matthew Iyer, Vijay Janapa Reddi and Daniel A. Connors University of Colorado.

A Characterization of Processor Performance in the VAX-11/780 From the ISCA Proceedings 1984 Emer & Clark.

By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and

Improving the Performance of Object-Oriented Languages with Dynamic Predication of Indirect Jumps José A. Joao *‡ Onur Mutlu ‡* Hyesoon Kim § Rishi Agarwal.

Flexible Reference-Counting-Based Hardware Acceleration for Garbage Collection José A. Joao * Onur Mutlu ‡ Yale N. Patt * * HPS Research Group University.

Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.

DBMSs On A Modern Processor: Where Does Time Go? by A. Ailamaki, D.J. DeWitt, M.D. Hill, and D. Wood University of Wisconsin-Madison Computer Science Dept.

A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems Dimitris Kaseridis, Jeffery Stuecheli,

A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.

Department of Computer Science Mining Performance Data from Sampled Event Traces Bret Olszewski IBM Corporation – Austin, TX Ricardo Portillo, Diana Villa,

Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.

Software Performance Analysis Using CodeAnalyst for Windows Sherry Hurwitz SW Applications Manager SRD Advanced Micro Devices Lei.

Computer Science Department University of Texas at El Paso PCAT Performance Counter Assessment Team PAPI Development Team SC 2003, Phoenix, AZ – November.

Performance Monitoring on the Intel ® Itanium ® 2 Processor CGO’04 Tutorial 3/21/04 CK. Luk Massachusetts Microprocessor Design.

Statistical Simulation of Superscalar Architectures using Commercial Workloads Lieven Eeckhout and Koen De Bosschere Dept. of Electronics and Information.

Hadi Salimi Distributed Systems Lab, School of Computer Engineering, Iran University of Science and Technology, Fall 2010 Performance.

A Time Predictable Instruction Cache for a Java Processor Martin Schoeberl.

A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Jason Zebchuk, Elham Safi, and Andreas Moshovos

1 Wenguang WangRichard B. Bunt Department of Computer Science University of Saskatchewan November 14, 2000 Simulating DB2 Buffer Pool Management.

Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.

Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.

Profiling Memory Subsystem Performance in an Advanced POWER Virtualization Environment The prominent role of the memory hierarchy as one of the major bottlenecks.

ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Execution Characteristics of SPEC CPU2000 Benchmarks: Intel C++ vs. Microsoft VC++

Srihari Makineni & Ravi Iyer Communications Technology Lab

02/09/2010 Industrial Project Course (234313) Virtualization-aware database engine Final Presentation Industrial Project Course (234313) Virtualization-aware.

Performance Analysis of the Compaq ES40--An Overview Paper evaluates Compaq’s ES40 system, based on the Alpha Only concern is performance: no power.

Memory Performance Profiling via Sampled Performance Monitor Event Traces Diana Villa, Patricia J. Teller, and Jaime Acosta The University of Texas at.

Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.

Computer Architecture Ch5-1 Ping-Liang Lai ( 賴秉樑 ) Lecture 5 Review of Memory Hierarchy (Appendix C in textbook) Computer Architecture 計算機結構.

CPE 631 Project Presentation Hussein Alzoubi and Rami Alnamneh Reconfiguration of architectural parameters to maximize performance and using software techniques.

© 2006, National Research Council Canada © 2006, IBM Corporation Solving performance issues in OTS-based systems Erik Putrycz Software Engineering Group.

On-board Performance Counters: What do they really tell us? Pat Teller The University of Texas at El Paso (UTEP) PTools 2002 Annual Meeting, University.

CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara) Dimitris Kaseridis and Lizy K. John The University of Texas at Austin Laboratory for Computer.

Computer Science Department University of Texas at El Paso PCAT Performance Counter Assessment Team PAPI Development Team UGC 2003, Bellevue, WA – June.

Vertical Profiling : Understanding the Behavior of Object-Oriented Applications Sookmyung Women’s Univ. PsLab Sewon,Moon.

Software Architecture in Practice Mandatory project in performance engineering.

Department of Computer Science 6 th Annual Austin CAS Conference – 24 February 2005 Ricardo Portillo, Diana Villa, Patricia J. Teller The University of.

LIOProf: Exposing Lustre File System Behavior for I/O Middleware

L2-Cache Miss Profiling on the p690 for a Large-scale Database Application Trevor Morgan, Diana Villa, Patricia J. Teller, and Jaime Acosta The University.

Computer Sciences Department University of Wisconsin-Madison

Improving the support for ARM in IgProf

Memory System Characterization of Commercial Workloads

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

ECE 445 – Computer Organization

Department of Computer Science University of California, Santa Barbara

Understanding Performance Counter Data - 1

Adaptive Code Unloading for Resource-Constrained JVMs

Lecture 10: Branch Prediction and Instruction Delivery

José A. Joao* Onur Mutlu‡ Yale N. Patt*

Dynamic Verification of Sequential Consistency

Department of Computer Science University of California, Santa Barbara

Presentation transcript:

Session 7C July 9, 2004ICPADS ‘04 A Framework for Profiling Multiprocessor Memory Performance Diana Villa, Jaime Acosta, Patricia J. Teller The University of Texas at El Paso Department of Computer Science Bret Olszewski IBM Corporation – Austin, TX

ICPADS ’04 Outline Motivation Data Collection Environment  Workload & Platform  Monitored Events Sampled Event Traces Performance Evaluation Framework Data Analysis & Results Conclusions and Future Work

ICPADS ’04 Motivation Modern Systems Performance governed by memory subsystem SMPs  Deeper and larger memory hierarchies  Performance analysis considerations Time to results and size of data set Goal Develop a new performance analysis methodology

ICPADS ’04 Data Collection Environment Workload  TPC-C benchmark Commercial OLTP Platform  IBM eServer pSeries 690 architecture (p690) 8- and 32-processor configurations

ICPADS ’04 Platform P X XP XP P X X X X P PP P L2 L3 MCM 0 MCM 1 X 8-processor p690 configuration

ICPADS ’04 Platform P P PP PP P L2 L3 MCM 0 P P P PP PP P L2 L3 MCM 2 P P P PP PP P L2 L3 MCM 1 P P P PP PP P L2 L3 MCM 3 P 32-processor p690 configuration

ICPADS ’04 Monitored Events L2-cache data-load misses  L2.5  L2.75  L3  L3.5  MEM L1-cache data-load miss  L2

ICPADS ’04 L2 P X XP XP P X X X X P PP P L3 Penalty: 12 cycles MCM 0 MCM 1 X

ICPADS ’04 L2.5 P X XP XP P X X X X P PP P L2 L3 Penalty: 73 cycles MCM 0 MCM 1 X

ICPADS ’04 L2.75 P X XP XP P X X X X P PP P L2 L3 Penalty: 96 cycles MCM 0 MCM 1 X

ICPADS ’04 L3 P X XP XP P X X X X P PP P L2 L3 Penalty: 112 cycles MCM 0 MCM 1 X

ICPADS ’04 L3.5 P X XP XP P X X X X P PP P L2 L3 Penalty: 143 cycles MCM 0 MCM 1 X

ICPADS ’04 Data Collection 10-minute observation interval Performance Monitoring Unit (PMU)  Special-purpose registers  Programming interface Kernel extension eprof  PMU configuration  Event-based sampling

ICPADS ’04 Sampled Event Traces Sampling  Record periodic occurrences of an event  100 events/sec/CPU Event record A8C PIDTIDTimestamp Effective Instruction Address Effective Data Address Average number of samples collected/event  238,448 for 8-processor data  212,396 for 32-processor data

ICPADS ’04 Performance Framework Database Load DB Java Tool Report Generation Java Tool p690TPC-C Data Collection Environment Reports 5 BufferPool Data,BSS,Heap Kernel Sampled Event Traces PID TID Timestamp Instr.Addr. DataAddr. Graphs

ICPADS ’04 Data Analysis - 1 Overall goal  Study effectiveness of p690 memory hierarchy Characterize differences between private and shared data loads Track missing L2-cache lines across levels of the p690 memory hierarchy Studied address regions  Referenced by 90% of L2-cache data-load misses  Private: Data,BSS,Heap  Shared: Buffer Pool

ICPADS ’04 Data Analysis - 2 Private data loads  Accessible only to owner process  Examples: process’ return stack, local variables  Ideal: Remain close to executing processor Shared data loads  Accessible by every TPC-C process  Examples: application code, global variables  Ideal: Remain in higher levels of memory hierarchy

ICPADS ’04 Results 32-Processor Data SharedPrivate

ICPADS ’04 Results 32-Processor Data Good Application/Architecture Match PrivateShared

ICPADS ’04 Results 32-Processor Data Possible Performance Impediment SharedPrivate

ICPADS ’04 Results 32-Processor Data Shared Data References More Localized than Private Data References PrivateShared

ICPADS ’04 Results 32-Processor Data MEM Data Load Hits Primarily Due To Compulsory Misses SharedPrivate

ICPADS ’04 Conclusions - 1 Developed new performance evaluation framework  Applicable to large SMP systems  Sampled performance monitor event traces Manageable, Collected in real-time  Core Database management system (MySQL), Java tools Applied methodology to study memory-subsystem behavior  TPC-C executing on p690  Evaluated differences between private and shared data loads

ICPADS ’04 Conclusions - 2 References for private data  Satisfied within the MCM  Good application/architecture match References for shared data  Referenced outside the MCM  Increased locality of reference  Target for performance improvement Main memory accesses primarily associated with compulsory misses

ICPADS ’04 Future Work Quantify representativeness of sampled event traces Enhance performance evaluation framework Expand study of application data load behavior e.g., process characterization Suggest ways to improve performance of TPC-C executing on p690  Improved memory management of Buffer Pool resulting in performance improvements  Track performance impediments to actual code and/or data structures

ICPADS ’04 Thank You. Questions?