PMaC Performance Modeling and Characterization A Static Binary Instrumentation Threading Model for Fast Memory Trace Collection Michael Laurenzano 1, Joshua.

Slides:

Advertisements

Similar presentations

Instrumentation of Linux Programs with Pin Robert Cohn & C-K Luk Platform Technology & Architecture Development Enterprise Platform Group Intel Corporation.

Advertisements

CGrid 2005, slide 1 Empirical Evaluation of Shared Parallel Execution on Independently Scheduled Clusters Mala Ghanesh Satish Kumar Jaspal Subhlok University.

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.

Software & Services Group PinPlay: A Framework for Deterministic Replay and Reproducible Analysis of Parallel Programs Harish Patil, Cristiano Pereira,

ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto

D. Tam, R. Azimi, L. Soares, M. Stumm, University of Toronto Appeared in ASPLOS XIV (2009) Reading Group by Theo 1.

University of Maryland Locality Optimizations in cc-NUMA Architectures Using Hardware Counters and Dyninst Mustafa M. Tikir Jeffrey K. Hollingsworth.

Overview Motivations Basic static and dynamic optimization methods ADAPT Dynamo.

UCSD SAN DIEGO SUPERCOMPUTER CENTER 1 Symbiotic Space-Sharing: Mitigating Resource Contention on SMP Systems Professor Snavely, University of California.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Pin : Building Customized Program Analysis Tools with Dynamic Instrumentation Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff.

Toward Efficient Support for Multithreaded MPI Communication Pavan Balaji 1, Darius Buntinas 1, David Goodell 1, William Gropp 2, and Rajeev Thakur 1 1.

Enabling Efficient On-the-fly Microarchitecture Simulation Thierry Lafage September 2000.

Chapter 4 Threads, SMP, and Microkernels Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design.

Pipelined Profiling and Analysis on Multi-core Systems Qin Zhao Ioana Cutcutache Weng-Fai Wong PiPA.

NUMA Tuning for Java Server Applications Mustafa M. Tikir.

Online Performance Auditing Using Hot Optimizations Without Getting Burned Jeremy Lau (UCSD, IBM) Matthew Arnold (IBM) Michael Hind (IBM) Brad Calder (UCSD)

Presented By Srinivas Sundaravaradan. MACH µ-Kernel system based on message passing Over 5000 cycles to transfer a short message Buffering IPC L3 Similar.

1 ICS 51 Introductory Computer Organization Fall 2006 updated: Oct. 2, 2006.

Threads 1 CS502 Spring 2006 Threads CS-502 Spring 2006.

San Diego Supercomputer Center Performance Modeling and Characterization Lab PMaC Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation.

1 New Architectures Need New Languages A triumph of optimism over experience! Ian Watson 3 rd July 2009.

Code Coverage Testing Using Hardware Performance Monitoring Support Alex Shye, Matthew Iyer, Vijay Janapa Reddi and Daniel A. Connors University of Colorado.

ECE 510 Brendan Crowley Paper Review October 31, 2006.

Catching Accurate Profiles in Hardware Satish Narayanasamy, Timothy Sherwood, Suleyman Sair, Brad Calder, George Varghese Presented by Jelena Trajkovic.

Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi.

Secure Embedded Processing through Hardware-assisted Run-time Monitoring Zubin Kumar.

Prospector : A Toolchain To Help Parallel Programming Minjang Kim, Hyesoon Kim, HPArch Lab, and Chi-Keung Luk Intel This work will be also supported by.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Analyzing parallel programs with Pin Moshe Bach, Mark Charney, Robert Cohn, Elena Demikhovsky, Tevi Devor, Kim Hazelwood, Aamer Jaleel, Chi- Keung Luk,

Chapter 4 Threads, SMP, and Microkernels Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design.

National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Cell processor implementation of a MILC lattice QCD application.

PMaC Performance Modeling and Characterization Efficient HPC Data Motion via Scratchpad Memory Kayla Seager, Ananta Tiwari, Michael Laurenzano, Joshua.

PMaC Performance Modeling and Characterization Performance Modeling and Analysis with PEBIL Michael Laurenzano, Ananta Tiwari, Laura Carrington Performance.

Native Client: A Sandbox for Portable, Untrusted x86 Native Code

S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE On pearls and perils of hybrid OpenMP/MPI programming.

IT253: Computer Organization

Transmeta and Dynamic Code Optimization Ashwin Bharambe Mahim Mishra Matthew Rosencrantz.

Predicting Coherence Communication by Tracking Synchronization Points at Run Time Socrates Demetriades and Sangyeun Cho 45 th International Symposium in.

Scalable Support for Multithreaded Applications on Dynamic Binary Instrumentation Systems Kim Hazelwood Greg Lueck Robert Cohn.

Chapter 4 – Threads (Pgs 153 – 174). Threads  A "Basic Unit of CPU Utilization"  A technique that assists in performing parallel computation by setting.

Debugging parallel programs. Breakpoint debugging Probably the most widely familiar method of debugging programs is breakpoint debugging. In this method,

Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.

Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

BarrierWatch: Characterizing Multithreaded Workloads across and within Program-Defined Epochs Socrates Demetriades and Sangyeun Cho Computer Frontiers.

Thread basics. A computer process Every time a program is executed a process is created It is managed via a data structure that keeps all things memory.

1 Ubiquitous Memory Introspection (UMI) Qin Zhao, NUS Rodric Rabbah, IBM Saman Amarasinghe, MIT Larry Rudolph, MIT Weng-Fai Wong, NUS CGO 2007, March 14.

A Binary Agent Technology for COTS Software Integrity Anant Agarwal Richard Schooler InCert Software.

Threaded Programming Lecture 1: Concepts. 2 Overview Shared memory systems Basic Concepts in Threaded Programming.

Cache Miss-Aware Dynamic Stack Allocation Authors: S. Jang. et al. Conference: International Symposium on Circuits and Systems (ISCAS), 2007 Presenter:

1 ROGUE Dynamic Optimization Framework Using Pin Vijay Janapa Reddi PhD. Candidate - Electrical And Computer Engineering University of Colorado at Boulder.

Introduction to HPC Debugging with Allinea DDT Nick Forrington

Tuning Threaded Code with Intel® Parallel Amplifier.

1 University of Maryland Using Information About Cache Evictions to Measure the Interactions of Application Data Structures Bryan R. Buck Jeffrey K. Hollingsworth.

Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,

Kendo: Efficient Deterministic Multithreading in Software M. Olszewski, J. Ansel, S. Amarasinghe MIT to be presented in ASPLOS 2009 slides by Evangelos.

Qin Zhao1, Joon Edward Sim2, WengFai Wong1,2 1SingaporeMIT Alliance 2Department of Computer Science National University of Singapore

1 Chapter 5: Threads Overview Multithreading Models & Issues Read Chapter 5 pages

Kernel Code Coverage Nilofer Motiwala Computer Sciences Department

Olatunji Ruwase* Shimin Chen+ Phillip B. Gibbons+ Todd C. Mowry*

Chapter 8: Main Memory.

What we need to be able to count to tune programs

Department of Computer Science University of California, Santa Barbara

Threads and Data Sharing

Adaptive Code Unloading for Resource-Constrained JVMs

Allen D. Malony Computer & Information Science Department

Department of Computer Science University of California, Santa Barbara

Dynamic Binary Translators and Instrumenters

Presentation transcript:

PMaC Performance Modeling and Characterization A Static Binary Instrumentation Threading Model for Fast Memory Trace Collection Michael Laurenzano 1, Joshua Peraza 1, Laura Carrington 1, Ananta Tiwari 1, William A. Ward 2, Roy Campbell 2 1 Performance Modeling and Characterization (PMaC) Laboratory, San Diego Supercomputer Center 2 High Performance Computing Modernization Program (HPCMP), United States Department of Defense

PMaC Performance Modeling and Characterization Memory-driven HPC  Many HPC applications are memory bound –Understanding application requires understanding memory behavior  Measurement? (e.g. timers or hardware counters) –Measuring at fine grain with reasonable overheads & transparently is HARD –How to get sufficient detail (e.g., reuse distance?)  Binary instrumentation –Obtains low-level details of address stream –Details are attached to specific structures within the application

PMaC Performance Modeling and Characterization Convolution Methods map Application Signatures to Machine Profiles produce performance prediction HPC Target System Characteristics of HPC system – Machine Profile HPC Target System Machine Profile – characterizations of the rates at which a machine can carry out fundamental operations Measured or projected via simple benchmarks on 1-2 nodes of the system HPC Application Requirements of HPC Application – Application Signature PMaC Performance/Energy Models Performance of Application on Target system HPC Application Application signature – detailed summaries of the fundamental operations to be carried out by the application Collected via trace tools Performance Model – a calculable expression of the runtime, efficiency, memory use, etc. of an HPC program on some machine

PMaC Performance Modeling and Characterization Runtime Overhead is a Big Deal  Real HPC applications –Relatively long runtimes: minutes, hours, days? –Lots of CPUS: O(10 7 ) in largest supercomputers –High slowdowns create problems  Too long for queue  Unsympathetic administrators/managers  Inconvenience  Unnecessarily use resources  PEBIL = PMaC’s Efficient Binary Instrumentation for x86/Linux

PMaC Performance Modeling and Characterization What’s New in PEBIL?  It can instrument multithreaded code –Developers use OpenMP and pthreads! –x86_64 only –Provide access to thread-local instrumentation data at runtime  Supports turning instrumentation on/off –Very lightweight operation –Swap nops with inserted instrumentation code at runtime –Overhead close to zero when all instrumentation is removed

PMaC Performance Modeling and Characterization Binary Instrumentation in HPC  Tuning and Analysis Utilities (TAU) – Dyninst and PEBIL  HPCToolkit – Dyninst  Open SpeedShop – Dyninst  Intel Parallel Studio – Pin  Memcheck memory bug detector – Valgrind valgrind –-leak-check=yes...  Understanding performance and energy  Many research projects (not just HPC) –BI used in papers in the last 15 years

PMaC Performance Modeling and Characterization Binary Instrumentation Basics Memory Address Tracing Original Instrumented 0000c000 : c000: d f8 mov %rdi,-0x8(%rbp) c004: 5e pop %rsi c005: 75 f8 jne 0xc004 c007: c9 leaveq c008: c3 retq 0000c000 : c000: d f8 mov %rdi,-0x8(%rbp) c004: 5e pop %rsi c005: 75 f8 jne 0xc004 c007: c9 leaveq c008: c3 retq 0000c000 : c000: // compute -0x8(%rbp) and copy it to a buffer c008: d f8 mov %rdi,-0x8(%rbp) c00c: // compute (%rsp) and copy it to a buffer c014: 5e pop %rsi c015: 75 f8 jne 0xc00c c017: c9 leaveq c018: c3 retq 0000c000 : c000: // compute -0x8(%rbp) and copy it to a buffer c008: d f8 mov %rdi,-0x8(%rbp) c00c: // compute (%rsp) and copy it to a buffer c014: 5e pop %rsi c015: 75 f8 jne 0xc00c c017: c9 leaveq c018: c3 retq

PMaC Performance Modeling and Characterization Enter Multithreaded Apps  All threads use a single buffer? –Don’t need to know which thread is executing  A buffer for each thread? –Faster. No concurrency operations needed –More interesting. Per-thread behavior != average thread behavior  PEBIL uses the latter –Fast method for computing location of thread-local data –Cache that location in a register if possible 0000c000 : c000: // compute -0x8(%rbp) and copy it to a buffer c008: d f8 mov %rdi,-0x8(%rbp) c00c: // compute (%rsp) and copy it to a buffer c014: 5e pop %rsi c015: 75 f8 jne 0xc00c c017: c9 leaveq c018: c3 retq 0000c000 : c000: // compute -0x8(%rbp) and copy it to a buffer c008: d f8 mov %rdi,-0x8(%rbp) c00c: // compute (%rsp) and copy it to a buffer c014: 5e pop %rsi c015: 75 f8 jne 0xc00c c017: c9 leaveq c018: c3 retq

PMaC Performance Modeling and Characterization Thread-local Instrumentation Data in PEBIL  Provide a large table to each process (2M) –Each entry is a small pool of memory (32 bytes)  Must be VERY fast –Get thread id (1 instruction) –Simple hash of thread id (2 instructions) –Index table with hashed id (1 instruction)  Assume no collisions (so far so good) Hash Function thread 1 id thread 2 id thread 3 id thread 4 id Thread-local memory pools thread 4’s memory pool

PMaC Performance Modeling and Characterization Caching Thread-local Data  Cache the address of thread-local data –Dead registers are known at instrumentation time –Is there 1 register in a function which is dead everywhere?  Compute thread-local data address only at function [re]entry  Should use smaller scopes! (loops, blocks) Significant reductions

PMaC Performance Modeling and Characterization Other x86/Linux Binary Instrumentation Tool NameStatic or Dynamic Thread-local Data AccessThreading Overhead Runtime overhead Pin 1 DynamicRegister stolen from program, program JIT-compiled around that lost register Very lowMedium Dyninst 2 EitherCompute thread ID (layered function call) at every point HighVaries PEBIL 3 StaticTable + fast hash function (4 instructions), cache result in dead registers Low 1 Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation. Luk, C., Cohn, R., Muth, R., Patil, H., Klauser, A., Lowney, G., Wallace, S., Vijay Janapa Reddi, and Hazelwood, K. ACM SIGPLAN Conference on Programming Language Design and Implementation, An API for Runtime Code Patching. Buck, B. and Hollingsworth, J. International Journal of High Performance Computing Applications, PEBIL: Efficient Static Binary Instrumentation for Linux. Laurenzano, M., Tikir, M., Carrington, L. and Snavely, A. International Symposium on the Performance Analysis of Systems and Software, 2010.

PMaC Performance Modeling and Characterization Runtime Overhead Experiments  Basic block counting  Classic test in binary instrumentation literature  Increment a counter each time a basic block is executed  Per-block, per-process, per-thread counters  Memory address tracing –Fill a process/thread-local buffer with memory addresses, then discard those addresses –Interval-based sampling  Take the first 10% of each billion memory accesses  Toggle instrumentation on/off when moving between sampling/non-sampling

PMaC Performance Modeling and Characterization Methodology  2 quad-core Xeon X3450, 2.67GHz –32K L1 and 256K L2 cache per core, 8M L3 per processor  NAS Parallel Benchmarks –2 sets: OpenMP and MPI, gcc/GOMP and gcc/mpich –8 threads/processes: CG, DC (omp only), EP, FT, IS, LU, MG –4 threads/processes: BT, SP  Dyninst 7.0 (dynamic) –Timing started when instrumented app begins running  Pin 2.12  PEBIL 2.0

PMaC Performance Modeling and Characterization Basic Block Counting (MPI)  All results are average of 3 runs  Slowdown relative to un-instrumented run –1 == no slowdown

PMaC Performance Modeling and Characterization Basic Block Counting (OpenMP)  Y-axis = log-scale slowdown factor  Dyninst thread ID lookup at every basic block

PMaC Performance Modeling and Characterization Threading Support Overhead (BB Counting)

PMaC Performance Modeling and Characterization Memory Tracing (MPI)  Slowdown relative to un-instrumented application ToolBTCGEPFTISLUMGSPMEAN PEBIL Pin Dyninst

PMaC Performance Modeling and Characterization Memory Tracing (OpenMP)  Instrumentation code inserted at every memory instruction –Dyninst computes thread ID at every memop –Pin runtime-optimizes instrumented code  Lots of opportunity to optimize ToolBTCGDCEPFTISLUMGSPMEAN PEBIL Pin Dyninst??? ???975.90** 30s  7h45m

PMaC Performance Modeling and Characterization Interval-based Sampling  Extract useful information from a subset of the memory address stream –Simple approach: the first 10% of every billion addresses  In practice we use a window 100x as small –Obvious: avoid processing addresses (e.g., just collect and throw away) –Not so obvious: avoid collecting addresses  Instrumentation tools can disable/re-enable instrumentation –PEBIL: binary on/off. Very lightweight, but limited –Pin and Dyninst: arbitrary removal/reinstrumentation. Heavyweight, but versatile –Sampling only requires on/off functionality

PMaC Performance Modeling and Characterization Sampled Memory Tracing (MPI)  PEBIL always improves, and significantly  Pin usually, but not always improves –Amount and complexity of code re-instrumented during each interval probably drives this  Dyninst never improves –P ToolBTCGEPFTISLUMGSPMEAN PEBIL Full PEBIL 10% Pin Full Pin 10% Pin Best Dyninst Full = Best

PMaC Performance Modeling and Characterization Sampled Memory Tracing (OpenMP) ToolBTCGDCEPFTISLUMGSPMEAN PEBIL PEBIL 10% Pin Full Pin 10% Pin Best Dyninst Full = Best* ??? ???975.90**

PMaC Performance Modeling and Characterization Conclusions  New PEBIL features –instrument multithreaded binaries –Turn instrumentation on/off  Fast access to per-thread memory pool to support per-thread data collection –Reasonable overheads  Cache memory pool location –Currently done at function level –Future work: smaller scopes  PEBIL is useful for practical memory address stream collection –Message passing or threaded

PMaC Performance Modeling and Characterization Questions?