Analyzing parallel programs with Pin Moshe Bach, Mark Charney, Robert Cohn, Elena Demikhovsky, Tevi Devor, Kim Hazelwood, Aamer Jaleel, Chi- Keung Luk,

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

© 2007 Eaton Corporation. All rights reserved. LabVIEW State Machine Architectures Presented By Scott Sirrine Eaton Corporation.
Module 13: Performance Tuning. Overview Performance tuning methodologies Instance level Database level Application level Overview of tools and techniques.
Software & Services Group PinPlay: A Framework for Deterministic Replay and Reproducible Analysis of Parallel Programs Harish Patil, Cristiano Pereira,
Integrity & Malware Dan Fleck CS469 Security Engineering Some of the slides are modified with permission from Quan Jia. Coming up: Integrity – Who Cares?
Intel® performance analyze tools Nikita Panov Idrisov Renat.
Pin : Building Customized Program Analysis Tools with Dynamic Instrumentation Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff.
Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
© 2003 IBM Corporation IBM Systems and Technology Group Operating System Attributes for High Performance Computing Ken Rozendal Distinguished Engineer.
Threads, SMP, and Microkernels Chapter 4. Process Resource ownership - process is allocated a virtual address space to hold the process image Scheduling/execution-
Chapter 4 Threads, SMP, and Microkernels Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design.
The Path to Multi-core Tools Paul Petersen. Multi-coreToolsThePathTo 2 Outline Motivation Where are we now What is easy to do next What is missing.
Computer Systems/Operating Systems - Class 8
 Introduction Originally developed by Open Software Foundation (OSF), which is now called The Open Group ( Provides a set of tools and.
Chapter 5 Processes and Threads Copyright © 2008.
1: Operating Systems Overview
San Diego Supercomputer Center Performance Modeling and Characterization Lab PMaC Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation.
Figure 1.1 Interaction between applications and the operating system.
By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Computer System Architectures Computer System Software
CC02 – Parallel Programming Using OpenMP 1 of 25 PhUSE 2011 Aniruddha Deshmukh Cytel Inc.
Prospector : A Toolchain To Help Parallel Programming Minjang Kim, Hyesoon Kim, HPArch Lab, and Chi-Keung Luk Intel This work will be also supported by.
Multi-core Programming Thread Profiler. 2 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Topics Look at Intel® Thread Profiler features.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 2: System Structures.
WORK ON CLUSTER HYBRILIT E. Aleksandrov 1, D. Belyakov 1, M. Matveev 1, M. Vala 1,2 1 Joint Institute for nuclear research, LIT, Russia 2 Institute for.
LOGO OPERATING SYSTEM Dalia AL-Dabbagh
Operating System Review September 10, 2012Introduction to Computer Security ©2004 Matt Bishop Slide #1-1.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Operating System 4 THREADS, SMP AND MICROKERNELS
TRACEREP: GATEWAY FOR SHARING AND COLLECTING TRACES IN HPC SYSTEMS Iván Pérez Enrique Vallejo José Luis Bosque University of Cantabria TraceRep IWSG'15.
Chapter 4 Threads, SMP, and Microkernels Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
1 Threads, SMP, and Microkernels Chapter 4. 2 Focus and Subtopics Focus: More advanced concepts related to process management : Resource ownership vs.
Support for Debugging Automatically Parallelized Programs Robert Hood Gabriele Jost CSC/MRJ Technology Solutions NASA.
 2004 Deitel & Associates, Inc. All rights reserved. 1 Chapter 4 – Thread Concepts Outline 4.1 Introduction 4.2Definition of Thread 4.3Motivation for.
Replay Compilation: Improving Debuggability of a Just-in Time Complier Presenter: Jun Tao.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
Instrumentation in Software Dynamic Translators for Self-Managed Systems Bruce R. Childers Naveen Kumar, Jonathan Misurda and Mary.
Scalable Support for Multithreaded Applications on Dynamic Binary Instrumentation Systems Kim Hazelwood Greg Lueck Robert Cohn.
Debugging parallel programs. Breakpoint debugging Probably the most widely familiar method of debugging programs is breakpoint debugging. In this method,
1 Threads, SMP, and Microkernels Chapter Multithreading Operating system supports multiple threads of execution within a single process MS-DOS.
M. Accetta, R. Baron, W. Bolosky, D. Golub, R. Rashid, A. Tevanian, and M. Young MACH: A New Kernel Foundation for UNIX Development Presenter: Wei-Lwun.
Virtual Application Profiler (VAPP) Problem – Increasing hardware complexity – Programmers need to understand interactions between architecture and their.
Full and Para Virtualization
CS510 Concurrent Systems Jonathan Walpole. RCU Usage in Linux.
An Enhanced Portable Thread Manager Presentation by Vijay Murthi Supervisor: David Levine Committee Members: Behrooz Shirazi and Bob Weems.
Disco: Running Commodity Operating Systems on Scalable Multiprocessors Presented by: Pierre LaBorde, Jordan Deveroux, Imran Ali, Yazen Ghannam, Tzu-Wei.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
Lecture 5 Rootkits Hoglund/Butler (Chapters 1-3).
Operating Systems Unit 2: – Process Context switch Interrupt Interprocess communication – Thread Thread models Operating Systems.
UDI Technology Benefits Slide 1 Uniform Driver Interface UDI Technology Benefits.
A Dynamic Tracing Mechanism For Performance Analysis of OpenMP Applications - Caubet, Gimenez, Labarta, DeRose, Vetter (WOMPAT 2001) - Presented by Anita.
Parallel Computing Presented by Justin Reschke
Background Computer System Architectures Computer System Software.
1 ROGUE Dynamic Optimization Framework Using Pin Vijay Janapa Reddi PhD. Candidate - Electrical And Computer Engineering University of Colorado at Boulder.
Tuning Threaded Code with Intel® Parallel Amplifier.
Threads, SMP, and Microkernels Chapter 4. Processes and Threads Operating systems use processes for two purposes - Resource allocation and resource ownership.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
Lecture 5. Example for periority The average waiting time : = 41/5= 8.2.
Introduction to Operating Systems Concepts
Chapter 4 – Thread Concepts
Threads Some of these slides were originally made by Dr. Roger deBry. They include text, figures, and information from this class’s textbook, Operating.
Chapter 4 – Thread Concepts
Async or Parallel? No they aren’t the same thing!
Computer Engg, IIT(BHU)
Intel® Parallel Studio and Advisor
Department of Computer Science University of California, Santa Barbara
Department of Computer Science University of California, Santa Barbara
Presentation transcript:

Analyzing parallel programs with Pin Moshe Bach, Mark Charney, Robert Cohn, Elena Demikhovsky, Tevi Devor, Kim Hazelwood, Aamer Jaleel, Chi- Keung Luk, Gail Lyons, Harish Patil, and Ady Tal, Intel Corp. Meng Ying GNU OSLab 1

Contents Introduction Pintool Instrumentation modes Instrumenting parallel programs Performance considerations Multithreaded versus multiprocess instrumentation Example tools Experimental results Conclusion 2

Introduction Instrumentation – A technique for collecting the information needed to understand programs. Instrumentation-based tools – They typically insert extra code into a program to record events during execution. Pin – A software system that performs runtime binary instrumentation of Linux and Microsoft Windows applications. – Pin ‘s goal is to provide an instrumentation platform for building a wide variety of program analysis tools. 3

Pintool Instrumentation routines – inspect the application’s instructions and insert calls to analysis routines. Analysis routines – are called when the program executes an instrumented instruction and often perform ancillary tasks. Callback routines – are called when specific conditions are met, or when a certain event has occurred, for example, when it is about to exit. 4

Pintool A simple pintool Instrumentation Analysis Callback 5

Instrumentation modes JIT mode – supports all features of Pin. – JIT mode uses a just-in-time compiler to recompile all program code and insert instrumentation. Probe mode – supports a limited feature set but is far faster, adding almost no overhead to program running time. – Pin overwrites the entry point of procedures with jumps (called probes) to dynamically generated instrumentation. 6

Instrumentation modes Probe mode example 7

Instrumenting parallel programs Pintool author – Taking some special care When a pintool instruments a parallel program, the application threads execute the calls to analysis functions. – making the analysis functions thread-safe. using locks to synchronize references to shared data with other threads. (figure1)figure1 Pin – provides callbacks when a new thread or new process is created. Analysis routines can be passed a thread ID. – Pin provides APIs for allocating and addressing thread- local storage. 8

Performance considerations A highly contended lock – serializes execution and leads to poor CPU utilization. – Pintool authors must employ standard parallel programming techniques to avoid excessive serialization. should use thread-local storage to avoid the need to lock global storage. Instead of a single monolithic lock for a data structure, they should use fine-grained locks. 9

Performance considerations False sharing – is occurring when multiple threads access different parts of the same cache line and at least one of them is a write. – Developers can eliminate false sharing by padding critical data structures to the size of a cache line or rearranging the structures’ data layout. 10

Multithreaded versus multiprocess instrumentation MultithreadedMultiprocess Pin notifies the pintool when a new one is created.NoYes The pintool must reinstrument.NoYes Sharing pintool dataYesNo Accessing the same dataYesNo 11

Example tools Intel Parallel Inspector – To find memory and threading errors, it uses Pin to instrument all machine instructions in the program that reference memory and records the effective addresses. (memory leaks, references to uninitialized data, data races, and deadlocks ) Intel Trace Analyzer and Collector – provides information critical to understanding and optimizing cluster performance by quickly finding performance bottlenecks with Message Passing Interface (MPI) communication. The tool presents a timeline showing when MPI messages are sent and received. 12

Example tools Intel Parallel Amplifier (three types of analysis) – Hotspots attribute time to source lines and call stacks, identifying the parts of the programs that would benefit from tuning and parallelism. – Concurrency measures the CPUs’ utilization, giving whole program and per-function summaries. – Locks and waits measures the time multithreaded programs spend waiting on locks, attributing time to synchronization objects and source lines. 13

Example tools CMP$im – is a cache profiler which uses pin to collect the memory addresses of multithreaded and multiprocessor programs, and uses a memory system’s software model to analyze program behavior. PinPlay – enables the capture and deterministic replay of the running of multithreaded programs under pin. Capturing the running of a program helps developers overcome the non-determinism inherent in multithreading 14

Example tools Prospector – discovers potential parallelism in serial programs by loop and data-dependence profiling. Intel Software Development Emulator – is a user-level functional emulator for new instructions in the Intel64 instruction set built on Pin. – uses Pin to alter the program while it is running. – is primarily a tool for debugging programs that use new instructions before the corresponding hardware exists. 15

Experimental results Runtime overhead 16

Experimental results Scalable workload performance 17

Conclusion Pin was originally conceived as a tool for computer architecture analysis. A user community latched on to its flexible API and high performance, taking the tool into unforeseen domains like security, emulation, and parallel program analysis. We can make tools based on Pin to help us get a deeper understanding of our software’s behavior. 18

Thank you ! 19