Benefits of sampling in tracefiles Harald Servat Program Development for Extreme-Scale Computing May 3rd, 2010.

Slides:

Advertisements

Similar presentations

CoMPI: Enhancing MPI based applications performance and scalability using run-time compression. Rosa Filgueira, David E.Singh, Alejandro Calderón and Jesús.

Advertisements

6.1 Synchronous Computations ITCS 4/5145 Cluster Computing, UNC-Charlotte, B. Wilkinson, 2006.

This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

University of Maryland Locality Optimizations in cc-NUMA Architectures Using Hardware Counters and Dyninst Mustafa M. Tikir Jeffrey K. Hollingsworth.

MIS 2000 Class 20 System Development Process Updated 2014.

ARM-DSP Multicore Considerations CT Scan Example.

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

NUMA Tuning for Java Server Applications Mustafa M. Tikir.

Engineering Analysis of High Performance Parallel Programs David Culler Computer Science Division U.C.Berkeley

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

A Practical Method For Quickly Evaluating Program Optimizations Grigori Fursin, Albert Cohen, Michael O’Boyle and Olivier Temam ALCHEMY Group, INRIA Futurs.

1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance.

Secure Embedded Processing through Hardware-assisted Run-time Monitoring Zubin Kumar.

Business Process Performance Prediction on a Tracked Simulation Model Andrei Solomon, Marin Litoiu– York University.

October 26, 2006 Parallel Image Processing Programming and Architecture IST PhD Lunch Seminar Wouter Caarls Quantitative Imaging Group.

1CPSD NSF/DARPA OPAAL Adaptive Parallelization Strategies using Data-driven Objects Laxmikant Kale First Annual Review October 1999, Iowa City.

D. Becker, M. Geimer, R. Rabenseifner, and F. Wolf Laboratory for Parallel Programming | September Synchronizing the timestamps of concurrent events.

Judit Giménez, Juan González, Pedro González, Jesús Labarta, Germán Llort, Eloy Martínez, Xavier Pegenaute, Harald Servat Brief introduction.

MpiP Evaluation Report Hans Sherburne, Adam Leko UPC Group HCS Research Laboratory University of Florida.

Predicting performance of applications and infrastructures Tania Lorido 27th May 2011.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Adventures in Mastering the Use of Performance Evaluation Tools Manuel Ríos Morales ICOM 5995 December 4, 2002.

© 2012 MELLANOX TECHNOLOGIES 1 The Exascale Interconnect Technology Rich Graham – Sr. Solutions Architect.

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

Petascale workshop 2013 Judit Gimenez Detailed evolution of performance metrics Folding.

PMaC Performance Modeling and Characterization Performance Modeling and Analysis with PEBIL Michael Laurenzano, Ananta Tiwari, Laura Carrington Performance.

Scalable Analysis of Distributed Workflow Traces Daniel K. Gunter and Brian Tierney Distributed Systems Department Lawrence Berkeley National Laboratory.

Performance Model & Tools Summary Hung-Hsun Su UPC Group, HCS lab 2/5/2004.

CS 584. Performance Analysis Remember: In measuring, we change what we are measuring. 3 Basic Steps Data Collection Data Transformation Data Visualization.

 Collectives on Two-tier Direct Networks EuroMPI – 2012 Nikhil Jain, JohnMark Lau, Laxmikant Kale 26 th September, 2012.

1/20 Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications Sheng Di, Mohamed Slim Bouguerra, Leonardo Bautista-gomez, Franck Cappello.

Sieve of Eratosthenes by Fola Olagbemi. Outline What is the sieve of Eratosthenes? Algorithm used Parallelizing the algorithm Data decomposition options.

Martin Schulz Center for Applied Scientific Computing Lawrence Livermore National Laboratory Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,

IPDPS 2005, slide 1 Automatic Construction and Evaluation of “Performance Skeletons” ( Predicting Performance in an Unpredictable World ) Sukhdeep Sodhi.

ICOM 6115: Computer Systems Performance Measurement and Evaluation August 11, 2006.

UAB Dynamic Tuning of Master/Worker Applications Anna Morajko, Paola Caymes Scutari, Tomàs Margalef, Eduardo Cesar, Joan Sorribes and Emilio Luque Universitat.

Computer Science Adaptive, Transparent Frequency and Voltage Scaling of Communication Phases in MPI Programs Min Yeol Lim Computer Science Department Sep.

Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.

Belgrade, 25 September 2014 George S. Markomanolis, Oriol Jorba, Kim Serradell Performance analysis Tools: a case study of NMMB on Marenostrum.

Resource Utilization in Large Scale InfiniBand Jobs Galen M. Shipman Los Alamos National Labs LAUR

Workshop BigSim Large Parallel Machine Simulation Presented by Eric Bohm PPL Charm Workshop 2004.

Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.

Software Architecture Evaluation Methodologies Presented By: Anthony Register.

Rassul Ayani 1 Performance of parallel and distributed systems  What is the purpose of measurement?  To evaluate a system (or an architecture)  To compare.

Business Analysis. Business Analysis Concepts Enterprise Analysis ► Identify business opportunities ► Understand the business strategy ► Identify Business.

Belgrade, 26 September 2014 George S. Markomanolis, Oriol Jorba, Kim Serradell Overview of on-going work on NMMB HPC performance at BSC.

Overview of AIMS Hans Sherburne UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red: Negative note Green:

Breakout Group: Debugging David E. Skinner and Wolfgang E. Nagel IESP Workshop 3, October, Tsukuba, Japan.

CC-MPI: A Compiled Communication Capable MPI Prototype for Ethernet Switched Clusters Amit Karwande, Xin Yuan Department of Computer Science, Florida State.

A Simulation Framework to Automatically Analyze the Communication-Computation Overlap in Scientific Applications Vladimir Subotic, Jose Carlos Sancho,

Presented by: Dr. Munam Ali Shah

MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.

CISC Machine Learning for Solving Systems Problems Presented by: Eunjung Park Dept of Computer & Information Sciences University of Delaware Solutions.

Other Tools HPC Code Development Tools July 29, 2010 Sue Kelly Sandia is a multiprogram laboratory operated by Sandia Corporation, a.

CEPBA-Tools experiences with MRNet and Dyninst Judit Gimenez, German Llort, Harald Servat

A Dynamic Tracing Mechanism For Performance Analysis of OpenMP Applications - Caubet, Gimenez, Labarta, DeRose, Vetter (WOMPAT 2001) - Presented by Anita.

Parallel Computing Presented by Justin Reschke

LACSI 2002, slide 1 Performance Prediction for Simple CPU and Network Sharing Shreenivasa Venkataramaiah Jaspal Subhlok University of Houston LACSI Symposium.

Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning.

XRD data analysis software development. Outline  Background  Reasons for change  Conversion challenges  Status 2.

Profiling/Tracing Method and Tool Evaluation Strategy Summary Slides Hung-Hsun Su UPC Group, HCS lab 1/25/2005.

Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

Presented by Munezero Immaculee Joselyne PhD in Software Engineering

Projections Overview Ronak Buch & Laxmikant (Sanjay) Kale

Parallel Programming in C with MPI and OpenMP

Department of Computer Science University of California, Santa Barbara

PADLA: A Dynamic Log Level Adapter Using Online Phase Detection

Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

Presentation transcript:

Benefits of sampling in tracefiles Harald Servat Program Development for Extreme-Scale Computing May 3rd, 2010

Program Development for Extreme-Scale Computing 2May 3rd, 2010 Outline Instrumentation and sampling Folding Summarized traces Some results Current work

Program Development for Extreme-Scale Computing 3May 3rd, 2010 Instrumentation Performance tools based on instrumentation Granularity of the results depends on the application structure Data gathered includes: Performance counters, callstack, message size…

Program Development for Extreme-Scale Computing 4May 3rd, 2010 Sampling Sampling reaches any application point at a interval Easily tunable frequency Gather performance counters and callstack

Program Development for Extreme-Scale Computing 5May 3rd, 2010 Main objective Combine both mechanisms Deeper performance details Using PAPI_overflow(..)... what about frequency trade-off? Not too high to disrupt the performance data Not too low to get useful information

Program Development for Extreme-Scale Computing 6May 3rd, 2010 Work done: Folding Harald Servat, Germán Llort, Judit Giménez, Jesús Labarta: Detailed performance analysis using coarse grain sampling. PROPER, Objective: get detailed metrics with few samples Benefits from both high and low frequencies! Take advantage of stationary behavior of scientific applications Build synthetic region from scattered samples Reintroduce into the tracefile at chosen ratio

Program Development for Extreme-Scale Computing 7May 3rd, 2010 Folding: Moving samples Main idea: Move samples to the target iteration preserving their original relative time. Steps

Program Development for Extreme-Scale Computing 8May 3rd, 2010 Folding: Interpolation Instructions evolution for routine copy_faces of NAS MPI BT B No instrumentation points within the routine, but we got details Red crosses represent the folded samples and show the completed instructions from the start of the routine Green line is the curve fitting of the folded samples and is used to reintroduce the values into the tracefile Blue line is the derivative of the curve fitting

Program Development for Extreme-Scale Computing 9May 3rd, 2010 Folding areas Folding is applied to delimited regions Previously instrumented  User function  Iteration Automatically obtained from the gathered results  Clusters of computation bursts Juan González, Judit Giménez, Jesús Labarta, Automatic detection of parallel applications computation phases, IPDPS 2009  Delimited time regions Marc Casas, Rosa M. Badia, Jesús Labarta, Automatic Structure Extraction from MPI Applications Tracefiles, Euro-Par 2007

Program Development for Extreme-Scale Computing 10May 3rd, 2010 Impact of the sampling frequency The more samples being fold, the more detailed results Longer executions Increase frequency Reach stability? Example: NAS BT class B copy_faces showing from 10 to 200 iterations 20 samples per SGI Altix

Program Development for Extreme-Scale Computing 11May 3rd, 2010 Impact of the sampling frequency Choosing a sampling frequency is important Sampling frequency can couple with application frequency Choose frequencies based on prime factors

Program Development for Extreme-Scale Computing 12May 3rd, 2010 Outline Instrumentation and sampling Folding Summarized traces Some results Current work

Program Development for Extreme-Scale Computing 13May 3rd, 2010 Dealing with large scale traces Jesús Labarta, Judit Giménez, Eloy Martínez, Pedro González, Harald Servat, Germán Llort, Xavier Aguilar: Scalability of tracing and visualization tools, PARCO Application’s behavior can be divided in: Communication phases Intensive computation phases Instrumentation library that identifies relevant computation phases

Program Development for Extreme-Scale Computing 14May 3rd, 2010 Dealing with large scale traces Information emitted at phase change Punctual (callstack) Aggregated  Hardware Counters  Software Counters Number of point-to-point and collective operations Number of bytes transferred Time in MPI

Program Development for Extreme-Scale Computing 15May 3rd, 2010 Example PEPC tasks on Jaguar Duration of the computation bursts # of MPI collective operations

Program Development for Extreme-Scale Computing 16May 3rd, 2010 Benefits of summarized tracefiles Important trace size reduction Gadget2 (128) – 10 Gbytes down to 428 Mbytes PEPC (16k) – 19 Gbytes down to 400 Mbytes PFLOTRAN (16k) – +250Gbytes down to 6 Gbytes Whole execution analysis

Program Development for Extreme-Scale Computing 17May 3rd, 2010 Working with large traces? We're dealing with large scale executions Maintain scalability of tracing + sampling By adding more data?  Use folding to reduce data Example (Gadget2 using 128 tasks) 100 its, 5 samples/s during 90minutes ~ 236MB Folding on samples/s ~ 64 MB

Program Development for Extreme-Scale Computing 18May 3rd, 2010 Outline Instrumentation and sampling Folding Summarized traces Combining mechanisms Some results Current work

Program Development for Extreme-Scale Computing 19May 3rd, 2010 Gadget2 analysis, 128 tasks 32%16% 13%8% force_tree.c gravity_tree.c +167 gravity_tree.c density.c +167 force_tree.c hydra.c +246 predict.c pm_periodic.c +385

Program Development for Extreme-Scale Computing 20May 3rd, 2010 PEPC analysis, 32 tasks 45%37% 5%3% tree_aswalk.f tree_aswalk.f tree_domains.f tree_branches.f tree_branches.f tree_properties.f tree_aswalk.f tree_aswalk.f

Program Development for Extreme-Scale Computing 21May 3rd, 2010 Current directions We work on: Is there an optimal sampling frequency? Quantify correctness and validate the results Callstack analysis

Program Development for Extreme-Scale Computing 22May 3rd, 2010 Thank you!