Performance Debugging Techniques For HPC Applications David Skinner CS267 Feb 17 2015.

Slides:

Advertisements

Similar presentations

CSC 360- Instructor: K. Wu Overview of Operating Systems.

Advertisements

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Profiling your application with Intel VTune at NERSC

WHAT IS AN OPERATING SYSTEM? An interface between users and hardware - an environment "architecture ” Allows convenient usage; hides the tedious stuff.

TU/e Processor Design 5Z032 1 Processor Design 5Z032 The role of Performance Henk Corporaal Eindhoven University of Technology 2009.

Introduction CSCI 444/544 Operating Systems Fall 2008.

Copyright © 2003, SAS Institute Inc. All rights reserved. Where's Waldo Uncovering Hard-to-Find Application Killers Claire Cates SAS Institute, Inc

CS 345 Computer System Overview

CMPT 300: Operating Systems I Dr. Mohamed Hefeeda

Debugging and Optimization Tools Richard Gerber NERSC User Services David Skinner NERSC Outreach, Software & Programming Group UCB CS267 February 15, 2011.

Introduction CS 524 – High-Performance Computing.

Tools for Investigating Graphics System Performance

1 Lecture 6 Performance Measurement and Improvement.

CS 3013 & CS 502 Summer 2006 Scheduling1 The art and science of allocating the CPU and other resources to processes.

Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL.

OPERATING SYSTEM OVERVIEW

Review: Operating System Manages all system resources ALU Memory I/O Files Objectives: Security Efficiency Convenience.

Memory Management 2010.

Chapter 1 and 2 Computer System and Operating System Overview

Copyright © 1998 Wanda Kunkle Computer Organization 1 Chapter 2.1 Introduction.

Concurrency, Threads, and Events Robbert van Renesse.

Chapter 1 and 2 Computer System and Operating System Overview

Wk 2 – Scheduling 1 CS502 Spring 2006 Scheduling The art and science of allocating the CPU and other resources to processes.

Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL.

Performance Engineering and Debugging HPC Applications David Skinner

 Introduction Introduction  Definition of Operating System Definition of Operating System  Abstract View of OperatingSystem Abstract View of OperatingSystem.

Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.

Multi-core Programming Thread Profiler. 2 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Topics Look at Intel® Thread Profiler features.

1 Babak Behzad, Yan Liu 1,2,4, Eric Shook 1,2, Michael P. Finn 5, David M. Mattli 5 and Shaowen Wang 1,2,3,4 Babak Behzad 1,3, Yan Liu 1,2,4, Eric Shook.

Statistical Performance Analysis for Scientific Applications Presentation at the XSEDE14 Conference Atlanta, GA Fei Xing Haihang You Charng-Da Lu July.

Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

LiNK: An Operating System Architecture for Network Processors Steve Muir, Jonathan Smith Princeton University, University of Pennsylvania

Principles of Scalable HPC System Design March 6, 2012 Sue Kelly Sandia National Laboratories Abstract: Sandia National.

Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.

Scalable Analysis of Distributed Workflow Traces Daniel K. Gunter and Brian Tierney Distributed Systems Department Lawrence Berkeley National Laboratory.

Real Time Operating Systems Lecture 10 David Andrews

Tools for Performance Debugging HPC Applications David Skinner

10/19/2015Erkay Savas1 Performance Computer Architecture – CS401 Erkay Savas Sabanci University.

Martin Schulz Center for Applied Scientific Computing Lawrence Livermore National Laboratory Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,

Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.

ICOM 6115: Computer Systems Performance Measurement and Evaluation August 11, 2006.

Performance Monitoring Tools on TCS Roberto Gomez and Raghu Reddy Pittsburgh Supercomputing Center David O’Neal National Center for Supercomputing Applications.

Headline in Arial Bold 30pt HPC User Forum, April 2008 John Hesterberg HPC OS Directions and Requirements.

Operating Systems David Goldschmidt, Ph.D. Computer Science The College of Saint Rose CIS 432.

CSC 7600 Lecture 28 : Final Exam Review Spring 2010 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS FINAL EXAM REVIEW Daniel Kogler, Chirag Dekate.

The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.

Workshop BigSim Large Parallel Machine Simulation Presented by Eric Bohm PPL Charm Workshop 2004.

1: Operating Systems Overview 1 Jerry Breecher Fall, 2004 CLARK UNIVERSITY CS215 OPERATING SYSTEMS OVERVIEW.

1 Computer Systems II Introduction to Processes. 2 First Two Major Computer System Evolution Steps Led to the idea of multiprogramming (multiple concurrent.

CSC Multiprocessor Programming, Spring, 2012 Chapter 11 – Performance and Scalability Dr. Dale E. Parson, week 12.

Efficient Live Checkpointing Mechanisms for computation and memory-intensive VMs in a data center Kasidit Chanchio Vasabilab Dept of Computer Science,

Full and Para Virtualization

Comprehensive Scientific Support Of Large Scale Parallel Computation David Skinner, NERSC.

HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

Next Generation of Apache Hadoop MapReduce Owen

SDN-SF LANL Tasks. LANL Research Tasks Explore parallel file system networking (e.g. LNet peer credits) in order to give preferential treatment to isolated.

Introduction to HPC Debugging with Allinea DDT Nick Forrington

Building PetaScale Applications and Tools on the TeraGrid Workshop December 11-12, 2007 Scott Lathrop and Sergiu Sanielevici.

Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,

Lecture 1 Page 1 CS 111 Summer 2013 Important OS Properties For real operating systems built and used by real people Differs depending on who you are talking.

Sujata Ray Dey Maheshtala College Computer Science Department

Is System X for Me? Cal Ribbens Computer Science Department

Characterization of Parallel Scientific Simulations

湖南大学-信息科学与工程学院-计算机与科学系

CARLA Buenos Aires, Argentina - Sept , 2017

Sujata Ray Dey Maheshtala College Computer Science Department

Operating System Overview

CSC Multiprocessor Programming, Spring, 2011

Presentation transcript:

Performance Debugging Techniques For HPC Applications David Skinner CS267 Feb

Today’s Topics Principles –Topics in performance scalability –Examples of areas where tools can help Practice –Where to find tools –Specifics to NERSC’s Hopper/Edison... Scope & Audience: The budding simulation scientist, I want to compute. The compiler/middleware dev, I want to code. 2

Overview of an HPC Facility Serving all of DOE Office of Science domain breadth range of scales Lots of users ~6K active ~500 logged in ~450 projects Science driven sustained performance on real apps Architecture aware system procurements driven by workload needs

Big Picture of Performance and Scalability 4

5 Formulate Research Problem Algor- ithms Debug Perf Debug jobs jobs Queue Wait Data? UQ VV Understand & Publish! Performance, more than a single number Plan where to put effort Optimization in one area can de-optimize another Timings come from timers and also from your calendar, time spent coding Sometimes a slower algorithm is simpler to verify correctness

To your goals –Time to solution, T q +T wall … –Your research agenda –Efficient use of allocation To the –application code –input deck –machine type/state Performance is Relative Suggestion: Focus on specific use cases as opposed to making everything perform well. Bottlenecks can shift.

7 Serial –Leverage ILP on the processor –Feed the pipelines –Reuse data in cache –Exploit data locality Parallel –Exposing task level concurrency –Minimizing latency effects –Maximizing work vs. communication Specific Facets of Performance

Registers Caches Local Memory Remote Memory Disk / Filesystem 8 Performance is Hierarchical instructions & operands lines pages messages blocks, files

…on to specifics about HPC tools Mostly at NERSC but fairly general 9

Registers Caches Local Memory Remote Memory Disk / Filesystem 10 Tools are Hierarchical PAPI valgrind Craypat IPM Tau... POSIX PMPI

11 Sampling –Regularly interrupt the program and record where it is –Build up a statistical profile Tracing / Instrumenting –Insert hooks into program to record and time events Use Hardware Event Counters –Special registers count events on processor –E.g. floating point instructions –Many possible events –Only a few (~4 counters) HPC Perf Tool Mechanisms (the how part)

Things HPC tools may ask you to do Modify your code with macros, API calls, timers Re-compile your code Transform your binary for profiling/tracing Run the transformed binary –A data file is produced Interpret the results with another tool 12

Performance NERSC Vendor Tools: –CrayPat Community Tools: –TAU (U. Oregon via ACTS) –PAPI (Performance Application Programming Interface) –gprof, many more, Center tools: –Integrated Performance Monitoring

What can HPC tools tell us? CPU and memory usage –FLOP rate –Memory high water mark OpenMP –OMP overhead –OMP scalability (finding right # threads) MPI –Detecting load imbalance –% wall time in communication –Analyzing message sizes

Tools can add overhead to code execution What level can you tolerate? Tools can add overhead to scientists What level can you tolerate? Scenarios: Debugging a code that is “slow” Detailed performance debugging Performance monitoring in production Using the right tool

One quick tool example: IPM Integrated Performance Monitoring MPI profiling, hardware counter metrics, POSIX IO profiling IPM requires no code modification & no instrumented binary –Only a “module load ipm” before running your program on systems that support dynamic libraries –Else link with the IPM library IPM uses hooks already in the MPI library to intercept your MPI calls and wrap them with timers and counters

IPM: Let’s See 1) Do “module load ipm”, link with $IPM, then run normally 2) Upon completion you get Maybe that’s enough. If so you’re done. Have a nice day ##IPM2v0.xx################################################## # # command :./fish -n # start : Tue Feb 08 11:05: host : nid06027 # stop : Tue Feb 08 11:08: wallclock : # mpi_tasks : 25 on 2 nodes %comm : 1.62 # mem [GB] : 0.24 gflop/sec : 5.06 …

IPM : IPM_PROFILE=full 18 # host : s05601/ C00_AIX mpi_tasks : 32 on 2 nodes # start : 11/30/04/14:35:34 wallclock : sec # stop : 11/30/04/14:36:00 %comm : # gbytes : e-01 total gflop/sec : e+00 total # [total] min max # wallclock # user # system # mpi # %comm # gflop/sec # gbytes # PM_FPU0_CMPL e e e e+08 # PM_FPU1_CMPL e e e e+08 # PM_FPU_FMA e e e e+08 # PM_INST_CMPL e e e e+09 # PM_LD_CMPL e e e e+09 # PM_ST_CMPL e e e e+09 # PM_TLB_MISS e e e e+07 # PM_CYC e e e e+09 # [time] [calls] # MPI_Send # MPI_Wait # MPI_Irecv # MPI_Barrier # MPI_Reduce # MPI_Comm_rank # MPI_Comm_size

19 There is a tradeoff between vendor- specific and vendor neutral tools –Each have their roles, vendor tools can often dive deeper Portable approaches allow apples-to- apples comparisons –Events, counters, metrics may be incomparable across vendors You can find printf most places –Put a few timers in your code? Advice: Develop (some) portable approaches to performance printf? really? Yes really.

Performance Principles in HPC Tools

Scaling: definitions Scaling studies involve changing the degree of parallelism. –Will we be changing the problem also? Strong scaling –Fixed problem size Weak scaling – Problem size grows with additional resources Speed up = T s /T p (n) Efficiency = T s /(n*T p (n)) Be aware there are multiple definitions for these terms

22 from Supercomputing 101, R. Nealy (good read)

MPI Scalability: Point to Point Ferreira, Kurt B., et al, SC14

MPI Scalability: Disseminate Ferreira, Kurt B., et al, SC14

MPI Scalability: Synchronization Ferreira, Kurt B., et al, SC14

Let’s look at a parallel algorithm. With a particular goal in mind, we systematically vary concurrency and/or problem size Example: How large a 3D (n^3) FFT can I efficiently run on 1024 cpus? Looks good? Watch out for variability: cross-job contention, OS jitter, perf weather

Let’s look a little deeper….

Performance in a 3D box (Navier-Stokes) Simple stencil, simple grid Transpose/ FFT is key to wallclock performance What if the problem size or core count change? One timestep, one node 61% time in FFT

The FFT(W) scalability landscape –Algorithm complexity or switching –Communication protocol switching –Inter-job contention –~bugs in vendor software  Whoa! Why so bumpy? Don’t assume performance is smooth  scaling study

30 Scaling is not always so tricky Main loop in jacobi_omp.f90; ngrid=6144 and maxiter=20

31 Weak Scaling and Communication

Load Imbalance : Pitfall 101 MPI ranks sorted by total communication time Communication Time: 64 tasks show 200s, 960 tasks show 230s

Load Balance : cartoon Universal App Unbalanced: Balanced: Time saved by load balance

Watch out for the little stuff. Even “trivial” MPI (or any function call) can add up Where does your code spend time?

Communication Topology Where are bottlenecks in the code & machine? Node Boundaries, P2P, Collective

36

Communication Topology As maps of data movement MILC PARATEC IMPACT-T CAM MAESTROGTC 37

Cactus Communication PDE Solvers on Block Structured Grids

PARATEC Communication 3D FFT

Time to solution? Don’t forget the batch queue. 40 not all performance is inside the app. Formulate Research Problem Algor- ithms Debug Perf Debug jobs jobs Queue Wait Data? UQ VV Understand & Publish!

Consider your schedule Charge factor regular vs. low Scavenger queues when you can tolerate interruption Xfer queues Downshift concurrency Consider the queue constraints Run limit : How many running at once Queue limit : How many queued Wall limit Soft (can you checkpoint?) Hard (game over) A few notes on queue optimization BTW, jobs can submit other jobs

Marshalling your own workflow Lots of choices in general –PBS, Hadoop, CondorG, MySGE On hopper it’s easy 42 #PBS -l mppwidth=4096 aprun –n 512./cmd & … aprun –n 512./cmd & wait #PBS -l mppwidth=4096 while(work_left) { if(nodes_avail) { aprun –n X next_job & } wait }

Scientific Workflows More Generally Tigres: Design templates for scientific workflows –Explicitly support Sequence, Parallel, Split, Merge Fireworks: High Throughput job scheduler –Runs on HPC systems L. Ramakrishnan, V. Hendrix, D. Gunter, G.Pastorello, R. Rodriguez, A. Essari, D. Agarwal Split Sequence Task 2 Task 1 Task Task3 Task4 0 Ask about NERSC DAS

44 Performance is Time-to-knowledge t start t stop

Contacts: 45 Thanks! Formulate Research Problem algor- ithms Debug Perf Debug jobs jobs Queue Wait Data? UQ VV Understand & Publish!