Geant4 Computing Performance Task with Open|Speedshop Soon Yung Jun, Krzysztof Genser, Philippe Canal (Fermilab) 21 st Geant4 Collaboration Meeting, Ferrara,

Slides:



Advertisements
Similar presentations
Profiling your application with Intel VTune at NERSC
Advertisements

Intel® performance analyze tools Nikita Panov Idrisov Renat.
Performance Analysis of Multiprocessor Architectures
11/18/2013 SC2013 Tutorial How to Analyze the Performance of Parallel Codes 101 A case study with Open|SpeedShop Section 2 Introduction into Tools and.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 3: Processes.
Fall 2001CS 4471 Chapter 2: Performance CS 447 Jason Bakos.
Process Concept An operating system executes a variety of programs
Lecture 3: Computer Performance
1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance.
1 Lecture 10: FP, Performance Metrics Today’s topics:  IEEE 754 representations  FP arithmetic  Evaluating a system Reminder: assignment 4 due in a.
Measuring zSeries System Performance Dr. Chu J. Jong School of Information Technology Illinois State University 06/11/2012 Sponsored in part by Deer &
1 Parallel Performance Analysis with Open|SpeedShop Trilab Tools-Workshop Martin Schulz, LLNL/CASC LLNL-PRES
Process Management. Processes Process Concept Process Scheduling Operations on Processes Interprocess Communication Examples of IPC Systems Communication.
04/27/2011 Paradyn Week Open|SpeedShop & Component Based Tool Framework (CBTF) project status and news Jim Galarowicz, Don Maghrak The Krell Institute.
SC 2012 © LLNL / JSC 1 HPCToolkit / Rice University Performance Analysis through callpath sampling  Designed for low overhead  Hot path analysis  Recovery.
Application performance and communication profiles of M3DC1_3D on NERSC babbage KNC with 16 MPI Ranks Thanh Phung, Intel TCAR Woo-Sun Yang, NERSC.
Offline Coordinators  CMSSW_7_1_0 release: 17 June 2014  Usage:  Generation and Simulation samples for run 2 startup  Limited digitization and reconstruction.
FNAL Geant4 Performance Group Issues and Progress Daniel Elvira for M. Fischler, J. Kowalkowski, M. Paterno.
Martin Schulz Center for Applied Scientific Computing Lawrence Livermore National Laboratory Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,
Performance Monitoring Tools on TCS Roberto Gomez and Raghu Reddy Pittsburgh Supercomputing Center David O’Neal National Center for Supercomputing Applications.
Detector Simulation on Modern Processors Vectorization of Physics Models Philippe Canal, Soon Yung Jun (FNAL) John Apostolakis, Mihaly Novak, Sandro Wenzel.
CSC 7600 Lecture 28 : Final Exam Review Spring 2010 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS FINAL EXAM REVIEW Daniel Kogler, Chirag Dekate.
Sep 08, 2009 SPEEDUP – Optimization and Porting of Path Integral MC Code to New Computing Architectures V. Slavnić, A. Balaž, D. Stojiljković, A. Belić,
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
Chapter 3 System Performance and Models Introduction A system is the part of the real world under study. Composed of a set of entities interacting.
Tool Visualizations, Metrics, and Profiled Entities Overview [Brief Version] Adam Leko HCS Research Laboratory University of Florida.
Morgan Kaufmann Publishers
Processor Architecture
CSC Multiprocessor Programming, Spring, 2012 Chapter 11 – Performance and Scalability Dr. Dale E. Parson, week 12.
Software Performance Monitoring Daniele Francesco Kruse July 2010.
Computing Performance Recommendations #10, #11, #12, #15, #16, #17.
4. Performance 4.1 Introduction 4.2 CPU Performance and Its Factors
Threads. Readings r Silberschatz et al : Chapter 4.
Lecture 5: 9/10/2002CS170 Fall CS170 Computer Organization and Architecture I Ayman Abdel-Hamid Department of Computer Science Old Dominion University.
PAPI on Blue Gene L Using network performance counters to layout tasks for improved performance.
Krzysztof Genser/Fermilab For the Fermilab Geant4 Performance Team.
1 How to do Multithreading First step: Sampling and Hotspot hunting Myongji University Sugwon Hong 1.
Chapter 4: Threads. 4.2 Silberschatz, Galvin and Gagne ©2005 Operating System Concepts – 7 th edition, Jan 23, 2005 Typical Program Processor utilization?
ATLAS Distributed Computing perspectives for Run-2 Simone Campana CERN-IT/SDC on behalf of ADC.
Performance profiling of Experiments’ Geant4 Simulations Geant4 Technical Forum Ryszard Jurga.
General Introduction and prospect Makoto Asai (SLAC PPA/SCA)
CMSC 611: Advanced Computer Architecture Performance & Benchmarks Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some.
Introduction to HPC Debugging with Allinea DDT Nick Forrington
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
Lecture 5. Example for periority The average waiting time : = 41/5= 8.2.
ATLAS – statements of interest (1) A degree of hierarchy between the different computing facilities, with distinct roles at each level –Event filter Online.
SQL Database Management
Two notions of performance
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
Introduction to threads
Improving the support for ARM in IgProf
Low-Cost High-Performance Computing Via Consumer GPUs
Threads vs. Events SEDA – An Event Model 5204 – Operating Systems.
Defining Performance Which airplane has the best performance?
The Mach System Sri Ramkrishna.
Processes and Threads Processes and their scheduling
Geant4 MT Performance Soon Yung Jun (Fermilab)
MCTS Guide to Microsoft Windows 7
Kilohertz Decision Making on Petabytes
Performance profiling and benchmark for medical physics
Chapter 4: Multithreaded Programming
Process Management Presented By Aditya Gupta Assistant Professor
Geant4 profiling performance for medical physics
Morgan Kaufmann Publishers
From Open|SpeedShop to a Component Based Tool Framework
Capriccio – A Thread Model
Chapter 4: Threads.
Chapter 2: Performance CS 447 Jason Bakos Fall 2001 CS 447.
CSC Multiprocessor Programming, Spring, 2011
CS Introduction to Operating Systems
Presentation transcript:

Geant4 Computing Performance Task with Open|Speedshop Soon Yung Jun, Krzysztof Genser, Philippe Canal (Fermilab) 21 st Geant4 Collaboration Meeting, Ferrara, Italy Sept , 2016

Geant4 Computing Performance (Review 1/4) Progressive improvements over the last five years (more information at 2 ~30%

Geant4 Computing Performance (Review 2/4) Changes in linked objects (libraries): from 9.5 to 10.2 Also many changes in hot functions from 9.5 to Library /10.2 libG4processes libG4geometry libm.so libG4tracking

Geant4 Computing Performance (Review 3/4) The current hot spot: G4PhysicsVector::Value (~6%) 4 73% Top 4-Callees

Geant4 Computing Performance (Review 4/4) 10.2 cycle: CPU/mem under control while improving physics Geant Multi-threading: good scalability and memory reduction 5

Current Profiling Tools and Primary Metrics Tools –FAST (CPU Profiling) –IgProf (Memory profiling) –Open|Speepshop (Geant4 MT profiling) Metrics (Observables) –Timing Event Function (exclusive and inclusive) Library –Memory Footprint (live, total) statm (top) –Multi-threaded applications Event throughput and memory reduction 6

Challenges and New Requirements Flat profile –No obviously reducible hot spots or bottlenecks –Need more specific metrics Measurement uncertainty –Fixed rate for sampling and call path collections (100Hz) –Different optimization paths when recompiling Geant4 libs ? –Rerun reference for each release (4 nodes x 32 jobs) Limited extensibility –Not all tools are capable for profiling multi-threaded applications –Not capable for profiling codes on new hardware architectures Identify the better profiling tool(s) for G4CPT –Pick a right (integrated) tool to meet all new requirements –Surveyed several tools and selected Open|Speedshop 7

Introduction to Open|Speedshop Open source (the Krell institute, ) Sampling Experiments (Light weighted) Support for Call Stack Analysis Hardware Performance Counters Memory Function Tracing MPI Profiling and Tracing I/O Profiling and Tracing Floating Point Exception Analysis POSIX Thread Function Tracing (Multithreading-capable) Flexible analysis options (GUI, command line, online) Support Intel MIC architecture and NVIDIA CUDA Well documented and supported 8

Sampling Experiments in OSS pcsamp (periodic sampling the program counters) –Low overhead overview of time distribution usertime (call path profiling) –Inclusive and exclusive timing data –Call paths, caller and callee relationships hwcsamp (periodic sampling hardware counters) –Profile of hardware counter events (PAPI events) mem (memory tracing) –Call paths for memory related function call events –Aggregate and individual rank, thread, or processing timings pthreads (POSIX thread tracing) io (I/O tracing) Many other useful experiments 9

Measurement Overheads and Output Size pcsamp: exclusive time - insensitive to sampling frequency (default 100Hz, set to Hz for G4CPT) usertime: inclusive time and call paths – large overhead (default 35, set to 100Hz): similar overhead for hwcsamp 10

Migration Plan and Status: Replace SimpleProfiler (FAST) with Open|Speedshop (OSS) All information provided by old tools are reproduced with OSS and results have been posted since 10.2.r06 11

Results: Average CPU Time Triple-cross-checks (2 sets x 128 AMD nodes, 1set x 48 Intel nodes) - example: H  ZZ (SimplifiedCalo) 12 Total Time Event Time

OSS pcsamp (program counter sampling) Experiment OSS GUI example: Function view Other views: linked object (libraries), function paths and so on 13  g4cpt web plot/table

Comparison between FAST and OSS: Top Ten Functions Exclusive time fraction (10.2.r06, H  ZZ, SimplifiedCalo) Also consistent profiling results for exclusive timing (call paths) and linked objects 14 FAST (100 Hz) OSS (10 KHz)

Open|Speedshop: Experiments with Hardware Counters Periodic sampling hardware counters (hwcsamp) Supports both derived and non-derived PAPI presets A list of some possible hardware counter combinations 15

OSS hwcsamp (hardware counters) Experiment hwcsamp example: TOT_CYS, TOT_INC, FP_OPS, LD_INS (10.2.r06, H  ZZ, SimplifiedCalo) 16

Code Performance by Hardware Counter Metrics Derivatives: examples Example: Higgs  ZZ (10.2.r06) with SimplifiedCalo –IPC = 0.74 (relatively small) –FMO = 0.30 S.Y. Jun17 Hardware Counter Metrics Derivatives Performance IPC (Instruction/Cycle) Large values suggest good balance with minimal stalls. FPC (FLOPS/Cycle) Large values for floating point intensive codes suggests efficient CPU utilization FMO (FLOPS/Memory Ops) Good data locality, Computational Intensity LPC (Loads/Cycle) Useful for calculating FMO, may indicate good stride through arrays. SPC (Stores/Cycle) Useful for calculating FMO, may indicate good stride through arrays.

More OSS Post Analysis Profiling information at the function statement (line) level –Will provide the top-100 list from 10.3 Multi-threading capability (see the next talk) Compare two experiments (osscompare): examples to use –two releases –Two experiments with the different numbers of threads Call path analysis based on DB 18

Work Plan Migration to Open|speedshop (partially done) –Have been tested with recent releases since 10.3.beta –Reproduced all performance data that we used to measure –More information are ready to be added Current set of experiments –pcsamp (exclusive time) –usersamp (inclusive time and call path) –hwcsamp (PAPI hardware counters) Extension –mem (memory tracing) –pthreads (POSIX thread tracing) –io (I/O tracking) 19

Summary Reviewed Geant4 computing performance for recent releases Challenges and new requirement in G4CPT have been addressed Open|speedshop is selected as the default profiler for the future computing performance task Migration is done and computing performance profiling and benchmark are ready to be extended 20

Backup Slides

Geant4 Computing Performance: Hot Functions Changes in hot functions from 9.5 to

Results: OSS linked Objects OSS GUI example: Linked Objects view 23  g4cpt web plot