Statistical Performance Analysis for Scientific Applications Presentation at the XSEDE14 Conference Atlanta, GA Fei Xing Haihang You Charng-Da Lu July 15, 2014
2 Running Time Analysis Causes of slow run on supercomputer –Improper memory usage –Poor parallelism –Too much I/O –Not optimize the program efficiently –… Examine user’s code: profiling tools Profiling = physical exam for applications –Communication – Fast Profiling library for MPI (FPMPI) –Processor & memory – Performance Application Programming Interface (PAPI) –Overall performance & Optimization opportunity – CrayPat
3 Profiling Reports Profiling tools produce comprehensive reports covering a wider spectrum of application performance Imagine, as a scientist and supercomputer user, you see… Question: how to make sense of these information from the report? –Meaning of the variables –Indication of the numbers I/O read time I/O write time MPI communication time MPI synchronization time MPI calls Level 1 Cache miss Memory usage TLB miss L1 Cache access MPI imbalance MPI communication imbalance More are coming!!!
4 Research Framework Select an HPC benchmark to create baseline kernels Use profiling tools to capture the peak performance Apply statistical approach to extract synthetic features that are easy to interpret Run real applications, and compare their performance with “role models” How about… Courtesy of C.-D. Lu
5 Gears for the Experiment Benchmarks – HPC Challenge (HPCC) –Gauge supercomputers toward peak performance –7 representative kernels: DGEMM, FFT, HPL, Random Access, PTRANS, Latency Bandwidth, Stream HPL is used in the TOP 500 ranking –3 parallelism regimes Serial / Single Processor Embarrassingly Parallel MPI Parallel Profiling tools – FPMPI and PAPI Testing environment – Kraken (Cray XT5)
6 HPCC Mode 1 means serial/single processor, * means embarrassingly parallel, M means MPI parallel
7 Training Set Design 2,954 observations –Various kernels, wide range of matrix sizes, different compute nodes 11 performance metrics – gathered from FPMPI and PAPI –MPI communication time, MPI synchronization time, MPI calls, total MPI bytes, memory, FLOPS, total instructions, L2 data cache access, L1 data cache access, synchronization imbalance, communication imbalance Data preprocessing –Convert some metrics to unit-less rates: divide by wall-time –Normalization FLOPSMemory…MPI calls HPL_1000 *_FFT_2000 … M_RA_300,000 Performance Metrics Obs.
8 Extract Synthetic Features Extract synthetic & accessible Performance Indices (PIs) Solution: Variable Clustering + Principal Component Analysis (PCA) PCA: decorrelate the data Problem of using PCA alone: variables with small loadings may over influence the PC score Standardization & modified PCA do not work well
9 Variable Clustering Given a partition of X, P k = (C 1, …, C k ) Centroid of cluster C i – is the Pearson Correlation – is 1 st Principle Component of C i Homogeneity of C i Quality of clustering, is Optimal partition
10 Variable Clustering – Visualize This! Optimal partition: Given a partition: P 4 = (C 1, …, C 4 ) Given a partition: P 4 = (C 1, …, C 4 ) Centroid of C k : 1 st PC of C k Centroid of C k : 1 st PC of C k H(C 1 )H(C 2 )H(C 3 )H(C 4 ) Quality of P 4 : =+++
11 Implementation Theoretical optimum is computationally complex Agglomerative hierarchical clustering –Start with the points as individual clusters –At each step, merge the closest pair of clusters until only one cluster left Result can be visualized as a dendrogram ClustOfVar in R
12 Simulation Output PI2: Memory PI1: Communication 0.53*0.52*0.49*0.46*1.00* -0.15* -0.07* 0.81* -0.30* 0.45* -0.14* PI3: Computation
13 PIs for Baseline Kernels
14 PI1 vs PI2 2 distinct strata on memory –Upper – multiple node runs, need extra memory buffers –Lower – single node runs, shared memory High PI2 for HPL PI1. Communication PI2. Memory
15 PI1 vs PI3 Similar PI3 pattern for HPL and DGEMM –Computation intensive –HPL utilize DGEMM routine extensively Similar all PIs for stream & random access PI1. Communication PI3. Computation
16 Courtesy of C.-D. Lu
17 Applications 9 real-world scientific applications in weather forecasting, molecular dynamics and quantum physics –Amber: molecular dynamics –ExaML: molecular sequencing –GADGET: cosmology –Gromacs: molecular dynamics –HOMME: climate modeling –LAMMPS: molecular dynamics –MILC: quantum chromodynamics –NAMD: molecular dynamics –WRF: weather research Voronoi Diagram PI1. Communication PI3. Computation
18 Conclusion and Future Work We have Proposed a statistical approach to give users a better insights into massive performance datasets; Created a performance scoring system using 3 PIs to capture high-dimensional performance space; Gave user accessible performance implications and improvement hints. We will Test the method on other machine and systems; Define and develop a set of baseline kernels that better represent HPC workloads; Construct a user-friendly system incorporating statistical techniques to drive more advanced performance analysis for non-experts.
19 Thanks for your attention! Questions?