Statistical Performance Analysis for Scientific Applications Presentation at the XSEDE14 Conference Atlanta, GA Fei Xing Haihang You Charng-Da Lu July.

Slides:

Advertisements

Similar presentations

HPC Benchmarking, Rudi Eigenmann, 2006 SPEC Benchmark Workshop this is important for machine procurements and for understanding where HPC technology is.

Advertisements

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.

The Development of Mellanox - NVIDIA GPUDirect over InfiniBand A New Model for GPU to GPU Communications Gilad Shainer.

Learning Trajectory Patterns by Clustering: Comparative Evaluation Group D.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

PARTITIONAL CLUSTERING

Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)

A Dynamic World, what can Grids do for Multi-Core computing? Daniel Goodman, Anne Trefethen and Douglas Creager

Thoughts on Shared Caches Jeff Odom University of Maryland.

Distributed Approximate Spectral Clustering for Large- Scale Datasets FEI GAO, WAEL ABD-ALMAGEED, MOHAMED HEFEEDA PRESENTED BY : BITA KAZEMI ZAHRANI 1.

Beowulf Supercomputer System Lee, Jung won CS843.

SAN DIEGO SUPERCOMPUTER CENTER Advanced User Support Project Outline October 9th 2008 Ross C. Walker.

Communication Pattern Based Node Selection for Shared Networks

Memory System Characterization of Big Data Workloads

Task Scheduling and Distribution System Saeed Mahameed, Hani Ayoub Electrical Engineering Department, Technion – Israel Institute of Technology

1 ASU MAT 591: Opportunities in Industry Performance Modeling Bo Faser Lockheed Martin Management & Data Systems Intelligence, Surveillance, and Reconnaissance.

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.

Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Evaluating the Tera MTA Allan Snavely, Wayne Pfeiffer et.

Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.

Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.

Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group.

FLANN Fast Library for Approximate Nearest Neighbors

Topology Aware Mapping for Performance Optimization of Science Applications Abhinav S Bhatele Parallel Programming Lab, UIUC.

N Tropy: A Framework for Analyzing Massive Astrophysical Datasets Harnessing the Power of Parallel Grid Resources for Astrophysical Data Analysis Jeffrey.

The Tutorial of Principal Component Analysis, Hierarchical Clustering, and Multidimensional Scaling Wenshan Wang.

© Fujitsu Laboratories of Europe 2009 HPC and Chaste: Towards Real-Time Simulation 24 March

Waleed Alkohlani 1, Jeanine Cook 2, Nafiul Siddique 1 1 New Mexico Sate University 2 Sandia National Laboratories Insight into Application Performance.

Predicting performance of applications and infrastructures Tania Lorido 27th May 2011.

Energy Profiling And Analysis Of The HPC Challenge Benchmarks Scalable Performance Laboratory Department of Computer Science Virginia Tech Shuaiwen Song,

Most modern operating systems incorporate these five components.

National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Cell processor implementation of a MILC lattice QCD application.

The WRF Model The Weather Research and Forecasting (WRF) Model is a mesoscale numerical weather prediction system designed for both atmospheric research.

Presented by ORNL Statistics and Data Sciences Understanding Variability and Bringing Rigor to Scientific Investigation George Ostrouchov Statistics and.

Performance Model & Tools Summary Hung-Hsun Su UPC Group, HCS lab 2/5/2004.

1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin.

Application performance and communication profiles of M3DC1_3D on NERSC babbage KNC with 16 MPI Ranks Thanh Phung, Intel TCAR Woo-Sun Yang, NERSC.

Analysis of the ROOT Persistence I/O Memory Footprint in LHCb Ivan Valenčík Supervisor Markus Frank 19 th September 2012.

Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.

CSC 7600 Lecture 28 : Final Exam Review Spring 2010 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS FINAL EXAM REVIEW Daniel Kogler, Chirag Dekate.

Zhengji Zhao, Nicholas Wright, and Katie Antypas NERSC Effects of Hyper- Threading on the NERSC workload on Edison NUG monthly meeting, June 6, 2013.

2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.

HPCMP Benchmarking Update Cray Henry April 2008 Department of Defense High Performance Computing Modernization Program.

Disco : Running commodity operating system on scalable multiprocessor Edouard et al. Presented by Vidhya Sivasankaran.

Chapter 14 – Cluster Analysis © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce.

Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

Data Structures and Algorithms in Parallel Computing Lecture 7.

DOE Network PI Meeting 2005 Runtime Data Management for Data-Intensive Scientific Applications Xiaosong Ma NC State University Joint Faculty: Oak Ridge.

Open XDMoD Overview Tom Furlani, Center for Computational Research

Concept Diagram Hung-Hsun Su UPC Group, HCS lab 1/27/2004.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

SDM Center High-Performance Parallel I/O Libraries (PI) Alok Choudhary, (Co-I) Wei-Keng Liao Northwestern University In Collaboration with the SEA Group.

SAN DIEGO SUPERCOMPUTER CENTER Advanced User Support Project Overview Thomas E. Cheatham III University of Utah Jan 14th 2010 By Ross C. Walker.

Disco: Running Commodity Operating Systems on Scalable Multiprocessors Presented by: Pierre LaBorde, Jordan Deveroux, Imran Ali, Yazen Ghannam, Tzu-Wei.

Workload Design: Selecting Representative Program-Input Pairs Lieven Eeckhout Hans Vandierendonck Koen De Bosschere Ghent University, Belgium PACT 2002,

Parallel IO for Cluster Computing Tran, Van Hoai.

LACSI 2002, slide 1 Performance Prediction for Simple CPU and Network Sharing Shreenivasa Venkataramaiah Jaspal Subhlok University of Houston LACSI Symposium.

LIOProf: Exposing Lustre File System Behavior for I/O Middleware

TEMPLATE DESIGN © H. Che 2, E. D’Azevedo 1, M. Sekachev 3, K. Wong 3 1 Oak Ridge National Laboratory, 2 Chinese University.

Parallel Programming Models

Kai Li, Allen D. Malony, Sameer Shende, Robert Bell

Lecture 2: Performance Evaluation

A modified hyperplane clustering algorithm allows for efficient and accurate clustering of extremely large datasets Ashok Sharma, Robert Podolsky, Jieping.

Parallel Objects: Virtualization & In-Process Components

Parallel Density-based Hybrid Clustering

Performance Evaluation of Adaptive MPI

Yu Su, Yi Wang, Gagan Agrawal The Ohio State University

BlueGene/L Supercomputer

Presentation transcript:

Statistical Performance Analysis for Scientific Applications Presentation at the XSEDE14 Conference Atlanta, GA Fei Xing Haihang You Charng-Da Lu July 15, 2014

2 Running Time Analysis Causes of slow run on supercomputer –Improper memory usage –Poor parallelism –Too much I/O –Not optimize the program efficiently –… Examine user’s code: profiling tools Profiling = physical exam for applications –Communication – Fast Profiling library for MPI (FPMPI) –Processor & memory – Performance Application Programming Interface (PAPI) –Overall performance & Optimization opportunity – CrayPat

3 Profiling Reports Profiling tools produce comprehensive reports covering a wider spectrum of application performance Imagine, as a scientist and supercomputer user, you see… Question: how to make sense of these information from the report? –Meaning of the variables –Indication of the numbers I/O read time I/O write time MPI communication time MPI synchronization time MPI calls Level 1 Cache miss Memory usage TLB miss L1 Cache access MPI imbalance MPI communication imbalance More are coming!!!

4 Research Framework Select an HPC benchmark to create baseline kernels Use profiling tools to capture the peak performance Apply statistical approach to extract synthetic features that are easy to interpret Run real applications, and compare their performance with “role models” How about… Courtesy of C.-D. Lu

5 Gears for the Experiment Benchmarks – HPC Challenge (HPCC) –Gauge supercomputers toward peak performance –7 representative kernels: DGEMM, FFT, HPL, Random Access, PTRANS, Latency Bandwidth, Stream HPL is used in the TOP 500 ranking –3 parallelism regimes Serial / Single Processor Embarrassingly Parallel MPI Parallel Profiling tools – FPMPI and PAPI Testing environment – Kraken (Cray XT5)

6 HPCC Mode 1 means serial/single processor, * means embarrassingly parallel, M means MPI parallel

7 Training Set Design 2,954 observations –Various kernels, wide range of matrix sizes, different compute nodes 11 performance metrics – gathered from FPMPI and PAPI –MPI communication time, MPI synchronization time, MPI calls, total MPI bytes, memory, FLOPS, total instructions, L2 data cache access, L1 data cache access, synchronization imbalance, communication imbalance Data preprocessing –Convert some metrics to unit-less rates: divide by wall-time –Normalization FLOPSMemory…MPI calls HPL_1000 *_FFT_2000 … M_RA_300,000 Performance Metrics Obs.

8 Extract Synthetic Features Extract synthetic & accessible Performance Indices (PIs) Solution: Variable Clustering + Principal Component Analysis (PCA) PCA: decorrelate the data Problem of using PCA alone: variables with small loadings may over influence the PC score Standardization & modified PCA do not work well

9 Variable Clustering Given a partition of X, P k = (C 1, …, C k ) Centroid of cluster C i – is the Pearson Correlation – is 1 st Principle Component of C i Homogeneity of C i Quality of clustering, is Optimal partition

10 Variable Clustering – Visualize This! Optimal partition: Given a partition: P 4 = (C 1, …, C 4 ) Given a partition: P 4 = (C 1, …, C 4 ) Centroid of C k : 1 st PC of C k Centroid of C k : 1 st PC of C k H(C 1 )H(C 2 )H(C 3 )H(C 4 ) Quality of P 4 : =+++

11 Implementation Theoretical optimum is computationally complex Agglomerative hierarchical clustering –Start with the points as individual clusters –At each step, merge the closest pair of clusters until only one cluster left Result can be visualized as a dendrogram ClustOfVar in R

12 Simulation Output PI2: Memory PI1: Communication 0.53*0.52*0.49*0.46*1.00* -0.15* -0.07* 0.81* -0.30* 0.45* -0.14* PI3: Computation

13 PIs for Baseline Kernels

14 PI1 vs PI2 2 distinct strata on memory –Upper – multiple node runs, need extra memory buffers –Lower – single node runs, shared memory High PI2 for HPL PI1. Communication PI2. Memory

15 PI1 vs PI3 Similar PI3 pattern for HPL and DGEMM –Computation intensive –HPL utilize DGEMM routine extensively Similar all PIs for stream & random access PI1. Communication PI3. Computation

16 Courtesy of C.-D. Lu

17 Applications 9 real-world scientific applications in weather forecasting, molecular dynamics and quantum physics –Amber: molecular dynamics –ExaML: molecular sequencing –GADGET: cosmology –Gromacs: molecular dynamics –HOMME: climate modeling –LAMMPS: molecular dynamics –MILC: quantum chromodynamics –NAMD: molecular dynamics –WRF: weather research Voronoi Diagram PI1. Communication PI3. Computation

18 Conclusion and Future Work We have Proposed a statistical approach to give users a better insights into massive performance datasets; Created a performance scoring system using 3 PIs to capture high-dimensional performance space; Gave user accessible performance implications and improvement hints. We will Test the method on other machine and systems; Define and develop a set of baseline kernels that better represent HPC workloads; Construct a user-friendly system incorporating statistical techniques to drive more advanced performance analysis for non-experts.

19 Thanks for your attention! Questions?