Jidong Zhai, Wenguang Chen, Weimin Zheng

Slides:



Advertisements
Similar presentations
Μπ A Scalable & Transparent System for Simulating MPI Programs Kalyan S. Perumalla, Ph.D. Senior R&D Manager Oak Ridge National Laboratory Adjunct Professor.
Advertisements

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.
Understanding Application Scaling NAS Parallel Benchmarks 2.2 on NOW and SGI Origin 2000 Frederick Wong, Rich Martin, Remzi Arpaci-Dusseau, David Wu, and.
A Parallel Computational Model for Heterogeneous Clusters Jose Luis Bosque, Luis Pastor, IEEE TRASACTION ON PARALLEL AND DISTRIBUTED SYSTEM, VOL. 17, NO.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
CSCI 4440 / 8446 Parallel Computing Three Sorting Algorithms.
1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.
Profile Guided MPI Protocol Selection for Point-to-Point Communication Calls 5/9/111 Aniruddha Marathe, David K. Lowenthal Department of Computer Science.
1 ProActive performance evaluation with NAS benchmarks and optimization of OO SPMD Brian AmedroVladimir Bodnartchouk.
Parallel Programming in C with MPI and OpenMP
Slide 1/24 Lawrence Livermore National Laboratory AutomaDeD: Automata-Based Debugging for Dissimilar Parallel Tasks Greg Bronevetsky, Bronis R. de Supinski,
Presented By HaeJoon Lee Yanyan Shen, Beng Chin Ooi, Bogdan Marius Tudor National University of Singapore Wei Lu Renmin University Cang Chen Zhejiang University.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
1 Using Multiple Energy Gears in MPI Programs on a Power- Scalable Cluster Vincent W. Freeh, David K. Lowenthal, Feng Pan, and Nandani Kappiah Presented.
1 MPI: Message-Passing Interface Chapter 2. 2 MPI - (Message Passing Interface) Message passing library standard (MPI) is developed by group of academics.
Performance Model & Tools Summary Hung-Hsun Su UPC Group, HCS lab 2/5/2004.
Lawrence Livermore National Laboratory Greg Bronevetsky in collaboration with Ignacio Laguna, Saurabh Bagchi, Bronis R. de Supinski, Dong H. Ahn, and Martin.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
IPDPS 2005, slide 1 Automatic Construction and Evaluation of “Performance Skeletons” ( Predicting Performance in an Unpredictable World ) Sukhdeep Sodhi.
Message Passing Programming Model AMANO, Hideharu Textbook pp. 140-147.
LogP and BSP models. LogP model Common MPP organization: complete machine connected by a network. LogP attempts to capture the characteristics of such.
Paradyn Project Paradyn / Dyninst Week Madison, Wisconsin April 29-May 3, 2013 Mr. Scan: Efficient Clustering with MRNet and GPUs Evan Samanas and Ben.
Distributed-Memory (Message-Passing) Paradigm FDI 2004 Track M Day 2 – Morning Session #1 C. J. Ribbens.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.
Message Passing and MPI Laxmikant Kale CS Message Passing Program consists of independent processes, –Each running in its own address space –Processors.
CSCI-455/552 Introduction to High Performance Computing Lecture 6.
Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC.
A Simulation Framework to Automatically Analyze the Communication-Computation Overlap in Scientific Applications Vladimir Subotic, Jose Carlos Sancho,
A Pattern Language for Parallel Programming Beverly Sanders University of Florida.
2.1 Collective Communication Involves set of processes, defined by an intra-communicator. Message tags not present. Principal collective operations: MPI_BCAST()
LACSI 2002, slide 1 Performance Prediction for Simple CPU and Network Sharing Shreenivasa Venkataramaiah Jaspal Subhlok University of Houston LACSI Symposium.
LIOProf: Exposing Lustre File System Behavior for I/O Middleware
Prediction of Interconnect Net-Degree Distribution Based on Rent’s Rule Tao Wan and Malgorzata Chrzanowska- Jeske Department of Electrical and Computer.
| presented by Vasileios Zois CS at USC 09/20/2013 Introducing Scalability into Smart Grid 1.
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Parallel Architecture
CS5102 High Performance Computer Systems Thread-Level Parallelism
Is System X for Me? Cal Ribbens Computer Science Department
Performance Evaluation of Adaptive MPI
Supporting Fault-Tolerance in Streaming Grid Applications
Parallel Programming with MPI and OpenMP
CS 584.
CMSC 611: Advanced Computer Architecture
Jidong Zhai, Tianwei Sheng, Jiangzhou He, Wenguang Chen, Weimin Zheng
Parallel Processing - MPI
Message Passing Models
Summary Background Introduction in algorithms and applications
RealProct: Reliable Protocol Conformance Testing with Real Nodes for Wireless Sensor Networks Junjie Xiong
Lecture 14: Inter-process Communication
Optimizing MapReduce for GPUs with Effective Shared Memory Usage
Architectural Interactions in High Performance Clusters
Shreeni Venkataramaiah
Yiannis Nikolakopoulos
CSCE569 Parallel Computing
Hybrid Programming with OpenMP and MPI
COMP60621 Fundamentals of Parallel and Distributed Systems
LoGPC: Modeling Network Contention in Message-Passing Programs
Hardware Counter Driven On-the-Fly Request Signatures
MPJ: A Java-based Parallel Computing System
BigSim: Simulating PetaFLOPS Supercomputers
Chapter 01: Introduction
Parallel Programming in C with MPI and OpenMP
Department of Computer Science University of California, Santa Barbara
COMP60611 Fundamentals of Parallel and Distributed Systems
Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project
Programming Parallel Computers
Presentation transcript:

Jidong Zhai, Wenguang Chen, Weimin Zheng PHANTOM: Predicting Performance of Parallel Applications on Large-scale Parallel Machines Using a Single Node Jidong Zhai, Wenguang Chen, Weimin Zheng Tsinghua University November 20, 2018

Motivation Large-scale parallel computers For designers of HPC systems Cost millions of dollars Take many years to design and implement For designers of HPC systems What is the performance of application X on a parallel machine Y with 10,000 nodes connected by network Z? Enable designers to evaluate various design alternatives

Motivation Performance prediction of parallel applications is important in HPC area: Application optimization System tuning System procurement …

Performance of Parallel Applications Accurate performance prediction is difficult Execution time of parallel applications  Sequential computation time Communication time Their convolution Overlap between computation and communication Synchronization overhead

Focus of Our Work In this work, we focus on how to acquire accurate sequential computation time Existing network simulator: BigNetSim (UIUC) DIMEMAS (UPC) SIM-MPI (Tsinghua) Current bottleneck: How to acquire sequential computation time accurately and efficiently without target parallel platform 1. Focus on applications written with MPI in this work

Previous Work Previous work for acquiring sequential computation time: Model-based method: Application signature Memory access pattern, num of INT and FP instructions etc. Target platform parameter Limitation: difficult to build an accurate model for complex program behavior Cache & bus contention on multi-core platform etc. Measurement-based method: Measure sequential computation time for weak-scaling applications Limitation: fail to deal with strong-scaling applications Regression-based method: Extrapolate sequential computation time Limitation: not applicable for some real applications due to the non-linear behavior

Our Contributions Employ deterministic replay technique to measure sequential computation time for strong-scaling applications without full-scale target platform Propose representative replay to reduce message-log size and measurement time

Outline Prediction Framework Basic Idea Improvement Evaluation

Prediction Framework Use a trace-driven simulation approach for P.P. Collect computation and communication traces Separate comp. and comm. in parallel applications CPU_Burst, msg_size, msg_type, source and dest. etc. FACT techniques (SC-2009) Collect MPI traces for large-scale applications on small-scale systems Acquire sequential computation time for each process With deterministic replay using a node of target platform Use a trace-driven simulator to convolute comm. and comp. performance SIM-MPI simulator

Prediction Framework An example of Fortran MPI program 1 real A(MAX,MAX), B(MAX,MAX), C(MAX,MAX), buf(MAX,MAX) 2 call MPI_INIT(ierr) 3 call MPI_COMM_RANK(MPI_COMM_WORLD,myid...) 4 DO iter=1, N 5 if (myid .gt. 0) then 6 call MPI_RECV(buf(1, 1),num,MPI_REAL,myid-1,...) 7 endif 8 DO i=1, MAX 9 DO j=1, MAX 10 A(i,j)=B(i,j)*C(i,j)+buf(i,j) 11 END DO 12 END DO 13 if (myid .lt. numprocs-1) then 14 call MPI_SEND(A(1, 1),num,MPI_REAL,myid+1,...) 15 endif 16 END DO 17 call MPI_FINALIZE(rc) An example of Fortran MPI program MPI_Init MPI_Rank(id=0) CPU_Burst(id,0) MPI_Send(id+1,size) CPU_Burst(id,1) CPU_Burst(id,2) … MPI Traces id=0 CPU_Burst(id,0)=2sec CPU_Burst(id,1)=3sec CPU_Burst(id,2)=4sec CPU_Burst(id,3)=2sec … Seq. Computation Time Latency =1.6 usec Bandwidth=1.5GB/sec Topology =2D Mesh … Network Parameter SIM-MPI Simulator (Trace-driven Simulator) Predicted Performance for Parallel Applications

Two Basic Definitions 1 MPI_Init() 2 c0 // Means computation 3 MPI_Barrier 4 for(i=0; i<N; i++){ 5 if(myid%2 == 1){ 6 MPI_Recv(..., myid-1, ...) 7 c1(i) 8 MPI_Send(..., myid-1, ...) 9 } 10 c2(i) 11 if(myid%2 == 0){ 12 MPI_Send(..., myid+1, ...) 13 c3(i) 14 MPI_Recv(..., myid+1, ...) 15 } 16 } 17 MPI_Final() DEF1: Communication Sequence (C.S.): Record message type of each comm. operation in temporal sequence C(P0) = {Init, Barrier, [Send, Recv], Fina} C(P1) = {Init, Barrier, [Recv, Send], Fina} DEF2: Sequential Computation Vector (S.C.V.): Record sequential computation performance for each process C0 = [c0, c2(0), c3(0), c2(1), c3(1), ..., c3(N − 1)] C1 = [c0, c1(0), c2(0), c1(1), c2(1), ..., c2(N − 1)] Execution Model Each element of the vector is the elapsed time of corresponding computation unit

Deterministic Replay Deterministic Replay A technique for debugging parallel applications Replay tools include two phases: Record phase Irreproducible information: return values, incoming messages etc. Replay phase Replay the faulty process to any state of the recorded execution Data Replay Execute any single process rather than having to execute the entire parallel applications during the replay phase

Acquire Sequential Computation Time Build message-log database: On a host platform Same with data-replay Store msg log into a D.B. Replay each process separately: Using a single node of target platform Measure the sequential computation time Concurrent Replay Build message-log database Replay each process separately

Acquire Sequential Computation Time Collect Message-log Replay and Record time-stamps int MPI_Recv (buf, count, type, src, tag, comm, status) { int retVal = PMPI_Recv (buf, count, type, src, tag, comm, status) Write retVal to log Write buf to log Write status to log return retVal } int MPI_Recv (buf, count, type, src, tag, comm, status) { Record time-stamp(Bk) Read log to retVal Read log to buf Read log to status Record time-stamp(Ek) return retVal }

Challenges Two challenges when processing large-scale applications: Large time overhead Assume the number of processes: n Replay one process at a time: T Take nT to acquire all the sequential computation time Time complexity is impractical for an application with thousands of processes even executing several hours Huge log size Data replay requires recording all the incoming messages for each process Log size will become huge with rising number of processes

Computation Similarity Observation: Computation behavior of processes in MPI-based parallel applications can be clustered into a few groups while processes in each group have similar computation behavior

Computation Similarity An example: NPB MG Program (CLASS=C NPROC=16) Group1: P0-P3, P8-P11 Group2: P4-P7, P12-P15 Process10 vs. Process 11 Process11 vs. Process 13

Representative Replay Our Approach: Partition processes into a number of groups computation behavior of processes in the same group is as similar as possible Choose a representative process (R.P.) from each group to record and replay Sequential computation time of R.P. will be used for other processes in the same group

Select Representative Processes Distance of S.C.V Manhattan distance Clustering technique K-means clustering Require a priori number of class Hierarchical clustering Complete linkage Dendrogram for NPB MG program (CLASS=C, NPROCS=16)

Implementation PHANTOM Three main modules A performance prediction framework based on representative replay PERC(SC-2002), Macro-level simulation(SC-2008) An automatic tool-chain Three main modules CompAna Module CommAna Module NetSim Module Overview of PHANTOM

Evaluation Platforms: Benchmarks: BT, CG, EP, LU, MG and SP from NPB Strong-scaling (CLASS=C) Sweep3D strong-scaling: 512*512*200 weak-scaling: 100*100*100

Grouping Result Grouping Results: Observation: BT, CG, EP and SP: All the processes have almost similar computation behavior LU and Sweep3D: number of groups keeps constant MG: number of groups increases as the number of processes Observation: Identical comm. sequence  similar comp. behavior The number of process groups that have similar computation behavior

Methodology Sequential computation time of R.P. Acquired using a single node of target platform Predicted Time  PHANTOM Real Time  Target Platform Comparison Compare the predicted time using PHANTOM with a regression-based method (Barnes et al.[1]) [1] B. J. Barnes, B. Rountree, D. K. Lowenthal, J. Reeves, B. de Supinski, and M. Schulz. A regression-based approach to scalability prediction. In ICS’08, pages 368–377, 2008.

Prediction Results Prediction error with PHANTOM Prediction errors(%) with PHANTOM (P.T.) vs. Regression-Based approach (R.B.) on Dawning platform. Prediction error with PHANTOM is less than 8% on average for all the programs

Predicted Result for Sweep3D Average prediction error: Dawning: 2.67% DeepComp-F: 1.30% DeepComp-B: 2.34% Maximum error: 6.54% on Dawning Platform For 1024 Processes: Error of PHANTOM: 4.53% Error of Regression: 23.67% Performance prediction for Sweep3D on Dawning, DeepComp-F and DeepComp-B platforms DeepComp-B platform presents the best performance For Sweep3D among three platforms

Limitations and Discussions Problem size Limited by the scale of host platforms Grid system, SSD devices Node of target platform A hardware simulator of single node I/O operation Future work

Conclusion Use deterministic replay technique to acquire accurate sequential computation time Propose representative replay to reduce measurement time Computation similarity

Thank you!

backup Tsinghua University

Concurrent Replay Application performance can be affected significantly by resource contention: Cache contention Bus contention Concurrent replay Replay multiple processes simultaneously on one node

Accuracy of Sequential Computation Performance The real sequential computation performance vs. acquired with representative replay for process 0 of Sweep3D-S with 256 processes on Dawning platform.

Breakdown of Predicted Time comp: computation time comm: communication time syn: synchronization cost Synchronization cost accounts for a large proportion of execution for most of programs Breakdown of predicted time of process 0 for each program with 256 processes on Dawning Platform.

Message-Log Size Message-Log Just record message logs for representative processes The message logs size is reasonable Message-log size (in Giga-Byte except EP in Kilo-Byte)

Replay Overhead Replay overhead Incoming messages are read from log database Little synchronization overhead is introduced The elapsed time of replay-based execution compared with normal execution for each program with 256 processes.

Performance of SIM-MPI Simulator Trace-driven simulation High efficiency www.hpctest.org.cn/resources/sim-mpi.tgz Simulation Platform: 2-way quad-core Xeon E5504 processor (2.0 GHz) 12GB memory Performance of SIM-MPI simulator (in Second)

Limitations Problem size Node of target platform I/O operation Grid System, SSD devices Cross-platform prediction Node of target platform A single node simulator I/O operation Future work Non-deterministic application Well-behaved applications, non-deterministic behavior does not cause significant impact on their performance