Jidong Zhai, Wenguang Chen, Weimin Zheng PHANTOM: Predicting Performance of Parallel Applications on Large-scale Parallel Machines Using a Single Node Jidong Zhai, Wenguang Chen, Weimin Zheng Tsinghua University November 20, 2018
Motivation Large-scale parallel computers For designers of HPC systems Cost millions of dollars Take many years to design and implement For designers of HPC systems What is the performance of application X on a parallel machine Y with 10,000 nodes connected by network Z? Enable designers to evaluate various design alternatives
Motivation Performance prediction of parallel applications is important in HPC area: Application optimization System tuning System procurement …
Performance of Parallel Applications Accurate performance prediction is difficult Execution time of parallel applications Sequential computation time Communication time Their convolution Overlap between computation and communication Synchronization overhead
Focus of Our Work In this work, we focus on how to acquire accurate sequential computation time Existing network simulator: BigNetSim (UIUC) DIMEMAS (UPC) SIM-MPI (Tsinghua) Current bottleneck: How to acquire sequential computation time accurately and efficiently without target parallel platform 1. Focus on applications written with MPI in this work
Previous Work Previous work for acquiring sequential computation time: Model-based method: Application signature Memory access pattern, num of INT and FP instructions etc. Target platform parameter Limitation: difficult to build an accurate model for complex program behavior Cache & bus contention on multi-core platform etc. Measurement-based method: Measure sequential computation time for weak-scaling applications Limitation: fail to deal with strong-scaling applications Regression-based method: Extrapolate sequential computation time Limitation: not applicable for some real applications due to the non-linear behavior
Our Contributions Employ deterministic replay technique to measure sequential computation time for strong-scaling applications without full-scale target platform Propose representative replay to reduce message-log size and measurement time
Outline Prediction Framework Basic Idea Improvement Evaluation
Prediction Framework Use a trace-driven simulation approach for P.P. Collect computation and communication traces Separate comp. and comm. in parallel applications CPU_Burst, msg_size, msg_type, source and dest. etc. FACT techniques (SC-2009) Collect MPI traces for large-scale applications on small-scale systems Acquire sequential computation time for each process With deterministic replay using a node of target platform Use a trace-driven simulator to convolute comm. and comp. performance SIM-MPI simulator
Prediction Framework An example of Fortran MPI program 1 real A(MAX,MAX), B(MAX,MAX), C(MAX,MAX), buf(MAX,MAX) 2 call MPI_INIT(ierr) 3 call MPI_COMM_RANK(MPI_COMM_WORLD,myid...) 4 DO iter=1, N 5 if (myid .gt. 0) then 6 call MPI_RECV(buf(1, 1),num,MPI_REAL,myid-1,...) 7 endif 8 DO i=1, MAX 9 DO j=1, MAX 10 A(i,j)=B(i,j)*C(i,j)+buf(i,j) 11 END DO 12 END DO 13 if (myid .lt. numprocs-1) then 14 call MPI_SEND(A(1, 1),num,MPI_REAL,myid+1,...) 15 endif 16 END DO 17 call MPI_FINALIZE(rc) An example of Fortran MPI program MPI_Init MPI_Rank(id=0) CPU_Burst(id,0) MPI_Send(id+1,size) CPU_Burst(id,1) CPU_Burst(id,2) … MPI Traces id=0 CPU_Burst(id,0)=2sec CPU_Burst(id,1)=3sec CPU_Burst(id,2)=4sec CPU_Burst(id,3)=2sec … Seq. Computation Time Latency =1.6 usec Bandwidth=1.5GB/sec Topology =2D Mesh … Network Parameter SIM-MPI Simulator (Trace-driven Simulator) Predicted Performance for Parallel Applications
Two Basic Definitions 1 MPI_Init() 2 c0 // Means computation 3 MPI_Barrier 4 for(i=0; i<N; i++){ 5 if(myid%2 == 1){ 6 MPI_Recv(..., myid-1, ...) 7 c1(i) 8 MPI_Send(..., myid-1, ...) 9 } 10 c2(i) 11 if(myid%2 == 0){ 12 MPI_Send(..., myid+1, ...) 13 c3(i) 14 MPI_Recv(..., myid+1, ...) 15 } 16 } 17 MPI_Final() DEF1: Communication Sequence (C.S.): Record message type of each comm. operation in temporal sequence C(P0) = {Init, Barrier, [Send, Recv], Fina} C(P1) = {Init, Barrier, [Recv, Send], Fina} DEF2: Sequential Computation Vector (S.C.V.): Record sequential computation performance for each process C0 = [c0, c2(0), c3(0), c2(1), c3(1), ..., c3(N − 1)] C1 = [c0, c1(0), c2(0), c1(1), c2(1), ..., c2(N − 1)] Execution Model Each element of the vector is the elapsed time of corresponding computation unit
Deterministic Replay Deterministic Replay A technique for debugging parallel applications Replay tools include two phases: Record phase Irreproducible information: return values, incoming messages etc. Replay phase Replay the faulty process to any state of the recorded execution Data Replay Execute any single process rather than having to execute the entire parallel applications during the replay phase
Acquire Sequential Computation Time Build message-log database: On a host platform Same with data-replay Store msg log into a D.B. Replay each process separately: Using a single node of target platform Measure the sequential computation time Concurrent Replay Build message-log database Replay each process separately
Acquire Sequential Computation Time Collect Message-log Replay and Record time-stamps int MPI_Recv (buf, count, type, src, tag, comm, status) { int retVal = PMPI_Recv (buf, count, type, src, tag, comm, status) Write retVal to log Write buf to log Write status to log return retVal } int MPI_Recv (buf, count, type, src, tag, comm, status) { Record time-stamp(Bk) Read log to retVal Read log to buf Read log to status Record time-stamp(Ek) return retVal }
Challenges Two challenges when processing large-scale applications: Large time overhead Assume the number of processes: n Replay one process at a time: T Take nT to acquire all the sequential computation time Time complexity is impractical for an application with thousands of processes even executing several hours Huge log size Data replay requires recording all the incoming messages for each process Log size will become huge with rising number of processes
Computation Similarity Observation: Computation behavior of processes in MPI-based parallel applications can be clustered into a few groups while processes in each group have similar computation behavior
Computation Similarity An example: NPB MG Program (CLASS=C NPROC=16) Group1: P0-P3, P8-P11 Group2: P4-P7, P12-P15 Process10 vs. Process 11 Process11 vs. Process 13
Representative Replay Our Approach: Partition processes into a number of groups computation behavior of processes in the same group is as similar as possible Choose a representative process (R.P.) from each group to record and replay Sequential computation time of R.P. will be used for other processes in the same group
Select Representative Processes Distance of S.C.V Manhattan distance Clustering technique K-means clustering Require a priori number of class Hierarchical clustering Complete linkage Dendrogram for NPB MG program (CLASS=C, NPROCS=16)
Implementation PHANTOM Three main modules A performance prediction framework based on representative replay PERC(SC-2002), Macro-level simulation(SC-2008) An automatic tool-chain Three main modules CompAna Module CommAna Module NetSim Module Overview of PHANTOM
Evaluation Platforms: Benchmarks: BT, CG, EP, LU, MG and SP from NPB Strong-scaling (CLASS=C) Sweep3D strong-scaling: 512*512*200 weak-scaling: 100*100*100
Grouping Result Grouping Results: Observation: BT, CG, EP and SP: All the processes have almost similar computation behavior LU and Sweep3D: number of groups keeps constant MG: number of groups increases as the number of processes Observation: Identical comm. sequence similar comp. behavior The number of process groups that have similar computation behavior
Methodology Sequential computation time of R.P. Acquired using a single node of target platform Predicted Time PHANTOM Real Time Target Platform Comparison Compare the predicted time using PHANTOM with a regression-based method (Barnes et al.[1]) [1] B. J. Barnes, B. Rountree, D. K. Lowenthal, J. Reeves, B. de Supinski, and M. Schulz. A regression-based approach to scalability prediction. In ICS’08, pages 368–377, 2008.
Prediction Results Prediction error with PHANTOM Prediction errors(%) with PHANTOM (P.T.) vs. Regression-Based approach (R.B.) on Dawning platform. Prediction error with PHANTOM is less than 8% on average for all the programs
Predicted Result for Sweep3D Average prediction error: Dawning: 2.67% DeepComp-F: 1.30% DeepComp-B: 2.34% Maximum error: 6.54% on Dawning Platform For 1024 Processes: Error of PHANTOM: 4.53% Error of Regression: 23.67% Performance prediction for Sweep3D on Dawning, DeepComp-F and DeepComp-B platforms DeepComp-B platform presents the best performance For Sweep3D among three platforms
Limitations and Discussions Problem size Limited by the scale of host platforms Grid system, SSD devices Node of target platform A hardware simulator of single node I/O operation Future work
Conclusion Use deterministic replay technique to acquire accurate sequential computation time Propose representative replay to reduce measurement time Computation similarity
Thank you!
backup Tsinghua University
Concurrent Replay Application performance can be affected significantly by resource contention: Cache contention Bus contention Concurrent replay Replay multiple processes simultaneously on one node
Accuracy of Sequential Computation Performance The real sequential computation performance vs. acquired with representative replay for process 0 of Sweep3D-S with 256 processes on Dawning platform.
Breakdown of Predicted Time comp: computation time comm: communication time syn: synchronization cost Synchronization cost accounts for a large proportion of execution for most of programs Breakdown of predicted time of process 0 for each program with 256 processes on Dawning Platform.
Message-Log Size Message-Log Just record message logs for representative processes The message logs size is reasonable Message-log size (in Giga-Byte except EP in Kilo-Byte)
Replay Overhead Replay overhead Incoming messages are read from log database Little synchronization overhead is introduced The elapsed time of replay-based execution compared with normal execution for each program with 256 processes.
Performance of SIM-MPI Simulator Trace-driven simulation High efficiency www.hpctest.org.cn/resources/sim-mpi.tgz Simulation Platform: 2-way quad-core Xeon E5504 processor (2.0 GHz) 12GB memory Performance of SIM-MPI simulator (in Second)
Limitations Problem size Node of target platform I/O operation Grid System, SSD devices Cross-platform prediction Node of target platform A single node simulator I/O operation Future work Non-deterministic application Well-behaved applications, non-deterministic behavior does not cause significant impact on their performance