Jidong Zhai, Wenguang Chen, Weimin Zheng

Jidong Zhai, Wenguang Chen, Weimin Zheng
PHANTOM: Predicting Performance of Parallel Applications on Large-scale Parallel Machines Using a Single Node Jidong Zhai, Wenguang Chen, Weimin Zheng Tsinghua University November 20, 2018

Motivation Large-scale parallel computers For designers of HPC systems
Cost millions of dollars Take many years to design and implement For designers of HPC systems What is the performance of application X on a parallel machine Y with 10,000 nodes connected by network Z? Enable designers to evaluate various design alternatives

Motivation Performance prediction of parallel applications is important in HPC area: Application optimization System tuning System procurement …

Performance of Parallel Applications
Accurate performance prediction is difficult Execution time of parallel applications  Sequential computation time Communication time Their convolution Overlap between computation and communication Synchronization overhead

Focus of Our Work In this work, we focus on how to acquire accurate sequential computation time Existing network simulator: BigNetSim (UIUC) DIMEMAS (UPC) SIM-MPI (Tsinghua) Current bottleneck: How to acquire sequential computation time accurately and efficiently without target parallel platform 1. Focus on applications written with MPI in this work

Previous Work Previous work for acquiring sequential computation time:
Model-based method: Application signature Memory access pattern, num of INT and FP instructions etc. Target platform parameter Limitation: difficult to build an accurate model for complex program behavior Cache & bus contention on multi-core platform etc. Measurement-based method: Measure sequential computation time for weak-scaling applications Limitation: fail to deal with strong-scaling applications Regression-based method: Extrapolate sequential computation time Limitation: not applicable for some real applications due to the non-linear behavior

Our Contributions Employ deterministic replay technique to measure sequential computation time for strong-scaling applications without full-scale target platform Propose representative replay to reduce message-log size and measurement time

Outline Prediction Framework Basic Idea Improvement Evaluation

Prediction Framework Use a trace-driven simulation approach for P.P.
Collect computation and communication traces Separate comp. and comm. in parallel applications CPU_Burst, msg_size, msg_type, source and dest. etc. FACT techniques (SC-2009) Collect MPI traces for large-scale applications on small-scale systems Acquire sequential computation time for each process With deterministic replay using a node of target platform Use a trace-driven simulator to convolute comm. and comp. performance SIM-MPI simulator

Prediction Framework An example of Fortran MPI program
1 real A(MAX,MAX), B(MAX,MAX), C(MAX,MAX), buf(MAX,MAX) 2 call MPI_INIT(ierr) 3 call MPI_COMM_RANK(MPI_COMM_WORLD,myid...) 4 DO iter=1, N if (myid .gt. 0) then call MPI_RECV(buf(1, 1),num,MPI_REAL,myid-1,...) endif DO i=1, MAX DO j=1, MAX A(i,j)=B(i,j)*C(i,j)+buf(i,j) END DO END DO if (myid .lt. numprocs-1) then call MPI_SEND(A(1, 1),num,MPI_REAL,myid+1,...) endif 16 END DO 17 call MPI_FINALIZE(rc) An example of Fortran MPI program MPI_Init MPI_Rank(id=0) CPU_Burst(id,0) MPI_Send(id+1,size) CPU_Burst(id,1) CPU_Burst(id,2) … MPI Traces id=0 CPU_Burst(id,0)=2sec CPU_Burst(id,1)=3sec CPU_Burst(id,2)=4sec CPU_Burst(id,3)=2sec … Seq. Computation Time Latency =1.6 usec Bandwidth=1.5GB/sec Topology =2D Mesh … Network Parameter SIM-MPI Simulator (Trace-driven Simulator) Predicted Performance for Parallel Applications

Two Basic Definitions 1 MPI_Init() 2 c0 // Means computation 3 MPI_Barrier 4 for(i=0; i<N; i++){ 5 if(myid%2 == 1){ 6 MPI_Recv(..., myid-1, ...) 7 c1(i) 8 MPI_Send(..., myid-1, ...) 9 } 10 c2(i) 11 if(myid%2 == 0){ 12 MPI_Send(..., myid+1, ...) 13 c3(i) 14 MPI_Recv(..., myid+1, ...) 15 } 16 } 17 MPI_Final() DEF1: Communication Sequence (C.S.): Record message type of each comm. operation in temporal sequence C(P0) = {Init, Barrier, [Send, Recv], Fina} C(P1) = {Init, Barrier, [Recv, Send], Fina} DEF2: Sequential Computation Vector (S.C.V.): Record sequential computation performance for each process C0 = [c0, c2(0), c3(0), c2(1), c3(1), ..., c3(N − 1)] C1 = [c0, c1(0), c2(0), c1(1), c2(1), ..., c2(N − 1)] Execution Model Each element of the vector is the elapsed time of corresponding computation unit

Deterministic Replay Deterministic Replay
A technique for debugging parallel applications Replay tools include two phases: Record phase Irreproducible information: return values, incoming messages etc. Replay phase Replay the faulty process to any state of the recorded execution Data Replay Execute any single process rather than having to execute the entire parallel applications during the replay phase

Acquire Sequential Computation Time
Build message-log database: On a host platform Same with data-replay Store msg log into a D.B. Replay each process separately: Using a single node of target platform Measure the sequential computation time Concurrent Replay Build message-log database Replay each process separately

Acquire Sequential Computation Time
Collect Message-log Replay and Record time-stamps int MPI_Recv (buf, count, type, src, tag, comm, status) { int retVal = PMPI_Recv (buf, count, type, src, tag, comm, status) Write retVal to log Write buf to log Write status to log return retVal } int MPI_Recv (buf, count, type, src, tag, comm, status) { Record time-stamp(Bk) Read log to retVal Read log to buf Read log to status Record time-stamp(Ek) return retVal }

Challenges Two challenges when processing large-scale applications:
Large time overhead Assume the number of processes: n Replay one process at a time: T Take nT to acquire all the sequential computation time Time complexity is impractical for an application with thousands of processes even executing several hours Huge log size Data replay requires recording all the incoming messages for each process Log size will become huge with rising number of processes

Computation Similarity
Observation: Computation behavior of processes in MPI-based parallel applications can be clustered into a few groups while processes in each group have similar computation behavior

Computation Similarity
An example: NPB MG Program (CLASS=C NPROC=16) Group1: P0-P3, P8-P11 Group2: P4-P7, P12-P15 Process10 vs. Process 11 Process11 vs. Process 13

Representative Replay
Our Approach: Partition processes into a number of groups computation behavior of processes in the same group is as similar as possible Choose a representative process (R.P.) from each group to record and replay Sequential computation time of R.P. will be used for other processes in the same group

Select Representative Processes
Distance of S.C.V Manhattan distance Clustering technique K-means clustering Require a priori number of class Hierarchical clustering Complete linkage Dendrogram for NPB MG program (CLASS=C, NPROCS=16)

Implementation PHANTOM Three main modules
A performance prediction framework based on representative replay PERC(SC-2002), Macro-level simulation(SC-2008) An automatic tool-chain Three main modules CompAna Module CommAna Module NetSim Module Overview of PHANTOM

Evaluation Platforms: Benchmarks: BT, CG, EP, LU, MG and SP from NPB
Strong-scaling (CLASS=C) Sweep3D strong-scaling: 512*512*200 weak-scaling: *100*100

Grouping Result Grouping Results: Observation:
BT, CG, EP and SP: All the processes have almost similar computation behavior LU and Sweep3D: number of groups keeps constant MG: number of groups increases as the number of processes Observation: Identical comm. sequence  similar comp. behavior The number of process groups that have similar computation behavior

Methodology Sequential computation time of R.P.
Acquired using a single node of target platform Predicted Time  PHANTOM Real Time  Target Platform Comparison Compare the predicted time using PHANTOM with a regression-based method (Barnes et al.[1]) [1] B. J. Barnes, B. Rountree, D. K. Lowenthal, J. Reeves, B. de Supinski, and M. Schulz. A regression-based approach to scalability prediction. In ICS’08, pages 368–377, 2008.

Prediction Results Prediction error with PHANTOM
Prediction errors(%) with PHANTOM (P.T.) vs. Regression-Based approach (R.B.) on Dawning platform. Prediction error with PHANTOM is less than 8% on average for all the programs

Predicted Result for Sweep3D
Average prediction error: Dawning: 2.67% DeepComp-F: 1.30% DeepComp-B: 2.34% Maximum error: 6.54% on Dawning Platform For 1024 Processes: Error of PHANTOM: 4.53% Error of Regression: 23.67% Performance prediction for Sweep3D on Dawning, DeepComp-F and DeepComp-B platforms DeepComp-B platform presents the best performance For Sweep3D among three platforms

Limitations and Discussions
Problem size Limited by the scale of host platforms Grid system, SSD devices Node of target platform A hardware simulator of single node I/O operation Future work

Conclusion Use deterministic replay technique to acquire accurate sequential computation time Propose representative replay to reduce measurement time Computation similarity

Thank you!

backup Tsinghua University

Concurrent Replay Application performance can be affected significantly by resource contention: Cache contention Bus contention Concurrent replay Replay multiple processes simultaneously on one node

Accuracy of Sequential Computation Performance
The real sequential computation performance vs. acquired with representative replay for process 0 of Sweep3D-S with 256 processes on Dawning platform.

Breakdown of Predicted Time
comp: computation time comm: communication time syn: synchronization cost Synchronization cost accounts for a large proportion of execution for most of programs Breakdown of predicted time of process 0 for each program with 256 processes on Dawning Platform.

Message-Log Size Message-Log
Just record message logs for representative processes The message logs size is reasonable Message-log size (in Giga-Byte except EP in Kilo-Byte)

Replay Overhead Replay overhead
Incoming messages are read from log database Little synchronization overhead is introduced The elapsed time of replay-based execution compared with normal execution for each program with 256 processes.

Performance of SIM-MPI Simulator
Trace-driven simulation High efficiency Simulation Platform: 2-way quad-core Xeon E5504 processor (2.0 GHz) 12GB memory Performance of SIM-MPI simulator (in Second)

Limitations Problem size Node of target platform I/O operation
Grid System, SSD devices Cross-platform prediction Node of target platform A single node simulator I/O operation Future work Non-deterministic application Well-behaved applications, non-deterministic behavior does not cause significant impact on their performance

Jidong Zhai, Wenguang Chen, Weimin Zheng

Similar presentations

Presentation on theme: "Jidong Zhai, Wenguang Chen, Weimin Zheng"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Jidong Zhai, Wenguang Chen, Weimin Zheng

Similar presentations

Presentation on theme: "Jidong Zhai, Wenguang Chen, Weimin Zheng"— Presentation transcript:

Similar presentations

About project

Feedback