Profile Guided MPI Protocol Selection for Point-to-Point Communication Calls 5/9/111 Aniruddha Marathe, David K. Lowenthal Department of Computer Science.

Profile Guided MPI Protocol Selection for Point-to-Point Communication Calls 5/9/111 Aniruddha Marathe, David K. Lowenthal Department of Computer Science The University of Arizona Tucson, AZ {amarathe,dkl}@cs.arizona.edu Zheng Gu, Matthew Small, Xin Yuan Department of Computer Science Florida State University Tallahassee, FL {zgu,small,xyuan}@cs.fsu.edu

Motivation - Need for an on-line protocol selection scheme: Optimal protocol for a communication routine: application and architecture specific Existing approaches o Off-line: Protocol selection at program compilation time o Static: One protocol per application Difficult to adapt to program’s runtime characteristics 5/9/112

Contributions - On-line protocol selection algorithm - Protocol cost model Employed by the on-line protocol selection algorithm to estimate the total execution time per protocol - Sender-initiated Post-copy protocol A novel protocol to complement the existing set of protocols 5/9/113

On-line Protocol Selection Algorithm - Selects the optimal communication protocol for a communication phase dynamically - Protocol selection algorithm split into two phases: Phase 1: Execution time estimation per protocol Phase 2 (optimization): Buffer usage profiling - System works with four protocols 5/9/114

On-line Protocol Selection Algorithm 5/9/115 Rank 1Rank 3 …Rank nRank 2 Execution of phase 1 of a sample application: n tasks m MPI calls per task

5/9/116 Phase 1 (Estimating Execution Times) On-line Protocol Selection Algorithm Start of phase Rank 1Rank 3 …Rank nRank 2

5/9/117 Rank 1Rank 3 …Rank nRank 2 MPI Call 1 On-line Protocol Selection Algorithm Start of phase Phase 1 (Estimating Execution Times)

5/9/118 t protocol On-line Protocol Selection Algorithm Start of phase MPI Call 1 MPI Call 2 Phase 1 (Estimating Execution Times) Rank 1Rank 3 …Rank nRank 2

5/9/119 t protocol On-line Protocol Selection Algorithm Start of phase MPI Call 1 MPI Call 2 MPI Call m Phase 1 (Estimating Execution Times) Rank 1Rank 3 …Rank nRank 2

5/9/1110 t protocol On-line Protocol Selection Algorithm Start of phase MPI Call 1 MPI Call 2 MPI Call m End of phase Phase 1 (Estimating Execution Times) Rank 1Rank 3 …Rank nRank 2

5/9/1111 On-line Protocol Selection Algorithm Start of phase MPI Call 1 MPI Call 2 MPI Call m End of phase t t t Optimal Protocol = min(t) Protocol Selection Phase 1 (Estimating Execution Times) Rank 1Rank 3 …Rank nRank 2

5/9/1112 On-line Protocol Selection Algorithm Start of phase MPI Call 1 MPI Call 2 MPI Call m End of phase Optimal Protocol = min(t) Protocol Selection t t t - Execution time linear in # MPI calls per phase Phase 1 (Estimating Execution Times) Rank 1Rank 3 …Rank nRank 2

Point-to-Point Protocols - Our system uses the following protocols  Existing Protocols (Yuan et al. 2009):  Pre-copy  Sender-initiated Rendezvous  Receiver-initiated Rendezvous New protocol  Post-copy - Protocols categorized based on: Message size Arrival patterns of the communicating tasks 5/9/1113

SenderReceiver Pre-copy Protocol Time 5/9/1114 MPI Call Data Operation

SenderReceiver MPI_Send 5/9/1115 Time MPI Call Data Operation Pre-copy Protocol

SenderReceiver MPI_Send Local buffer copy 5/9/1116 Time MPI Call Data Operation Pre-copy Protocol

SenderReceiver MPI_Send Local buffer copy RDMA Write Request 5/9/1117 Time MPI Call Data Operation Pre-copy Protocol

SenderReceiver MPI_Recv MPI_Send MPI_Barrier Local buffer copy RDMA Write 5/9/1118 Time MPI Call Data Operation Request Pre-copy Protocol

RDMA Read SenderReceiver MPI_Recv MPI_Send MPI_Barrier Local buffer copy RDMA Write 5/9/1119 Time MPI Call Data Operation Request Data Pre-copy Protocol

RDMA Read SenderReceiver MPI_Recv MPI_Send MPI_Barrier Local buffer copy RDMA Write ACK 5/9/1120 RDMA Write Time MPI Call Data Operation Request Data Pre-copy Protocol

RDMA Read SenderReceiver Sender Idle MPI_Recv MPI_Send MPI_Barrier Local buffer copy RDMA Write 5/9/1121 RDMA Write Time MPI Call Data Operation Request ACK Data Pre-copy Protocol

SenderReceiver 5/9/1122 Time MPI Call Data Operation Post-copy Protocol

SenderReceiver MPI_Send 5/9/1123 Time MPI Call Data Operation Post-copy Protocol

SenderReceiver RDMA Write 5/9/1124 Time MPI Call Data Operation MPI_Send Request + Data Post-copy Protocol

SenderReceiver MPI_Barrier 5/9/1125 Time MPI Call Data Operation MPI_Send RDMA Write Request + Data Post-copy Protocol

SenderReceiver MPI_Recv 5/9/1126 Time MPI Call Data Operation MPI_Send RDMA Write MPI_Barrier Request + Data Post-copy Protocol

Local buffer copy SenderReceiver MPI_Recv 5/9/1127 Time MPI Call Data Operation MPI_Send RDMA Write MPI_Barrier Request + Data Post-copy Protocol

Local buffer copy SenderReceiver MPI_Recv ACK 5/9/1128 Time MPI Call Data Operation MPI_Send RDMA Write MPI_Barrier Request + Data Post-copy Protocol

Local buffer copy SenderReceiver Sender Idle MPI_Recv MPI_Barrier 5/9/1129 ACK Time MPI Call Data Operation MPI_Send Request + Data RDMA Write MPI_Barrier Post-copy Protocol - Sender spends significantly less idle time compared to Pre-copy

Protocol Cost Model 5/9/1130 - Supports five basic MPI operations:  MPI_Send  MPI_Recv  MPI_Isend  MPI_Irecv  MPI_Wait - Important terms: t memreg - Buffer registration time t memcopy - Buffer memory copy time t rdma_read - Buffer RDMA Read time t rdma_write - Buffer RDMA Write time t func_delay - Constant book-keeping time

5/9/1131 SenderReceiver MPI_Irecv MPI_Wait MPI_Isend MPI_Wait Post-copy Protocol Cost Model: Sender Early t memreg t rdma_write t func_delay t memcopy t func_delay

5/9/1132 SenderReceiver MPI_Irecv t memreg t rdma_write t func_delay MPI_Wait MPI_Isend MPI_Wait t func_delay t memcopy t func_delay Receiver = Total time t memcopy + 2 x t func_delay Sender = Total time t memreg + t rdma_write + 2 x t func_delay Post-copy Protocol Cost Model: Sender Early

5/9/1133 SenderReceiver MPI_Irecv MPI_Wait MPI_Isend MPI_Wait Post-copy Protocol Cost Model: Receiver Early t memreg t rdma_write t func_delay t memcopy t func_delay t wait_delay

5/9/1134 SenderReceiver MPI_Irecv t memreg t rdma_write t func_delay MPI_Wait MPI_Isend MPI_Wait t func_delay t memcopy t func_delay Sender = Total time t memreg + t rdma_write + 2 x t func_delay Receiver = Total time t wait_delay + t memcopy + 2 x t func_delay t wait_delay Post-copy Protocol Cost Model: Receiver Early

5/9/1135... MPI_Send(buff1,...); MPI_Recv(buff2,...); MPI_Send(buff3,...); MPI_Recv(buff1,...);... Optimization: Buffer Usage Profiling - Example code snippet:

5/9/1136 Rank 1Rank 3Rank nRank 2 Start of phase Phase 2 (Buffer usage profiling) Optimization: Buffer Usage Profiling

5/9/1137 Rank 1Rank 3Rank nRank 2 Start of phase MPI_Send (Buff 1) MPI_Recv (Buff 2) MPI_Send (Buff 3) Phase 2 (Buffer usage profiling) Optimization: Buffer Usage Profiling

5/9/1138 Rank 1Rank 3Rank nRank 2 Start of phase MPI_Send (Buff 1) MPI_Recv (Buff 2) MPI_Send (Buff 3) MPI_Recv (Buff 1) Phase 2 (Buffer usage profiling) Optimization: Buffer Usage Profiling

5/9/1139 Rank 1Rank 3Rank nRank 2 Start of phase MPI_Send (Buff 1) Phase 2 (Buffer usage profiling) MPI_Recv (Buff 2) MPI_Send (Buff 3) MPI_Recv (Buff 1) Buff 1Buff 2Buff 3 MPI Call 1MPI Call 2MPI Call 3 MPI Call 2MPI Call 3 Optimization: Buffer Usage Profiling

... MPI_Send(buff1,...); MPI_Recv(buff2,...); MPI_Send(buff3,...); MPI_Recv(buff1,...);... 5/9/1140 Optimization: Buffer Usage Profiling - Conversion of synchronous calls to asynchronous calls

... MPI_Send(buff1,...); MPI_Recv(buff2,...); MPI_Send(buff3,...); MPI_Recv(buff1,...);... 5/9/1141... MPI_Isend(buff1,..., req1); MPI_Recv(buff2,...); MPI_Send(buff3,...); MPI_Wait(req1,...); MPI_Recv(buff1,...);... Buffer Usage Profile Optimization: Buffer Usage Profiling - Conversion of synchronous calls to asynchronous calls

Performance Evaluation - Test Cluster:  Intel Xeon Processors (64 bit)  8-core 2.33 GHz  8 GB System Memory  16 nodes  Infiniband Interconnect - Software: MVAPICH 2 - Benchmarks :  Sparse Matrix  CG  Sweep3D  Microbenchmarks 5/9/1142

Performance Evaluation 5/9/1143 - Single communication phase per application

5/9/1144 - System chose optimal protocol for each phase dynamically Performance Evaluation

5/9/1145 - Real and modeled execution times for Sparse Matrix Application Modeling accuracy: 95% to 99% Modeling overhead: less than 1% of total execution time Real Modeled

Summary - Our system for on-line protocol selection was successfully tested for real and microbenchmarks. - Protocol cost model: high accuracy with negligible overhead. - Sender-initiated Post-copy protocol was successfully implemented. 5/9/1146

Questions? 5/9/1147

Thank You! 5/9/1148

Profile Guided MPI Protocol Selection for Point-to-Point Communication Calls 5/9/111 Aniruddha Marathe, David K. Lowenthal Department of Computer Science.

Similar presentations

Presentation on theme: "Profile Guided MPI Protocol Selection for Point-to-Point Communication Calls 5/9/111 Aniruddha Marathe, David K. Lowenthal Department of Computer Science."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Profile Guided MPI Protocol Selection for Point-to-Point Communication Calls 5/9/111 Aniruddha Marathe, David K. Lowenthal Department of Computer Science.

Similar presentations

Presentation on theme: "Profile Guided MPI Protocol Selection for Point-to-Point Communication Calls 5/9/111 Aniruddha Marathe, David K. Lowenthal Department of Computer Science."— Presentation transcript:

Similar presentations

About project

Feedback