Download presentation
Presentation is loading. Please wait.
Published byLynn Arnold Modified over 9 years ago
1
Profile Guided MPI Protocol Selection for Point-to-Point Communication Calls 5/9/111 Aniruddha Marathe, David K. Lowenthal Department of Computer Science The University of Arizona Tucson, AZ {amarathe,dkl}@cs.arizona.edu Zheng Gu, Matthew Small, Xin Yuan Department of Computer Science Florida State University Tallahassee, FL {zgu,small,xyuan}@cs.fsu.edu
2
Motivation - Need for an on-line protocol selection scheme: Optimal protocol for a communication routine: application and architecture specific Existing approaches o Off-line: Protocol selection at program compilation time o Static: One protocol per application Difficult to adapt to program’s runtime characteristics 5/9/112
3
Contributions - On-line protocol selection algorithm - Protocol cost model Employed by the on-line protocol selection algorithm to estimate the total execution time per protocol - Sender-initiated Post-copy protocol A novel protocol to complement the existing set of protocols 5/9/113
4
On-line Protocol Selection Algorithm - Selects the optimal communication protocol for a communication phase dynamically - Protocol selection algorithm split into two phases: Phase 1: Execution time estimation per protocol Phase 2 (optimization): Buffer usage profiling - System works with four protocols 5/9/114
5
On-line Protocol Selection Algorithm 5/9/115 Rank 1Rank 3 …Rank nRank 2 Execution of phase 1 of a sample application: n tasks m MPI calls per task
6
5/9/116 Phase 1 (Estimating Execution Times) On-line Protocol Selection Algorithm Start of phase Rank 1Rank 3 …Rank nRank 2
7
5/9/117 Rank 1Rank 3 …Rank nRank 2 MPI Call 1 On-line Protocol Selection Algorithm Start of phase Phase 1 (Estimating Execution Times)
8
5/9/118 t protocol On-line Protocol Selection Algorithm Start of phase MPI Call 1 MPI Call 2 Phase 1 (Estimating Execution Times) Rank 1Rank 3 …Rank nRank 2
9
5/9/119 t protocol On-line Protocol Selection Algorithm Start of phase MPI Call 1 MPI Call 2 MPI Call m Phase 1 (Estimating Execution Times) Rank 1Rank 3 …Rank nRank 2
10
5/9/1110 t protocol On-line Protocol Selection Algorithm Start of phase MPI Call 1 MPI Call 2 MPI Call m End of phase Phase 1 (Estimating Execution Times) Rank 1Rank 3 …Rank nRank 2
11
5/9/1111 On-line Protocol Selection Algorithm Start of phase MPI Call 1 MPI Call 2 MPI Call m End of phase t t t Optimal Protocol = min(t) Protocol Selection Phase 1 (Estimating Execution Times) Rank 1Rank 3 …Rank nRank 2
12
5/9/1112 On-line Protocol Selection Algorithm Start of phase MPI Call 1 MPI Call 2 MPI Call m End of phase Optimal Protocol = min(t) Protocol Selection t t t - Execution time linear in # MPI calls per phase Phase 1 (Estimating Execution Times) Rank 1Rank 3 …Rank nRank 2
13
Point-to-Point Protocols - Our system uses the following protocols Existing Protocols (Yuan et al. 2009): Pre-copy Sender-initiated Rendezvous Receiver-initiated Rendezvous New protocol Post-copy - Protocols categorized based on: Message size Arrival patterns of the communicating tasks 5/9/1113
14
SenderReceiver Pre-copy Protocol Time 5/9/1114 MPI Call Data Operation
15
SenderReceiver MPI_Send 5/9/1115 Time MPI Call Data Operation Pre-copy Protocol
16
SenderReceiver MPI_Send Local buffer copy 5/9/1116 Time MPI Call Data Operation Pre-copy Protocol
17
SenderReceiver MPI_Send Local buffer copy RDMA Write Request 5/9/1117 Time MPI Call Data Operation Pre-copy Protocol
18
SenderReceiver MPI_Recv MPI_Send MPI_Barrier Local buffer copy RDMA Write 5/9/1118 Time MPI Call Data Operation Request Pre-copy Protocol
19
RDMA Read SenderReceiver MPI_Recv MPI_Send MPI_Barrier Local buffer copy RDMA Write 5/9/1119 Time MPI Call Data Operation Request Data Pre-copy Protocol
20
RDMA Read SenderReceiver MPI_Recv MPI_Send MPI_Barrier Local buffer copy RDMA Write ACK 5/9/1120 RDMA Write Time MPI Call Data Operation Request Data Pre-copy Protocol
21
RDMA Read SenderReceiver Sender Idle MPI_Recv MPI_Send MPI_Barrier Local buffer copy RDMA Write 5/9/1121 RDMA Write Time MPI Call Data Operation Request ACK Data Pre-copy Protocol
22
SenderReceiver 5/9/1122 Time MPI Call Data Operation Post-copy Protocol
23
SenderReceiver MPI_Send 5/9/1123 Time MPI Call Data Operation Post-copy Protocol
24
SenderReceiver RDMA Write 5/9/1124 Time MPI Call Data Operation MPI_Send Request + Data Post-copy Protocol
25
SenderReceiver MPI_Barrier 5/9/1125 Time MPI Call Data Operation MPI_Send RDMA Write Request + Data Post-copy Protocol
26
SenderReceiver MPI_Recv 5/9/1126 Time MPI Call Data Operation MPI_Send RDMA Write MPI_Barrier Request + Data Post-copy Protocol
27
Local buffer copy SenderReceiver MPI_Recv 5/9/1127 Time MPI Call Data Operation MPI_Send RDMA Write MPI_Barrier Request + Data Post-copy Protocol
28
Local buffer copy SenderReceiver MPI_Recv ACK 5/9/1128 Time MPI Call Data Operation MPI_Send RDMA Write MPI_Barrier Request + Data Post-copy Protocol
29
Local buffer copy SenderReceiver Sender Idle MPI_Recv MPI_Barrier 5/9/1129 ACK Time MPI Call Data Operation MPI_Send Request + Data RDMA Write MPI_Barrier Post-copy Protocol - Sender spends significantly less idle time compared to Pre-copy
30
Protocol Cost Model 5/9/1130 - Supports five basic MPI operations: MPI_Send MPI_Recv MPI_Isend MPI_Irecv MPI_Wait - Important terms: t memreg - Buffer registration time t memcopy - Buffer memory copy time t rdma_read - Buffer RDMA Read time t rdma_write - Buffer RDMA Write time t func_delay - Constant book-keeping time
31
5/9/1131 SenderReceiver MPI_Irecv MPI_Wait MPI_Isend MPI_Wait Post-copy Protocol Cost Model: Sender Early t memreg t rdma_write t func_delay t memcopy t func_delay
32
5/9/1132 SenderReceiver MPI_Irecv t memreg t rdma_write t func_delay MPI_Wait MPI_Isend MPI_Wait t func_delay t memcopy t func_delay Receiver = Total time t memcopy + 2 x t func_delay Sender = Total time t memreg + t rdma_write + 2 x t func_delay Post-copy Protocol Cost Model: Sender Early
33
5/9/1133 SenderReceiver MPI_Irecv MPI_Wait MPI_Isend MPI_Wait Post-copy Protocol Cost Model: Receiver Early t memreg t rdma_write t func_delay t memcopy t func_delay t wait_delay
34
5/9/1134 SenderReceiver MPI_Irecv t memreg t rdma_write t func_delay MPI_Wait MPI_Isend MPI_Wait t func_delay t memcopy t func_delay Sender = Total time t memreg + t rdma_write + 2 x t func_delay Receiver = Total time t wait_delay + t memcopy + 2 x t func_delay t wait_delay Post-copy Protocol Cost Model: Receiver Early
35
5/9/1135... MPI_Send(buff1,...); MPI_Recv(buff2,...); MPI_Send(buff3,...); MPI_Recv(buff1,...);... Optimization: Buffer Usage Profiling - Example code snippet:
36
5/9/1136 Rank 1Rank 3Rank nRank 2 Start of phase Phase 2 (Buffer usage profiling) Optimization: Buffer Usage Profiling
37
5/9/1137 Rank 1Rank 3Rank nRank 2 Start of phase MPI_Send (Buff 1) MPI_Recv (Buff 2) MPI_Send (Buff 3) Phase 2 (Buffer usage profiling) Optimization: Buffer Usage Profiling
38
5/9/1138 Rank 1Rank 3Rank nRank 2 Start of phase MPI_Send (Buff 1) MPI_Recv (Buff 2) MPI_Send (Buff 3) MPI_Recv (Buff 1) Phase 2 (Buffer usage profiling) Optimization: Buffer Usage Profiling
39
5/9/1139 Rank 1Rank 3Rank nRank 2 Start of phase MPI_Send (Buff 1) Phase 2 (Buffer usage profiling) MPI_Recv (Buff 2) MPI_Send (Buff 3) MPI_Recv (Buff 1) Buff 1Buff 2Buff 3 MPI Call 1MPI Call 2MPI Call 3 MPI Call 2MPI Call 3 Optimization: Buffer Usage Profiling
40
... MPI_Send(buff1,...); MPI_Recv(buff2,...); MPI_Send(buff3,...); MPI_Recv(buff1,...);... 5/9/1140 Optimization: Buffer Usage Profiling - Conversion of synchronous calls to asynchronous calls
41
... MPI_Send(buff1,...); MPI_Recv(buff2,...); MPI_Send(buff3,...); MPI_Recv(buff1,...);... 5/9/1141... MPI_Isend(buff1,..., req1); MPI_Recv(buff2,...); MPI_Send(buff3,...); MPI_Wait(req1,...); MPI_Recv(buff1,...);... Buffer Usage Profile Optimization: Buffer Usage Profiling - Conversion of synchronous calls to asynchronous calls
42
Performance Evaluation - Test Cluster: Intel Xeon Processors (64 bit) 8-core 2.33 GHz 8 GB System Memory 16 nodes Infiniband Interconnect - Software: MVAPICH 2 - Benchmarks : Sparse Matrix CG Sweep3D Microbenchmarks 5/9/1142
43
Performance Evaluation 5/9/1143 - Single communication phase per application
44
5/9/1144 - System chose optimal protocol for each phase dynamically Performance Evaluation
45
5/9/1145 - Real and modeled execution times for Sparse Matrix Application Modeling accuracy: 95% to 99% Modeling overhead: less than 1% of total execution time Real Modeled
46
Summary - Our system for on-line protocol selection was successfully tested for real and microbenchmarks. - Protocol cost model: high accuracy with negligible overhead. - Sender-initiated Post-copy protocol was successfully implemented. 5/9/1146
47
Questions? 5/9/1147
48
Thank You! 5/9/1148
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.