Presentation is loading. Please wait.

Presentation is loading. Please wait.

Ishikawa, The University of Tokyo1 GridMPI : Grid Enabled MPI Yutaka Ishikawa University of Tokyo and AIST.

Similar presentations


Presentation on theme: "Ishikawa, The University of Tokyo1 GridMPI : Grid Enabled MPI Yutaka Ishikawa University of Tokyo and AIST."— Presentation transcript:

1 http://www.gridmpi.orgYutaka Ishikawa, The University of Tokyo1 GridMPI : Grid Enabled MPI Yutaka Ishikawa University of Tokyo and AIST

2 http://www.gridmpi.orgYutaka Ishikawa, The University of Tokyo2 Motivation MPI has been widely used to program parallel applications Users want to run such applications over the Grid environment without any modifications of the program However, the performance of existing MPI implementations is not scaled up on the Grid environment Wide-area Network Single (monolithic) MPI application over the Grid environment computing resource site A computing resource site A computing resource site B computing resource site B

3 http://www.gridmpi.orgYutaka Ishikawa, The University of Tokyo3 Motivation Focus on metropolitan-area, high-bandwidth environment:  10Gpbs,  500miles (smaller than 10ms one-way latency) –Internet Bandwidth in Grid  Interconnect Bandwidth in Cluster 10 Gbps vs. 1 Gbps 100 Gbps vs. 10 Gbps Wide-area Network Single (monolithic) MPI application over the Grid environment computing resource site A computing resource site A computing resource site B computing resource site B

4 http://www.gridmpi.orgYutaka Ishikawa, The University of Tokyo4 Motivation Focus on metropolitan-area, high-bandwidth environment:  10Gpbs,  500miles (smaller than 10ms one-way latency) –We have already demonstrated that the performance of the NAS parallel benchmark programs are scaled up if one-way latency is smaller than 10ms using an emulated WAN environment Wide-area Network Single (monolithic) MPI application over the Grid environment computing resource site A computing resource site A computing resource site B computing resource site B Motohiko Matsuda, Yutaka Ishikawa, and Tomohiro Kudoh, ``Evaluation of MPI Implementations on Grid-connected Clusters using an Emulated WAN Environment,'' CCGRID2003, 2003 Motohiko Matsuda, Yutaka Ishikawa, and Tomohiro Kudoh, ``Evaluation of MPI Implementations on Grid-connected Clusters using an Emulated WAN Environment,'' CCGRID2003, 2003

5 http://www.gridmpi.orgYutaka Ishikawa, The University of Tokyo5 Issues High Performance Communication Facilities for MPI on Long and Fat Networks –TCP vs. MPI communication patterns –Network Topology Latency and Bandwidth Interoperability –Most MPI library implementations use their own network protocol. Fault Tolerance and Migration –To survive a site failure Security TCPMPI Designed for streams. Burst traffic. Repeat the computation and communication phases. Change traffic by communication patterns. Repeating 10MB data transfer with two second intervals Observed during one 10MB data transfer The slow-start phase window size is set to 1 The slow-start phase window size is set to 1 These silences results from burst traffic

6 http://www.gridmpi.orgYutaka Ishikawa, The University of Tokyo6 Issues High Performance Communication Facilities for MPI on Long and Fat Networks –TCP vs. MPI communication patterns –Network Topology Latency and Bandwidth Interoperability –Most MPI library implementations use their own network protocol. Fault Tolerance and Migration –To survive a site failure Security TCPMPI Designed for streams. Burst traffic. Repeat the computation and communication phases. Change traffic by communication patterns. Start one-to-one communication at time 0 after all-to-all

7 http://www.gridmpi.orgYutaka Ishikawa, The University of Tokyo7 Issues High Performance Communication Facilities for MPI on Long and Fat Networks –TCP vs. MPI communication patterns –Network Topology Latency and Bandwidth Interoperability –Most MPI library implementations use their own network protocol. Fault Tolerance and Migration –To survive a site failure Security TCPMPI Designed for streams. Burst traffic. Repeat the computation and communication phases. Change traffic by communication patterns. Internet

8 http://www.gridmpi.orgYutaka Ishikawa, The University of Tokyo8 Interne t Issues High Performance Communication Facilities for MPI on Long and Fat Networks –TCP vs. MPI communication patterns –Network Topology Latency and Bandwidth Interoperability –Many MPI library implementations. Most implementations use their own network protocol Fault Tolerance and Migration –To survive a site failure Security TCPMPI Designed for streams. Burst traffic. Repeat the computation and communication phases. Change traffic by communication patterns. Using Vendor C’s MPI library Using Vendor A’s MPI library Using Vendor B’s MPI library Using Vendor D’s MPI library

9 http://www.gridmpi.orgYutaka Ishikawa, The University of Tokyo9 GridMPI Features MPI API TCP/IP PMv2MXO2GVendor MPI P2P Interface Request Layer Request Interface IMPI LAC Layer (Collectives) IMPI sshrshSCoreGlobusVendor MPI RPIM Interface MPI-2 implementation YAMPII, developed at the University of Tokyo, is used as the core implementation Intra communication by YAMPII ( TCP/IP 、 SCore ) Inter communication by IMPI ( Interoperable MPI), protocol and extension to Grid –MPI-2 –New Collective protocols Integration of Vendor MPI –IBM Regatta MPI, MPICH2, Solaris MPI, Fujitsu MPI, (NEC SX MPI) Incremental checkpoint High Performance TCP/IP implementation Interne t IPMI/TCP Vendor’s MPI YAMPII LAC: Latency Aware Collectives bcast/allreduce algorithms have been developed (to appear at the cluster06 conference)

10 http://www.gridmpi.orgYutaka Ishikawa, The University of Tokyo10 High-performance Communication Mechanisms in the Long and Fat Network Modifications of TCP Behavior –M Matsuda, T. Kudoh, Y. Kodama, R. Takano, and Y. Ishikawa, “TCP Adaptation for MPI on Long-and-Fat Networks,” IEEE Cluster 2005, 2005. Precise Software Pacing –R. Takano, T. Kudoh, Y. Kodama, M. Matsuda, H. Tezuka, Y. Ishikawa, “Design and Evaluation of Precise Software Pacing Mechanisms for Fast Long-Distance Networks”, PFLDnet2005, 2005. Collective communication algorithms with respect to network latency and bandwidth. –M. Matsuda, T. Kudoh, Y. Kodama, R. Takano, Y. Ishikawa, “Efficient MPI Collective Operations for Clusters in Long-and- Fast Networks”, to appear at IEEE Cluster 2006.

11 http://www.gridmpi.orgYutaka Ishikawa, The University of Tokyo11 Evaluation It is almost impossible to reproduce the execution behavior of communication performance in the wide area network A WAN emulator, GtrcNET-1, is used to scientifically examine implementations, protocols, communication algorithms, etc. GtrcNET-1 GtrcNET-1 is developed at AIST. injection of delay, jitter, error, … traffic monitor, frame capture Four 1000Base-SX ports One USB port for Host PC FPGA (XC2V6000) http://www.gtrc.aist.go.jp/gnet/

12 http://www.gridmpi.orgYutaka Ishikawa, The University of Tokyo12 Experimental Environment 8 PCs CPU: Pentium4/2.4GHz, Memory: DDR400 512MB NIC: Intel PRO/1000 (82547EI) OS: Linux-2.6.9-1.6 (Fedora Core 2) Socket Buffer Size: 20MB WAN Emulator GtrcNET-1 8 PCs Node7 Host 0 Node0 Catalyst 3750 Node15 Host 0 Node8 Catalyst 3750 ……… Bandwidth:1Gbps Delay: 0ms -- 10ms

13 http://www.gridmpi.orgYutaka Ishikawa, The University of Tokyo13 GridMPI vs. MPICH-G2 (1/4) FT (Class B) of NAS Parallel Benchmarks 3.2 on 8 x 8 processes One way delay (msec) Relative Performance

14 http://www.gridmpi.orgYutaka Ishikawa, The University of Tokyo14 GridMPI vs. MPICH-G2 (2/4) IS (Class B) of NAS Parallel Benchmarks 3.2 on 8 x 8 processes One way delay (msec) Relative Performance

15 http://www.gridmpi.orgYutaka Ishikawa, The University of Tokyo15 GridMPI vs. MPICH-G2 (3/4) LU (Class B) of NAS Parallel Benchmarks 3.2 on 8 x 8 processes One way delay (msec) Relative Performance

16 http://www.gridmpi.orgYutaka Ishikawa, The University of Tokyo16 GridMPI vs. MPICH-G2 (4/4) NAS Parallel Benchmarks 3.2 Class B on 8 x 8 processes One way delay (msec) Relative Performance No parameters tuned in GridMPI

17 http://www.gridmpi.orgYutaka Ishikawa, The University of Tokyo17 GridMPI on Actual Network NAS Parallel Benchmarks run using 8 node (2.4GHz) cluster at Tsukuba and 8 node (2.8GHz) cluster at Akihabara –16 nodes Comparing the performance with –result using 16 node (2.4 GHz) –result using 16 node (2.8 GHz) JGN2 Network 10Gbps Bandwidth 1.5 msec RTT JGN2 Network 10Gbps Bandwidth 1.5 msec RTT Pentium-4 2.4GHz x 8 connected by 1G Ethernet @ Tsukuba Pentium-4 2.8 GHz x 8 Connected by 1G Ethernet @ Akihabara 60 Km (40mi.) Benchmarks Relative performance

18 http://www.gridmpi.orgYutaka Ishikawa, The University of Tokyo18 GridMPI Now and Future GridMPI version 1.0 has been released –Conformance Tests MPICH Test Suite: 0/142 (Fails/Tests) Intel Test Suite: 0/493 (Fails/Tests) –GridMPI is integrated into the NaReGI package Extension of IMPI Specification –Refine the current extensions –Collective communication and check point algorithms could not be fixed. The current idea is specifying The mechanism of –dynamic algorithm selection –dynamic algorithm shipment and load »virtual machine to implement algorithms

19 http://www.gridmpi.orgYutaka Ishikawa, The University of Tokyo19 Internet Dynamic Algorithm Shipment A collective communication algorithm implemented in the virtual machine The code is shipped to all MPI processes The MPI runtime library interprets the algorithm to perform the collective communication for inter-clusters

20 http://www.gridmpi.orgYutaka Ishikawa, The University of Tokyo20 Concluding Remarks Our Main Concern is the metropolitan area network –high-bandwidth environment:  10Gpbs,  500miles (smaller than 10ms one-way latency) Overseas (  100 milliseconds) –Applications must be aware of the communication latency –data movement using MPI-IO ? Collaborations –We would like to ask people, who are interested in this work, for collaborations


Download ppt "Ishikawa, The University of Tokyo1 GridMPI : Grid Enabled MPI Yutaka Ishikawa University of Tokyo and AIST."

Similar presentations


Ads by Google