Ishikawa, The University of Tokyo1 GridMPI ： Grid Enabled MPI Yutaka Ishikawa University of Tokyo and AIST.

Slides:

Advertisements

Similar presentations

Live migration of Virtual Machines Nour Stefan, SCPD.

Advertisements

Providing Fault-tolerance for Parallel Programs on Grid (FT-MPICH) Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University.

CoMPI: Enhancing MPI based applications performance and scalability using run-time compression. Rosa Filgueira, David E.Singh, Alejandro Calderón and Jesús.

Current Testbed : 100 GE 2 sites (NERSC, ANL) with 3 nodes each. Each node with 4 x 10 GE NICs Measure various overheads from protocols and file sizes.

Estinet open flow network simulator and emulator. IEEE Communications Magazine 51.9 (2013): Wang, Shie-Yuan, Chih-Liang Chou, and Chun-Ming Yang.

1 GridTorrent Framework: A High-performance Data Transfer and Data Sharing Framework for Scientific Computing.

♦ Commodity processor with commodity inter- processor connection Clusters Pentium, Itanium, Opteron, Alpha GigE, Infiniband, Myrinet, Quadrics, SCI NEC.

Protocols and software for exploiting Myrinet clusters Congduc Pham and the main contributors P. Geoffray, L. Prylli, B. Tourancheau, R. Westrelin.

Performance Analysis of Virtualization for High Performance Computing A Practical Evaluation of Hypervisor Overheads Matthew Cawood University of Cape.

SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.

An Efficient Process Live Migration Mechanism for Load Balanced Distributed Virtual Environments Balazs Gerofi, Hajime Fujita, Yutaka Ishikawa Yutaka Ishikawa.

Geoff Salmon, Monia Ghobadi, Yashar Ganjali, Martin Labrecque, J. Gregory Steffan University of Toronto.

A Grid Parallel Application Framework Jeremy Villalobos PhD student Department of Computer Science University of North Carolina Charlotte.

VIA and Its Extension To TCP/IP Network Yingping Lu Based on Paper “Queue Pair IP, …” by Philip Buonadonna.

Diagnosing Wireless TCP Performance Problems: A Case Study Tianbo Kuang, Fang Xiao, and Carey Williamson University of Calgary.

MPICH-V: Fault Tolerant MPI Rachit Chawla. Outline  Introduction  Objectives  Architecture  Performance  Conclusion.

A Comparative Study of Network Protocols & Interconnect for Cluster Computing Performance Evaluation of Fast Ethernet, Gigabit Ethernet and Myrinet.

An Energy Consumption Framework for Distributed Java-Based Systems Chiyoung Seo Software Architecture Research Group University of Southern California.

Chapter 15 Chapter 15: Network Monitoring and Tuning.

Computer Network Architecture and Programming

Nor Asilah Wati Abdul Hamid, Paul Coddington. School of Computer Science, University of Adelaide PDCN FEBRUARY 2007 AVERAGES, DISTRIBUTIONS AND SCALABILITY.

1 Performance Evaluation of Gigabit Ethernet & Myrinet

Inferring the Topology and Traffic Load of Parallel Programs in a VM environment Ashish Gupta Peter Dinda Department of Computer Science Northwestern University.

NPACI: National Partnership for Advanced Computational Infrastructure August 17-21, 1998 NPACI Parallel Computing Institute 1 Cluster Archtectures and.

CPP Staff - 30 CPP Staff - 30 FCIPT Staff - 35 IPR Staff IPR Staff ITER-India Staff ITER-India Staff Research Areas: 1.Studies.

Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.

Vision/Benefits/Introduction Randy Armstrong (OPC Foundation)

Department of Computer Science Southern Illinois University Edwardsville Dr. Hiroshi Fujinoki and Kiran Gollamudi {hfujino,

An Agile Vertical Handoff Scheme for Heterogeneous Networks Hsung-Pin Chang Department of Computer Science National Chung Hsing University Taichung, Taiwan,

2006/1/23Yutaka Ishikawa, The University of Tokyo1 An Introduction of GridMPI Yutaka Ishikawa and Motohiko Matsuda University of Tokyo Grid Technology.

Research Achievements Kenji Kaneda. Agenda Research background and goal Research background and goal Overview of my research achievements Overview of.

Influence of Virtualization on Process of Grid Application Deployment Distributed Systems Research Group Department of Computer Science AGH-UST Cracow,

Towards a Common Communication Infrastructure for Clusters and Grids Darius Buntinas Argonne National Laboratory.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

SUMA: A Scientific Metacomputer Cardinale, Yudith Figueira, Carlos Hernández, Emilio Baquero, Eduardo Berbín, Luis Bouza, Roberto Gamess, Eric García,

G-JavaMPI: A Grid Middleware for Distributed Java Computing with MPI Binding and Process Migration Supports Lin Chen, Cho-Li Wang, Francis C. M. Lau and.

UNDERSTANDING THE HOST-TO-HOST COMMUNICATIONS MODEL - OSI LAYER & TCP/IP MODEL 1.

Example: Sorting on Distributed Computing Environment Apr 20,

Introduction of CRON Lin Xue Feb What is CRON “cron.cct.lsu.edu” testbed project is based on the Emulab system in the University of Utah. Emulab:

Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.

NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.

Latest news on JXTA and JuxMem-C/DIET Mathieu Jan GDS meeting, Rennes, 11 march 2005.

Towards Exascale File I/O Yutaka Ishikawa University of Tokyo, Japan 2009/05/21.

Derek Wright Computer Sciences Department University of Wisconsin-Madison MPI Scheduling in Condor: An.

In Large-Scale Cluster Yutaka Ishikawa Computer Science Department/Information Technology Center The University of Tokyo

1 On Dynamic Parallelism Adjustment Mechanism for Data Transfer Protocol GridFTP Takeshi Itou, Hiroyuki Ohsaki Graduate School of Information Sci. & Tech.

An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.

Optimizing Charm++ Messaging for the Grid Gregory A. Koenig Parallel Programming Laboratory Department of Computer.

1 ©2004 Board of Trustees of the University of Illinois Computer Science Overview Laxmikant (Sanjay) Kale ©

1 Wide Area Network Emulation on the Millennium Bhaskaran Raman Yan Chen Weidong Cui Randy Katz {bhaskar, yanchen, wdc, Millennium.

Mr. P. K. GuptaSandeep Gupta Roopak Agarwal

Wide-Area Parallel Computing in Java Henri Bal Vrije Universiteit Amsterdam Faculty of Sciences vrije Universiteit.

Module 16: Distributed System Structures Silberschatz, Galvin and Gagne ©2005 Operating System Concepts – 7 th Edition, Apr 4, 2005 Distributed.

Improving the Reliability of Commodity Operating Systems Michael M. Swift, Brian N. Bershad, Henry M. Levy Presented by Ya-Yun Lo EECS 582 – W161.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

Querying the Internet with PIER CS294-4 Paul Burstein 11/10/2003.

LACSI 2002, slide 1 Performance Prediction for Simple CPU and Network Sharing Shreenivasa Venkataramaiah Jaspal Subhlok University of Houston LACSI Symposium.

FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.

CIm -IE775 computer Integrated manufacturing Industrial & Manufacturing Enterprise Department The Wichita State University

BDTS and Its Evaluation on IGTMD link C. Chen, S. Soudan, M. Pasin, B. Chen, D. Divakaran, P. Primet CC-IN2P3, LIP ENS-Lyon

A Practical Evaluation of Hypervisor Overheads Matthew Cawood Supervised by: Dr. Simon Winberg University of Cape Town Performance Analysis of Virtualization.

1 Scalability and Accuracy in a Large-Scale Network Emulator Nov. 12, 2003 Byung-Gon Chun.

Towards a High Performance Extensible Grid Architecture Klaus Krauter Muthucumaru Maheswaran {krauter,

Distributed Network Traffic Feature Extraction for a Real-time IDS

Realization of a stable network flow with high performance communication in high bandwidth-delay product network Y. Kodama, T. Kudoh, O. Tatebe, S. Sekiguchi.

Grid Computing.

University of Technology

MPJ: A Java-based Parallel Computing System

GridTorrent Framework: A High-performance Data Transfer and Data Sharing Framework for Scientific Computing.

A Virtual Machine Monitor for Utilizing Non-dedicated Clusters

Presentation transcript:

Ishikawa, The University of Tokyo1 GridMPI ： Grid Enabled MPI Yutaka Ishikawa University of Tokyo and AIST

Ishikawa, The University of Tokyo2 Motivation MPI has been widely used to program parallel applications Users want to run such applications over the Grid environment without any modifications of the program However, the performance of existing MPI implementations is not scaled up on the Grid environment Wide-area Network Single (monolithic) MPI application over the Grid environment computing resource site A computing resource site A computing resource site B computing resource site B

Ishikawa, The University of Tokyo3 Motivation Focus on metropolitan-area, high-bandwidth environment:  10Gpbs,  500miles (smaller than 10ms one-way latency) –Internet Bandwidth in Grid  Interconnect Bandwidth in Cluster 10 Gbps vs. 1 Gbps 100 Gbps vs. 10 Gbps Wide-area Network Single (monolithic) MPI application over the Grid environment computing resource site A computing resource site A computing resource site B computing resource site B

Ishikawa, The University of Tokyo4 Motivation Focus on metropolitan-area, high-bandwidth environment:  10Gpbs,  500miles (smaller than 10ms one-way latency) –We have already demonstrated that the performance of the NAS parallel benchmark programs are scaled up if one-way latency is smaller than 10ms using an emulated WAN environment Wide-area Network Single (monolithic) MPI application over the Grid environment computing resource site A computing resource site A computing resource site B computing resource site B Motohiko Matsuda, Yutaka Ishikawa, and Tomohiro Kudoh, ``Evaluation of MPI Implementations on Grid-connected Clusters using an Emulated WAN Environment,'' CCGRID2003, 2003 Motohiko Matsuda, Yutaka Ishikawa, and Tomohiro Kudoh, ``Evaluation of MPI Implementations on Grid-connected Clusters using an Emulated WAN Environment,'' CCGRID2003, 2003

Ishikawa, The University of Tokyo5 Issues High Performance Communication Facilities for MPI on Long and Fat Networks –TCP vs. MPI communication patterns –Network Topology Latency and Bandwidth Interoperability –Most MPI library implementations use their own network protocol. Fault Tolerance and Migration –To survive a site failure Security TCPMPI Designed for streams. Burst traffic. Repeat the computation and communication phases. Change traffic by communication patterns. Repeating 10MB data transfer with two second intervals Observed during one 10MB data transfer The slow-start phase window size is set to 1 The slow-start phase window size is set to 1 These silences results from burst traffic

Ishikawa, The University of Tokyo6 Issues High Performance Communication Facilities for MPI on Long and Fat Networks –TCP vs. MPI communication patterns –Network Topology Latency and Bandwidth Interoperability –Most MPI library implementations use their own network protocol. Fault Tolerance and Migration –To survive a site failure Security TCPMPI Designed for streams. Burst traffic. Repeat the computation and communication phases. Change traffic by communication patterns. Start one-to-one communication at time 0 after all-to-all

Ishikawa, The University of Tokyo7 Issues High Performance Communication Facilities for MPI on Long and Fat Networks –TCP vs. MPI communication patterns –Network Topology Latency and Bandwidth Interoperability –Most MPI library implementations use their own network protocol. Fault Tolerance and Migration –To survive a site failure Security TCPMPI Designed for streams. Burst traffic. Repeat the computation and communication phases. Change traffic by communication patterns. Internet

Ishikawa, The University of Tokyo8 Interne t Issues High Performance Communication Facilities for MPI on Long and Fat Networks –TCP vs. MPI communication patterns –Network Topology Latency and Bandwidth Interoperability –Many MPI library implementations. Most implementations use their own network protocol Fault Tolerance and Migration –To survive a site failure Security TCPMPI Designed for streams. Burst traffic. Repeat the computation and communication phases. Change traffic by communication patterns. Using Vendor C’s MPI library Using Vendor A’s MPI library Using Vendor B’s MPI library Using Vendor D’s MPI library

Ishikawa, The University of Tokyo9 GridMPI Features MPI API TCP/IP PMv2MXO2GVendor MPI P2P Interface Request Layer Request Interface IMPI LAC Layer (Collectives) IMPI sshrshSCoreGlobusVendor MPI RPIM Interface MPI-2 implementation YAMPII, developed at the University of Tokyo, is used as the core implementation Intra communication by YAMPII （ TCP/IP 、 SCore ） Inter communication by IMPI （ Interoperable MPI), protocol and extension to Grid –MPI-2 –New Collective protocols Integration of Vendor MPI –IBM Regatta MPI, MPICH2, Solaris MPI, Fujitsu MPI, (NEC SX MPI) Incremental checkpoint High Performance TCP/IP implementation Interne t IPMI/TCP Vendor’s MPI YAMPII LAC: Latency Aware Collectives bcast/allreduce algorithms have been developed (to appear at the cluster06 conference)

Ishikawa, The University of Tokyo10 High-performance Communication Mechanisms in the Long and Fat Network Modifications of TCP Behavior –M Matsuda, T. Kudoh, Y. Kodama, R. Takano, and Y. Ishikawa, “TCP Adaptation for MPI on Long-and-Fat Networks,” IEEE Cluster 2005, Precise Software Pacing –R. Takano, T. Kudoh, Y. Kodama, M. Matsuda, H. Tezuka, Y. Ishikawa, “Design and Evaluation of Precise Software Pacing Mechanisms for Fast Long-Distance Networks”, PFLDnet2005, Collective communication algorithms with respect to network latency and bandwidth. –M. Matsuda, T. Kudoh, Y. Kodama, R. Takano, Y. Ishikawa, “Efficient MPI Collective Operations for Clusters in Long-and- Fast Networks”, to appear at IEEE Cluster 2006.

Ishikawa, The University of Tokyo11 Evaluation It is almost impossible to reproduce the execution behavior of communication performance in the wide area network A WAN emulator, GtrcNET-1, is used to scientifically examine implementations, protocols, communication algorithms, etc. GtrcNET-1 GtrcNET-1 is developed at AIST. injection of delay, jitter, error, … traffic monitor, frame capture Four 1000Base-SX ports One USB port for Host PC FPGA (XC2V6000)

Ishikawa, The University of Tokyo12 Experimental Environment 8 PCs CPU: Pentium4/2.4GHz, Memory: DDR MB NIC: Intel PRO/1000 (82547EI) OS: Linux (Fedora Core 2) Socket Buffer Size: 20MB WAN Emulator GtrcNET-1 8 PCs Node7 Host 0 Node0 Catalyst 3750 Node15 Host 0 Node8 Catalyst 3750 ……… Bandwidth:1Gbps Delay: 0ms -- 10ms

Ishikawa, The University of Tokyo13 GridMPI vs. MPICH-G2 (1/4) FT (Class B) of NAS Parallel Benchmarks 3.2 on 8 x 8 processes One way delay (msec) Relative Performance

Ishikawa, The University of Tokyo14 GridMPI vs. MPICH-G2 (2/4) IS (Class B) of NAS Parallel Benchmarks 3.2 on 8 x 8 processes One way delay (msec) Relative Performance

Ishikawa, The University of Tokyo15 GridMPI vs. MPICH-G2 (3/4) LU (Class B) of NAS Parallel Benchmarks 3.2 on 8 x 8 processes One way delay (msec) Relative Performance

Ishikawa, The University of Tokyo16 GridMPI vs. MPICH-G2 (4/4) NAS Parallel Benchmarks 3.2 Class B on 8 x 8 processes One way delay (msec) Relative Performance No parameters tuned in GridMPI

Ishikawa, The University of Tokyo17 GridMPI on Actual Network NAS Parallel Benchmarks run using 8 node (2.4GHz) cluster at Tsukuba and 8 node (2.8GHz) cluster at Akihabara –16 nodes Comparing the performance with –result using 16 node (2.4 GHz) –result using 16 node (2.8 GHz) JGN2 Network 10Gbps Bandwidth 1.5 msec RTT JGN2 Network 10Gbps Bandwidth 1.5 msec RTT Pentium-4 2.4GHz x 8 connected by 1G Tsukuba Pentium GHz x 8 Connected by 1G Akihabara 60 Km (40mi.) Benchmarks Relative performance

Ishikawa, The University of Tokyo18 GridMPI Now and Future GridMPI version 1.0 has been released –Conformance Tests MPICH Test Suite: 0/142 (Fails/Tests) Intel Test Suite: 0/493 (Fails/Tests) –GridMPI is integrated into the NaReGI package Extension of IMPI Specification –Refine the current extensions –Collective communication and check point algorithms could not be fixed. The current idea is specifying The mechanism of –dynamic algorithm selection –dynamic algorithm shipment and load »virtual machine to implement algorithms

Ishikawa, The University of Tokyo19 Internet Dynamic Algorithm Shipment A collective communication algorithm implemented in the virtual machine The code is shipped to all MPI processes The MPI runtime library interprets the algorithm to perform the collective communication for inter-clusters

Ishikawa, The University of Tokyo20 Concluding Remarks Our Main Concern is the metropolitan area network –high-bandwidth environment:  10Gpbs,  500miles (smaller than 10ms one-way latency) Overseas (  100 milliseconds) –Applications must be aware of the communication latency –data movement using MPI-IO ? Collaborations –We would like to ask people, who are interested in this work, for collaborations