CC-MPI: A Compiled Communication Capable MPI Prototype for Ethernet Switched Clusters Amit Karwande, Xin Yuan Department of Computer Science, Florida State.

Slides:



Advertisements
Similar presentations
IEEE INFOCOM 2004 MultiNet: Connecting to Multiple IEEE Networks Using a Single Wireless Card.
Advertisements

Multicasting in Mobile Ad hoc Networks By XIE Jiawei.
Distributed Processing, Client/Server and Clusters
CoMPI: Enhancing MPI based applications performance and scalability using run-time compression. Rosa Filgueira, David E.Singh, Alejandro Calderón and Jesús.
Presented By- Sayandeep Mitra TH SEMESTER Sensor Networks(CS 704D) Assignment.
GridRPC Sources / Credits: IRISA/IFSIC IRISA/INRIA Thierry Priol et. al papers.
1 Fall 2005 Hardware Addressing and Frame Identification Qutaibah Malluhi CSE Department Qatar University.
A Comparative Study of Network Protocols & Interconnect for Cluster Computing Performance Evaluation of Fast Ethernet, Gigabit Ethernet and Myrinet.
School of Information Technologies Internet Multicasting NETS3303/3603 Week 10.
COS 420 Day 18. Agenda Group Project Discussion Program Requirements Rejected Resubmit by Friday Noon Protocol Definition Due April 12 Assignment 3 Due.
William Stallings Data and Computer Communications 7th Edition
DDDDRRaw: A Prototype Toolkit for Distributed Real-Time Rendering on Commodity Clusters Thu D. Nguyen and Christopher Peery Department of Computer Science.
OPERATING SYSTEM OVERVIEW
Mobile and Wireless Computing Institute for Computer Science, University of Freiburg Western Australian Interactive Virtual Environments Centre (IVEC)
CS335 Networking & Network Administration Tuesday, April 20, 2010.
Nor Asilah Wati Abdul Hamid, Paul Coddington. School of Computer Science, University of Adelaide PDCN FEBRUARY 2007 AVERAGES, DISTRIBUTIONS AND SCALABILITY.
Heterogeneous and Grid Computing2 Communication models u Modeling the performance of communications –Huge area –Two main communities »Network designers.
1 Performance Evaluation of Gigabit Ethernet & Myrinet
Institute of Technology, Sligo Dept of Computing Semester 3, version Semester 3 Chapter 3 VLANs.
Inferring the Topology and Traffic Load of Parallel Programs in a VM environment Ashish Gupta Peter Dinda Department of Computer Science Northwestern University.
NPACI: National Partnership for Advanced Computational Infrastructure August 17-21, 1998 NPACI Parallel Computing Institute 1 Cluster Archtectures and.
WAN Technologies.
Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.
Profile Guided MPI Protocol Selection for Point-to-Point Communication Calls 5/9/111 Aniruddha Marathe, David K. Lowenthal Department of Computer Science.
Network Architecture and Protocol Concepts. Network Architectures (1) The network provides one or more communication services to applications –A service.
ADLB Update Recent and Current Adventures with the Asynchronous Dynamic Load Balancing Library Rusty Lusk Mathematics and Computer Science Division Argonne.
Optimizing Threaded MPI Execution on SMP Clusters Hong Tang and Tao Yang Department of Computer Science University of California, Santa Barbara.
Lecture 2 TCP/IP Protocol Suite Reference: TCP/IP Protocol Suite, 4 th Edition (chapter 2) 1.
Distributed Shared Memory Systems and Programming
Effective User Services for High Performance Computing A White Paper by the TeraGrid Science Advisory Board May 2009.
Experience and Expertise of Designing SDN MPI_Bcast Khureltulga Dashdavaa, Susumu Date, Hiroaki Yamanaka, Eiji Kawai, Yasuhiro Watashiba, Kohei Ichikawa,
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Low-Power Wireless Sensor Networks
Power Save Mechanisms for Multi-Hop Wireless Networks Matthew J. Miller and Nitin H. Vaidya University of Illinois at Urbana-Champaign BROADNETS October.
Department of Electronic Engineering City University of Hong Kong EE3900 Computer Networks Introduction Slide 1 A Communications Model Source: generates.
Department of Computer Science Southern Illinois University Edwardsville Spring, 2010 Dr. Hiroshi Fujinoki CS 547/490 Network.
IEEE Globecom 2010 Tan Le Yong Liu Department of Electrical and Computer Engineering Polytechnic Institute of NYU Opportunistic Overlay Multicast in Wireless.
HPCA2001HPCA Message Passing Interface (MPI) and Parallel Algorithm Design.
Design and Implementation of a Multi-Channel Multi-Interface Network Chandrakanth Chereddi Pradeep Kyasanur Nitin H. Vaidya University of Illinois at Urbana-Champaign.
An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech.
1 Chapter 11 Extending Networks (Repeaters, Bridges, Switches)
Planned AlltoAllv a clustered approach Stephen Booth (EPCC) Adrian Jackson (EPCC)
Jump to first page One-gigabit Router Oskar E. Bruening and Cemal Akcaba Advisor: Prof. Agarwal.
Internetworking Concept and Architectural Model
Synchronization Transformations for Parallel Computing Pedro Diniz and Martin Rinard Department of Computer Science University of California, Santa Barbara.
CE Operating Systems Lecture 3 Overview of OS functions and structure.
Paper # – 2009 A Comparison of Heterogeneous Video Multicast schemes: Layered encoding or Stream Replication Authors: Taehyun Kim and Mostafa H.
The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.
PMI: A Scalable Process- Management Interface for Extreme-Scale Systems Pavan Balaji, Darius Buntinas, David Goodell, William Gropp, Jayesh Krishna, Ewing.
An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.
K-Anycast Routing Schemes for Mobile Ad Hoc Networks 指導老師 : 黃鈴玲 教授 學生 : 李京釜.
Multiuser Receiver Aware Multicast in CDMA-based Multihop Wireless Ad-hoc Networks Parmesh Ramanathan Department of ECE University of Wisconsin-Madison.
Design an MPI collective communication scheme A collective communication involves a group of processes. –Assumption: Collective operation is realized based.
Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 April 11, 2006 Session 23.
Framework & Requirements for an Access Node Control Mechanism in Broadband Multi-Service Networks draft-ietf-ancp-framework-02.txt Presenter: Dong Sun.
Technology Mapping. 2 Technology mapping is the phase of logic synthesis when gates are selected from a technology library to implement the circuit. Technology.
1 BİL 542 Parallel Computing. 2 Message Passing Chapter 2.
MPI implementation – collective communication MPI_Bcast implementation.
Challenges in the Next Generation Internet Xin Yuan Department of Computer Science Florida State University
Pipelined Broadcast on Ethernet Switched Clusters Pitch Patarasuk, Ahmad Faraj, Xin Yuan Department of Computer Science Florida State University Tallahassee,
1 Network Address Translation. 2 Network Address Translation (NAT) Extension of original addressing scheme Motivated by exhaustion of IP address space.
Parallel Computing Presented by Justin Reschke
Seminar On Rain Technology
1 Chapter 2: Operating-System Structures Services Interface provided to users & programmers –System calls (programmer access) –User level access to system.
SEMINAR TOPIC ON “RAIN TECHNOLOGY”
Virtual Private Servers – Types of Virtualization platforms Virtual Private ServersVirtual Private Servers, popularly known as VPS is considered one of.
Department of Computer Science University of California, Santa Barbara
MPJ: A Java-based Parallel Computing System
Ch 17 - Binding Protocol Addresses
Department of Computer Science University of California, Santa Barbara
Presentation transcript:

CC-MPI: A Compiled Communication Capable MPI Prototype for Ethernet Switched Clusters Amit Karwande, Xin Yuan Department of Computer Science, Florida State University David K. Lowenthal Department of Computer Science, University of Georgia

Motivation Related work CC-MPI –One-to-many(all) communications –Many(all)-to-many(all) communications Performance study Conclusion

Traditional communication libraries (e.g. MPI) hide network details and provide a simple API. Advantage: User friendly. Limitation: Communication optimization opportunity is limited. –Optimizations can either be done at the compiler or in the library. »Architectural independent optimizations in the compiler »Architectural dependent optimizations in the library, but such optimizations can only be done for a single routine.

Compiled Communication: –At compile time, use both the application communication and network architecture information to perform communication optimizations. Static management of network resources. Compiler directed architecture dependent optimizations. Architecture dependent optimization across patterns. –To apply the compiled communication technique to MPI programs, The library must closely match the MPI library. The library must be able to support optimizations in compiled communication. –Expose network details. –Different implementations for a routine so that the user can choose the best one. –This work focuses on the compiled communication capable communication library.

Related Work: –Compiled directed architectural dependent optimization [Hinrichs94] –Compiled Communication [Bromley91,Cappello95,Kumar92,Yuan03] –MPI optimizations [Ogawa96,Lauria97,Tang00,Kielmann99]

CC-MPI: –Optimizes one-al-all, one-to-many, all-to-all, and many-to-many communications –Targets Ethernet Switched Clusters –Basic idea: Separate network control routines from data transmission routines Multiple implementations for each MPI routine.

One-to-many(all) communications: –Multicast based implementations. Reliable multicast (IP multicast is unreliable) –Use a simple ACK-based protocol) Group Management –A group needs to be created before any communication is to be performed. –2^n potential groups for n members –The hardware limits the number of simultaneous groups. –CC-MPI supports three group management schemes: Static, dynamic, and compiler-assisted

Static group management: –Associate a multicast group with a communicator statically. MPI_Bcast: send a reliable multicast message to the group MPI_Scatter: aggregate all messages to different nodes and send the aggregated messages to the group. Each receiver can extract its portion. MPI_Scatterv: Two MPI_Bcasts, one for the layout of the data, the other for the data. –Problem: For one-to-many communications, nodes that are not in the communications must also participate in the reliable multicast process

dynamic group management: –Dynamically creates a new group for one-to- many communications. –May introduce too much group management overheads. Compiler-assisted group management: –Extend the MPI API to allow users to directly manage multicast groups. For example, for MPI_Scatterv, we have three routines –MPI_Scatterv_open_group –MPI_Scatterv_ data_movement –MPI_Scatterv_close_group –The users may move, merge, and delete the control routines when additional information is available.

An example:

All(many)-to-all(many) communications –MPI_Alltoall, MPI_Alltoallv, MPI_Allgather, etc. –Multicast based implementation may not be efficient. –We need to distinguish between communications with large messages and small messages: Small messages: each node sends as fast as it can Large messages: use some mechanism to reduce contention. –Phase communication [Hinrichs94] »partition the all-to-all communication into phases such that there is no network contention within each phase. »Use barriers to separate phases so that different phases do not interference with each other. –Phase communications for all-to-all communications is well studied for many topologies.

Phase communication for many-to-many communication (MPI_alltoallv): –All nodes must know the pattern Use MPI_Allgather before anything is done Assume the compile has the info and store in local data structures. –Communication scheduling Greedy scheduling All-to-all based scheduling

CC-MPI supports four methods for MPI_Alltoallv: –All nodes send as fast as possible –Phased communication, level 1: MPI_Allgather for pattern information Communication scheduling Actual phase communication –Phased communication, level 2: (pattern is known) Communication scheduling Actual phase communication –Phased communication, level 3: (phases are known) Actual phase communication.

Performance Study: –Environment: 29 P3-650, 100Mbps Ethernet switch –LAM/MPI version with c2c mode –MPICH version with device ch_p4

Evaluation individual routine:

MPI_Bcast:

MPI_Scatter:

MPI_Scatterv (5 to 5 out of 29 nodes):

MPI_Allgather (16 nodes):

MPI_alltoall (16 nodes)

MPI_Alltoallv (alltoall pattern on 16 nodes):

MPI_Alltoallv (random pattern):

Benchmark Program (IS):

Benchmark Program (FT):

CC-MPI for software DSM (a synthetic application):

Conclusion: –We develop a compiled communication capable MPI prototype. –We demonstrate that by allowing users more control on the communications, significant improvement can be obtained. –Compiler support is needed for this model to be successful.