Programming Environment & Training (PET) 14 - 15 Sep 99 1 Parallel I/O for Distributed Applications Dr Graham E Fagg Innovative Computing Laboratory University.

Slides:



Advertisements
Similar presentations
A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University
Advertisements

Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.
MPI Message Passing Interface
Processes Management.
Dr. Kalpakis CMSC 621, Advanced Operating Systems. Fall 2003 URL: Distributed System Architectures.
WHAT IS AN OPERATING SYSTEM? An interface between users and hardware - an environment "architecture ” Allows convenient usage; hides the tedious stuff.
S.Chechelnitskiy / SFU Simon Fraser Running CE and SE in a XEN virtualized environment S.Chechelnitskiy Simon Fraser University CHEP 2007 September 6 th.
Ameoba Designed by: Prof Andrew S. Tanenbaum at Vrija University since 1981.
Distributed Processing, Client/Server, and Clusters
1: Operating Systems Overview
OPERATING SYSTEM OVERVIEW
1 Parallel Computing—Introduction to Message Passing Interface (MPI)
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
A. Frank - P. Weisberg Operating Systems Introduction to Tasks/Threads.
1 MPI-2 and Threads. 2 What are Threads? l Executing program (process) is defined by »Address space »Program Counter l Threads are multiple program counters.
Chapter 51 Threads Chapter 5. 2 Process Characteristics  Concept of Process has two facets.  A Process is: A Unit of resource ownership:  a virtual.
Managing Heterogeneous MPI Application Interoperation and Execution. From PVMPI to SNIPE based MPI_Connect() Graham E. Fagg*, Kevin S. London, Jack J.
Task Farming on HPCx David Henty HPCx Applications Support
ISO Layer Model Lecture 9 October 16, The Need for Protocols Multiple hardware platforms need to have the ability to communicate. Writing communications.
1 Chapter Client-Server Interaction. 2 Functionality  Transport layer and layers below  Basic communication  Reliability  Application layer.
1 Lecture 4: Threads Operating System Fall Contents Overview: Processes & Threads Benefits of Threads Thread State and Operations User Thread.
LOGO OPERATING SYSTEM Dalia AL-Dabbagh
 What is an operating system? What is an operating system?  Where does the OS fit in? Where does the OS fit in?  Services provided by an OS Services.
Process Management. Processes Process Concept Process Scheduling Operations on Processes Interprocess Communication Examples of IPC Systems Communication.
Operating System Review September 10, 2012Introduction to Computer Security ©2004 Matt Bishop Slide #1-1.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
M i SMob i S Mob i Store - Mobile i nternet File Storage Platform Chetna Kaur.
Lecture 8: Design of Parallel Programs Part III Lecturer: Simon Winberg.
The Pipeline Processing Framework LSST Applications Meeting IPAC Feb. 19, 2008 Raymond Plante National Center for Supercomputing Applications.
Operating System 4 THREADS, SMP AND MICROKERNELS
◦ What is an Operating System? What is an Operating System? ◦ Operating System Objectives Operating System Objectives ◦ Services Provided by the Operating.
Recall: Three I/O Methods Synchronous: Wait for I/O operation to complete. Asynchronous: Post I/O request and switch to other work. DMA (Direct Memory.
Model Coupling Environmental Library. Goals Develop a framework where geophysical models can be easily coupled together –Work across multiple platforms,
OPERATING SYSTEMS Goals of the course Definitions of operating systems Operating system goals What is not an operating system Computer architecture O/S.
Parallel Computing A task is broken down into tasks, performed by separate workers or processes Processes interact by exchanging information What do we.
Advanced Computer Networks Topic 2: Characterization of Distributed Systems.
CE Operating Systems Lecture 3 Overview of OS functions and structure.
Operating System 2 Overview. OPERATING SYSTEM OBJECTIVES AND FUNCTIONS.
Distributed Information Systems. Motivation ● To understand the problems that Web services try to solve it is helpful to understand how distributed information.
PMI: A Scalable Process- Management Interface for Extreme-Scale Systems Pavan Balaji, Darius Buntinas, David Goodell, William Gropp, Jayesh Krishna, Ewing.
1: Operating Systems Overview 1 Jerry Breecher Fall, 2004 CLARK UNIVERSITY CS215 OPERATING SYSTEMS OVERVIEW.
1 MPI_Connect and Parallel I/O for Distributed Applications Dr Graham E Fagg Innovative Computing Laboratory University of Tennessee Knoxville, TN
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 3: Process-Concept.
Message-Passing Computing Chapter 2. Programming Multicomputer Design special parallel programming language –Occam Extend existing language to handle.
1 Lecture 4: Part 2: MPI Point-to-Point Communication.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Operating Systems Processes and Threads.
Lecture on Central Process Unit (CPU)
Operating Systems (CS 340 D) Dr. Abeer Mahmoud Princess Nora University Faculty of Computer & Information Systems Computer science Department.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
HPC across Heterogeneous Resources Sathish Vadhiyar.
Background Computer System Architectures Computer System Software.
CHAPTER 6 Threads, Handlers, and Programmatic Movement.
CSCI/CMPE 4334 Operating Systems Review: Exam 1 1.
Chapter 13: I/O Systems Modified by Dr. Neerja Mhaskar for CS 3SH3.
The Mach System Sri Ramkrishna.
Operating Systems (CS 340 D)
Async or Parallel? No they aren’t the same thing!
Process Management Presented By Aditya Gupta Assistant Professor
Operating Systems (CS 340 D)
Department of Computer Science University of California, Santa Barbara
CS703 - Advanced Operating Systems
Multiple Processor Systems
Time Gathering Systems Secure Data Collection for IBM System i Server
Software models - Software Architecture Design Patterns
Lecture Topics: 11/1 General Operating System Concepts Processes
Threads Chapter 4.
Department of Computer Science University of California, Santa Barbara
Parallel I/O for Distributed Applications (MPI-Conn-IO)
Message Passing Systems
Presentation transcript:

Programming Environment & Training (PET) Sep 99 1 Parallel I/O for Distributed Applications Dr Graham E Fagg Innovative Computing Laboratory University of Tennessee Knoxville, TN

Programming Environment & Training (PET) Sep 99 2 Overview Aims of Project Background on MPI_Connect (PVMPI) Year 4 work –Changes to MPI_Connect PVM SNIPE and RCDS Single Thread comms Multi-threaded comms

Programming Environment & Training (PET) Sep 99 3 Overview (cont) Parallel IO –Subsystems ROMIO (MPICH) High performance Platforms File handling and management –Pre-caching –IBP and SNIPE-Lite –Experimental system

Programming Environment & Training (PET) Sep 99 4 Overview (cont) Future Work –File handling and exemplars DoD benefits Changes to milestones Additional Comments Conclusions

Programming Environment & Training (PET) Sep 99 5 Aims of Project Continue development of MPI_Connect –Fix bugs Like async non-blocking message problems –Enhance features as requested by users –Support Parallel IO (as in MPI-2 Parallel IO) –Support complex file management across systems and sites As we already support computation why not the input and results files as well?

Programming Environment & Training (PET) Sep 99 6 Aims of Project (cont) Support better scheduling of application runs across sites and systems –I.e. gang scheduling of both processors and pre- fetching of data (logicistical scheduling) Support CFD and CWO CTAs Training and outreach HPC challenges (SC98-SC99-SC200..)

Programming Environment & Training (PET) Sep 99 7 Background on MPI_Connect What is MPI_Connect? –System that allows two or more high performance MPI applications to inter-operate across systems/sites. –Allows each application to use the tuned vendor supplied MPI implementation which out forcing loss of local performance that occurs with systems like MPICH (p2) and Global MPI (nexus MPI).

Programming Environment & Training (PET) Sep 99 8 MPI_Connect Coupled model example MPI Application Ocean ModelMPI Application Air Model MPI_COMM_WORLD Global inter-communicator air_comm -> <- ocean_comm

Programming Environment & Training (PET) Sep 99 9 MPI_Connect Application developer just adds three extra calls to an application to allow it to inter-operate with any other application. –MPI_Conn_register, MPI_Conn_intercomm_create, MPI_Conn_remove Once above calls added normal MPI point-2-point calls can be used to send message between systems. –Only requirements are that they can both access a common name service (usually via IP) and that the MPI implementation has a profiling layer available.

Programming Environment & Training (PET) Sep MPI_Connect MPI_function Users Code Intercomm Library Return code Look up communicators etc If true MPI intracomm then use profiled MPI call PMPI_Function Else translate into SNIPE/PVM addressing and use SNIPE/PVM functions other library Work out correct return code

Programming Environment & Training (PET) Sep Year 4 work Changes to MPI_Connect PVM –Worked well for the SC98 High Performance Computing Challenge demo. BUT –Not everything worked as well as it should PVM got in the way of the IBMs POE job control system. Async messages were not non-blocking asyncronous –As discovered by SPLICE team.

Programming Environment & Training (PET) Sep MPI_Connect and SNIPE SNIPE (Scalable Networked Information Processing Environment) –Was seen as a replacement for PVM No central point of failure High speed reliable communications with limited QoS Powerful, Secure MetaData service based on RCDS

Programming Environment & Training (PET) Sep MPI_Connect and SNIPE SNIPE used RCDS for its name service –This worked on the Cray SGI Origin and IBM SP systems but did not and still does not work on the Cray T3E (jim). Solution –Kept the communications (SNIPE_Lite) and dropped RCDS for a custom name service daemon (more on this later).

Programming Environment & Training (PET) Sep MPI_Connect and single threaded communications SNIPE_Lite communications library was by default single threaded –Note single threaded nexus is also called nexuslite. This meant that asynchronous non-blocking calls just became non-blocking and no progress could be made while outside of an MPI call (just as in the PVM case when using direct IP sockets)

Programming Environment & Training (PET) Sep Single threaded communications Message sent in MPI_Isend() Between different systems Each time an MPI call is called the sender can check the out going socket and Force some more data through it. The socket should be marked nonblocking So that the MPI application cannot be deadlocked due to the actions of an External system. I.e. the system does not make progress. When the users application does a MPI_Wait() this communication is forced Through to completion.

Programming Environment & Training (PET) Sep Multi-thread Communications Solution was to use multi-threaded communications for external communications. –3 threads in initial implementation 1 send thread 1 receive thread 1 control thread that handles name service requests, setting up connections to external applications

Programming Environment & Training (PET) Sep Multi-threaded Communications How it works? –Sends put message descriptions onto a send-queue –Receive operation put requests on a receive queue –If the operation is blocking then the caller is suspended until the a condition arrises that would wake them up (using condition variables) While the main thread continues after ‘posting’ a non-blocking operation the threading library steals cycles to send/receive the message.

Programming Environment & Training (PET) Sep Multi-threaded communications Performance Test done by posting a Non-blocking send and measuring the number of operations the main thread could perform while waiting on the ‘send’ to complete. System switched to non- blocking TCP sockets when more than one external connecton was open.

Programming Environment & Training (PET) Sep Parallel IO Parallel IO allows parallel users applications access to large volums of data in such a way that by avoiding sharing, throughput can be increased by optimsations at the OS and H/W architecture levels. MPI-2 provides an API for access high performance Parallel IO subsystems.

Programming Environment & Training (PET) Sep Parallel IO Most Parallel IO implementations are built from ROMIO a model implementation supplied with MPICH SGI Origin at CEWES is MPT and the version needed is MPT 1.3. Cray T3E (jim) is MPT but (1.3 and (patched) are reported by the system)

Programming Environment & Training (PET) Sep File handling and Management MPI_Connect handles the communication between separate MPI applications BUT it does not handle the files that they work on or produce. Aims of the second half of the project are to provide users of MPI_Connect the ability to share files across multiple system and sites in such a way that it complements their application execution and the use of MPI-2 Parallel IO.

Programming Environment & Training (PET) Sep File handling and Management This project should produce tools that allow applications to share whole files, part of files and allow these files to accessed by running application no matter where they execute. –If an application runs a CEWES or ASC, at the beginning of the run, the input file should be in a single location and at the end of the run the result file should be also in a single location regardless of where the application executed.

Programming Environment & Training (PET) Sep File handling and Management Two systems considered. –Internet Backplane Protocol IBP Part of the Internet 2 project Distributed Storage Infrastructure (I2DSI). Code developed at UTK. Tested on the I2 system. –SNIPE_Lite store&forward daemon (SFD) SNIPE_Lite already used by MPI_Connect. Code developed at UTK.

Programming Environment & Training (PET) Sep File handling and management System uses five extra command: MPI_Conn_getfile MPI_Conn_getfile_view MPI_Conn_putfile MPI_Conn_putfile_view MPI_Conn_releasefile Getfile gets a file from a central location into the local ‘parallel’ filesystem. Putfile puts a file in a local filesystem into a central location _View versions work on subsets of files.

Programming Environment & Training (PET) Sep File handling and management Example code: MPI_Init(&argc, &argv) …… MPI_Conn_getfile_view (mydata, myworkdata, me, num_of_apps, & size); /* Get my part of the file called mydata and call it myworkdata */ … MPI_File_open (MCW, myworkdata,…..) /* file is now available via MPI-2 IO */

Programming Environment & Training (PET) Sep Test-bed Between two clusters of Sun UltraSparc systems. –MPI applications are implemented using MPICH. –MPI Parallel IO is implemented using ROMIO –System is tested with both IBP and SNIPE_Lite as the user API is the same.

Programming Environment & Training (PET) Sep Example Input file MPI_App 1 MPI_App_2 File Supprt Daemon IBP or SLSFD

Programming Environment & Training (PET) Sep Example Input file MPI_App 1 MPI_App_2 File Supprt Daemon IBP or SLSFD Getfile (0,2..) Getfile (1,2..)

Programming Environment & Training (PET) Sep Example Input file MPI_App 1 MPI_App_2 File Supprt Daemon IBP or SLSFD Getfile (0,2..) Getfile (1,2..)

Programming Environment & Training (PET) Sep Example Input file MPI_App 1 MPI_App_2 File Supprt Daemon IBP or SLSFD Getfile (0,2..) Getfile (1,2..) File data passed in block fashion So that it does not overload daemon.

Programming Environment & Training (PET) Sep Example Input file MPI_App 1 MPI_App_2 File Supprt Daemon IBP or SLSFD Input file Files are now ready to be opened by the MPI-2 MPI_File_open function.

Programming Environment & Training (PET) Sep Future Work File handling (next half) –Change to the system so that shared files can be combined and new views (access modes) created for the files on the fly. File handling (next year) –Change the system so that we can perform true prefeteching of large data sets before the applications even start to execute. This would be performed by co- operation between the file handling daemons and site wide batch queue systems such as a modified version of NASA’s PBS.

Programming Environment & Training (PET) Sep DoD Benefits MPI_Connect –Is still allows independently MPI applications to interoperate with little or no loss of local communication performance, allowing the solving of larger problems that in is possible with individual systems. MPI_Connect communication mode changes –These have allowed applications across systems to be more flexible in the use of non-blocking messaging which increases overall communication performance, with out application developers having to code around potential effects of using MPI_Connect.

Programming Environment & Training (PET) Sep DoD benefits File handling and management –Allow simplified handling of files/data sets when executing across multiple systems. Allows users to keep central repositories of data without having to collect files from multiple locations when running on non-local sites. MPI-2 IO support –Together with the file handling utilities allow users to access to the new parallel-IO subsystems without hindering them. I.e. Just because you use MPI_Connect does not mean you should not use parallel IO!

Programming Environment & Training (PET) Sep Changes Moved away from using RCDS to custom name service. Pre-fetching only possible if user runs a script that uploads file and partition information into file handling daemon in advance of queuing computational jobs.

Programming Environment & Training (PET) Sep Additional Comments External MPI performance –Last years review reported very poor bandwidth between MSRC sites when using MPI_Connect. Recent tests (August 1999) show this is no-longer the case. All tests performed between origin.wes.hpc.mil and other SGI origins at ARL or ASC. –CEWES ASC (hpc03.asc.hpc.mil) Mbytes/Sec –CEWES ARL (adele1.arl.hpc.mil) Mbytes/Sec Internal MPI performance under MPI_Connect –CEWES (origin) 77.2 Mbytes/sec –ASC (hpc03) 77.4 Mbytes/sec –ARL (adele1) 98.0 Mbytes/sec (faster newer machine)

Programming Environment & Training (PET) Sep Additional Comments Other than use as part of the DoD MSRC CEWES HPC challenge at SC98, MPI_Connect was recently used for a Challenge project at a DOE ASCI site. The computation involved over 5800 CPUs on a by system of linear equations accounting for almost 35,000 CPU hours in a single run. Thus proving its stability and low performance overheads when compared to other competing Meta-Computing middleware.

Programming Environment & Training (PET) Sep Additional Comments Need help from on-site leads with locating users who actually use Parallel IO from within MPI applications. –Many users agree that Parallel IO is important, but few actually use it explicitly in their applications.

Programming Environment & Training (PET) Sep Conclusions MPI_Connect is still on target to meet milestone. MPI_Connect has improved the communication between external systems and no-longer needs users to run PVM. File management tools and new calls allow very simple handling of files across systems in a natural way accessible directly from within applications. This support also complements Parallel-IO rather than hinders its adoption and use.