Building Coupled Parallel and Distributed Scientific Simulations with InterComm Alan Sussman Department of Computer Science & Institute for Advanced Computer.

Slides:

Advertisements

Similar presentations

Prof. Natalia Kussul, PhD. Andrey Shelestov, Lobunets A., Korbakov M., Kravchenko A.

Advertisements

A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University

High Performance Computing Course Notes Grid Computing.

The InterComm-based CCA MxN components Hassan Afzal Alan Sussman University of Maryland.

GridRPC Sources / Credits: IRISA/IFSIC IRISA/INRIA Thierry Priol et. al papers.

ProActive Task Manager Component for SEGL Parameter Sweeping Natalia Currle-Linde and Wasseim Alzouabi High Performance Computing Center Stuttgart (HLRS),

1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.

Slides for Grid Computing: Techniques and Applications by Barry Wilkinson, Chapman & Hall/CRC press, © Chapter 1, pp For educational use only.

Office of Science U.S. Department of Energy Grids and Portals at NERSC Presented by Steve Chan.

Distributed Application Management Using PLuSH Jeannie Albrecht, Christopher Tuttle, Alex C. Snoeren, and Amin Vahdat UC San Diego CSE {jalbrecht, ctuttle,

Expanded Observatory support (redundancy, verification) CME (Empirical) propagation (Cone Model) (ICME strength and arrival time) Electrodynamics model.

Flexible Control of Data Transfer between Parallel Programs Joe Shang-chieh Wu Alan Sussman Department of Computer Science University of Maryland, USA.

Flexible and Efficient Control of Data Transfers for Loosely Coupled Components Joe Shang-Chieh Wu Department of Computer Science University.

Cross Cluster Migration Remote access support Adianto Wibisono supervised by : Dr. Dick van Albada Kamil Iskra, M. Sc.

July 14, 2005 Shine 2005 Coupling Frameworks in Solar-Terrestrial Research Chuck Goodrich Boston University.

The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.

ADLB Update Recent and Current Adventures with the Asynchronous Dynamic Load Balancing Library Rusty Lusk Mathematics and Computer Science Division Argonne.

STRATEGIES INVOLVED IN REMOTE COMPUTATION

Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.

Christopher Jeffers August 2012

Self Adaptivity in Grid Computing Reporter : Po - Jen Lo Sathish S. Vadhiyar and Jack J. Dongarra.

PVM. PVM - What Is It? F Stands for: Parallel Virtual Machine F A software tool used to create and execute concurrent or parallel applications. F Operates.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Tools and Utilities for parallel and serial codes in ENEA-GRID environment CRESCO Project: Salvatore Raia SubProject I.2 C.R. ENEA-Portici. 11/12/2007.

The Pipeline Processing Framework LSST Applications Meeting IPAC Feb. 19, 2008 Raymond Plante National Center for Supercomputing Applications.

CCA Common Component Architecture Manoj Krishnan Pacific Northwest National Laboratory MCMD Programming and Implementation Issues.

Coupling Parallel Programs via MetaChaos Alan Sussman Computer Science Dept. University of Maryland With thanks to Mike Wiltberger (Dartmouth/NCAR)

Magnetic Field Measurement System as Part of a Software Family Jerzy M. Nogiec Joe DiMarco Fermilab.

GT Components. Globus Toolkit A “toolkit” of services and packages for creating the basic grid computing infrastructure Higher level tools added to this.

Initial Results from the Integration of Earth and Space Frameworks Cecelia DeLuca/NCAR, Alan Sussman/University of Maryland, Gabor Toth/University of Michigan.

OpenSees on NEEShub Frank McKenna UC Berkeley. Bell’s Law Bell's Law of Computer Class formation was discovered about It states that technology.

Model Coupling Environmental Library. Goals Develop a framework where geophysical models can be easily coupled together –Work across multiple platforms,

SUMA: A Scientific Metacomputer Cardinale, Yudith Figueira, Carlos Hernández, Emilio Baquero, Eduardo Berbín, Luis Bouza, Roberto Gamess, Eric García,

Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.

Computer Emergency Notification System (CENS)

Evaluation of Agent Teamwork High Performance Distributed Computing Middleware. Solomon Lane Agent Teamwork Research Assistant October 2006 – March 2007.

Grid Computing Research Lab SUNY Binghamton 1 XCAT-C++: A High Performance Distributed CCA Framework Madhu Govindaraju.

Experiences with the Globus Toolkit on AIX and deploying the Large Scale Air Pollution Model as a grid service Ashish Thandavan Advanced Computing and.

Tool Integration with Data and Computation Grid GWE - “Grid Wizard Enterprise”

Communicating Security Assertions over the GridFTP Control Channel Rajkumar Kettimuthu 1,2, Liu Wantao 3,4, Frank Siebenlist 1,2 and Ian Foster 1,2,3 1.

Presented by An Overview of the Common Component Architecture (CCA) The CCA Forum and the Center for Technology for Advanced Scientific Component Software.

NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.

Grid Security: Authentication Most Grids rely on a Public Key Infrastructure system for issuing credentials. Users are issued long term public and private.

GRIDS Center Middleware Overview Sandra Redman Information Technology and Systems Center and Information Technology Research Center National Space Science.

GRID Overview Internet2 Member Meeting Spring 2003 Sandra Redman Information Technology and Systems Center and Information Technology Research Center National.

NEES Cyberinfrastructure Center at the San Diego Supercomputer Center, UCSD George E. Brown, Jr. Network for Earthquake Engineering Simulation NEES TeraGrid.

A Software Framework for Distributed Services Michael M. McKerns and Michael A.G. Aivazis California Institute of Technology, Pasadena, CA Introduction.

Introduction to Grids By: Fetahi Z. Wuhib [CSD2004-Team19]

Distributed System Services Fall 2008 Siva Josyula

Message-Passing Computing Chapter 2. Programming Multicomputer Design special parallel programming language –Occam Extend existing language to handle.

Connections to Other Packages The Cactus Team Albert Einstein Institute

Module: Software Engineering of Web Applications Chapter 2: Technologies 1.

Globus and PlanetLab Resource Management Solutions Compared M. Ripeanu, M. Bowman, J. Chase, I. Foster, M. Milenkovic Presented by Dionysis Logothetis.

Globus Grid Tutorial Part 2: Running Programs Across Multiple Resources.

WebFlow High-Level Programming Environment and Visual Authoring Toolkit for HPDC (desktop access to remote resources) Tomasz Haupt Northeast Parallel Architectures.

Intro to Web Services Dr. John P. Abraham UTPA. What are Web Services? Applications execute across multiple computers on a network.  The machine on which.

Tool Integration with Data and Computation Grid “Grid Wizard 2”

Globus: A Report. Introduction What is Globus? Need for Globus. Goal of Globus Approach used by Globus: –Develop High level tools and basic technologies.

Copyright 2007, Information Builders. Slide 1 iWay Web Services and WebFOCUS Consumption Michael Florkowski Information Builders.

A TIME-GCM CAM Multi-executable Coupled Model Using ESMF and InterComm Robert Oehmke, Michael Wiltberger, Alan Sussman, Wenbin Wang, and Norman Lo.

PARALLEL AND DISTRIBUTED PROGRAMMING MODELS U. Jhashuva 1 Asst. Prof Dept. of CSE om.

Wednesday NI Vision Sessions

The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Introduction Salma Saber Electronic.

VGrADS and GridSolve Asim YarKhan Jack Dongarra, Zhiao Shi, Fengguang Song Innovative Computing Laboratory University of Tennessee VGrADS Workshop – September.

Architecture Review 10/11/2004

Kai Li, Allen D. Malony, Sameer Shende, Robert Bell

Duncan MacMichael & Galen Deal CSS 534 – Autumn 2016

Introduction to Operating System (OS)

University of Technology

Distributed System Concepts and Architectures

Presentation transcript:

Building Coupled Parallel and Distributed Scientific Simulations with InterComm Alan Sussman Department of Computer Science & Institute for Advanced Computer Studies

A Simple Example (MxN coupling) p1 p2 p3 p4 p1 p2 M=4 processorsN=2 processors InterComm: Data exchange at the borders (transfer and control) Visualization station Parallel Application LEFTSIDE (Fortran90, MPI-based) Parallel Application RIGHTSIDE (C++, PVM-based) 2-D Wave Equation

The Problem Coupling codes (components), not models Codes written in different languages Fortran (77, 95), C, C++/P++, … Both parallel (shared or distributed memory) and sequential Codes may be run on same, or different resources One or more parallel machines or clusters (the Grid) concentrate on this today

Space Weather Prediction Our driving application: Production of an ever- improving series of comprehensive scientific models of the Solar Terrestrial environment Codes model both large scale and microscale structures and dynamics of the Sun-Earth system

What is InterComm? A programming environment and runtime library For performing efficient, direct data transfers between data structures (multi- dimensional arrays) in different programs For controlling when data transfers occur For deploying multiple coupled programs in a Grid environment

Deploying Components Infrastructure for deploying programs and managing interactions between them  Starting each of the models on the desired Grid resources  Connecting the models together via the InterComm framework  Models communicate via the import and export calls

Motivation Developer has to deal with … Multiple logons Manual resource discovery and allocation Application run-time requirements Process for launching complex applications with multiple components is Repetitive Time-consuming Error-prone

Deploying Components – HPCALE A single environment for running coupled applications in the high performance, distributed, heterogeneous Grid environment We must provide: Resource discovery: Find resources that can run the job, and automate how model code finds the other model codes that it should be coupled to Resource Allocation: Schedule the jobs to run on the resources – without you dealing with each one directly Application Execution: start every component appropriately and monitor their execution Built on top of basic Web and Grid services (XML, SOAP, Globus, PBS, Loadleveler, LSF, etc.)

Resource Discovery Discover Resource Register Resource Register Software Update Resource Information Repository XJD Cluster1 Cluster2 Cluster3 XSDXRD XJD SSL-Enabled Web Server

Resource Allocation Allocate Resource Server XJD 1. Assign Job ID Cluster1 Cluster2 Cluster3 2. Create scratch directories 4. Allocate requests (with retry) - Send script to local resource manager - Lock file creation Success/Fail 3. Create/manage app specific runtime info

Application Launching Launch Application Server Cluster1 Cluster2 Cluster3 4. Launch using appropriate launching method Success/Fail 3. Split single resource across components 2. Deal with runtime environment (e.g., PVM) XJD 5. Transfer created files to user’s machine (get) File Transfer Service 1. Transfer input files to resources (put) XJD

Current HPCALE implementation Discovery: Simple resource discovery service implemented Resources register themselves with the service, including what codes they can run Resource Allocation Scripts implemented to parse XJD files and start applications on desired resources Works with local schedulers – e.g., PBS, LoadLeveler Job Execution: Startup service implemented – for machines without schedulers, and with schedulers, for PBS (Linux clusters) and LoadLeveler (IBM SP, p Series) Works via ssh and remote execution – can startup from anywhere, like your home workstation Need better authentication, looking at GSI certificates Building connections automated through startup service passing configuration parameters interpreted through InterComm initialization code

Current status First InterComm release (data movement) April 2005 source code and documentation included Release includes test suite and sample applications for multiple platforms (Linux, AIX, Solaris, …) Support for F77, F95, C, C++/P++ InterComm 1.5 released Summer 2006 new version of library with export/import support, broadcast of scalars/arrays between programs HPCALE adds support for automated launching and resource allocation, tool for building XJD files for application runs release is imminent for HPCALE (documentation still not complete)

The team at Maryland Il-Chul Yoon (graduate student) HPCALE Norman Lo (staff programmer) InterComm testing, debugging, packaging ESMF integration Shang-Chieh Wu (graduate student) InterComm control design and implementation

End of Slides

Data Transfers in InterComm Interact with data parallel (SPMD) code used in separate programs (including MPI) Exchange data between separate (sequential or parallel) programs, running on different resources (parallel machines or clusters) Manage data transfers between different data structures in the same application CCA Forum refers to this as the MxN problem

InterComm Goals One main goal is minimal modification to existing programs In scientific computing: plenty of legacy code Computational scientists want to solve their problem, not worry about plumbing Other main goal is low overhead and efficient data transfers Low overhead in planning the data transfers Efficient data transfers via customized all-to-all message passing between source and destination processes

Coupling OUTSIDE components Separate coupling information from the participating components Maintainability – Components can be developed/upgraded individually Flexibility – Change participants/components easily Functionality – Support variable-sized time interval numerical algorithms or visualizations Matching information is specified separately by application integrator Runtime match via simulation time stamps

Controlling Data Transfers A flexible method for specifying when data should be moved Based on matching export and import calls in different programs via timestamps Transfer decisions take place based on a separate coordination specification Coordination specification can also be used to deploy model codes and grid translation/interpolation routines (how many and where to run codes)

Example Simulation exports every time step, visualization imports every 2 nd time step Visualization station Simulation Cluster

Separate codes from matching define region Sr12 define region Sr4 define region Sr5... Do t = 1, N, Step0... // computation jobs export(Sr12,t) export(Sr4,t) export(Sr5,t) EndDo define region Sr0... Do t = 1, M, Step1 import(Sr0,t)... // computation jobs EndDo Importer Ap1 Exporter Ap0 Ap1.Sr0 Ap2.Sr0 Ap4.Sr0 Ap0.Sr12 Ap0.Sr4 Ap0.Sr5 Configuration file # Ap0 cluster0 /bin/Ap Ap1 cluster1 /bin/Ap Ap2 cluster2 /bin/Ap Ap4 cluster4 /bin/Ap4 4 # Ap0.Sr12 Ap1.Sr0 REGL 0.05 Ap0.Sr12 Ap2.Sr0 REGU 0.1 Ap0.Sr4 Ap4.Sr0 REG 1.0 #

Plumbing Bindings for C, C++/P++, Fortran77, Fortran95 Currently external message passing and program interconnection via PVM Each model/program can do whatever it wants internally (MPI, pthreads, sockets, …) – and start up by whatever mechanism it wants