Department of Computer Science Joshua Goehner and Dorian Arnold University of New Mexico (In collaboration with LLNL and U. of Wisconsin.

Slides:



Advertisements
Similar presentations
© 2005 Dorian C. Arnold Reliability in Tree-based Overlay Networks Dorian C. Arnold University of Wisconsin Paradyn/Condor Week March 14-18, 2005 Madison,
Advertisements

Presented by Scalable Systems Software Project Al Geist Computer Science Research Group Computer Science and Mathematics Division Research supported by.
Integrated Scientific Workflow Management for the Emulab Network Testbed Eric Eide, Leigh Stoller, Tim Stack, Juliana Freire, and Jay Lepreau and Jay Lepreau.
New Direction Proposal: An OpenFabrics Framework for high-performance I/O apps OFA TAC, Key drivers: Sean Hefty, Paul Grun.
 Cloud computing  Workflow  Workflow lifecycle  Workflow design  Workflow tools : xcp, eucalyptus, open nebula.
GRACE Project IST EGAAP meeting – Den Haag, 25/11/2004 Giuseppe Sisto – Telecom Italia Lab.
Tools and Utilities for parallel and serial codes in ENEA-GRID environment CRESCO Project: Salvatore Raia SubProject I.2 C.R. ENEA-Portici. 11/12/2007.
University of Illinois at Urbana-Champaign NCSA Supercluster Administration NT Cluster Group Computing and Communications Division NCSA Avneesh Pant
Artdaq Introduction artdaq is a toolkit for creating the event building and filtering portions of a DAQ. A set of ready-to-use components along with hooks.
Trace Generation to Simulate Large Scale Distributed Application Olivier Dalle, Emiio P. ManciniMar. 8th, 2012.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
SC 2012 © LLNL / JSC 1 HPCToolkit / Rice University Performance Analysis through callpath sampling  Designed for low overhead  Hot path analysis  Recovery.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
The Future of the iPlant Cyberinfrastructure: Coming Attractions.
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 1 On-line Automated Performance Diagnosis on Thousands of Processors Philip C. Roth Future.
Evaluation of Agent Teamwork High Performance Distributed Computing Middleware. Solomon Lane Agent Teamwork Research Assistant October 2006 – March 2007.
Freelib: A Self-sustainable Digital Library for Education Community Ashraf Amrou, Kurt Maly, Mohammad Zubair Computer Science Dept., Old Dominion University.
Management of the LHCb DAQ Network Guoming Liu * †, Niko Neufeld * * CERN, Switzerland † University of Ferrara, Italy.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
Scalable Systems Software for Terascale Computer Centers Coordinator: Al Geist Participating Organizations ORNL ANL LBNL.
© 2002 Barton P. MillerMarch 4, 2001Tool Dæmon Protocol The Tool Dæmon Protocol: Using Monitoring Tools on Remote Applications Barton P. Miller
April 14, 2004 The Distributed Performance Consultant: Automated Performance Diagnosis on 1000s of Processors Philip C. Roth Computer.
PTools Annual Meeting, Knoxville, TN, September 2002 The Tool Daemon Protocol: Defining the Interface Between Tools and Process Management Systems.
Sept 20-21, 2001R. Scott Cost - CADIP, UMBC1 CARROT II Collaborative Agent-based Routing and Retrieval of Text, Version 2 CADIP Fall Research Symposium.
Aneka Cloud ApplicationPlatform. Introduction Aneka consists of a scalable cloud middleware that can be deployed on top of heterogeneous computing resources.
Paradyn Week Monday 8:45-9:15am - The Deconstruction of Dyninst: Lessons Learned and Best Practices The Deconstruction of Dyninst: Lessons Learned.
LSF Universus By Robert Stober Systems Engineer Platform Computing, Inc.
LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.
Five todos when moving an application to distributed HTC.
TeraGrid Software Integration: Area Overview (detailed in 2007 Annual Report Section 3) Lee Liming, JP Navarro TeraGrid Annual Project Review April, 2008.
Towards a High Performance Extensible Grid Architecture Klaus Krauter Muthucumaru Maheswaran {krauter,
Bob Jones EGEE Technical Director
The Distributed Application Debugger (DAD)
GridOS: Operating System Services for Grid Architectures
Docker and Azure Container Service
The Open Grid Service Architecture (OGSA) Standard for Grid Computing
Information Collection and Presentation Enriched by Remote Sensor Data
Spark Presentation.
Joseph JaJa, Mike Smorul, and Sangchul Song
Key Terms Windows 2008 Network Infrastructure Confiuguration Lesson 6
A Messaging Infrastructure for WLCG
CRESCO Project: Salvatore Raia
Abstract Machine Layer Research in VGrADS
CHAPTER 3 Architectures for Distributed Systems
Top-Down Network Design Chapter Nine Developing Network Management Strategies Copyright 2010 Cisco Press & Priscilla Oppenheimer.
Tom Rink Tom Whittaker Paolo Antonelli Kevin Baggett.
NCSA Supercluster Administration
RM3G: Next Generation Recovery Manager
Time Gathering Systems Secure Data Collection for IBM System i Server
Module 01 ETICS Overview ETICS Online Tutorials
HPML Conference, Lyon, Sept 2018
Cloud computing mechanisms
Stack Trace Analysis for Large Scale Debugging using MRNet
Genre1: Condor Grid: CSECCR
MPJ: A Java-based Parallel Computing System
Cloud-Enabling Technology
Storing and Accessing G-OnRamp’s Assembly Hubs outside of Galaxy
Fundamental Concepts and Models
Implementing Processes, Threads, and Resources
Distributing META-pipe on ELIXIR compute resources
The Performance and Scalability of the back-end DAQ sub-system
DNSR: Domain Name Suffix-based Routing in Overlay Networks
Distributed System using Web Services
Kostas Kolomvatsos, Christos Anagnostopoulos
Client/Server Computing and Web Technologies
Top-Down Network Design Chapter Nine Developing Network Management Strategies Copyright 2010 Cisco Press & Priscilla Oppenheimer.
Message Passing Systems
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

Department of Computer Science Joshua Goehner and Dorian Arnold University of New Mexico (In collaboration with LLNL and U. of Wisconsin

 Key statistics from Top500 (Nov. 2011) ◦ 149/500 (30%) more than 8,192 core  In 2006, this number was 12/500 (2.4%) ◦ 7 systems: more than 64K cores ◦ 6 systems: between 64K and 128K cores ◦ 3 systems: more than 128K cores  Later this year: LLNL’s Sequoia w/ 1.6M cores  2018?: Exascale systems w/ 10 7 or 10 8 cores A Little Preaching to the Choir Software systems must scale as well!

 Before bootstrapping: ◦ Program image on storage device ◦ Set of (allocated) computer nodes  After bootstrapping:  Application processes started on computer nodes  Application’s configuration complete  ready for primary operation Software Infrastructure-bootstrapping Given a node allocation, start infrastructure's composite processes and propagate necessary startup information.

“I gotta get my application up and running!”

(1) “First, I start all the processes on the appropriate nodes”

(2) “Next, I must disseminate some initialization information”

Bootstrapping is complete when the infrastructure is ready for steady-state usage.

Basic Bootstrapping  Sequential instantiation ◦ e.g. rsh or ssh-based  Point-to-point propagation 70 seconds to start 2K processes!

 Infrastructure-specific, scalable mechanisms ◦ Still limited by sequential operations ◦ Not portable to other infrastructures  Using high-performance resource managers ◦ Myriad interfaces ◦ No communication facilities  Generic bootstrapping infrastructures ◦ LaunchMON: targets tools with wrapper for existing RMs ◦ ScELA: sequentially launch agents that create local processes More Scalable Approaches

MRNet Sequential-ish Bootstrapping  Parent creates children  Local  fork ()/ exec ()  Remote  rsh -based mechanism  Integrated instantiation and information propagation  MRNet’s “standard”

MRNet XT (hybrid) process launch  Bulk-launch 1 process per node  Process launches collocated processes  Information propagated in similar fashion

LIBI Approach  LIBI: Lightweight infrastructure- bootstrapping infrastructure ◦ Name is no longer a work in progress ◦ Generic service for scalable distributed software infrastructure bootstrapping  Process launch  Scalable, low-level collectives Large Scale Distributed Software Job Launchers Communication Services LIBI rsh/ssh SLURM OpenRTE ALPS LaunchMON DebuggersSystem Monitors Applications Performance Analyzers Overlay Networks COBO MPI

 session: set of processes (to be) deployed ◦ session master manages other members ◦ session front-end interacts with session master ◦ LIBI currently supports only master/member communication  host-distribution: where to create processes ◦  process distribution: how/where to create processes ◦ LIBI API

launch(process-distribution-array) ◦ instantiate processes according to input distributions [send|recv]UsrData(session-id, msg) ◦ communicate between front-end and session master broadcast(), scatter(), gather(), barrier() ◦ communicate between session master and members LIBI API (cont’d)

front-end( ){ LIBI_fe_init(); LIBI_fe_createSession(sess); proc_dist_req_t pd; pd.sessionHandle = sess; pd.proc_path = get_ExePath(); pd.proc_argv = get_ProgArgs(); pd.hd = get_HostDistribution(); LIBI_fe_launch(pd); //test broadcast and barrier LIBI_fe_sendUsrData(sess1, msg, len ); LIBI_fe_recvUsrData(sess1, msg, len); //test scatter and gather LIBI_fe_sendUsrData(sess1, msg, len ); LIBI_fe_recvUsrData(sess1, msg, len); return 0; } Example LIBI Front-end

session_member() { LIBI_init(); //test broadcast and barrier LIBI_recvUsrData(msg, msg_length); LIBI_broadcast(msg, msg_length); LIBI_barrier(); LIBI_sendUsrData(msg, msg_length) //test scatter and gather LIBI_recvUsrData(msg, msg_length); LIBI_scatter(msg, sizeof(rcvmsg), rcvmsg); LIBI_gather(sndmsg, sizeof(sndmsg), msg); LIBI_sendUsrData(msg, msg_length); LIBI_finalize(); } Example LIBI-launched Application

 LaunchMON-based runtime ◦ Tested SLURM or rsh launching ◦ COBO PMGR service LIBI Implementation Status Large Scale Distributed Software Job Launchers Communication Services LIBI rsh/ssh SLURM OpenRTE ALPS LaunchMON DebuggersSystem Monitors Applications Performance Analyzers Overlay Networks COBO MPI

LIBI Early Evaluation

LIBI Microbenchmark Results

 MRNet uses LIBI to launch all MRNet processes ◦ Parse topology file and setup/call LIBI_launch()  Session front-end gathers/scatters startup information ◦ Parent listening socket (IP/port) MRNet/LIBI Integration

 STAT over MRNet over LIBI performance results  Basic, scalable rsh-based mechanism  Hybrid (XT-like) launch approach  Mechanisms to alleviate filesystem contention ◦ Like our scalable binary relocation service  More flexible process and host distributions ◦ Instantiating different images in same session ◦ Integrating allocating and launching  Integrated scalable communication infrastructure Some Future Work