GridShell/Condor: A virtual login Shell for the NSF TeraGrid (How do you run a million jobs on the NSF TeraGrid?) The University of Texas at Austin.

Slides:



Advertisements
Similar presentations
Test harness and reporting framework Shava Smallen San Diego Supercomputer Center Grid Performance Workshop 6/22/05.
Advertisements

Cross-site data transfer on TeraGrid using GridFTP TeraGrid06 Institute User Introduction to TeraGrid June 12 th by Krishna Muriki
1 Applications Virtualization in VPC Nadya Williams UCSD.
Agreement-based Distributed Resource Management Alain Andrieux Karl Czajkowski.
Sergiu January 2007 TG Users’ Data Transfer Needs SDSC NCAR TACC UC/ANL NCSA ORNL PU IU PSC.
Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.
A Computation Management Agent for Multi-Institutional Grids
Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
GRID workload management system and CMS fall production Massimo Sgaravatto INFN Padova.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
Milos Kobliha Alejandro Cimadevilla Luis de Alba Parallel Computing Seminar GROUP 12.
First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova
Simo Niskala Teemu Pasanen
Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova.
1 Bridging Clouds with CernVM: ATLAS/PanDA example Wenjing Wu
CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are compiled from various TeraGrid Documentations.
Vladimir Litvin, Harvey Newman Caltech CMS Scott Koranda, Bruce Loftis, John Towns NCSA Miron Livny, Peter Couvares, Todd Tannenbaum, Jamie Frey Wisconsin.
Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.
National Center for Supercomputing Applications GridChem: Integrated Cyber Infrastructure for Computational Chemistry Sudhakar.
ATLAS DQ2 Deletion Service D.A. Oleynik, A.S. Petrosyan, V. Garonne, S. Campana (on behalf of the ATLAS Collaboration)
PCGRID ‘08 Workshop, Miami, FL April 18, 2008 Preston Smith Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University.
Through the development of advanced middleware, Grid computing has evolved to a mature technology in which scientists and researchers can leverage to gain.
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
GridShell + Condor How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner Edward Walker Miron Livney Todd Tannenbaum The Condor Development Team.
Grid job submission using HTCondor Andrew Lahiff.
Condor Week 2005Optimizing Workflows on the Grid1 Optimizing workflow execution on the Grid Gaurang Mehta - Based on “Optimizing.
Using the NSF TeraGrid for Parametric Sweep CMS Applications Jeffrey P. Gardner Edward Walker Vladimir Litvin Pittsburgh Supercomputing Center Texas Advanced.
Kurt Mueller San Diego Supercomputer Center NPACI HotPage Updates.
© 2007 UC Regents1 Track 1: Cluster and Grid Computing NBCR Summer Institute Session 1.1: Introduction to Cluster and Grid Computing July 31, 2007 Wilfred.
Report from USA Massimo Sgaravatto INFN Padova. Introduction Workload management system for productions Monte Carlo productions, data reconstructions.
TeraGrid Advanced Scheduling Tools Warren Smith Texas Advanced Computing Center wsmith at tacc.utexas.edu.
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
What is SAM-Grid? Job Handling Data Handling Monitoring and Information.
Review of Condor,SGE,LSF,PBS
Ruth Pordes November 2004TeraGrid GIG Site Review1 TeraGrid and Open Science Grid Ruth Pordes, Fermilab representing the Open Science.
Leveraging the InCommon Federation to access the NSF TeraGrid Jim Basney Senior Research Scientist National Center for Supercomputing Applications University.
GLIDEINWMS - PARAG MHASHILKAR Department Meeting, August 07, 2013.
Glite. Architecture Applications have access both to Higher-level Grid Services and to Foundation Grid Middleware Higher-Level Grid Services are supposed.
TeraGrid Gateway User Concept – Supporting Users V. E. Lynch, M. L. Chen, J. W. Cobb, J. A. Kohl, S. D. Miller, S. S. Vazhkudai Oak Ridge National Laboratory.
HammerCloud Functional tests Valentina Mancinelli IT/SDC 28/2/2014.
GSI: Security On Teragrid A Introduction To Security In Cyberinfrastructure By Dru Sepulveda.
Eileen Berman. Condor in the Fermilab Grid FacilitiesApril 30, 2008  Fermi National Accelerator Laboratory is a high energy physics laboratory outside.
Bulk Data Transfer Activities We regard data transfers as “first class citizens,” just like computational jobs. We have transferred ~3 TB of DPOSS data.
Jaime Frey Computer Sciences Department University of Wisconsin-Madison What’s New in Condor-G.
Miron Livny Computer Sciences Department University of Wisconsin-Madison Condor and (the) Grid (one of.
George Kola Computer Sciences Department University of Wisconsin-Madison Data Pipelines: Real Life Fully.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Cluman: Advanced Cluster Management for Large-scale Infrastructures.
Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.
Five todos when moving an application to distributed HTC.
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Introduction Salma Saber Electronic.
HTCondor-CE. 2 The Open Science Grid OSG is a consortium of software, service and resource providers and researchers, from universities, national laboratories.
Enabling Grids for E-sciencE Claudio Cherubino INFN DGAS (Distributed Grid Accounting System)
Honolulu - Oct 31st, 2007 Using Glideins to Maximize Scientific Output 1 IEEE NSS 2007 Making Science in the Grid World - Using Glideins to Maximize Scientific.
Condor DAGMan: Managing Job Dependencies with Condor
Blueprint of Persistent Infrastructure as a Service
Joint Techs, Columbus, OH
PROOF – Parallel ROOT Facility
Grid2Win: Porting of gLite middleware to Windows XP platform
Jeffrey P. Gardner Pittsburgh Supercomputing Center
Auburn University COMP7500 Advanced Operating Systems I/O-Aware Load Balancing Techniques (2) Dr. Xiao Qin Auburn University.
Management of Virtual Execution Environments 3 June 2008
Basic Grid Projects – Condor (Part I)
Introduction to Apache
Genre1: Condor Grid: CSECCR
Wide Area Workload Management Work Package DATAGRID project
Building and running HPC apps in Windows Azure
Condor-G Making Condor Grid Enabled
Production Manager Tools (New Architecture)
ZORAN BARAC DATA ARCHITECT at CIN7
Presentation transcript:

GridShell/Condor: A virtual login Shell for the NSF TeraGrid (How do you run a million jobs on the NSF TeraGrid?) The University of Texas at Austin

The NSF TeraGrid Links nine resource provider sites via a high speed 10/30 Gbps network backbone, providing, in aggregate, 20 teraflops of computing power, 1 petabyte of storage capacity, and high-end facilities for remote visualization of computing results. Compute resources composed of a heterogeneous mix of clusters with different architectures and operating systems, running different workload management systems, with different local configuration, specifying different queues, with different job submission and run-time limits.

What is GridShell? GridShell extends TCSH, and BASH, to invoke interposition agents to perform explicit, as well as implicit actions, on behalf of the user. Agents persists during the lifetime of the action, or of the gridshell login session.

Example – A site independent script

Motivation – Parametric Sweep Jobs are Common on the TeraGrid

Motivation – Providing a persistent and robust environment across the Grid

Example – A virtual login session from a client

Features: GridShell/Condor on the TeraGrid Automatically throttles condor_startd job submissions based on configurable local site policies. Tolerates and recovers in the presence of transient faults in the WAN and login nodes. Balances condor_startd jobs between sites based on queue wait times. Shunts condor_startd jobs off temporary faulty sites. Automatically renews grid proxy credentials across sites within the command environment. Current users: Caltech CMS and NVO. SUBMITTED 70,000 jobs through clusters on TeraGrid to date.

GridShell/Condor Process Architecture

Why GridShell/Condor? scalability - the actual parametric job submission is done directly to the compute nodes and not through the front-end node of the cluster; fault-tolerance –agents at a front-end node of a cluster maintain the condor_startd submissions locally, allowing transient WAN outages and periodic front-end node reboots to be resolved independently, in isolation from the rest of the system; and usability – the entire Condor submission, monitor, and control infrastructure is leveraged as a common job management environment for the user.

Condor Pool created using NCSA, SDSC, CACR and TACC resources on the TeraGrid

Condor pool created using NCSA, SDSC, CACR and TACC resources with pending job “balancing”