Tarball server (for Condor installation) Site Headnode Worker Nodes Schedd glidein - special purpose Condor pool master DB Panda Server Pilot Factory -

Slides:

Advertisements

Similar presentations

LEAD Portal: a TeraGrid Gateway and Application Service Architecture Marcus Christie and Suresh Marru Indiana University LEAD Project (

Advertisements

FP7-INFRA Enabling Grids for E-sciencE EGEE Induction Grid training for users, Institute of Physics Belgrade, Serbia Sep. 19, 2008.

CERN LCG Overview & Scaling challenges David Smith For LCG Deployment Group CERN HEPiX 2003, Vancouver.

A Computation Management Agent for Multi-Institutional Grids

1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.

GRID workload management system and CMS fall production Massimo Sgaravatto INFN Padova.

Workload Management Workpackage Massimo Sgaravatto INFN Padova.

GRID Workload Management System Massimo Sgaravatto INFN Padova.

Workload Management Massimo Sgaravatto INFN Padova.

First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova

Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova.

Resource Management Reading: “A Resource Management Architecture for Metacomputing Systems”

Makrand Siddhabhatti Tata Institute of Fundamental Research Mumbai 17 Aug

The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.

KARMA with ProActive Parallel Suite 12/01/2009 Air France, Sophia Antipolis Solutions and Services for Accelerating your Applications.

CGW 2003 Institute of Computer Science AGH Proposal of Adaptation of Legacy C/C++ Software to Grid Services Bartosz Baliś, Marian Bubak, Michał Węgiel,

Connecting OurGrid & GridSAM A Short Overview. Content Goals OurGrid: architecture overview OurGrid: short overview GridSAM: short overview GridSAM: example.

The Glidein Service Gideon Juve What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from.

Dynamic Firewalls and Service Deployment Models for Grid Environments Gian Luca Volpato, Christian Grimm RRZN – Leibniz Universität Hannover Cracow Grid.

Progress Report Barnett Chiu Glidein Code Updates and Tests (1) Major modifications to condor_glidein code are as follows: 1. Command Options:

Database Laboratory Regular Seminar TaeHoon Kim.

Grid Resource Allocation and Management (GRAM) Execution management Execution management –Deployment, scheduling and monitoring Community Scheduler Framework.

Campus Grids Report OSG Area Coordinator’s Meeting Dec 15, 2010 Dan Fraser (Derek Weitzel, Brian Bockelman)

Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.

CHEP 2003Stefan Stonjek1 Physics with SAM-Grid Stefan Stonjek University of Oxford CHEP th March 2003 San Diego.

GRAM5 - A sustainable, scalable, reliable GRAM service Stuart Martin - UC/ANL.

PanDA Multi-User Pilot Jobs Maxim Potekhin Brookhaven National Laboratory Open Science Grid WLCG GDB Meeting CERN March 11, 2009.

OSG Area Coordinator’s Report: Workload Management April 20 th, 2011 Maxim Potekhin BNL

Tool Integration with Data and Computation Grid GWE - “Grid Wizard Enterprise”

Report from USA Massimo Sgaravatto INFN Padova. Introduction Workload management system for productions Monte Carlo productions, data reconstructions.

OSG Area Coordinator’s Report: Workload Management Maxim Potekhin BNL

June 24-25, 2008 Regional Grid Training, University of Belgrade, Serbia Introduction to gLite gLite Basic Services Antun Balaž SCL, Institute of Physics.

Remote Site C Pilot Scheduler Pilots and Pilot Schedulers Jobs Statistics Production Dashboard Dynamic Data Movement Monitor Panda Server (Apache) Development.

July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.

What is SAM-Grid? Job Handling Data Handling Monitoring and Information.

Evolution of a High Performance Computing and Monitoring system onto the GRID for High Energy Experiments T.L. Hsieh, S. Hou, P.K. Teng Academia Sinica,

Condor Project Computer Sciences Department University of Wisconsin-Madison Grids and Condor Barcelona,

US LHC OSG Technology Roadmap May 4-5th, 2005 Welcome. Thank you to Deirdre for the arrangements.

GLIDEINWMS - PARAG MHASHILKAR Department Meeting, August 07, 2013.

A PanDA Backend for the Ganga Analysis Interface J. Elmsheuser 1, D. Liko 2, T. Maeno 3, P. Nilsson 4, D.C. Vanderster 5, T. Wenaus 3, R. Walker 1 1: Ludwig-Maximilians-Universität.

Trusted Virtual Machine Images a step towards Cloud Computing for HEP? Tony Cass on behalf of the HEPiX Virtualisation Working Group October 19 th 2010.

6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.

Pilot Factory using Schedd Glidein Barnett Chiu BNL

Introduction to Grid Computing and its components.

Aneka Cloud ApplicationPlatform. Introduction Aneka consists of a scalable cloud middleware that can be deployed on top of heterogeneous computing resources.

Condor Services for the Global Grid: Interoperability between OGSA and Condor Clovis Chapman 1, Paul Wilson 2, Todd Tannenbaum 3, Matthew Farrellee 3,

Tool Integration with Data and Computation Grid “Grid Wizard 2”

LSF Universus By Robert Stober Systems Engineer Platform Computing, Inc.

Eileen Berman. Condor in the Fermilab Grid FacilitiesApril 30, 2008  Fermi National Accelerator Laboratory is a high energy physics laboratory outside.

DIRAC Pilot Jobs A. Casajus, R. Graciani, A. Tsaregorodtsev for the LHCb DIRAC team Pilot Framework and the DIRAC WMS DIRAC Workload Management System.

Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.

OSG Area Coordinator’s Report: Workload Management Maxim Potekhin BNL May 8 th, 2008.

The Gateway Computational Web Portal Marlon Pierce Indiana University March 15, 2002.

DIRAC Project A.Tsaregorodtsev (CPPM) on behalf of the LHCb DIRAC team A Community Grid Solution The DIRAC (Distributed Infrastructure with Remote Agent.

Pavel Nevski DDM Workshop BNL, September 27, 2006 JOB DEFINITION as a part of Production.

Status of Globus activities Massimo Sgaravatto INFN Padova for the INFN Globus group

STAR Scheduler Gabriele Carcassi STAR Collaboration.

D.Spiga, L.Servoli, L.Faina INFN & University of Perugia CRAB WorkFlow : CRAB: CMS Remote Analysis Builder A CMS specific tool written in python and developed.

© Geodise Project, University of Southampton, Workflow Support for Advanced Grid-Enabled Computing Fenglian Xu *, M.

Security and VO management enhancements in Panda Workload Management System Jose Caballero Maxim Potekhin Torre Wenaus Presented by Maxim Potekhin at HPDC08.

HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.

Consorzio COMETA - Progetto PI2S2 UNIONE EUROPEA Grid2Win : gLite for Microsoft Windows Elisa Ingrà - INFN.

First evaluation of the Globus GRAM service Massimo Sgaravatto INFN Padova.

CMS Experience with the Common Analysis Framework I. Fisk & M. Girone Experience in CMS with the Common Analysis Framework Ian Fisk & Maria Girone 1.

Job submission overview Marco Mambelli – August OSG Summer Workshop TTU - Lubbock, TX THE UNIVERSITY OF CHICAGO.

Panda Monitoring, Job Information, Performance Collection Kaushik De (UT Arlington), Torre Wenaus (BNL) OSG All Hands Consortium Meeting March 3, 2008.

Honolulu - Oct 31st, 2007 Using Glideins to Maximize Scientific Output 1 IEEE NSS 2007 Making Science in the Grid World - Using Glideins to Maximize Scientific.

Dynamic Deployment of VO Specific Condor Scheduler using GT4

Workload Management System

GWE Core Grid Wizard Enterprise (

Presentation transcript:

Tarball server (for Condor installation) Site Headnode Worker Nodes Schedd glidein - special purpose Condor pool master DB Panda Server Pilot Factory - a Condor-based System for Scalable Pilot Job Generation in the Panda WMS Framework Po-Hsiang Chiu, Maxim Potekhin Brookhaven National Laboratory Upton, NY11973,USA DB Task Buffer (job queue) Brokerage Policy, Priority Remote Site A Job Dispatcher CE Pilot Job Job Dispatch Job Request Simplified View of core Panda Architecture Job Submission Interfaces Job Resource Manager Pilot Launch (Condor, PBS, LSF, etc) Panda’s Pilot Framework for Workload Management Workload jobs are assigned to successfully activated and validated Pilot Jobs (lightweight processes which probe the environment and act as a ‘smart wrapper’ for payload jobs), based on Panda-managed brokerage criteria. This 'late binding' of workload jobs to processing slots prevents latencies and failure modes in slot acquisition from impacting the jobs, and maximizes the flexibility of job allocation to resources based on the dynamic status of processing facilities and job priorities. The pilot also encapsulates the complex heterogeneous environments and interfaces of the grids and facilities on which Panda operates. Job Submission Jobs are submitted via a client interface (written in Python), whereby the jobs sets, associated datasets, input/output files etc can be defined. This client interface was used to implement Panda front ends for both production and distributed analysis. Jobs received by the Panda server are placed in the job queue, and the brokerage module operates on this queue to prioritize and assign work based on job type, available resources, data location and other criteria. Pilot Scheduler Regional Production Distributed Analysis Site Capability Service Remote Site B CE Pilot Factory Components Remote Site C CE Submission of Pilot Jobs Panda makes extensive use of Condor (particularly Condor-G) as a Pilot job submission infrastructure of proven capability and reliability. Other submission methods (such as PBS) are also supported. Other principal design features Through a system-wide job database, a comprehensive and coherent view of the system and job execution is afforded to the users. Integrated data management is based on the concept of ‘datasets’ (collections of files), and movement of datasets for processing and archiving is built into the Panda workflow Asynchronous and automated pre-staging of input data minimizes data transport latencies Integration of site resources is made easy, with minimum site requirements consisting of being able to accept pilot jobs and allow outbound HTTP traffic from the worker nodes Panda system security is based on standard grid tools and approaches (such as X.509 certificate proxy-based authentication and authorization) Motivations for the Pilot Factory Pilot jobs submission via Condor-G is not different from submission of any other kind of jobs. With the increase in demand of production and distributed analyses jobs, more pilot jobs will need to be submitted, which leads to heavier traffic through the GRAM gatekeeper. That presents a potential scalability problem for the target site’s headnode. One way of circumventing this is to implement local pilot submission on the headnode, thus significantly reducing the load on the gatekeeper. Design considerations To work efficiently, personnel operating the Pilot Scheduler must have a way to monitor the submitted pilots and their output. The new submission approach that satisfies this requirement was inspired from the idea of Condor-C, an alternative job submission mechanism that allows for jobs to be "forwarded" between a Condor schedd (i.e. a job request agent) to another. In order to use Condor-C, we would still need a communication gateway on the headnode itself in order to relay jobs directly to the native Condor system. This gateway also needs to be flexible enough to interface with the other batch systems for sites that do not employ Condor for job management. Schedd- based glidein is therefore conceived to support these features. Using schedd-based glidein, the Pilot Factory further removes the need of installing a pilot submission system on the headnode, leaving a thin layer - glidein schedd to redirect pilots to the native batch system. Once a glidein schedd is installed and running, it can be utilized exactly the same way as local schedds and therefore, pilots submitted in such a way are almost equal to those submitted within local Condor pool, resembling Condor's vanilla universe jobs. Pilot Factory is effectively a new pilot-generating system using dynamic job queue as a gateway to dispatch pilots from within remote site’s perimeter. The factory first deploys glidein schedds to the head nodes of the participating sites, followed by Pilot Generator submitting pilots directly to these glideins, from which these pilots are then forwarded the native batch system. Pilot Factory consists of three major components: (1) A glidein launcher, responsible for the dynamic deployment of glideins to eligible sites, (2) A glidein monitor that detects occasional failures due to site’s temporary downtimes, upon which it then invokes the glidein launcher to perform another glidein deployment, and (3) A pilot generator that disseminates pilots through the schedd glideins running on remote resources. Pilot Generator (scheduler) schedd External requests Condor Interface to the local batch system – Pilot Submission Pilot Factory Components and Operation The current implementation of Pilot Factory consists of three major components: 1.Glidein launcher that dynamically installs and configures a set of schedd-related Condor daemons on a Globus-controlled headnode. 2.Glidein monitor. Both it and the Glidein launcher are coupled to the schedd glidein script, which is implemented based on the existing Condor condor_glidein script and has a very similar usage mode and functionality, but adds a few more options to aid in remote installation. 3.a pilot submitter that performs the following tasks: a)Configure a proper Condor submit file for pilot jobs b)Submit pilot jobs with a regulated control of submission rate c)Obtain the outputs and/or other byproducts of pilot jobs upon their completion. More information and examples can be found here: Pilot Factory deployment status Over the period of the past few months, an instance of the Pilot Factory was run in test mode in conjunction with the BNL instance of the Panda server. Status of the pilots generated by the Pilot Factory can be checked using the Panda Monitor: Examples of commands used with the Pilot Factory Starting a number of glideins on a few sites: condor_glidein -count 1 -arch i686-pc-Linux-2.4 -setup_jobmanager=jobmanager-fork gridgk01.racf.bnl.gov/jobmanager-fork -type schedd –forcesetup condor_glidein -count 1 -arch i686-pc-Linux-2.4 -setup_jobmanager=jobmanager-fork gridgk02.racf.bnl.gov/jobmanager-fork -type schedd –forcesetup condor_glidein -count 3 -arch i686-pc-Linux-2.4 -setup_jobmanager=jobmanager-fork nostos.cs.wisc.edu/jobmanager-fork -type schedd –forcesetup Check the list of available glideins: condor_status -schedd -c 'is_glidein=?=true’ Monitor glideins using the Pilot Scheduler:./pilotScheduler.py --glidein-monitor Submit Pilots using the Pilot Scheduler:./pilotScheduler.py --queue=bnl-glidein-cc --nqueue=5 --pilot=myPilot --pandasite=mySite --server-schedd=gridgk10.racf.bnl.gov --server-collector=gridgk10.racf.bnl.gov:9660 Initial Instantiation