Arlington, Dec 7th 2006 Glidein Based WMS 1 A pilot-based (PULL) approach to the Grid An overview by Igor Sfiligoi.

Slides:



Advertisements
Similar presentations
Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.
Advertisements

Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.
Building a secure Condor ® pool in an open academic environment Bruce Beckles University of Cambridge Computing Service.
Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.
Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova
Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova.
Resource Management Reading: “A Resource Management Architecture for Metacomputing Systems”
SCD FIFE Workshop - GlideinWMS Overview GlideinWMS Overview FIFE Workshop (June 04, 2013) - Parag Mhashilkar Why GlideinWMS? GlideinWMS Architecture Summary.
Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova.
OSG Site Provide one or more of the following capabilities: – access to local computational resources using a batch queue – interactive access to local.
BaBar MC production BaBar MC production software VU (Amsterdam University) A lot of computers EDG testbed (NIKHEF) Jobs Results The simple question:
1 1 Vulnerability Assessment of Grid Software Jim Kupsch Associate Researcher, Dept. of Computer Sciences University of Wisconsin-Madison Condor Week 2006.
Campus Grids Report OSG Area Coordinator’s Meeting Dec 15, 2010 Dan Fraser (Derek Weitzel, Brian Bockelman)
Use of Condor on the Open Science Grid Chris Green, OSG User Group / FNAL Condor Week, April
Tarball server (for Condor installation) Site Headnode Worker Nodes Schedd glidein - special purpose Condor pool master DB Panda Server Pilot Factory -
Derek Wright Computer Sciences Department University of Wisconsin-Madison Condor and MPI Paradyn/Condor.
Condor Week 2004 The use of Condor at the CDF Analysis Farm Presented by Sfiligoi Igor on behalf of the CAF group.
GLIDEINWMS - PARAG MHASHILKAR Department Meeting, August 07, 2013.
Pilot Factory using Schedd Glidein Barnett Chiu BNL
Ian D. Alderman Computer Sciences Department University of Wisconsin-Madison Condor Week 2008 End-to-end.
OSG Site Admin Workshop - Mar 2008Using gLExec to improve security1 OSG Site Administrators Workshop Using gLExec to improve security of Grid jobs by Alain.
Eileen Berman. Condor in the Fermilab Grid FacilitiesApril 30, 2008  Fermi National Accelerator Laboratory is a high energy physics laboratory outside.
Dan Bradley Condor Project CS and Physics Departments University of Wisconsin-Madison CCB The Condor Connection Broker.
HTCondor Security Basics HTCondor Week, Madison 2016 Zach Miller Center for High Throughput Computing Department of Computer Sciences.
Campus Grid Technology Derek Weitzel University of Nebraska – Lincoln Holland Computing Center (HCC) Home of the 2012 OSG AHM!
Job submission overview Marco Mambelli – August OSG Summer Workshop TTU - Lubbock, TX THE UNIVERSITY OF CHICAGO.
Why you should care about glexec OSG Site Administrator’s Meeting Written by Igor Sfiligoi Presented by Alain Roy Hint: It’s about security.
Panda Monitoring, Job Information, Performance Collection Kaushik De (UT Arlington), Torre Wenaus (BNL) OSG All Hands Consortium Meeting March 3, 2008.
3 Compute Elements are manageable By hand 2 ? We need middleware – specifically a Workload Management System (and more specifically, “glideinWMS”) 3.
Introduction to Distributed HTC and overlay systems Tuesday morning, 9:00am Igor Sfiligoi Leader of the OSG Glidein Factory Operations University of California.
Condor Week 2007Glidein Factories - by I. Sfiligoi1 Condor Week 2007 Glidein Factories (and in particular, the glideinWMS) by Igor Sfiligoi.
Introduction to the Grid and the glideinWMS architecture Tuesday morning, 11:15am Igor Sfiligoi Leader of the OSG Glidein Factory Operations University.
UCS D OSG Summer School 2011 Intro to DHTC OSG Summer School An introduction to Distributed High-Throughput Computing with emphasis on Grid computing.
OSG Consortium Meeting - March 6th 2007Evaluation of WMS for OSG - by I. Sfiligoi1 OSG Consortium Meeting Evaluation of Workload Management Systems for.
UCS D OSG Summer School 2011 Overlay systems OSG Summer School An introduction to Overlay systems Also known as Pilot systems by Igor Sfiligoi University.
European Condor Week CDF experience with Condor glide-ins and GCB - Igor Sfiligoi1 European Condor Week 2006 Using Condor Glide-Ins and GCB to run.
Condor Week 2006, University of Wisconsin 1 Matthew Norman Using Condor Glide-ins and GCB to run in a grid environment Elliot Lipeles, Matthew Norman,
Madison, Apr 2010Igor Sfiligoi1 Condor World 2010 Condor-G – A few lessons learned by Igor UCSD.
Condor Week 09Condor WAN scalability improvements1 Condor Week 2009 Condor WAN scalability improvements A needed evolution to support the CMS compute model.
Condor Week May 2012No user requirements1 Condor Week 2012 An argument for moving the requirements out of user hands - The CMS experience presented.
Honolulu - Oct 31st, 2007 Using Glideins to Maximize Scientific Output 1 IEEE NSS 2007 Making Science in the Grid World - Using Glideins to Maximize Scientific.
FermiGrid - PRIMA, VOMS, GUMS & SAZ Keith Chadwick Fermilab
HTCondor Networking Concepts
HTCondor Networking Concepts
HTCondor Security Basics
Quick Architecture Overview INFN HTCondor Workshop Oct 2016
Dynamic Deployment of VO Specific Condor Scheduler using GT4
Operating a glideinWMS frontend by Igor Sfiligoi (UCSD)
Outline Expand via Flocking Grid Universe in HTCondor ("Condor-G")
Workload Management System
BDII Performance Tests
Glidein Factory Operations
High Availability in HTCondor
The CMS use of glideinWMS by Igor Sfiligoi (UCSD)
Evolution of the distributed computing model The case of CMS
Monitoring HTCondor with Ganglia
Building Grids with Condor
The Scheduling Strategy and Experience of IHEP HTCondor Cluster
HTCondor Security Basics HTCondor Week, Madison 2016
Condor Glidein: Condor Daemons On-The-Fly
Basic Grid Projects – Condor (Part I)
Upgrading Condor Best Practices
From Prototype to Production Grid
The Condor JobRouter.
Condor: Firewall Mirroring
GRID Workload Management System for CMS fall production
Condor-G Making Condor Grid Enabled
GLOW A Campus Grid within OSG
Condor-G: An Update.
Presentation transcript:

Arlington, Dec 7th 2006 Glidein Based WMS 1 A pilot-based (PULL) approach to the Grid An overview by Igor Sfiligoi

Arlington, Dec 7th 2006 Glidein Based WMS 2 Why use pilots on the Grid? ● Grid based on a loose aggregation of resources ● Users have no say on how the resources are managed ● The probability of at least one resource being broken at any given time is very high Pilots can verify the resource before pulling a user job ● Difficult to have complete and reliable information about the resources ● Managing priorities inside the resource pools difficult if not impossible Pilots can do full matchmaking based on complete info

Arlington, Dec 7th 2006 Glidein Based WMS 3 Why using Condor for pilots? ● Condor is a very widely used system – Mature and well known ● Has an active development team –... that listen to their users – A new condor release every few months ● Condor architecture distributed by design – Started as a project to gather unused CPU cycles – A perfect fit for the Grid world

Arlington, Dec 7th 2006 Glidein Based WMS 4 What are the “glideins”? ● Glideins are just properly configured, regular Condor daemons submitted as batch jobs ● From the user point of view, they look like regular Condor worker nodes Grid site Collector Schedd User jobs Gatekeeper Running Grid jobs Dedicated node Startd User job Startd User job StartdStartd StartdStartd 1a 2a 3a 4a 1b 2b 3b 4b 5b 6b Negotiator

Arlington, Dec 7th 2006 Glidein Based WMS 5 How do we create a WMS out of it? ● Glideins should be submitted on demand – The WMS must know about the characteristics of the user jobs ● Different Grid sites may need different configurations – The WMS must know the details of the Grid sites

Arlington, Dec 7th 2006 Glidein Based WMS 6 Dividi et impera ● Split the task between different processes Collector VO frontend Request glideins VO frontend Glidein Factory Grid site Glidein Factory Grid site Submit glideins Submit glideins Publish available glidein entries Publish available glidein entries Request glideins VO frontend Publish available glidein entries Request glideins Job Schedd Grid site Submit glideins Glidein Factory VO Frontend knows about user jobs Glidein Factory knows about Grid site(s) Job Schedd

Arlington, Dec 7th 2006 Glidein Based WMS 7 Dividi et impera (2) ● VO frontend – User/VO specific – A reasonably generic implementation provided for simple use cases ● Customizable – Looks only for a known subset of jobs – Can process custom attributes ● Glidein Factory – Completely generic – The default implementation should suit most tasks – The same factory could in principle serve several VOs

Arlington, Dec 7th 2006 Glidein Based WMS 8 Comunication between processes ● Using standard Condor Class-Ads – The Collector defines the WMS – This Collector is usually different than the glidein Collector Collector VO frontend Publish available glidein entries Request glideins Job Schedd Grid site Submit glideins Glidein Factory 1a 2 3 1b Startd Collector Negotiator

Arlington, Dec 7th 2006 Glidein Based WMS 9 Glidein factory ClassAd MyType=”glidefactory” FactoryName=”factory” GlideinName=”entry” Attribute1=”...”... AttributeN=”...” GlideinParamXXX=”val1”... GlideinParamYYY=”valN” Published classad Attributes describe the glidein like: ARCH=”INTEL”, MaxHours=72, Site=”Florida” Due to Condor limitations, define also GlideinMyType Parameters set glidein parameter defaults like: CONDOR_HOST=”UNDEFINED”,SEC_DEFAULT_ENCRYPTION=OPTIONAL MinDisk=16G, CheckFilesExist=”/tmp/CMS,$DATA/OSG”

Arlington, Dec 7th 2006 Glidein Based WMS 10 Frontend ClassAd MyType=”glideclient” ClientName=”client” ReqName=”reqX” ReqIdleGlideins=nr ReqMaxRun=nr ReqMaxSubmitXHour=nr GlideinParamWWW=”val1”... GlideinParamZZZ=”valY” Published classad GlideParamXXX must match the names published by the factory Due to Condor limitations, define also GlideinMyType Trying to keep a steady stream of glideins starting Limits on the amount of glideins submitted envisioned but not yet implemented

Arlington, Dec 7th 2006 Glidein Based WMS 11 Implementation details

Arlington, Dec 7th 2006 Glidein Based WMS 12 Frontend logic Query Schedd(s) Query WMS Collector Match and count ● Essentially a Matchmaker ● Will match user job attributes to factory attributes ● A scaled back count is then published as a request Jobs Attributes Factories Attributes Nr jobs x Factory Publish requests

Arlington, Dec 7th 2006 Glidein Based WMS 13 Frontend in the big picture Schedd User jobs VO Collector VO Frontend WMS Collector Frontend requests Factory Ads Grid site Startd User job Glidein Factory Grid site 2b 2a

Arlington, Dec 7th 2006 Glidein Based WMS 14 Glidein Factory Logic ● Essentially a publish-read-submit loop – Blindly execute orders – Security implemented at Collector level Grid site Collector Gatekeeper Non-glide Grid jobs Startd StartdStartd 6 1 Frontend requests Factory ads Glidein Factory Schedd Startd jobs 2 4 Grid site Gatekeepe r StartdStartd Grid site Gatekeepe r StartdStartd 3 5 7

Arlington, Dec 7th 2006 Glidein Based WMS 15 Glidein Factory Internals ● Made of multiple Entry Points – Each Entry Point an independent process – Implements the publish-read-submit loop ● Entry Points do the real job – The Glidein Factory is just a logical object – Management tools that act on the whole factory (i.e. several entry points at a time) envisioned, but not yet implemented

Arlington, Dec 7th 2006 Glidein Based WMS 16 Entry points at a glance Factory Daemon Schedd Startd jobs /mydir/glidein_e1 job.cond or glidein_s tartup.sh params.c fg condor_submit Grid site Gatekeepe r StartdStartd Grid site Gatekeepe r StartdStartd Grid site Gatekeepe r StartdStartd /mydir/glidein_e2 job.cond or glidein_s tartup.sh params.c fg /mydir/glidein_e3 job.cond or glidein_s tartup.sh params.c fg Factory Daemon Frontend requests Collector Entry Point Ad Frontend requests Entry Point Ad Frontend requests Entry Point Ad The status of the Entry Point is held on disk One Factory Daemon per Entry Point

Arlington, Dec 7th 2006 Glidein Based WMS 17 Glidein implementation ● Dummy startup script – Just loads other files and execute the ones marked as executable ● File transfer implemented using HTTP – Easy cacheable, standard tools available (Squid) – Proven to scale, widely used in Industry ● All sensitive file transfers signed (sha1) – Prevent tampering, as HTTP travels in clear

Arlington, Dec 7th 2006 Glidein Based WMS 18 Glidein implementation (2) ● Standard sanity checks provided – Disk space constraints – Node blacklisting ● Generic Condor configure and startup script provided, too ● Factory admins can easily add their own customization scripts (both for checks and configs) – Allowing Frontends to add custom scripts envisioned, but not yet implemented

Arlington, Dec 7th 2006 Glidein Based WMS 19 Glidein at a glance /mydir/glidein_e1 job.cond or glidein_s tartup.sh params.c fg condor_submit /var/www/html/glidein_e1 condor_b in.tgz user_setu p.sh condor_s tartup.sh condor_c onfig nodes_bl acklist files_list. lst signatur e.sha1 Schedd Glidein jobs Grid site Gatekeeper StartdStartd Worker Node Web Server Web Cache glidein_startup load file list execute scripts Erro rs? Startd Yes No Check the running environment, configure dynamic options or fail if problems found This batch slot would not be able to run a user job load files Only the startup script transferred by Globus. All files are signed to prevent tampering. Factory Daemon

Arlington, Dec 7th 2006 Glidein Based WMS 20 Advanced topics

Arlington, Dec 7th 2006 Glidein Based WMS 21 Security ● Traffic between the Collector and Schedd, and the Grid sites cannot be trusted – WAN is insecure by definition ● Strong authentication needs to be used The startd has access to the service X509 proxy, so that is used for authentication Supported natively by Condor

Arlington, Dec 7th 2006 Glidein Based WMS 22 Grid site Private networks and firewalls ● Condor developed with LAN in mind – All daemons require bi-directional networking ● Default mode fails with firewalls and private networks – Incoming traffic on WNs almost never allowed Schedd User jobs VO Collector Startd User job

Arlington, Dec 7th 2006 Glidein Based WMS 23 Using Condor GCB ● Adding Condor GCB to the mix solves the problem – Startd keeps a permanent TCP connection with GCB – Only outgoing connectivity used from WN ● Adds complexity to the system, but is needed – One GCB server can handle ~1k glideins Grid site Schedd User jobs VO Collector Startd User job GCB 1 5

Arlington, Dec 7th 2006 Glidein Based WMS 24 Integration with Grid-local security infrastructure ● Default pilot operation mode breaks Grid security rules – The real user job is pulled in without local consent – Grid site knows only about the glideins ● The default mode also a security risk for the VO – User jobs run under the same UID as the condor daemons Users have access to the service X509 proxy Gatekeeper User queue startd User Job GUMS Glidein UID Glidein X509 Glide in User Job User X509 Glidein UID Worker node Grid site Glidein X509

Arlington, Dec 7th 2006 Glidein Based WMS 25 Condor integration with gLExec ● gLExec started to be deployed on OSG WNs – Will change UID based on X509 proxy ● Condor natively supports it as of v6.9.0 Glidein factory have the necessary config script ● Solves both problems at once Gatekeeper User queue Pilot User Job gLExec GUMS User UID Pilot User Job User DN User UID Worker node Grid site Glidein X509 User X509 Glidein UID Glidein X509 Glidein UID

Arlington, Dec 7th 2006 Glidein Based WMS 26 Future work ● Need to implement several pieces that have been mentioned before – Waiting for some real users to bump up priorities – Current set satisfies the needs of CMS test harness ● With real users will certainly come new requests – I do not dare forecast what they might be ● As a wish, I would not mind if Condor natively implemented most of the functionality

Arlington, Dec 7th 2006 Glidein Based WMS 27 An example of possible future work ● Hierarchies of glidein factories Collector VO frontend Publish available glidein entries Request glideins Job Schedd Grid site Glidein Factory 1a 3 4 2b Startd Collector Negotiator Collector Submit glideins Glidein Factory 5a Glidein Factory Grid site Submit glideins Request glideins Publish available glidein entries Publish available glidein entries 2a 1b 5b