Condor Week 2012 - May 2012No user requirements1 Condor Week 2012 An argument for moving the requirements out of user hands - The CMS experience presented.

Slides:



Advertisements
Similar presentations
Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.
Advertisements

Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.
Overview of Wisconsin Campus Grid Dan Bradley Center for High-Throughput Computing.
ANTHONY TIRADANI AND THE GLIDEINWMS TEAM glideinWMS in the Cloud.
Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
Efficiently Sharing Common Data HTCondor Week 2015 Zach Miller Center for High Throughput Computing Department of Computer Sciences.
First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova
Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova.
Jaeyoung Yoon Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.
SCD FIFE Workshop - GlideinWMS Overview GlideinWMS Overview FIFE Workshop (June 04, 2013) - Parag Mhashilkar Why GlideinWMS? GlideinWMS Architecture Summary.
Alain Roy Computer Sciences Department University of Wisconsin-Madison An Introduction To Condor International.
High Throughput Parallel Computing (HTPC) Dan Fraser, UChicago Greg Thain, Uwisc.
Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.
Introduction Optimizing Application Performance with Pinpoint Accuracy What every IT Executive, Administrator & Developer Needs to Know.
The Glidein Service Gideon Juve What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from.
1 1 Vulnerability Assessment of Grid Software Jim Kupsch Associate Researcher, Dept. of Computer Sciences University of Wisconsin-Madison Condor Week 2006.
1 Evolution of OSG to support virtualization and multi-core applications (Perspective of a Condor Guy) Dan Bradley University of Wisconsin Workshop on.
Peter Keller Computer Sciences Department University of Wisconsin-Madison Quill Tutorial Condor Week.
Campus Grids Report OSG Area Coordinator’s Meeting Dec 15, 2010 Dan Fraser (Derek Weitzel, Brian Bockelman)
Evolution of the Open Science Grid Authentication Model Kevin Hill Fermilab OSG Security Team.
Condor: High-throughput Computing From Clusters to Grid Computing P. Kacsuk – M. Livny MTA SYTAKI – Univ. of Wisconsin-Madison
Mar 28, 20071/18 The OSG Resource Selection Service (ReSS) Gabriele Garzoglio OSG Resource Selection Service (ReSS) Don Petravick for Gabriele Garzoglio.
Privilege separation in Condor Bruce Beckles University of Cambridge Computing Service.
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Quill / Quill++ Tutorial.
Conference name Company name INFSOM-RI Speaker name The ETICS Job management architecture EGEE ‘08 Istanbul, September 25 th 2008 Valerio Venturi.
Condor Week 2004 The use of Condor at the CDF Analysis Farm Presented by Sfiligoi Igor on behalf of the CAF group.
GLIDEINWMS - PARAG MHASHILKAR Department Meeting, August 07, 2013.
Pilot Factory using Schedd Glidein Barnett Chiu BNL
Greg Thain Computer Sciences Department University of Wisconsin-Madison Configuring Quill Condor Week.
Ian D. Alderman Computer Sciences Department University of Wisconsin-Madison Condor Week 2008 End-to-end.
22 nd Oct 2008Euro Condor Week 2008 Barcelona 1 Condor Gotchas III John Kewley STFC Daresbury Laboratory
Dan Bradley Condor Project CS and Physics Departments University of Wisconsin-Madison CCB The Condor Connection Broker.
Douglas Thain, John Bent Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, Miron Livny Computer Sciences Department, UW-Madison Gathering at the Well: Creating.
HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.
Campus Grid Technology Derek Weitzel University of Nebraska – Lincoln Holland Computing Center (HCC) Home of the 2012 OSG AHM!
3 Compute Elements are manageable By hand 2 ? We need middleware – specifically a Workload Management System (and more specifically, “glideinWMS”) 3.
UCS D OSG Summer School 2011 Life of an OSG job OSG Summer School A peek behind the scenes The life of an OSG job by Igor Sfiligoi University of.
How to get the needed computing Tuesday afternoon, 1:30pm Igor Sfiligoi Leader of the OSG Glidein Factory Operations University of California San Diego.
Introduction to Distributed HTC and overlay systems Tuesday morning, 9:00am Igor Sfiligoi Leader of the OSG Glidein Factory Operations University of California.
Condor Week 2007Glidein Factories - by I. Sfiligoi1 Condor Week 2007 Glidein Factories (and in particular, the glideinWMS) by Igor Sfiligoi.
Introduction to the Grid and the glideinWMS architecture Tuesday morning, 11:15am Igor Sfiligoi Leader of the OSG Glidein Factory Operations University.
UCS D OSG Summer School 2011 Intro to DHTC OSG Summer School An introduction to Distributed High-Throughput Computing with emphasis on Grid computing.
Condor Week Apr 30, 2008Pseudo Interactive monitoring - I. Sfiligoi1 Condor Week 2008 Pseudo-interactive monitoring in Condor by Igor Sfiligoi.
OSG Consortium Meeting - March 6th 2007Evaluation of WMS for OSG - by I. Sfiligoi1 OSG Consortium Meeting Evaluation of Workload Management Systems for.
UCS D OSG Summer School 2011 Overlay systems OSG Summer School An introduction to Overlay systems Also known as Pilot systems by Igor Sfiligoi University.
Rome, Sep 2011Adapting with few simple rules in glideinWMS1 Adaptive 2011 Adapting to the Unknown With a few Simple Rules: The glideinWMS Experience by.
Madison, Apr 2010Igor Sfiligoi1 Condor World 2010 Condor-G – A few lessons learned by Igor UCSD.
Condor Week 09Condor WAN scalability improvements1 Condor Week 2009 Condor WAN scalability improvements A needed evolution to support the CMS compute model.
UCS D OSG School 11 Grids vs Clouds OSG Summer School Comparing Grids to Clouds by Igor Sfiligoi University of California San Diego.
Condor Week May 2012Remote Condor1 Condor Week 2012 Remote Condor presented by J. M. Dost co-author I. Sfiligoi UC San Diego.
Honolulu - Oct 31st, 2007 Using Glideins to Maximize Scientific Output 1 IEEE NSS 2007 Making Science in the Grid World - Using Glideins to Maximize Scientific.
CHTC Policy and Configuration
HTCondor Security Basics
Quick Architecture Overview INFN HTCondor Workshop Oct 2016
Dynamic Deployment of VO Specific Condor Scheduler using GT4
Operating a glideinWMS frontend by Igor Sfiligoi (UCSD)
Workload Management System
Glidein Factory Operations
Moving CHTC from RHEL 6 to RHEL 7
The CMS use of glideinWMS by Igor Sfiligoi (UCSD)
Moving CHTC from RHEL 6 to RHEL 7
Building Grids with Condor
Accounting, Group Quotas, and User Priorities
HTCondor Security Basics HTCondor Week, Madison 2016
Basic Grid Projects – Condor (Part I)
Upgrading Condor Best Practices
The Condor JobRouter.
Condor: Firewall Mirroring
Condor-G Making Condor Grid Enabled
GLOW A Campus Grid within OSG
Presentation transcript:

Condor Week May 2012No user requirements1 Condor Week 2012 An argument for moving the requirements out of user hands - The CMS experience presented by Igor Sfiligoi UC San Diego

Condor Week May 2012No user requirements2 The basics ● Condor architecture clearly separates ● Resource providers ● from ● Resource consumers ● Each has a daemon process to represent it ● Startd for resource provides ● Schedd for resource consumers ● A central service connects them all ● Managed by a Collector/Negotiator pair Machines (aka worker nodes) CPUs, Memory, IO,... Job queues (aka submit nodes) Jobs submitted by users Collecto r Negotiat or Sched d Startd...

Condor Week May 2012No user requirements3 Matchmaking ● In order for a job to start running on a resource ● The job requirements must evaluate to True ● The machine requirements must evaluate to True There is also the ranking, but that's 2 nd level optimization

Condor Week May 2012No user requirements4 Matchmaking ● In order for a job to start running on a resource ● The job requirements must evaluate to True ● The machine requirements must evaluate to True There is also the ranking, but that's 2 nd level optimization Most manuals focus on job reqs

Condor Week May 2012No user requirements5 Matchmaking ● In order for a job to start running on a resource ● The job requirements must evaluate to True ● The machine requirements must evaluate to True There is also the ranking, but that's 2 nd level optimization Most manuals focus on job reqs Machine reqs deemed for handling Owner state only

Condor Week May 2012No user requirements6 My argument Get rid of Job Requirements Put all logic in the Machine Requirements

Condor Week May 2012No user requirements7 Some background ● I am a big glideinWMS user ● And glideinWMS has 2 level matchmaking ● One at VO Frontend level – where to send glideins ● One at Negotiator level – which job to start in a glidein Collector Negotiator Schedd Execution node Startd Job Factory glidein Frontend

Condor Week May 2012No user requirements8 Matchmaking problem ● Both levels must be in sync, or you either ● Ask for glideins which never match any jobs Job matched to site in VO Frontend but not to site's machine in the Negotiator

Condor Week May 2012No user requirements9 Matchmaking problem ● Both levels must be in sync, or you either ● Ask for glideins which never match any jobs, or ● Have job waiting in the queue when site available Job doesn't match to site in VO Frontend but would to site's machine in the Negotiator if glideins were requested

Condor Week May 2012No user requirements10 Matchmaking problem ● Both levels must be in sync, or you either ● Ask for glideins which never match any jobs, or ● Have job waiting in the queue when site available ● But site and machine adds have different attributes ● Not all machines on a site are exactly the same So I need 2 different requirements

Condor Week May 2012No user requirements11 The usual dilemma ● Where do I define these requirements? ● In user job ClassAd? ● Or in the Resource ClassAds? – And we have 2 different resource types here

Condor Week May 2012No user requirements12 The usual dilemma ● Where do I define these requirements? ● In user job ClassAd? ● Or in the Resource ClassAds? – And we have 2 different resource types here The 2 resources are handled by the same admin (in VO Frontend)

Condor Week May 2012No user requirements13 The usual dilemma ● Where do I define these requirements? ● In user job ClassAd? ● Or in the Resource ClassAds? – And we have 2 different resource types here Potentially O(1k) users! The 2 resources are handled by the same admin (in VO Frontend) Typically O(1)

Condor Week May 2012No user requirements14 The usual dilemma ● Where do I define these requirements? ● In user job ClassAd? ● Or in the Resource ClassAds? – And we have 2 different resource types here And do you really trust your users to do the right thing? The 2 resources are handled by the same admin (in VO Frontend) Typically O(1) Potentially O(1k)

Condor Week May 2012No user requirements15 Moving reqs to the resources ● So we went for setting the requirements in the resources themselves

Condor Week May 2012No user requirements16 Moving reqs to the resources ● So we went for setting the requirements in the resources themselves ● But users still need a way to select resources! ● How do they do it???

Condor Week May 2012No user requirements17 Moving reqs to the resources ● So we went for setting the requirements in the resources themselves ● But users still need a way to select resources! ● How do they do it??? They express their needs through attributes!

Condor Week May 2012No user requirements18 Fixed schema ● The resource provider defines the requirements ● The VO Frontend admin in our case ● Those requirements look for well-defined user-provided attributes entry_req = stringListMember(GLIDEIN_Site,DESIRED_Sites)&& ((GLIDEIN_Min_Mem>DESIRED_Mem)=!=False) Start = stringListMember(GLIDEIN_Site,DESIRED_Sites)&& ((Memory>DESIRED_Mem)=!=False) Site attribute Machine attribute Job attribute Example a.k.a startd + VO Frontend

Condor Week May 2012No user requirements19 Simple user job submit file ● No complex requirements to write ● Very little user training ● Low error rate Executable = a.sh Output = a.out +DESIRED_Sites=”UCSD,Nebraska” Requirements=True queue a.submit

Condor Week May 2012No user requirements20 How well does it work? ● Advantages ● Easy to keep the two levels in sync ● Easy to define reasonable defaults ● Easy on the users ● And more... (wait for later slides) ● Disadvantages ● Rigid schema

Condor Week May 2012No user requirements21 How well does it work? ● Advantages ● Easy to keep the two levels in sync ● Easy to define reasonable defaults ● Easy on the users ● And more... (wait for later slides) ● Disadvantages ● Rigid schema In our experience, does not need to change more than a couple of times a year

Condor Week May 2012No user requirements22 ● Disadvantages ● Rigid schema ● Advantages ● Easy to keep the two levels in sync ● Easy to define reasonable defaults ● Easy on the users Is this glideinWMS specific? Don't think so! The ease of use is there for any Condor setup

Condor Week May 2012No user requirements23 ● Disadvantages ● Rigid schema ● Advantages ● Easy to keep the two levels in sync ● Easy to define reasonable defaults ● Easy on the users Is this glideinWMS specific? Don't think so! The ease of use is there for any Condor setup I would argue it should become standard practice

Condor Week May 2012No user requirements24 And there is still more to it

Condor Week May 2012No user requirements25 Side effect ● Discovered one unexpected nice side effect

Condor Week May 2012No user requirements26 Side effect ● Discovered one unexpected nice side effect We can outsmart our users!

Condor Week May 2012No user requirements27 entry_req = stringListMember(GLIDEIN_Site,DESIRED_Sites)&& ((GLIDEIN_Min_Mem>DESIRED_Mem)=!=False) Start = stringListMember(GLIDEIN_Site,DESIRED_Sites)&& ((Memory>DESIRED_Mem)=!=False) Example The overflow use case ● Normally, CMS jobs run near the data ● So users provide a whitelist of sites to run on ● And we have the appropriate glideinWMS expression Executable = a.sh Output = a.out +DESIRED_Sites=”UCSD,Nebraska” Requirements=True queue a.submit

Condor Week May 2012No user requirements28 The overflow use case ● Normally, CMS jobs run near the data ● But some jobs could run over the WAN ● It is just slightly less efficient Xrootd based WAN access if you are interested

Condor Week May 2012No user requirements29 The overflow use case ● Normally, CMS jobs run near the data ● Jobs could run over the WAN ● It is just slightly less efficient ● But if some CPUs are idle due to low demand ● A low efficiency job is still better than no job! ● As long as it results in users getting their results sooner – i.e. only if “optimal resources” are not available We call this “overflowing”

Condor Week May 2012No user requirements30 The overflow use case ● Normally, CMS jobs run near the data ● Jobs could run over the WAN ● It is just slightly less efficient ● But if some CPUs are idle due to low demand ● A low efficiency job is still better than no job! ● As long as it results in users getting their results sooner – i.e. only if “optimal resources” are not available So, how do we implement this?

Condor Week May 2012No user requirements31 Overflow configuration ● Essentially, we change the rules! ● Without involving the users ● We write the requirements based on where the data is, not where the CPUs are ● Since not all sites have xrootd installed entry_req = ((CurrentTime-QDate)>21600)&&(Country=?=”US”)&& stringListsIntersect(DESIRED_Sites,”UCSD,Wisc”)&& ((GLIDEIN_Min_Mem>DESIRED_Mem)=!=False) Start = ((CurrentTime-QDate)>21600)&& stringListsIntersect(DESIRED_Sites,”UCSD,Wisc”)&& ((Memory>DESIRED_Mem)=!=False) Example

Condor Week May 2012No user requirements32 Overflow configuration ● Essentially, we change the rules! ● Without involving the users ● We write the requirements based on where the data is, not where the CPUs are ● Since not all sites have xrootd installed entry_req = ((CurrentTime-QDate)>21600)&&(Country=?=”US”)&& stringListsIntersect(DESIRED_Sites,”UCSD,Wisc”)&& ((GLIDEIN_Min_Mem>DESIRED_Mem)=!=False) Start = ((CurrentTime-QDate)>21600)&& stringListsIntersect(DESIRED_Sites,”UCSD,Wisc”)&& ((Memory>DESIRED_Mem)=!=False) Example Executable = a.sh Output = a.out +DESIRED_Sites=”UCSD,Nebraska” Requirements=True queue a.submit No change to the user submit file!

Condor Week May 2012No user requirements33 Overflow configuration ● Essentially, we change the rules! ● Without involving the users ● We write the requirements based on where the data is, not where the CPUs are ● Since not all sites have xrootd installed entry_req = ((CurrentTime-QDate)>21600)&&(Country=?=”US”)&& stringListsIntersect(DESIRED_Sites,”UCSD,Wisc”)&& ((GLIDEIN_Min_Mem>DESIRED_Mem)=!=False) Start = ((CurrentTime-QDate)>21600)&& stringListsIntersect(DESIRED_Sites,”UCSD,Wisc”)&& ((Memory>DESIRED_Mem)=!=False) Example Executable = a.sh Output = a.out +DESIRED_Sites=”UCSD,Nebraska” Requirements=True queue a.submit No change to the user submit file! ”UCSD,Wisc” And we can decide where to overflow from hours after the jobs were submitted

Condor Week May 2012No user requirements34 Looking at the future

Condor Week May 2012No user requirements35 Looking at the future ● The attribute schema now a fixed one ● At least regarding matchmaking ● Opens up new interesting possibilities

Condor Week May 2012No user requirements36 Looking at the future ● The attribute schema now a fixed one ● At least regarding matchmaking ● Can consider RDBMS techniques ● Example uses – RDMBS driven matchmaking – Tracking of job and/or machine history in a DB

Condor Week May 2012No user requirements37 Looking at the future ● The attribute schema now a fixed one ● At least regarding matchmaking ● Can consider RDBMS techniques ● Example uses – RDMBS driven matchmaking – Tracking of job and/or machine history in a DB ● Will likely need code changes in Condor – UCSD committed to try it our in the near future – If the results end up as good as hoped for, will push for official adoption

Condor Week May 2012No user requirements38 Conclusions ● Matchmaking is made of requirements ● But there are two places where they can be defined ● Usually, users expected to set requirements ● CMS glideinWMS setup moves it completely into the resource domain ● Experience with no-user-req setup very positive ● Advocating that this should be the recommended way for all Condor deployments ● Also opens up interesting new venues

Condor Week May 2012No user requirements39 Acknowledgments ● This work is partially sponsored by ● the US National Science Foundation under Grants No. PHY (AAA) and PHY (CMS Maintenance & Operations) ● and ● the US Department of Energy under Grant No. DE- FC02-06ER41436 subcontract No. 647F290 (OSG).