The CMS use of glideinWMS by Igor Sfiligoi (UCSD) WLCG TEG Meeting The CMS use of glideinWMS by Igor Sfiligoi (UCSD) CERN, Nov 3rd 2011 glideinWMS
Outline Requirements Why glideins Glidein overview GlideinWMS overview Lessons learned CERN, Nov 3rd 2011 glideinWMS
Requirements (1) Job description All jobs are single node ones (no multinode MPI) Jobs last anywhere from 10 minutes to 2 days With a median of a couple hours Each job has a whitelist of Grid sites it can use Based on available SW and data at site User expectations O(1k) users with O(10k) jobs each in the queue at any time Users measure run time by “last job finished” CERN, Nov 3rd 2011 glideinWMS
Requirements (2) CMS expectations Being able to schedule all jobs across all available resources in a single system Decide scheduling policy on a global scale Having the flexibility of defining complex policies Recent example: job overflow in case of a site oversubscription CERN, Nov 3rd 2011 glideinWMS
Problems with non-pilot systems Slow – high overhead Cannot really handle 10 min jobs Priority issues Each site has its own queue, with own priorities Impossible for CMS to set global priorities Failure handling Blackhole nodes can eat hundreds of jobs before being fixed CERN, Nov 3rd 2011 glideinWMS
Available pilot systems CMS went with the general purpose solution General purpose Condor glideins Used by CDF in HEP VO specific DIRAC I know it is sometimes portrayed as general purpose, but all presentations are LHCb centric Alien (PANDA came later) CERN, Nov 3rd 2011 glideinWMS
Why Condor? Condor is a widely used product Not a CMS product Condor is a widely used product Both in academia and industry Mature yet still under active development Condor architecture lends well to overlays It was meant to be “a scavenger” from the start Condor provides a rich semantics Allows for complex policies, if desired Many add-on tools (dagman, monitoring, etc.) CERN, Nov 3rd 2011 glideinWMS
Quick architectural overview A “glidein” is just a Condor daemon as a job Policy node Negotiator Collector Submit node Submit node Submit node Schedd Grid/Cloud Worker node Grid node Legend Grid node Glidein Fixed Startd Dynamic CERN, Nov 3rd 2011 glideinWMS
Condor glidein properties (1) Single policy domain All glideins, from all resources managed by a single entity CMS can easily decide which user job gets the next available resource Powerful matchmaking options Each glidein will advertise actual resource status Good scalability Demonstrated 90k glideins in a single system CERN, Nov 3rd 2011 glideinWMS
Condor glidein properties (2) Very fast – low overhead Demonstrated 10Hz job startup on single schedd Can have multiple schedds with single policy domain Not Grid limited; running many user jobs per glidein WAN friendly Allows for strong security (CMS uses GSI) Auth, Integrity and/or Encryption + proxy delegation Glideins require outgoing TCP only Can handle high latencies between nodes CERN, Nov 3rd 2011 glideinWMS
Condor glidein properties (3) Fault tolerant Central processes can be restarted w/o job penalty Policy node can be replicated in hot-spare mode for high availability If a glidein dies, the running job will be rescheduled on another one Can use external tools for UID switching when not running as root Supports gLExec CERN, Nov 3rd 2011 glideinWMS
Developed and maintained by CMS (see also acknowledgements) What is glideinWMS Someone has to configure and submit the glideins This is what glideinWMS is for GlideinWMS is a thin layer on top of Condor Logic to decide when and were to submit glideins Powerful but easy-to-use tools for glidein configuration Automated glidein submission Monitoring and debugging tools Developed and maintained by CMS (see also acknowledgements) CERN, Nov 3rd 2011 glideinWMS
GlideinWMS logic (1) Trust but verify List of sites not automatically updated Although we have tools to query the information systems Verify site functionality before enabling mass- submissions Validate node before glidein startup If validation fails, Condor not started User never ends there Avoids most black hole problems Based on plug-ins Site attributes also not auto-updated And can add custom attributes CERN, Nov 3rd 2011 glideinWMS
GlideinWMS logic (2) Glideins sent to sites only if/when there are jobs waiting that can run there Matches jobs to sites, counts how many where A job will count 1/N to each of N matching sites But glideins not limited to the job(s) that triggered the submission Glideins have a limited lifetime Will self-terminate by a defined deadline Typically, the length of the Grid queue lease Will also die if nobody uses them for 20 mins (20 mins to give ample time for the matchmaking) CERN, Nov 3rd 2011 glideinWMS
Operational considerations Splits matching logic from Grid submission Logic = Frontend, Grid submission = Factory Independent services, running by different people N-to-M relationship Factory can be shared by different VOs OSG runs one that serves 10 VOs CMS one of the supported Frontends All Grid ops happen here CERN, Nov 3rd 2011 glideinWMS
Architectural overview (1) Policy node Negotiator Collector Submit node Submit node Submit node Schedd Frontend node Grid/Cloud Worker node Grid node Grid node Frontend Glidein Startd Factory node Factory Condor-G CERN, Nov 3rd 2011 glideinWMS
Architectural overview (2) Summary overview of current situation CMS MC Frontend + Condor CMS T1s CERN Factory CMS Reco Frontend + Condor FNAL Factory CMS T2s CMS AnaOps Frontend + Condor Indiana Factory OSG (non-CMS) Engage VO Frontend + Condor UCSD Factory HCC VO Frontend + Condor Teragrid GLOW VO Frontend + Condor GlueX VO Frontend + Condor CERN, Nov 3rd 2011 glideinWMS
Operational experience 3 years of production experience Both Reconstruction and AnaOps Generally worked well Very few glidein-related problems Ops load mostly on factory side Due to things breaking at Grid sites Have to monitor for failures and keep opening tickets to keep efficiency high CERN, Nov 3rd 2011 glideinWMS
Typical glidein problems CE not accepting glideins Glideins sitting in queues forever (or lost) WN problems Validation finds problems (e.g. broken NFS) Networking issues (e.g. NAT overload) Security issues (e.g. missing CAs) Glideins being prematurely killed (policy misunderstandings, either time or memory) CERN, Nov 3rd 2011 glideinWMS
Known site complaints Don't know who is actually running gLExec was supposed to solve that Pilots run too long Can be tuned, but affects efficiency Missing feature: Standard way for site to tell the pilot it should go away at next job boundary Glideins actually have this: condor_off - peaceful But site must know how to use it CERN, Nov 3rd 2011 glideinWMS
Beyond the Grid (1) Factory controls the glidein configuration Can easily wrap it in a VM We have a working prototype that can submit to EC2 CERN, Nov 3rd 2011 glideinWMS
Beyond the Grid (2) New Grid concepts: Stream CEs I like the idea, but not sure how we would integrate it Should not be too difficult, though Resource allocation instead of “jobs” I like this idea, too But not sure how this would work yet CERN, Nov 3rd 2011 glideinWMS
Acknowledgements GlideinWMS is a CMS led project, with A major contribution from the Fermilab Computing Division And contributions from UCSD and ISI CERN, Nov 3rd 2011 glideinWMS
Backup slides CERN, Nov 3rd 2011 glideinWMS
Who uses Condor Many US Grid sites Top 500 Cycle CERN, Nov 3rd 2011 glideinWMS
What tech. used in Factory All Grid and Cloud submissions done by Condor-G Single API to support Condor has been very good in adding support for all major Grid and Cloud APIs CERN, Nov 3rd 2011 glideinWMS
Graphs (1) CMS AnaOps Frontend+Condor CERN, Nov 3rd 2011 glideinWMS
Graphs (2) UCSD Factory CERN, Nov 3rd 2011 glideinWMS
Graphs (3) Indiana Factory CERN, Nov 3rd 2011 glideinWMS
Graphs (4) FNAL Factory CERN, Nov 3rd 2011 glideinWMS