The CMS use of glideinWMS by Igor Sfiligoi (UCSD)

Slides:



Advertisements
Similar presentations
Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.
Advertisements

Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.
ANTHONY TIRADANI AND THE GLIDEINWMS TEAM glideinWMS in the Cloud.
Jaime Frey Computer Sciences Department University of Wisconsin-Madison Condor-G: A Case in Distributed.
CVMFS: Software Access Anywhere Dan Bradley Any data, Any time, Anywhere Project.
Resource Management Reading: “A Resource Management Architecture for Metacomputing Systems”
SCD FIFE Workshop - GlideinWMS Overview GlideinWMS Overview FIFE Workshop (June 04, 2013) - Parag Mhashilkar Why GlideinWMS? GlideinWMS Architecture Summary.
CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are compiled from various TeraGrid Documentations.
glideinWMS: Quick Facts  glideinWMS is an open-source Fermilab Computing Sector product driven by CMS  Heavy reliance on HTCondor from UW Madison and.
OSG Site Provide one or more of the following capabilities: – access to local computational resources using a batch queue – interactive access to local.
1 Evolution of OSG to support virtualization and multi-core applications (Perspective of a Condor Guy) Dan Bradley University of Wisconsin Workshop on.
Campus Grids Report OSG Area Coordinator’s Meeting Dec 15, 2010 Dan Fraser (Derek Weitzel, Brian Bockelman)
Integrating HPC into the ATLAS Distributed Computing environment Doug Benjamin Duke University.
Use of Condor on the Open Science Grid Chris Green, OSG User Group / FNAL Condor Week, April
Ideas for a virtual analysis facility Stefano Bagnasco, INFN Torino CAF & PROOF Workshop CERN Nov 29-30, 2007.
OSG Production Report OSG Area Coordinator’s Meeting Aug 12, 2010 Dan Fraser.
Condor Week 2004 The use of Condor at the CDF Analysis Farm Presented by Sfiligoi Igor on behalf of the CAF group.
GLIDEINWMS - PARAG MHASHILKAR Department Meeting, August 07, 2013.
Pilot Factory using Schedd Glidein Barnett Chiu BNL
Workload management, virtualisation, clouds & multicore Andrew Lahiff.
Eileen Berman. Condor in the Fermilab Grid FacilitiesApril 30, 2008  Fermi National Accelerator Laboratory is a high energy physics laboratory outside.
Parag Mhashilkar Computing Division, Fermi National Accelerator Laboratory.
CVMFS: Software Access Anywhere Dan Bradley Any data, Any time, Anywhere Project.
LHCb 2009-Q4 report Q4 report LHCb 2009-Q4 report, PhC2 Activities in 2009-Q4 m Core Software o Stable versions of Gaudi and LCG-AA m Applications.
Campus Grid Technology Derek Weitzel University of Nebraska – Lincoln Holland Computing Center (HCC) Home of the 2012 OSG AHM!
Job submission overview Marco Mambelli – August OSG Summer Workshop TTU - Lubbock, TX THE UNIVERSITY OF CHICAGO.
3 Compute Elements are manageable By hand 2 ? We need middleware – specifically a Workload Management System (and more specifically, “glideinWMS”) 3.
UCS D OSG Summer School 2011 Life of an OSG job OSG Summer School A peek behind the scenes The life of an OSG job by Igor Sfiligoi University of.
Introduction to Distributed HTC and overlay systems Tuesday morning, 9:00am Igor Sfiligoi Leader of the OSG Glidein Factory Operations University of California.
Condor Week 2007Glidein Factories - by I. Sfiligoi1 Condor Week 2007 Glidein Factories (and in particular, the glideinWMS) by Igor Sfiligoi.
Introduction to the Grid and the glideinWMS architecture Tuesday morning, 11:15am Igor Sfiligoi Leader of the OSG Glidein Factory Operations University.
UCS D OSG Summer School 2011 Intro to DHTC OSG Summer School An introduction to Distributed High-Throughput Computing with emphasis on Grid computing.
OSG Consortium Meeting - March 6th 2007Evaluation of WMS for OSG - by I. Sfiligoi1 OSG Consortium Meeting Evaluation of Workload Management Systems for.
UCS D OSG Summer School 2011 Overlay systems OSG Summer School An introduction to Overlay systems Also known as Pilot systems by Igor Sfiligoi University.
Security in OSG Tuesday afternoon, 4:15pm Igor Sfiligoi Member of the OSG Security team University of California San Diego.
Condor Week 2006, University of Wisconsin 1 Matthew Norman Using Condor Glide-ins and GCB to run in a grid environment Elliot Lipeles, Matthew Norman,
Towards a High Performance Extensible Grid Architecture Klaus Krauter Muthucumaru Maheswaran {krauter,
Rome, Sep 2011Adapting with few simple rules in glideinWMS1 Adaptive 2011 Adapting to the Unknown With a few Simple Rules: The glideinWMS Experience by.
Madison, Apr 2010Igor Sfiligoi1 Condor World 2010 Condor-G – A few lessons learned by Igor UCSD.
Condor Week 09Condor WAN scalability improvements1 Condor Week 2009 Condor WAN scalability improvements A needed evolution to support the CMS compute model.
UCS D OSG School 11 Grids vs Clouds OSG Summer School Comparing Grids to Clouds by Igor Sfiligoi University of California San Diego.
Condor Week May 2012No user requirements1 Condor Week 2012 An argument for moving the requirements out of user hands - The CMS experience presented.
Honolulu - Oct 31st, 2007 Using Glideins to Maximize Scientific Output 1 IEEE NSS 2007 Making Science in the Grid World - Using Glideins to Maximize Scientific.
Arlington, Dec 7th 2006 Glidein Based WMS 1 A pilot-based (PULL) approach to the Grid An overview by Igor Sfiligoi.
WLCG Workshop 2017 [Manchester] Operations Session Summary
The Future of Apache Flink®
HTCondor Security Basics
Quick Architecture Overview INFN HTCondor Workshop Oct 2016
Dynamic Deployment of VO Specific Condor Scheduler using GT4
N-Tier Architecture.
Operating a glideinWMS frontend by Igor Sfiligoi (UCSD)
Primer for Site Debugging
Outline Expand via Flocking Grid Universe in HTCondor ("Condor-G")
ATLAS Cloud Operations
Workload Management System
Glidein Factory Operations
1 VO User Team Alarm Total ALICE ATLAS CMS
Evolution of the distributed computing model The case of CMS
Security in OSG Rob Quick
Survey on User’s Computing Experience
Building Grids with Condor
HTCondor Security Basics HTCondor Week, Madison 2016
Basic Grid Projects – Condor (Part I)
Reliable Distributed Systems
WMS Options: DIRAC and GlideIN-WMS
Grid Coordination by Using the Grid Coordination Protocol
Any Data, Anytime, Anywhere
Condor-G Making Condor Grid Enabled
Job Submission Via File Transfer
Presentation transcript:

The CMS use of glideinWMS by Igor Sfiligoi (UCSD) WLCG TEG Meeting The CMS use of glideinWMS by Igor Sfiligoi (UCSD) CERN, Nov 3rd 2011 glideinWMS

Outline Requirements Why glideins Glidein overview GlideinWMS overview Lessons learned CERN, Nov 3rd 2011 glideinWMS

Requirements (1) Job description All jobs are single node ones (no multinode MPI) Jobs last anywhere from 10 minutes to 2 days With a median of a couple hours Each job has a whitelist of Grid sites it can use Based on available SW and data at site User expectations O(1k) users with O(10k) jobs each in the queue at any time Users measure run time by “last job finished” CERN, Nov 3rd 2011 glideinWMS

Requirements (2) CMS expectations Being able to schedule all jobs across all available resources in a single system Decide scheduling policy on a global scale Having the flexibility of defining complex policies Recent example: job overflow in case of a site oversubscription CERN, Nov 3rd 2011 glideinWMS

Problems with non-pilot systems Slow – high overhead Cannot really handle 10 min jobs Priority issues Each site has its own queue, with own priorities Impossible for CMS to set global priorities Failure handling Blackhole nodes can eat hundreds of jobs before being fixed CERN, Nov 3rd 2011 glideinWMS

Available pilot systems CMS went with the general purpose solution General purpose Condor glideins Used by CDF in HEP VO specific DIRAC I know it is sometimes portrayed as general purpose, but all presentations are LHCb centric Alien (PANDA came later) CERN, Nov 3rd 2011 glideinWMS

Why Condor? Condor is a widely used product Not a CMS product Condor is a widely used product Both in academia and industry Mature yet still under active development Condor architecture lends well to overlays It was meant to be “a scavenger” from the start Condor provides a rich semantics Allows for complex policies, if desired Many add-on tools (dagman, monitoring, etc.) CERN, Nov 3rd 2011 glideinWMS

Quick architectural overview A “glidein” is just a Condor daemon as a job Policy node Negotiator Collector Submit node Submit node Submit node Schedd Grid/Cloud Worker node Grid node Legend Grid node Glidein Fixed Startd Dynamic CERN, Nov 3rd 2011 glideinWMS

Condor glidein properties (1) Single policy domain All glideins, from all resources managed by a single entity CMS can easily decide which user job gets the next available resource Powerful matchmaking options Each glidein will advertise actual resource status Good scalability Demonstrated 90k glideins in a single system CERN, Nov 3rd 2011 glideinWMS

Condor glidein properties (2) Very fast – low overhead Demonstrated 10Hz job startup on single schedd Can have multiple schedds with single policy domain Not Grid limited; running many user jobs per glidein WAN friendly Allows for strong security (CMS uses GSI) Auth, Integrity and/or Encryption + proxy delegation Glideins require outgoing TCP only Can handle high latencies between nodes CERN, Nov 3rd 2011 glideinWMS

Condor glidein properties (3) Fault tolerant Central processes can be restarted w/o job penalty Policy node can be replicated in hot-spare mode for high availability If a glidein dies, the running job will be rescheduled on another one Can use external tools for UID switching when not running as root Supports gLExec CERN, Nov 3rd 2011 glideinWMS

Developed and maintained by CMS (see also acknowledgements) What is glideinWMS Someone has to configure and submit the glideins This is what glideinWMS is for GlideinWMS is a thin layer on top of Condor Logic to decide when and were to submit glideins Powerful but easy-to-use tools for glidein configuration Automated glidein submission Monitoring and debugging tools Developed and maintained by CMS (see also acknowledgements) CERN, Nov 3rd 2011 glideinWMS

GlideinWMS logic (1) Trust but verify List of sites not automatically updated Although we have tools to query the information systems Verify site functionality before enabling mass- submissions Validate node before glidein startup If validation fails, Condor not started User never ends there Avoids most black hole problems Based on plug-ins Site attributes also not auto-updated And can add custom attributes CERN, Nov 3rd 2011 glideinWMS

GlideinWMS logic (2) Glideins sent to sites only if/when there are jobs waiting that can run there Matches jobs to sites, counts how many where A job will count 1/N to each of N matching sites But glideins not limited to the job(s) that triggered the submission Glideins have a limited lifetime Will self-terminate by a defined deadline Typically, the length of the Grid queue lease Will also die if nobody uses them for 20 mins (20 mins to give ample time for the matchmaking) CERN, Nov 3rd 2011 glideinWMS

Operational considerations Splits matching logic from Grid submission Logic = Frontend, Grid submission = Factory Independent services, running by different people N-to-M relationship Factory can be shared by different VOs OSG runs one that serves 10 VOs CMS one of the supported Frontends All Grid ops happen here CERN, Nov 3rd 2011 glideinWMS

Architectural overview (1) Policy node Negotiator Collector Submit node Submit node Submit node Schedd Frontend node Grid/Cloud Worker node Grid node Grid node Frontend Glidein Startd Factory node Factory Condor-G CERN, Nov 3rd 2011 glideinWMS

Architectural overview (2) Summary overview of current situation CMS MC Frontend + Condor CMS T1s CERN Factory CMS Reco Frontend + Condor FNAL Factory CMS T2s CMS AnaOps Frontend + Condor Indiana Factory OSG (non-CMS) Engage VO Frontend + Condor UCSD Factory HCC VO Frontend + Condor Teragrid GLOW VO Frontend + Condor GlueX VO Frontend + Condor CERN, Nov 3rd 2011 glideinWMS

Operational experience 3 years of production experience Both Reconstruction and AnaOps Generally worked well Very few glidein-related problems Ops load mostly on factory side Due to things breaking at Grid sites Have to monitor for failures and keep opening tickets to keep efficiency high CERN, Nov 3rd 2011 glideinWMS

Typical glidein problems CE not accepting glideins Glideins sitting in queues forever (or lost) WN problems Validation finds problems (e.g. broken NFS) Networking issues (e.g. NAT overload) Security issues (e.g. missing CAs) Glideins being prematurely killed (policy misunderstandings, either time or memory) CERN, Nov 3rd 2011 glideinWMS

Known site complaints Don't know who is actually running gLExec was supposed to solve that Pilots run too long Can be tuned, but affects efficiency Missing feature: Standard way for site to tell the pilot it should go away at next job boundary Glideins actually have this: condor_off - peaceful But site must know how to use it CERN, Nov 3rd 2011 glideinWMS

Beyond the Grid (1) Factory controls the glidein configuration Can easily wrap it in a VM We have a working prototype that can submit to EC2 CERN, Nov 3rd 2011 glideinWMS

Beyond the Grid (2) New Grid concepts: Stream CEs I like the idea, but not sure how we would integrate it Should not be too difficult, though Resource allocation instead of “jobs” I like this idea, too But not sure how this would work yet CERN, Nov 3rd 2011 glideinWMS

Acknowledgements GlideinWMS is a CMS led project, with A major contribution from the Fermilab Computing Division And contributions from UCSD and ISI CERN, Nov 3rd 2011 glideinWMS

Backup slides CERN, Nov 3rd 2011 glideinWMS

Who uses Condor Many US Grid sites Top 500 Cycle CERN, Nov 3rd 2011 glideinWMS

What tech. used in Factory All Grid and Cloud submissions done by Condor-G Single API to support Condor has been very good in adding support for all major Grid and Cloud APIs CERN, Nov 3rd 2011 glideinWMS

Graphs (1) CMS AnaOps Frontend+Condor CERN, Nov 3rd 2011 glideinWMS

Graphs (2) UCSD Factory CERN, Nov 3rd 2011 glideinWMS

Graphs (3) Indiana Factory CERN, Nov 3rd 2011 glideinWMS

Graphs (4) FNAL Factory CERN, Nov 3rd 2011 glideinWMS