glideinWMS: Quick Facts  glideinWMS is an open-source Fermilab Computing Sector product driven by CMS  Heavy reliance on HTCondor from UW Madison and.

Slides:



Advertisements
Similar presentations
Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.
Advertisements

HTCondor and the European Grid Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.
Using EC2 with HTCondor Todd L Miller 1. › Introduction › Submitting an EC2 job (user tutorial) › New features and other improvements › John Hover talking.
CERN LCG Overview & Scaling challenges David Smith For LCG Deployment Group CERN HEPiX 2003, Vancouver.
Bosco: Enabling Researchers to Expand Their HTC Resources The Bosco Team: Dan Fraser, Jaime Frey, Brooklin Gore, Marco Mambelli, Alain Roy, Todd Tannenbaum,
ANTHONY TIRADANI AND THE GLIDEINWMS TEAM glideinWMS in the Cloud.
1 October 2013 APF Summary Oct 2013 John Hover John Hover.
CMS Diverse use of clouds David Colling
CMS Experience Provisioning Cloud Resources with GlideinWMS Anthony Tiradani HTCondor Week May 2015.
Minerva Infrastructure Meeting – October 04, 2011.
SCD FIFE Workshop - GlideinWMS Overview GlideinWMS Overview FIFE Workshop (June 04, 2013) - Parag Mhashilkar Why GlideinWMS? GlideinWMS Architecture Summary.
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.
A. Mohapatra, HEPiX 2013 Ann Arbor1 UW Madison CMS T2 site report D. Bradley, T. Sarangi, S. Dasu, A. Mohapatra HEP Computing Group Outline  Infrastructure.
BESIII distributed computing and VMDIRAC
OSG Site Provide one or more of the following capabilities: – access to local computational resources using a batch queue – interactive access to local.
1 Evolution of OSG to support virtualization and multi-core applications (Perspective of a Condor Guy) Dan Bradley University of Wisconsin Workshop on.
Campus Grids Report OSG Area Coordinator’s Meeting Dec 15, 2010 Dan Fraser (Derek Weitzel, Brian Bockelman)
GRAM5 - A sustainable, scalable, reliable GRAM service Stuart Martin - UC/ANL.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
BOSCO Architecture Derek Weitzel University of Nebraska – Lincoln.
Sep 21, 20101/14 LSST Simulations on OSG Sep 21, 2010 Gabriele Garzoglio for the OSG Task Force on LSST Computing Division, Fermilab Overview OSG Engagement.
Use of Condor on the Open Science Grid Chris Green, OSG User Group / FNAL Condor Week, April
Evolution of the Open Science Grid Authentication Model Kevin Hill Fermilab OSG Security Team.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Successful Common Projects: Structures and Processes WLCG Management.
Dealing with real resources Wednesday Afternoon, 3:00 pm Derek Weitzel OSG Campus Grids University of Nebraska.
Agile Infrastructure IaaS Compute Jan van Eldik CERN IT Department Status Update 6 July 2012.
Open Science Grid (OSG) Introduction for the Ohio Supercomputer Center Open Science Grid (OSG) Introduction for the Ohio Supercomputer Center February.
GLIDEINWMS - PARAG MHASHILKAR Department Meeting, August 07, 2013.
Running persistent jobs in Condor Derek Weitzel & Brian Bockelman Holland Computing Center.
Pilot Factory using Schedd Glidein Barnett Chiu BNL
Commissioning the CERN IT Agile Infrastructure with experiment workloads Ramón Medrano Llamas IT-SDC-OL
OSG Site Admin Workshop - Mar 2008Using gLExec to improve security1 OSG Site Administrators Workshop Using gLExec to improve security of Grid jobs by Alain.
Workload management, virtualisation, clouds & multicore Andrew Lahiff.
Eileen Berman. Condor in the Fermilab Grid FacilitiesApril 30, 2008  Fermi National Accelerator Laboratory is a high energy physics laboratory outside.
OSG Area Coordinator’s Report: Workload Management Maxim Potekhin BNL May 8 th, 2008.
Parag Mhashilkar Computing Division, Fermi National Accelerator Laboratory.
April 25, 2006Parag Mhashilkar, Fermilab1 Resource Selection in OSG & SAM-On-The-Fly Parag Mhashilkar Fermi National Accelerator Laboratory Condor Week.
TANYA LEVSHINA Monitoring, Diagnostics and Accounting.
HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.
Campus Grid Technology Derek Weitzel University of Nebraska – Lincoln Holland Computing Center (HCC) Home of the 2012 OSG AHM!
Building on virtualization capabilities for ExTENCI Carol Song and Preston Smith Rosen Center for Advanced Computing Purdue University ExTENCI Kickoff.
Running User Jobs In the Grid without End User Certificates - Assessing Traceability Anand Padmanabhan CyberGIS Center for Advanced Digital and Spatial.
Job submission overview Marco Mambelli – August OSG Summer Workshop TTU - Lubbock, TX THE UNIVERSITY OF CHICAGO.
Using the CMS Higher Level Trigger Farm as a Cloud Resource David Colling Imperial College London.
Parag Mhashilkar (Fermi National Accelerator Laboratory)
Why you should care about glexec OSG Site Administrator’s Meeting Written by Igor Sfiligoi Presented by Alain Roy Hint: It’s about security.
3 Compute Elements are manageable By hand 2 ? We need middleware – specifically a Workload Management System (and more specifically, “glideinWMS”) 3.
UCS D OSG Summer School 2011 Life of an OSG job OSG Summer School A peek behind the scenes The life of an OSG job by Igor Sfiligoi University of.
Introduction to the Grid and the glideinWMS architecture Tuesday morning, 11:15am Igor Sfiligoi Leader of the OSG Glidein Factory Operations University.
OSG Consortium Meeting - March 6th 2007Evaluation of WMS for OSG - by I. Sfiligoi1 OSG Consortium Meeting Evaluation of Workload Management Systems for.
UCS D OSG Summer School 2011 Overlay systems OSG Summer School An introduction to Overlay systems Also known as Pilot systems by Igor Sfiligoi University.
Rome, Sep 2011Adapting with few simple rules in glideinWMS1 Adaptive 2011 Adapting to the Unknown With a few Simple Rules: The glideinWMS Experience by.
Madison, Apr 2010Igor Sfiligoi1 Condor World 2010 Condor-G – A few lessons learned by Igor UCSD.
Condor Week 09Condor WAN scalability improvements1 Condor Week 2009 Condor WAN scalability improvements A needed evolution to support the CMS compute model.
UCS D OSG School 11 Grids vs Clouds OSG Summer School Comparing Grids to Clouds by Igor Sfiligoi University of California San Diego.
Condor Week May 2012No user requirements1 Condor Week 2012 An argument for moving the requirements out of user hands - The CMS experience presented.
Honolulu - Oct 31st, 2007 Using Glideins to Maximize Scientific Output 1 IEEE NSS 2007 Making Science in the Grid World - Using Glideins to Maximize Scientific.
Operating a glideinWMS frontend by Igor Sfiligoi (UCSD)
Primer for Site Debugging
Outline Expand via Flocking Grid Universe in HTCondor ("Condor-G")
ATLAS Cloud Operations
Workload Management System
Provisioning 160,000 cores with HEPCloud at SC17
Glidein Factory Operations
“The HEP Cloud Facility: elastic computing for High Energy Physics” Outline Computing Facilities for HEP need to evolve to address the new needs of the.
The CMS use of glideinWMS by Igor Sfiligoi (UCSD)
CREAM-CE/HTCondor site
The Scheduling Strategy and Experience of IHEP HTCondor Cluster
Troubleshooting Your Jobs
Presentation transcript:

glideinWMS: Quick Facts  glideinWMS is an open-source Fermilab Computing Sector product driven by CMS  Heavy reliance on HTCondor from UW Madison and we work closely with them   Contributors include: Krista Larson (FNAL/Corral) Parag Mhashilkar (FNAL/Corral) Mats Rynge (ISI/USC/Corral) Igor Sfiligoi (UCSD) Doug Strain (FNAL) Anthony Tiradani (FNAL/CMS) John Weigand (CMS) Derek Weitzel (UNL) Burt Holzman (FNAL)

glideinWMS: Overview VO Infrastructure Grid Site Worker Node Condor Scheduler Job glideinWMS Glidein Factory, WMS Pool VO Frontend glidein Condor Startd Condor Central Manager

glideinWMS: version timeline 2.4.x: privilege separation, aggregate monitoring, glexec control, glidein lifetime control 2.5.x: HTCondor TCP bulk updates, efficiency improvements, factory limits per frontend, excess glidein removal, shared ports, better user pool matchmaking 2.6.x: Better multislot support, ARC CE, more glidein lifetime controls, factory limits per frontend security class 2.7.x: Refactor for factory scaling, performance fixes, partitionable slot support 3.x: Cloud support, CorralWMS frontend support Production Development In 2013, v3 becomes production

glideinWMS: OSG Usage 2 Factories ~50K pilots peak

glideinWMS: CMS Usage ~23K Claimed

glideinWMS: Grid vs Cloud VO Infrastructure Cloud Site glideinWMS VM Condor Scheduler Job glideinWMS Glidein Factory, WMS Pool VO Frontend bootstrap Condor Startd Condor Central Manager Grid Site Worker Node glidein Condor Startd

OpenStack Why OpenStack? [1] If you want Amazon and VMware alternatives OpenStack is Open OpenStack matures at rapid pace Its fundamental architecture is very sound It’s driven by a diverse, huge community Note: We are not affiliated with OpenStack. We just use OpenStack [1] hive/2013/03/28/7-reasons-why-openstack-matters.aspx

OpenStack: For Fun and Profit (1) “People Get Pissed Off About OpenStack And That’s Why It Will Survive” [1] [1] thats-why-it-will-survive/ thats-why-it-will-survive/

OpenStack: For Fun and Profit (2)  Very, very easy to overload the cloud controller 1 node, 1 condor instance, 10 VM requests == DOS  OpenStack doesn’t completely follow EC2 “standards” Adds new VM states Ignores “terminate on shutdown” flag User data size limit undocumented and much smaller than Amazon  OpenStack can be fragile internally Many “broken pipe” messages in the nova api log

OpenStack: Improvements  Submitted patch to OpenStack that brings the default user data size limit in line with Amazon Should be in Grizzly release  Thanks to Adam Huffman for discovering and assisting in troubleshooting efforts

glideinWMS: Adaptations  Do not release any held jobs For grid, we know some held reasons are transient, so we release jobs held with particular held reason codes For Cloud, specifically OpenStack, all held reasons are fatal to the job  Throttle polling frequency Did I mention that it is very easy to overload the OpenStack controller? We modify the factory HTCondor configs using existing knobs

HTCondor: Adaptations  Batch update requests Now HTCondor performs a DescribeInstances call with no filter to return all VMs requested and parses state internally Note: OpenStack will return ALL VMs, regardless of who requested them  Handle SHUTOFF state OpenStack adds a new terminal state for VMs  OpenStack ignores “terminate on shutdown” flag InstanceInstantiatedShutdownBehavior OpenStack will never terminate requested VMs automatically HTCondor will explicitly terminate these VMs

HTCondor: Unexpected Features  All ssh keys are deleted when HTCondor is restarted We are currently running a pre-release Fixed in official release? (Have to test)  condor_status –xml Rules changed for the escaping “\” glideinWMS now uses a more robust regex

HTCondor: Enhancement Requests  Batch VM requests Request N vms in one call When a workflow starts up there are few (if any) VMs running We need to ramp up quickly Currently we can only request one VM at a time

glideinWMS: Enhancement Requests  Refactoring how pilots handle user data Currently all user data must be in glideinWMS specific format Pilot bootstrap service controls all startup and user data processes Want to use CloudInit and make bootstrap a plugin to CloudInit

Other Considerations  Must architect your network properly CMS is limited to slightly under 1K jobs due to current network architecture Have ~13k cores, but can’t use them all Network reconfiguration will happen this week  Tune your controller database carefully One early bottleneck for OpenStack cloud controller was poorly tuned MySQL database

VM Image

Acknowledgements  HTCondor Team Jaime Frey Todd Miller Todd Tannebaum Timothy St. Clair  Adam Huffman (Imperial College)

Questions?