Glidein Factory Operations

Slides:



Advertisements
Similar presentations
Why Use Test Driven Development (TDD)?.  Why the need to change to TDD.  Talk about what TDD is.  Talk about the expectations of TDD.
Advertisements

ANTHONY TIRADANI AND THE GLIDEINWMS TEAM glideinWMS in the Cloud.
IST346: Workstations. Agenda  Look at the computer from the administration viewpoint.  Discuss common workstation operating systems  Discuss computer.
OSG Area Coordinators Meeting Security Team Report Mine Altunay 05/15/2013.
SCD FIFE Workshop - GlideinWMS Overview GlideinWMS Overview FIFE Workshop (June 04, 2013) - Parag Mhashilkar Why GlideinWMS? GlideinWMS Architecture Summary.
FINAL DEMO Apollo Crew, group 3 T SW Development Project.
TESTING.
glideinWMS: Quick Facts  glideinWMS is an open-source Fermilab Computing Sector product driven by CMS  Heavy reliance on HTCondor from UW Madison and.
Sep 21, 20101/14 LSST Simulations on OSG Sep 21, 2010 Gabriele Garzoglio for the OSG Task Force on LSST Computing Division, Fermilab Overview OSG Engagement.
Use of Condor on the Open Science Grid Chris Green, OSG User Group / FNAL Condor Week, April
Evolution of the Open Science Grid Authentication Model Kevin Hill Fermilab OSG Security Team.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Next steps with EGEE EGEE training community.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
GLIDEINWMS - PARAG MHASHILKAR Department Meeting, August 07, 2013.
Designing Classes CS239 – Jan 26, Key points from yesterday’s lab  Enumerated types are abstract data types that define a set of values.  They.
Software Engineering 2004 Jyrki Nummenmaa 1 BACKGROUND There is no way to generally test programs exhaustively (that is, going through all execution.
VO Box Issues Summary of concerns expressed following publication of Jeff’s slides Ian Bird GDB, Bologna, 12 Oct 2005 (not necessarily the opinion of)
PERFORMANCE AND ANALYSIS WORKFLOW ISSUES US ATLAS Distributed Facility Workshop November 2012, Santa Cruz.
EGEE-II INFSO-RI Enabling Grids for E-sciencE Operations procedures: summary for round table Maite Barroso OCC, CERN
Open Science Grid Build a Grid Session Siddhartha E.S University of Florida.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE Operations: Evolution of the Role of.
EGEE is a project funded by the European Union under contract IST Issues from current Experience SA1 Feedback to JRA1 A. Pacheco PIC Barcelona.
StoRM + Lustre Proposal YAN Tian On behalf of Distributed Computing Group
II EGEE conference Den Haag November, ROC-CIC status in Italy
Campus Grid Technology Derek Weitzel University of Nebraska – Lincoln Holland Computing Center (HCC) Home of the 2012 OSG AHM!
Running User Jobs In the Grid without End User Certificates - Assessing Traceability Anand Padmanabhan CyberGIS Center for Advanced Digital and Spatial.
3/5/2007Chris Green, FNAL / OSG VO-Level Site Validation.
3 Compute Elements are manageable By hand 2 ? We need middleware – specifically a Workload Management System (and more specifically, “glideinWMS”) 3.
Operations Coordination Team Maria Girone, CERN IT-ES GDB, 11 July 2012.
UCS D OSG Summer School 2011 Life of an OSG job OSG Summer School A peek behind the scenes The life of an OSG job by Igor Sfiligoi University of.
How to get the needed computing Tuesday afternoon, 1:30pm Igor Sfiligoi Leader of the OSG Glidein Factory Operations University of California San Diego.
Introduction to Distributed HTC and overlay systems Tuesday morning, 9:00am Igor Sfiligoi Leader of the OSG Glidein Factory Operations University of California.
Introduction to the Grid and the glideinWMS architecture Tuesday morning, 11:15am Igor Sfiligoi Leader of the OSG Glidein Factory Operations University.
UCS D OSG Summer School 2011 Intro to DHTC OSG Summer School An introduction to Distributed High-Throughput Computing with emphasis on Grid computing.
Condor Week Apr 30, 2008Pseudo Interactive monitoring - I. Sfiligoi1 Condor Week 2008 Pseudo-interactive monitoring in Condor by Igor Sfiligoi.
OSG Consortium Meeting - March 6th 2007Evaluation of WMS for OSG - by I. Sfiligoi1 OSG Consortium Meeting Evaluation of Workload Management Systems for.
UCS D OSG Summer School 2011 Overlay systems OSG Summer School An introduction to Overlay systems Also known as Pilot systems by Igor Sfiligoi University.
Security in OSG Tuesday afternoon, 4:15pm Igor Sfiligoi Member of the OSG Security team University of California San Diego.
Rome, Sep 2011Adapting with few simple rules in glideinWMS1 Adaptive 2011 Adapting to the Unknown With a few Simple Rules: The glideinWMS Experience by.
Madison, Apr 2010Igor Sfiligoi1 Condor World 2010 Condor-G – A few lessons learned by Igor UCSD.
Condor Week 09Condor WAN scalability improvements1 Condor Week 2009 Condor WAN scalability improvements A needed evolution to support the CMS compute model.
CERN Feb 14thFactory operations1 Condor CERN Operating a glideinWMS factory by Igor Sfiligoi (UCSD)
Sitecore upgrades The Past, The Present, The Future.
Condor Week May 2012No user requirements1 Condor Week 2012 An argument for moving the requirements out of user hands - The CMS experience presented.
Honolulu - Oct 31st, 2007 Using Glideins to Maximize Scientific Output 1 IEEE NSS 2007 Making Science in the Grid World - Using Glideins to Maximize Scientific.
CS223: Software Engineering
Software Development.
Usability Overview Upsorn Praphamontripong CS
WLCG IPv6 deployment strategy
Software Engineering (CSI 321)
Dynamic Deployment of VO Specific Condor Scheduler using GT4
Operating a glideinWMS frontend by Igor Sfiligoi (UCSD)
Primer for Site Debugging
Workload Management System
Ian Bird GDB Meeting CERN 9 September 2003
BDII Performance Tests
Brief overview on GridICE and Ticketing System
1 VO User Team Alarm Total ALICE ATLAS CMS
The CMS use of glideinWMS by Igor Sfiligoi (UCSD)
Experience with jemalloc
Security in OSG Rob Quick
Testing and Test-Driven Development CSC 4700 Software Engineering
Chapter 3 – Agile Software Development
Static Class Members March 29, 2006 ComS 207: Programming I (in Java)
Optena: Enterprise Condor
Engineering Quality Software
Extreme Programming.
Data Challenge 1 Closeout Lessons Learned Already
The LHCb Computing Data Challenge DC06
Introduction to Software Testing
Presentation transcript:

Glidein Factory Operations glideinWMS training Glidein Factory Operations i.e. How we spend our time? by Igor Sfiligoi (UCSD) glideinWMS training G.Factroy Operations

G. Factory Operation Categories Factory node operations Serving VO Frontend Admin requests Keeping up with changes in the Grid Debugging Grid problems glideinWMS training G.Factroy Operations

G. Factory Operation Ongoing Costs Factory node operations Pretty much runs itself, unexpected <1day/month Serving VO Frontend Admin requests Highly variable, average a few hours/week Keeping up with changes in the Grid Variable, currently O(10 hours)/week Debugging Grid problems More than we have effort for! Better tools could drastically reduce this glideinWMS training G.Factroy Operations

Factory node ops O(hours) / month The factory mostly just runs Occasional upgrade of SW needed, but typically fast and painless Most effort going into investigating unexpected behavior, e.g. High load Weird problems after a reboot/OS upgrade Of course, installing a new node can take significant time But a very rare event O(hours) / month glideinWMS training G.Factroy Operations

VO FE Admin requests O(hours) / week Adding a new VO FE can be expensive Apart from config changes, to help them start running However, relatively rare to have new VOs In steady state, VOs may request New sites New attributes g.Factory operators also must assist with debugging FE config changes Error logs come only to GF (currently) O(hours) / week glideinWMS training G.Factroy Operations

Following changes in the Grid G.Factory operational principle is trust-but-verify G.Factory admins must approve any change in the G.Factory config Grid a very dynamic place At least one site makes a change every single day Mostly complaint driven, have no good tools to automate change discovery G.Factory admins thus must change the G.Factory config often Currently mostly a manual process Better tools would be welcome O(10 hours) / week glideinWMS training G.Factroy Operations

Grid debugging 1/2 With O(50k) glideins running at any time, we always find something broken somewhere Full spectrum of errors Broken worker nodes (validation errors) Broken CEs (authentication/startup/monitor errors) Network problems (glideins not registering) Mostly cannot directly solve the problem(s) i.e. have to notify remote Admins But we have to discover the root cause to get it solved glideinWMS training G.Factroy Operations

Many FTEs DC, if we had them Grid debugging 2/2 Grid a difficult place to debug Most sites are black boxes for us Luckily, glideins provide lots of info in the logs When we get them... a broken site may not return anything useful, or anything at all Prodding the black box often needed Which is hard! And some problems may be VO specific, too Many FTEs DC, if we had them glideinWMS training G.Factroy Operations

What else we do? In order to make our life easier, we also Host a test glideinWMS instance Develop new helper tools The test glideinWMS instance allows us to discover problems early, thus both Increasing user satisfaction Reducing the time needed in debugging errors We create helper tools to suit our needs And anything major we contribute back to glideinWMS glideinWMS training G.Factroy Operations

The test glideinWMS Instance The test glideinWMS instance contains both a G.Factory and a VO Frontend This allows us end-to-end testing Major focus on the G.Factory, to test before deploying in production New SW releases New sites New services on existing sites glideinWMS training G.Factroy Operations

Summary Operating a G.Factory is much more than keeping the G.Factory service alive Indeed, this part takes almost a negligible amount of time Most effort going into debugging Grid-related problems At O(50k) CPUs, something is always broken somewhere Finally, providing expertise to help VO FE Admins also an essential part of the job glideinWMS training G.Factroy Operations

Acknowledgments This document was sponsored by grants from the US NSF and US DOE, and by the UC system glideinWMS training G.Factroy Operations