USCMS T2 Site Admin Toolkit Samir Cury MTF Meeting – May 26 th, 2011
How it began OSG All Hand Meeting 2010 Fermilab Yearly T2 Workshop Gathering of site admins A lot of ideas/comments Some code – Scripts
About site admins Frontline of site management They have in a Daily basis : Many requests Many issues Many workarounds – What happen with these? Relevant feedback for CMS Leak of features in existing software Leak of monitoring in existing systems May lead to Blindly operating it Is there always someone to listen? Thanks Monitoring Task Force!
Workarounds From the past slide, this toolkit is all about that. Not always complaining is the best way It may never be implemented Not everyone will see the benefits/cost Different needs Not always developers think about all user/ops needs Scripts are done to cover these needs These scripts can give a different approach to the ops Monitoring tools focused in admin's needs. Can improve response time / error/waste detection » Example – GridFTP Spy » JobView / CPU Efficiency on T1's Not essential, but normally saves some time.
The goal What is really missing – Official place for unofficial code – People get encouraged to share Call for tools Get the generic ones –> package into RPM Get the specific ones Turn into generic, then package into RPM Standard place (repository) Standard deploy procedure If it's not quick, no one tries. → RPM's Helping us to help ourselves.
What it is Full documentation/reference available : Where we document each tool included in the toolkit, future plans, etc. A gathering of scripts, that may need some work to get it working We also try to avoid that by having RPMs and all dependencies included – packages or in the repos. A free-time-task for every involved person We normally don't have schedules, but a plan. Shameless “coders” - that's what we need! We don't care how “bad written” it is, as long as it works
What certainly is not Something that is maintained by a lot of people But some that contribute with tools A dependency-solver / packager (me) Would appreciate some help Something that will solve all the problems That is not the goal, just to put together specific tools Something that has “professional quality” Involved people are very capable, but proportionaly time-constrained
What we can learn “Sites” can also generate some useful code They probably will do it for themselves, so don't expect High quality code Something that has not a lot of dependencies Expect Tools that you can adapt for your site with little effort To contribute and make it better instead of complaining “Sites” should be shameless enough to publish (and send us) tools they find useful. Ken bloom gave me space for a contribution on a USCMS T2 support meeting so I could present the proposal, then, some tools showed up. (Thanks, Ken!) T2 Coordinators could inform us when they see something useful in their support meetings, and also remind these sites that the toolkit is there
What I did learn Since getting the script until the RPM gives more work than I thought – many details, dependencies, etc... We will live better if we have a step before this : Toolkit People can download/edit from there, and is a shortcut for the ones that really want to spend some time understanding and deploying the tools that still don't have the RPM. It helped me to patch Stale Data improving the CLI
Tools we have right now CondorView (Caltech) - RPM ready GridFTP Spy (Caltech) – RPM ready Condor4Web (UERJ) - RPM ready Stale Data (Nebraska) – tested, needs packaging Condor Extract Mail (Nebraska) – to be tested Dcache tools (Wisconsin) – to be tested Your tool here
CondorView GUI for managing condor List every single job Can list ALL classAds for a given job Can do what you see in the menu Run from the cluster frontend Have the ability of SSH to the node, exactly into the running job temp dir Run from the site's CE Have the ability of killing/releasing/restart jobs
GridFTP Spy Shows in near real time active GridFTP transfers Very useful for link usage / server settings optimizing Somewhat tricky to deploy Needs a shared FS for harvesting logs How it does is reading the logs in real time and gathering interesting info Never tested it myself – testers are welcome!
Condor4web Real time batch system monitoring Visible from any corner of the world Your users like it They know what's going on with their jobs, after the CE MC People like it For the same reason. Live demos : If you don't use Condor, try JobView : isOpsT2Monitoring
Stale Data Looks like the (un)popularity data service Shows which datasets people didn't run a single job against Tested. Works fine, has a lot of dependencies which should be included in the RPM date = , Starting Date = Getting json Datasets idle since /JetMET/Run2010A-Dec4ReReco_v1/AOD, GB, Owned by AnalysisOps /G2Jets_Pt-20to60_TuneZ2_7TeV-alpgen/Fall10-START38_V12-v1/AODSIM, GB, Owned by top /W2Jets_ptW-0to100_TuneZ2_7TeV-alpgen-tauola/Fall10-START38_V12-v1/GEN, GB, Owned by DataOps /QCD6Jets_Pt120to280-alpgen/Spring10-START3X_V26_S09-v1/GEN-SIM-RECO, GB, Owned by top /W1Jets_ptW-800to1600_TuneD6T_7TeV-alpgen-tauola/Fall10-START38_V12-v1/AODSIM, GB, Owned by top (Suppressed) Space taken by stale datasets = TB Broken down by group: tracker-dpg => top => AnalysisOps => undef => FacOps => b-tagging => local => DataOps =>
“Condor Extract Mail” Fetches from grid proxies in your CE's, mails from the users running jobs in your cluster ~]# ~bbockelm/extract_ "Bockelman"
What CMS can profit Better than the code, the ideas Usability – you may find here potential features for existing real software Adapt ideas or tools that diserve to CMS central monitoring like cmsweb Gives an overview of site admin needs and what they would like to see in the software they use. Some become patches – like Brian Bockelman's script The model / idea of a free software community is a good example to follow – Small patches from many people turn small things into great ones. Share!
Thanks all involved Ken Bloom, Michael Thomas – Initial effort to set up and make everything public Authors that submitted tools : Caltech – Michael Thomas CondorView GridFTP Spy Nebraska – Carl Lundsted and Brian Bockelman Condor Extract Mail Stale Data Wisconsin - Will dCache Tools UERJ – Samir Condor4Web
Feel free to send : Tools Suggestions Help But first, we recommend some (small) reading here :
For the future 2 Trainees interested in help UERJ Migrate YUM Repos to CERN webservers Finish testing/package tools we already have.
Contacts
Recommended toolkit
Thanks!